We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Text mining on COVID19 datasets - Terminology extraction

00:00

Formal Metadata

Title
Text mining on COVID19 datasets - Terminology extraction
Title of Series
Number of Parts
45
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production Year2021
Production PlaceWageningen

Content Metadata

Subject Area
Genre
Abstract
Mathieu Roche, Senior Research Scientist and currently co-leader of the MISCA group (i.e. Spatial Information, Modelling, Data Mining, and Knowledge Extraction) at TETIS (CIRAD - France) presented the results of his latest analysis on how to use terminology and text-mining for event-based surveillance systems (i.e. disease-based and symptom-based surveillance). In this presentation Mathieu discussed the use of different datasets related to COVID-19, e.g. scientific publications, news data (PADI-web, MedISys), social media data (Twitter). The extracted terminology has been used (i) for surveillance systems (i.e. web crawling and information extraction tasks) and (ii) for spatio-temporal analysis of tweets dealing with COVID-19.
Keywords
Computer animation
Computer animation
Meeting/Interview
Meeting/InterviewComputer animation
Meeting/InterviewComputer animation
Meeting/Interview
Meeting/Interview
Meeting/Interview
Transcript: English(auto-generated)
My talk is about text mining and COVID-19 data set, but it's larger than my presentation, my purpose. I would like to focus on terminology extraction
and how we can take into account terminology extraction in the context of mood, and in the context of event-based surveillance systems. It's based on data text mining,
on informal sources like press, social media, and blogs. And the main challenge I would like to discuss, and at the end it would be nice to have this discussion, is how to use terminology and text mining for event-based surveillance systems. I would like to summarize and to sum up the different systems
we developed at CIRAD and in RIA, but called PADIweb. You know this system, but I think it could be interesting to summarize this pipeline.
So in output we have different text-to-read data, news data, and the objective of PADIweb is to extract epidemiology event in this text-to-read data, that this is host, location, death, number of cases, symptoms.
This software is dedicated to animal disease surveillance systems, and it's important to keep in mind this aspect and how we can have a generic approach in the 1L context.
So in the context of PADIweb, and in order to summarize the pipeline, the PADIweb pipeline, there are different steps. The first one is data collection, in order to collect news data like newspaper, for instance.
Currently we collect data from Google News. The second step is dedicated to data processing, in order to claim the text, to language detection, translation into English, because currently PADIweb is working on English data,
so the algorithms are working in English. The following step is dedicated to data classification using machine learning techniques, in order to identify automatically if a document is relevant or irrelevant
with epidemiology point of view. And at the end, we extract information into this text-to-read data, like your disease, host location, and so on. And in this context, terminology extraction
and the use of terminology is crucial. For all the different steps, and more precisely for data collection, the choice of the keywords in order to collect data is crucial for syndromic surveillance, for disease-based surveillance,
the name of the disease, the name of the symptoms, the different synonyms of the symptoms, and so on. It's crucial. Terminology extraction is crucial for classification. Actually, classification is based on a textual representation of text called bag of words,
and the choice of the words in order to predict if a document is relevant or irrelevant is very important for machine learning. And at the end, in order to extract information, the different keywords are important.
For instance, we can have different definitions of the symptoms with different variations of the definition of the symptoms, and it's important in an automatic way to consider this aspect.
So in the context of COVID-19, we investigated different terminology extraction strategies, and we have two different objectives. The first one is terminology extraction for surround systems and information extraction.
This is the first step and the last step of PADI-Wave, for example. And at the end, this is dedicated to the work conducted in the context of work packet three of mood, is a trend analysis of COVID-19 terminology per period and location. And it's the main subject of the paper
we can discuss today. In the context of this work, we used different kind of text-to-read data, and this is important to keep in mind this aspect
because the methods and the results are very different. The first kind of text-to-read data we take into consideration is a scientific publication dealing with COVID-19. Currently, we have a lot of publication available
in order to apply text mining methods, so this is important in all domains. The second one is media data, and more specifically, we would like to discuss the use of media data in this context,
and we use PADI-Wave data as well. And at the end, we use social media like tweet, and currently we investigate other kind of social media like YouTube data, and how we can consider this new social media
in the context of mood and surveillance systems in general. So the first objective is dedicated to terminology extraction. So this is just to explain the global pipeline
in order to extract terminology in the context of text mining. Currently, we use a software called Biotech implementing in the context of a PhD in Montpellier.
Currently, for terminology extraction in input of the process, we have a different corpora in this context, publications, scientific publication dealing with COVID-19,
and at the end, we have relevant terms extracted automatically in the text. There are different steps in order to extract relevant terms. The first one is a part of speech tagging in order to give a grammatical label for each word of the text
like adjective, noun, preposition, and so on, verb, and so on. The second step is to extract term according to syntactic patterns like noun, noun, non-proposition, noun, adjective, noun, and so on.
So there are generic patterns used in different languages, but we can use specific patterns dedicated to specific domains like biomedical domain, for instance. And at the end, we have a lot of terms, candidates,
and we would like to highlight and to rank the term using ranking measures based on statistical aspects and statistical measures. At the end of this slide, we have a measure with three different components. This is an example of measures used,
for instance, for the first component, the first component is dedicated to extract terms with relevant patterns for the domain in this context, biomedical domain. The second component is dedicated to extract discriminative terms.
I'm going to explain this aspect at the end of my talk. And the last component is to extract relevant multi-word term. I mean, in this example, we would like to follow the extraction of African swing fever and not African swing. This is a good example because in biomedical domain,
specific terms are more relevant and we use information about embedded terms in order to highlight a specific term like African swing fever in this example.
So, yeah, I have a problem, yes. And in the context of mood, we had a different strategy, different measure, different strategies in order to extract terms, for example, in publication,
we can extract terms in titles, in abstract, in the content of the publication, and we don't have exactly the same results. And the combination of this expression could be very attractive. We can use linguistic patterns in order to extract variation.
In this example, in this slide, we have different variation of terms. So this is very important for, for example, for syndromic surveillance using different variation of the term. And this is the objective of our work.
We evaluated this strategy, these different combination of methods for two tasks. So first one is COVID-19 surveillance and the second one is syndromic surveillance. For instance, it could be used for a known disease and we evaluated the relevant of the terms,
yeah, the relevant, yeah, yeah, thank you, the relevant of the variation extracted with all strategies. So the second steps, the second objective, the second objective for work conducted in the context of work package two
and work package three, is a trend analysis of COVID-19 terminology per period and location. And this is the main subject of the paper. We investigated the use of different corpora. The first one is media data, medicis data, I'm going to explain
and to summarize the use of these corpus. And the last one is social media data. So for medicis data, actually we collected data in 2020
for different periods, March, May and July in different countries, UK, Spain, France with different languages. And we adapted biotech in order to extract terms using different strategies dealing with the, this is an example,
with the word mask, the multi-word terms using the word mask. And this is a discussion at this period. It was a main discussion in the different countries at this period in order to compare the use of this vocabulary
according to the period and according to the country. So in order to finish, I would like to present a new work conducted currently in the context of mode.
Currently in text mining, we can use different strategies in order to extract discriminative terms, popular information which we will measure exists. And currently we adapted this measure in order to extract discriminative terms
per period and per countries automatically. So discriminative in text mining is based on the number of documents and in our context we use information, spatial information and temporal information
in order to extract discriminative term. The objective is to have, to monitor the vocabulary used in the different countries and during the different periods. This is the main objective of this work.
We implemented a new method. Now we are going to evaluate this method on different study cases. We would like to integrate sentiment analysis in this context in order to monitor the sentiment vocabulary used
and so on. This is the current work. So in order to conclude my short presentation, we have different challenges in the context of mood and surveillance system in general. The first one is about the generosity
of the approach. Currently PADIweb has been developed for animal disease surveillance and the main question is how we can have a one-horse approach in order to adapt PADIweb in the one-horse context for human disease but for plant disease too.
What is generic? What is specific? So this is an attractive and an important challenge today. The second challenge we have today is to extract information in social media and in noisy data like YouTube.
Currently we investigate the use of Twitter but we study the use of transcription of YouTube in order to extract information, spatial information, temporal information, thematic information according to the event, epidemiologic event, thematic information.
But these textual data are noisy. For instance, we don't have the punctuation and it's a natural language processing approach are not is given by the result in this context.
And the last aspect is about quality criteria. In the context of event-based surveillance systems we use informal resources and the question is how we can take into account the quality of the sources, the quality of the textual data,
the quality of the text mining method used in order to deal with this textual data. So it's important to have a global approach in order to take into account quality criteria in order to highlight relevant event in the context of this approach.
And the last aspect is how to connect, how to link the different kind of information we have. YouTube data, social media data, covariate data, other textual data, official information and so on.
So this is another challenge of work package 3 and work package 2 dedicated to text mining. I understand the work on text mining and that you can get how you can extract information and data
but how do you link that at the end to what we are looking at, outbreaks or events or sentiments and how do you see that in the I suppose the next step in the analysis? An important question. Currently we focus on the extraction of
information in order to define an event and we focus on the extraction of location, thematic information. And the key point is how we can define an event. Yesterday we had a discussion with Renaud
about the definition of which epidemiologic information we can extract in information. For instance, sentiment analysis could be used in order to extract a weak signal, how we can connect this weak signal with an event. For the connection with other kind of data like environmental data, official data,
I think we can connect with entity biomedical information and we have to have a normalization of the data point to keep in mind. I would like to understand better if the only way to search by using the terms
and the different synonyms is by listing them or there are alternatives that ensure a higher effectiveness of the search in order to don't miss any potential interesting topic or piece of information. Thank you. This is an important question.
We can have an input text and we can use different statistical measures in order to extract terms but we can adopt a driven extraction with a list of given terms in order to extract variation
using linguistic approach in this approach. And it's interesting to adopt this strategy because with variation we have the frequency of the term is low and with a statistical approach
we lost this information. So our terms are not extracted with statistical approach so using a linguistic approach with a list of terms and the extraction of variation with this approach
is crucial. We need information about terminology at different steps at the beginning of the process in order to collect relevant documents based on the choice of the keywords
and in this context the terminology is crucial for syndromic surveillance for disease-based surveillance with the name of the disease and so on. So at the beginning of the process the terminology is crucial. At the end as well
is important in order to extract event and epidemiologic information like symptoms and so on. We can have different variation of terms so an automatic extraction terminology extraction methods is important in this context.
With computer science point of view an important challenge is to mine data to apply text mining with noisy data with Twitter but with a transcription of YouTube it's an important question because we have noisy data
without punctuation and so on and text mining approach in general has a lot of problems. And another challenge is to extract and to identify weak signal in this kind of data and what is a weak signal. Another challenge is how we can consider
the quality of the data in order to highlight relevant information. The last one is how we can link all this data in the context of surveillance systems. The main challenge is how we can use
this kind of method in a one-house context with the genericity of the approach. We can have specific approach for specific disease but it's important to consider the genericity of the approach in the one-house context.
And I think another challenge is the multidisciplinary aspect how the expert could be integrated at the beginning of the system in order to collect data in order to identify relevant sources in order to adapt the different algorithms
with different parameters. So an expert has to know the behavior of the different algorithms and at the end in order to evaluate and to have a retrospective analysis of the result. This is a research project
so it's a combination of a new approach and a theoretical approach and a practical approach and in this kind of project like MUD it's important to consider new ranking measures new approach algorithms and application, real application we can use directly in the systems.
So we have to find a combination between both and I think this is very challenging because in the context of COVID-19 we have a lot of needs and it's important to keep in mind we need time in order to find new algorithms
in the research context.