Recipe for text analysis in social media: a linguistic approach
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44960 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
Mathematical analysisMedianComputational linguisticsPreprocessorOrder (biology)Endliche ModelltheorieMultilaterationComputer animation
00:39
Social softwareHypermediaUser-generated contentComputer networkProcess (computing)Endliche ModelltheorieSupervised learningUser-generated contentPreprocessorWebsiteSoftware developerProfil (magazine)HypermediaContent (media)Mobile appFacebookTwitterComputer animation
02:07
HypermediaMathematical analysisProcess (computing)Source codeType theoryWeb applicationWeb 2.0Content (media)HypermediaVideoconferencingMedical imagingBeta functionInstance (computer science)TwitterCartesian coordinate systemMathematical analysisYouTube
02:35
Social softwareInformationMiniDiscSource codeFreewareWeb 2.0Cartesian coordinate systemInstance (computer science)Graphical user interfaceType theoryVideoconferencingCodeoutputComputer animation
03:48
Social softwareInformationPrice indexSummierbarkeitContent (media)Goodness of fitNatural languageSource code
04:15
NP-hardSocial softwareInformationParsingInternet forumData acquisitionMetropolitan area networkProcess (computing)Mathematical analysisRadio-frequency identificationIndependence (probability theory)Goodness of fitWordProcess (computing)Negative numberNatural languageTwitterAlphabet (computer science)Execution unitInformationInstance (computer science)CASE <Informatik>Task (computing)Token ring1 (number)FacebookHypermediaElectronic mailing listMathematical analysisForm (programming)Flow separationPrime idealSpacetimeWeb pageWeb 2.0Scripting languageComputer animation
07:37
Electronic mailing listWordCASE <Informatik>Noise (electronics)Endliche ModelltheorieAlgorithmDeterminantComputer animation
08:06
MikroblogComputer networkBlogSocial softwareDylan <Programmiersprache>Sensitivity analysisInformationContent (media)TwitterPredictionEvent horizonMathematical analysisSystem identificationAnnihilator (ring theory)Process (computing)WebsiteTask (computing)Pattern recognitionAbelian categoryCategory of beingAlgorithmSign (mathematics)Negative numberLengthEndliche ModelltheorieCategory of beingHypermediaRule of inferenceDifferent (Kate Ryan album)InformationUnsupervised learningTask (computing)WordTheory of relativityPattern recognitionMultiplication signNegative numberStrategy gameMathematical analysisPosition operatorGene clusterInstance (computer science)CASE <Informatik>Parameter (computer programming)Set (mathematics)WebsiteEvent horizonPredictabilityTwitterContent (media)Lemma (mathematics)CollaborationismSensitivity analysisElectronic mailing listData dictionaryFacebookOrder (biology)PreprocessorWeb pageNoise (electronics)Web 2.0Variety (linguistics)Buffer overflowThomas BayesGroup actionStack (abstract data type)Supervised learningLink (knot theory)Cartesian coordinate systemLimit (category theory)
14:57
Sign (mathematics)Negative numberMathematical analysisElectronic mailing listWordSource codeVariety (linguistics)Pointer (computer programming)AerodynamicsNatural languageSocial softwareStreaming mediaData miningProcess (computing)Speech synthesisGastropod shellPattern languageComputerRegular graphComputational physicsTask (computing)Multiplication signSet (mathematics)Musical ensemblePosition operatorStrategy gameWordWeightNegative numberPolarization (waves)Natural languageElectronic mailing list1 (number)Arithmetic meanDifferent (Kate Ryan album)CASE <Informatik>Formal grammarResultantPattern recognitionAlgorithmoutputPresentation of a groupInterior (topology)Instance (computer science)TwitterComputational linguisticsPiEndliche ModelltheorieAreaWeb 2.0NeuroinformatikInformationCoordinate systemData miningSelf-organizationGene clusterDynamical systemSlide ruleDiagram
21:48
ResultantAlgorithmMetric systemElectronic mailing listWordTask (computing)Multiplication signGene clusterProcess (computing)MeasurementTheory of relativityInformationResolvent formalismCASE <Informatik>Regulärer Ausdruck <Textverarbeitung>CodeEndliche ModelltheoriePoint (geometry)Set (mathematics)Rule of inferenceHypermediaMathematical analysisNoise (electronics)Combinational logicSocial classInstance (computer science)Focus (optics)Vector spaceView (database)Virtual machineArithmetic meanDeterminantSimilarity (geometry)Computer animation
Transcript: English(auto-generated)
00:05
Hello everybody. Before I start, I want you to know that this is not a technical talk. I'm a computational linguist, so I prepare the data. I pre-process the data for modeling later.
00:23
So this is a practical approach on how to prepare and how to pre-process data in order to be able to use it when modeling. So, just for you to know, I'm a computational linguist, as she said.
00:42
First, for this recipe, you need to gather the corpus, then you pre-process it, then you do the text modeling, and then I'm going to tell you about the pros and cons of supervised learning. So, I'm going to talk about pre-processing in social media, pre-processing in text.
01:05
So we need to know first what social media is. And basically, there are four things that you need to take into account. First, it needs to be a web-based app. Also, it needs to be user-generated content.
01:21
Users must be able to create profiles and to be able to connect with other users. And with this, we have the development of social networks, which are basically social media. So what is social media? With the definition we had before, we can think of Twitter, Facebook, Instagram, Pinterest maybe, LinkedIn,
01:51
but also Amazon, because in Amazon you have a lot of content reviews. Booking, the same, TripAdvisor, and also Wikipedia, if we think as a content site where you can share ideas.
02:09
So, the type of content we find in social media is text, as in Twitter, for instance, or any other web-based application. Also, Instagram with images, videos in YouTube, for instance, or Vmail.
02:28
Then the step for text analysis, as I said before, are those there. And we start with gathering the corpus. To gather the corpus, you can take corpora from online-free sources or from web scrapping.
02:46
Those are the two main, basic places where you can find data. So, online-free corpora, you have this corpora there, and for instance, you have the first one. So you can type import nltk, I know, no, because it's a video, but I'm not writing any code.
03:13
You write nltk.download and then you get this user graphic interface where you can download all the data.
03:24
It's just so that you know the tools that you have. I'm not coding, but if you have a look at this, you can download a lot of packages which are really useful when starting, especially doing NLP-based applications, because they're really useful, because they have labels,
03:43
they have a lot of resources that are really useful for starters. Also, the Big Am Young University, they have really good corpora in English, also in Spanish and Portuguese, but any other languages are difficult to find.
04:04
The British National Corpus, you can access the corpora or you can download it. And there's this guy called Martin Weiser, and he also has really good corpora of online English, and he's got really good resources there too.
04:24
And then you can do the web scrapping of social media and information resources. Social media, you know that, it's Facebook, Twitter, whatever. And then the information resources, in my case for instance, I did a script that was able to retrieve information from the Spanish academia web page,
04:45
and it was useful because I needed to check whether a list of words do really exist or not in the Spanish academia, and I could do it automatically, so it was really helpful. So the kind of text we find in social media are the ones that are tricky ones.
05:06
For some reasons, I'll tell you later, but it determines the way we're going to analyse the data. So we have posts, we have tweets, we have those tags and comments in the post. I mean, you see there's a lot of comments.
05:22
And in the tweets, you can also have the hashtags. All of these information are really valuable information for text analysis, because all those tags and comments and whatever are going to be really helpful when organising and classifying text and all these tasks we want to perform.
05:43
So now we have the corpus, and we need to go to pre-processing. When pre-processing, you can do a lot of tasks, but I'm going to explain those three, because they are the main important ones and also the most useful ones. Tokenisation is separating a text into smaller units.
06:04
So you can separate into sentences or words or whatever unit you need. And you might think this is very easy, apparently, but you might find some examples like those, like the city of Bombay. You have to decide whether you want to keep this as a whole unit
06:21
or you want to separate each word as a separate unit. So this is a lot of work you need to think about before you do the processing. Also, you will have problems such as ex-Malaysian prime minister. You need to decide whether you want to keep those units as one unit or separate ones.
06:43
So this is it. Also, want and theirs. You need to decide whether you want to keep the verb and the negative form or you want to separate them, because especially the negatives are tricky when sentiment analysis, text analysis.
07:03
So all of these you need to think about beforehand so that you can have the proper data you need. Also, it depends on the language. It's not only based on the text itself, but the language. You see, Japanese and Chinese used to write everything together
07:21
so you cannot use spaces as word breakers. And especially Japanese has also those four alphabets. So it's not that easy to decide which one of the tokens are you going to separate. Stop word removal is pretty good also for removing words that are meaningless.
07:45
In the case of pronouns or determiners, possessives, I know you can't read those, but it's basically a list of words you want to remove from your text because it will be noise in your algorithms, in your models.
08:02
So it's very useful to have a list of words you don't need. And finally, lemmatization and stemming are useful because you can group lemmas better than if you have all these inflectional endings that are also noise.
08:21
So for instance, lemmatization is to remove inflectional endings but coming back to the lemma, the original lemma. So in this case, you will basically need a dictionary or a set of rules that help you to go back from the inflectional ending,
08:40
I mean the word with inflectional ending for instance in smiling, to smile. So you need the set of rules or the dictionary to go back to the lemma which is smile. Stemming is easier because you just chop off word endings so you don't need the dictionary.
09:00
You can only chop off the endings, have a small list of endings, and then you chop off and it's over. So what are the problems we find in text in social media? Basically, the most important one is time sensitivity. As you know, Twitter, Facebook and Amazon, or Amazon reviews I mean,
09:21
and all of these sites have content which is dynamic. It means that it's constantly changing, people are constantly creating new content, and it's really difficult to build a model because you cannot decide on a set of parameters and expect it to work
09:40
because parameters are constantly changing. So this is the main problem we have in social media when analyzing the text. Also the short lengths. If you try to analyze Twitter, you'll see that you have, for instance, I don't know, Superman and Clark Kent,
10:02
and you know that they are the same person, but you might have a Twitter on one name and a Twitter about the other name. So when you come to cluster those words into different groups, you'll see that they might not be related in the same cluster because there's no contextual information that is telling you they are the same person.
10:22
So it's difficult. This is a problem for text analysis. And this brings you to the semantic gap, which is exactly what I just explained. Also the problem of unstructured data. There's two variants. First, the variants of content quality.
10:40
You see there's people that write really good in a really polite manner, but then there's people who write the way words come out of their mind, and they don't try to make sense of the sentences. So this is a problem. And also acronyms, abbreviations like you or to in this case or the and,
11:00
and the misspellings like where is the word, but how do you distinguish where from we are. So all of these problems you'll find in text analysis a lot. And also you have abundant information. There's tons of data, and you need to cut somewhere. In order to be able to process.
11:23
So applications in real world. You can use all those strategies for event detection, for instance, to know what is a new which is very famous or which is popular, or to predict what kind of information is going to be fashioned in the next week or whatever.
11:46
Also you can take advantage of collaborative question and answering in sites like Stack Overflow, for instance. If you scrub the web, you will be able to find really important information and a specific one rather than if you Google your search.
12:05
Also you can use Wikipedia to fill in the semantic gap I was talking about before. If you use Wikipedia, which is a trustful resource, you might be able to create somehow relations between those words
12:22
that are apparently non-related, but with Wikipedia you find relations of the two. It's also useful for sentiment analysis. I will put an example later. And also to identify influencers and see if they are a lot of mentions of that person, or also for quality prediction.
12:42
Like in Amazon, you might want to know if a review is faithful or not, and it's very useful, these kind of tasks you can perform with NLP to decide whether to trust that user or not. So now we're going to go through text modeling.
13:03
I'm going to go through it very quick because I'm not an expert, but with this pre-processing we did, we should be able to come up with a proper dataset to be able to perform this.
13:21
So if you go to Amazon and you find those reviews, you might want to separate. Well, they already do, but if you have your own page, you might want to separate between positive and negative comments. So to find positive and negative comments, you need a vocabulary of positive words,
13:50
negative words, and neutral words. So if we go back to this, you first need to define the task. Maybe you want to group the words in clusters,
14:02
and you need to decide which kind of clusters you want. So once we have this task defined, we need to decide the strategy we're going to use for sentiment recognition. There's mainly two ways, the supervised and unsupervised, but unsupervised learning in this case is really, really difficult.
14:23
So the supervised learning, you will need a labeled corpora, which is time consuming, predefined categories, and you need to go all through the data before you decide on the categories, and you can use sentiment lexicons. For unsupervised learning, you will use an unlabeled corpora, which is easy.
14:45
You can use k-means for category discovery, which is also useful, but you'll have the problem I said before with the time, because it's constantly changing. So you never know if your model will be properly good.
15:03
As I said before, we need a list of vocabulary for positive, negative, and neutral, and also you need a list of comments you want to analyze. The result, this is a very basic algorithm for this snipe base,
15:21
but you can use any one you want. The input will be mostly always the same. So the result you'll get is generally a positive score and a negative score, and then you can classify the text depending on the result you get.
15:43
This is a very basic task. In my opinion, the best way is using sentiment lexicons, because you have a word list of positive and a word list of negative words, and it's a binary fashion, so it's an easy text classification task.
16:11
These are two really good ones. Also you have WordNet for NLP, and you can download WordNet in Python, and it's very useful.
16:24
I'll add the link after if you want. But those are really simple ones, and they come with this binary fashion, already classified, so it's pretty useful. Another approach, which I thought was really interesting,
16:41
although it's not that new, but it's a really interesting approach, are polarity lexicons. What those guys did is they had a small list of positive adjectives, and they thought that every adjective connected with that adjective in the list by end,
17:05
which is coordination, will be necessarily a synonym. So they decided to scrub the web and find, well, more manually,
17:21
but they found a lot of pairs of words connected by end, and so all the words were added automatically to their already-built list of positive words. They did the same with negative words, with the word but, or however,
17:42
so they could automatically have a bigger list of negative words. So this is a really good semi-supervised approach for this task, because you get to have the best of both. The unsupervised is less time-consuming, and it's easier.
18:06
This is an example from Dan Juravsky's Stanford NLP course. This is a really good book also. They have a book that they are editing right now. The third edition, I guess, is going to be ready in a few months,
18:22
and this is a really good course also for a lot of NLP tasks, and this is a really good approach so far, I've seen so far. So the results are basically depending on the task you want to perform.
18:44
As I said before, having tasks in NLP depends a lot on the tasks you have. I mean, if you are analyzing Twitter, you might need some strategies really different from the strategies you're going to use if you are analyzing, I don't know, a text in a novel.
19:03
So this is a very important thing you have to bear in mind, because if your task is different, you're going to need a completely different strategy, also a very different algorithm. So k-means is a good one for clustering. If you have a bag of words and you need to group them in clusters,
19:29
but if you want to, I don't know, for instance, decide what, in a spell corrector, if you want to know if a grammar is well written or not,
19:44
I mean in a sentence, if you want to check the grammar, you might need a completely different approach and you might want to use n-grams or another strategy. So using the lexicons for sentiment recognition,
20:00
for me, is the best approach, as I said before. And we have prongs, like the topic discovery is a challenge. And if you already have a small set of words, it's really helpful, although you can increase the amount of words automatically somehow. And also, the performance is better, almost always.
20:23
I mean, all the cases I've seen, always the performance is better. On the contrary, dynamic language is difficult. So it's really time consuming, building the lexicons and labeling the corpora.
20:42
So, well, it depends on how, if you need to have a task finished in a small amount of time or you have more time, you can use one strategy or another one. So this is the bibliography.
21:00
I'm going to upload the slides in case you want to check. Mining Text Data is a really, really good book. They have a lot of algorithms explained for each kind of task. So it's very useful. And just to finish, I'm from Mallorca.
21:21
I'm a co-organizer of the PAI Data and I'm a computational linguist. I couldn't come up in the whole web with the proper definition of what a computational linguistic is. So I found this slide, which I think is awesome, and I decided to copy it here. You also have the information there. It's from a presentation.
21:41
And, well, thank you very much for watching. Thank you very much. We have a few minutes for questions. How do you determine your stop words?
22:00
Is there a common set of words that tend to work or is it based on your application? Well, basically anything without lexical information would do. Like, for instance, if you have the word house, you know that it might be necessary to have the information of house
22:21
because you can find synonyms. I don't know, you can find words that are related, such as I know the roof or whatever. But if you have, for instance, the determiner, it has no semantic meaning, which means that you cannot find relative words,
22:42
I mean related words, sorry. So basically this is what you take into account when listing stop words. Also it depends on your task. You might want to, I don't know, decide that house in your case makes no sense in your backup words, so you won't use it. So there are lists of stop words which are already built,
23:05
and you can, in LTK, for instance, the package has some, and I know many packages have lists. But basically this is the main principle when deciding whether a word is a stop word or not.
23:23
No, you tokenize first, and then you remove whatever you don't like, usually. Thanks for the talk. I was wondering how you apply k-means to a bag of words. Like, there must be a step in between, right? Well, as I said before, I don't do a lot of modeling myself,
23:43
but I usually prepare the data, so I know the problem in that case with k-means was it made the clusters that resulted from the first task
24:00
made no sense at all, because as I said in the example, you might have two words that are related because you know it, but when it comes to clustering, somehow the algorithm doesn't find similarities. So my job was to decide how to improve the data
24:23
so that the algorithm was better. I don't know exactly how they did implement the algorithm itself. Okay, so I'm not so much talking about k-means or anything, it's more like how you represent the data to an algorithm like k-means, because usually you have a set of vectors,
24:43
and then you compare the vectors, if this makes sense to you. And it's difficult, like you need to encode the word into something that is meaningful for a machine, right? Yeah, I understand, but I don't code the algorithm,
25:01
so I don't know if they did a step in the middle. I mean, I know I have a set of data, a list of words in that case, and I had the results and I saw that the clusters made no sense, and I had to come up with a solution on how to improve those clusters.
25:22
So from a linguistic point of view, I saw that words that should be related were not related at all, so I started reading a couple of articles, and I found out that the problem with k-means, for instance, is that you need somehow information that relates the words,
25:45
and this was missing. So this is why I was saying that if you use another resource like Wikipedia, you can somehow find relations between the words, but I'm not coding the k-means algorithm in this case.
26:01
Okay, thank you. Another question? Hi, thanks for the talk. So these days on social media especially,
26:21
if I'm trying to measure sentiment, what I see is that a lot of people respond or review with emojis, or they would use sarcasm, or they would just react with like a GIF of something that they're feeling. So when you're preparing data to measure sentiment, for example,
26:42
do you take these things into account somehow? Yeah, of course. It's really difficult. I mean, most of the times you want to remove all the noise you can. So when it comes to sarcasm, and you find that some words are really not helping and are noise,
27:05
you basically remove them. Also you can try to do another analysis which is deeper, linguistically I mean, and you try to label everything, and with this you improve a lot, but it takes a lot of time.
27:21
It's time consuming, and also you need a lot of people working in that. You need a lot of people labeling the data and helping with the task. Like you have a list of stop words. There's no such resource out there where you can say, okay, this emoji reflects this emotion. No, no.
27:41
Okay, thank you. Well, not that I know, but so far I never found a list. So you mentioned one of the difficulties is to decide if you split the combination of words. So in your practical experience,
28:01
what kind of metrics you use to determine if you need to split or not? You do it manually, or you use rules. Like what kind of rules? Well, you can use, I mean, of course you can use algorithms and try to do it automatically.
28:24
I mean, you can get good results, but if you want to be really, really specific and make sure that all the words are the way you want them to be, you usually can build a dictionary, or you can use...
28:41
My question is how do you determine if you want to split or not? This is depending on what you want to do. I mean, the task. If you want to, I don't know, if you need the names of people, you might want to focus on that task specifically,
29:04
and then you perform that task better, and then you find... I mean, I usually work with lists because the task I did was... I mean, you needed specific words to be found and without no other interference or so,
29:24
but I mean, I don't like doing it this way, you know? There's many ways you can do it better. I mean, I'm a linguist, and my job is to solve the problems that the algorithms cannot solve.
29:43
So my job is not so good as... I mean, it's not that cool as writing code and everything works perfectly. I need to find the bugs and resolve them, which is the problem with the algorithms, which are obviously much fun, but...
30:02
I work with lists or with rules like regular expressions or whatever I can. Thank you. We have run out of time for questions already, so let's thank Olalia once again for her talk. Thank you.