Introduction to sentiment analysis
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44876 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 2018129 / 132
2
3
7
8
10
14
15
19
22
27
29
30
31
34
35
41
44
54
55
56
58
59
61
66
74
77
78
80
81
85
87
91
93
96
98
103
104
105
109
110
111
113
115
116
118
120
121
122
123
125
127
128
129
130
131
132
00:00
Mathematical analysisToken ringTerm (mathematics)SoftwareProcess (computing)InformationOpen sourceHome pageHypothesisFormal languageFeedbackMobile WebDrill commandsSystem programmingService (economics)EmailProduct (business)Social softwareInternet forumData miningState of matterSystem identificationObservational studyLevel (video gaming)Source codeStructured programmingStudent's t-testTorusPoint (geometry)Uniqueness quantificationPreprocessorPresentation of a groupFocus (optics)MereologyPole (complex analysis)QuadrilateralHeegaard splittingAlgorithmFunction (mathematics)ParsingObject-oriented analysis and designSpacetimeString (computer science)Electronic mailing listMeta elementInclusion mapNP-hardAttribute grammarLemma (mathematics)Codierung <Programmierung>IntegerEndliche ModelltheorieData modelProcess modelingTime domainMenu (computing)Telephone number mappingMachine codeDirection (geometry)Degree (graph theory)Content (media)SpreadsheetRegular graphInstance (computer science)Revision controlIterationRegulärer Ausdruck <Textverarbeitung>Transformation (genetics)CompilerMatching (graph theory)Price indexOptical disc driveFile formatSet (mathematics)ClefMaxima and minimaAbsolute valueSimilarity (geometry)GodTexture mappingDefault (computer science)Extension (kinesiology)Fatou-MengeBootingCodecBlock (periodic table)BuildingHierarchyRule of inferenceTupleGame theoryMachine codeKolmogorov complexityElectronic mailing listBlock (periodic table)Combinational logicMachine codeUniqueness quantificationWordTerm (mathematics)Software developerSpacetimeExtension (kinesiology)Negative numberAttribute grammarInformationMobile WebComplex analysisFunktionalanalysisBuildingFeedbackToken ringSubject indexingMathematicsCondition numberEndliche ModelltheorieTelephone number mappingResultantDomain nameSet (mathematics)Letterpress printingService (economics)Formal languageMenu (computing)Line (geometry)HypothesisLink (knot theory)Rule of inferenceStreaming mediaMathematical analysisCASE <Informatik>Limit (category theory)BitGoodness of fitMultiplication signMortality rateMachine codeMainframe computerMobile appReading (process)Projective planeData miningLocal ringGroup actionLemma (mathematics)Computer programmingHypermediaLevel (video gaming)Slide ruleMereologyPower (physics)Speech synthesisDifferent (Kate Ryan album)Prisoner's dilemmaHydraulic jumpTable (information)Student's t-testFood energySoftware testingOffice suiteField (computer science)Degree (graph theory)State of matterProof theoryAreaCartesian coordinate systemTransformation (genetics)Type theoryHeegaard splittingString (computer science)Computing platformTotal S.A.Constructor (object-oriented programming)Figurate numberData structureLoop (music)Front and back endsNumberTouchscreenPhysical lawCircleComputer fontTheory of relativityComputer filePerfect groupSource codeProcess (computing)Form (programming)Social classData storage deviceRegulärer Ausdruck <Textverarbeitung>Home pageUtility softwarePreprocessorDatabaseSoftware architectureExpert systemProgrammschleifeMusical ensembleInterface (computing)outputPhysical systemTask (computing)Decision theoryPoint (geometry)Virtual machineLaptopQuicksortGame theoryLibrary (computing)Medical imagingInternet forumNatural languageProduct (business)Validity (statistics)Order (biology)FlagHierarchyExtranetEmailBoolean algebraMoving averageEmoticonData dictionaryParsingParameter (computer programming)Software engineeringPresentation of a groupDrum memoryFocus (optics)Fuzzy logicDirection (geometry)Standard deviationAbstractionRevision controlComputer chessPattern languagePrice indexReal numberUnicodeArmBlogPoint cloudComputer virusCondensationDebuggerIndependence (probability theory)AlgorithmModal logicRoundness (object)Single-precision floating-point formatSoftwareComputer animation
Transcript: English(auto-generated)
00:03
Thank you. Hello, and welcome to my talk about introduction to sentiment analysis. Thank you for already removing four or five slides, because it was a perfect definition of sentiment analysis. So we can skip a little bit. That's nice.
00:21
To the microphone, and I should stay inside the circle. Thank you very much. Yeah, sentiment analysis. Who has been here yesterday when there was a linguist talking about sentiment analysis? I think you're going to have a great combination with these two talks, because I'm going to talk about the code.
00:42
So to develop a few on sentiment analysis. So what are basically the topics I'm talking about? A little background, what kind of data we were analyzing. What I have done before I jumped into this topic. A little bit about what is sentiment analysis, so we can cut it short now.
01:02
Then we look into extracting parts from a text, like sentences, separate words. Then we look at the approach, how we can assign an opinion to these words using a lexicon-based approach. And that already gives some results
01:22
when you want to find out how somebody feels about something. But there's also a little bit more about it when you have to combine certain words, and we're also looking about that. And if there's still time, we'll elaborate on a few special topics, like how to deal with slang terms. But we can easily skip that. It's very simple. It's only one slide per topic, and you can easily look it up.
01:42
But I found it useful to have it there, because it might save you time if you still have to, if you have to do a sentiment analysis on your own. All right, so let's jump into it. Some background about me. I'm a software developer, and basically everything that's related to it. So I also do a little bit requirements engineering,
02:01
software architecture, testing, if it has to be, a little bit of project work. And I worked in various areas, in hospitals, in banking. And currently, I'm working in e-commerce. I have a master's degree in information processing science. I maintain a few open source projects. So if you ever feel the need to read
02:21
lipstick data from the mainframe, I did a package for that. Yay. So basically, nobody wants that, I guess. And I'm also a co-organizer of a local Python user group in Graz, the city where I come from. And there's a home page where you can find out more. So banking, hospital, e-commerce.
02:41
Why am I talking about sentiment analysis? Well, I needed a topic for a master's thesis. And a friend of mine had just founded a startup, and he said, hey, we have this kind of data feedbacks from innkeepers, or four innkeepers from the guests. And it's text, and it's unstructured, and we want to analyze it. Hey, you could do a master's thesis about it. And I did a quick glance.
03:01
Yeah, there's people who write about it. There's books. There's even Python, so we can have regular expressions if there's nothing else. And there seems to be a couple of libraries, like Spaces and NLTK. Yeah, good enough. Let's do it. So that's how you jump into sentiment analysis. You don't need more.
03:20
The company is basically a startup. As I said already, you have guests that go into an inn, and they can give feedback. But different to public platforms like TripAdvisor, the feedback only goes to the innkeeper. So they have a mobile website where the URL is posted in the inn, and if they type the feedback,
03:41
the innkeeper can see it on an application. It uses Django in the background, and the front end is in Angular. And if you want to find out more, there's a home page about it. And the feedback, basically the interesting part of the feedback is unstructured text feedback that answers questions like, how did you enjoy the visit? What did you like about your last visit?
04:02
What can we prove? What else do you want to tell us? So this is very broad, and you don't really know what they are going to answer about. So here's a screenshot. That's how it looks when you enter such a feedback. It's an Austrian company, so this is German, and, well, the food is not really German,
04:22
karebort and knödel, that's Austrian terms. Anyway, technically it's language independent. And the innkeeper wants to find out if he gets lots of feedback. What are really the pain points in my business? What are the people complaining about most? And also, what are the people happy about the most? So these are my USPs, I have to preserve them.
04:42
And he don't want to wade through all the feedbacks, so he just wants to get a quick glance and then maybe manually drill down into specific feedbacks that complain about certain issues. So sentiment analysis also has a couple of related applications. If you have service ticket systems, if you want to pre-process custom email
05:02
and automatically assign it to the appropriate place if you want to extract information with product refills, or as we heard yesterday, social media. So what is the sentiment analysis? Yeah, we already heard it. It collects opinions from text written in natural languages and stores it in a structured way.
05:22
That's basically it. Sometimes it's also called sentiment detection. Sometimes it's called opinion mining. There are slight differences, but generally that's the same thing. So you can do it on three levels. Essentially, there is a document level when you have a big customer review and you reduce it to three stars, then you know, okay, the customer's not very happy with it.
05:43
So there is this product and he's not very happy with it. That's the sentiment. But it's really useful because you want to know what is he unhappy about. You want to find out, is this, yeah, well, examples, when a phone does not have enough power and it quickly runs out of power, then you have to improve something about that.
06:01
And it's different to a phone that's just too slow. You have to take other steps for that. So you want to know where to have to improve it. And to get that, you can start on a sentence level. So you have a document with many sentences and then you split the sentences and you extract an opinion for each sentence. For example, the schnitzel is too small
06:22
for a hungry student. What's a schnitzel? That's a schnitzel. So if you ever come to Austria, you should taste something like that. It's not very healthy, but very tasty if it's done well. And you can go even further and go down on an aspect level. So you say, okay, when I get a feedback like,
06:42
the schnitzel tastes very well, but it's too small. Then he's talking about the schnitzel, but the aspect of the taste, which gets a rating good. And then there's the aspect of size, which gets a rating small or too small or bad, bad. So the taste is good, the size is bad. It's the same schnitzel.
07:02
And what's an opinion? There is a deluxe definition. If you're a developer, you can already see the fields in the table here. Basically, the schnitzel is too small for a hungry student. What does it tell us? There's a target entity, the schnitzel. There's an aspect of it, the size, because it's too small. The sentiment about it is bad.
07:22
The opinion holder was a guest called Hans Meyer, and he gave the sentiment at the end of April. There's a reason for the sentiment. Why is the schnitzel bad? Because it's too small. And there's also a qualifier. It's not generally too small. It's just too small for a hungry student. It might be big enough for an office worker who hardly burns any calories the whole day.
07:40
So it's just for students. And that's from a book from a guy called Leo. And he also has another one because the first one is really unwieldy and you rarely need that much information. And it's also very hard to extract. Basically, an opinion has a topic. Okay, we talk about food. It doesn't matter if it's schnitzel or spaghetti. It's just the food.
08:01
The sentiment is bad. There's an opinion holder and a date and time. And this is enough to figure out where are the pain points? Where are my USPs? What makes my business special and where do I need to improve? So basic workflow. How do you make a sentiment analysis? And this, if you want to know more about it, check the presentation from yesterday.
08:21
You collect the data. You pre-process the data and remove data. You can't process or clean up data that are in a state that's difficult to analyze. Then you analyze them and then you interpret the results and act on it. So how does this look in our example? Well, we already have the data. They are in the database.
08:41
Not much to do. Pre-processing, yes, there was a little bit, but it's only towards the end of the presentation and nothing exciting. The main focus is the analysis. This is the main part of this presentation. And when you want to interpret the results and to act on it, in our case, it's the innkeeper. So we just give him the data to make the correct decisions.
09:04
All right, so enough of the pleasantries. I'm a developer and I said we're going to see code, drum rolls.
09:22
So we have sentences and tokens. If you want to split the document into sentences and tokens, yeah, that's easy in Python. You just look for a dot, split it with the string function and now you have the sentences. One large sentence and then you get,
09:42
and I have the labels under here, yeah, and then you split it into three sentences. And if you want to have words, then you split it on the blanks. That's easy, right? That's how everyone would do it. Well, not quite.
10:01
Because what about the sentence like this? The waiter was very rude, e.g., for example, when I accidentally opened the wrong door, he screamed, private. So there's indirect speech, there's abbreviations in there, and if you use this very simple split function, then you end up with a mess of letters and it's not really a sentence anymore.
10:23
So Spacey to the rescue. What does Spacey offer? Well, if you want to use Spacey, you first have to load the language, in this case English, and then it's pretty easy to take a text and split it into sentences, basically these three lines of code.
10:41
You make a document out of your text, which is a Spacey internal structure, and then you just iterate over the sentences. And that's it. And as you can see, these are very nice sentences now. It even preserves the thought. If you want to do the same with words, yeah, then you can iterate over a sentence and then you get the words from it.
11:02
That is the example of the nasty sentence, so it even can deal with the nasty sentence. For example, the e.g. is still separated. If you want to split it into words, yeah, just iterate over the sentence. So, actually, word is a term, you shouldn't ask a linguist what a word is,
11:21
as far as I know, because they start to get aggressive. I think they have 20 different definitions for it. At least I couldn't figure out one. But Spacey doesn't have words, it has tokens. So what's a token? It's basically a container class where there's attributes. So you have the word itself,
11:41
the one it found in the text, like tastes, but it also can compute different information. It can find the base word, which is called lemma, and the base word, to taste, is taste. And it also knows that it's a verb, so that's the role in the sentence. So if it's a noun, a verb, if it's an adjective, Spacey can find these things out.
12:03
And this concept is called token attributes, and several attributes exist in two forms, like the POS, yeah, why is it called POS, by the way? It's not piece of shit, no? It's part of speech tagging, that's what it's about. So I had to look it up when I saw it the first time.
12:21
So you have a variant with an underscore, which I think is hard to see, but this should be an underscore, but the font somehow destroyed it. And this is the text part, and then you have a variant without an underscore, and this is just a number. The reason for that is the text variant is easy to read, and the number is easy to store and fast to compare,
12:40
so it's performance. Linguists usually use the number variant and only print the text variant. And there's a couple of functions and utilities to convert around between this. So Spacey does have a couple of limitations you have to be aware of, so it's not perfect because this is really difficult, is what Spacey does here.
13:02
The tokenizer sometimes, or the tokenizer uses probabilistic models, so there's taste as a verb, and there's also taste as a noun, and sometimes it can get it wrong. Yeah, that just happens. And also lemma and the part of speech can get wrong,
13:23
so it doesn't get the correct lemma. For example, when you have German Tisch, like table, it thinks the lemma is Tischen, which is not really a word, but nevermind. So if you really run into trouble, because most of the time it doesn't matter, then you have to build your own model, and Spacey provides tools for that.
13:43
So, but how do we actually find these topics and ratings now? That's what we are looking for, a topic like food, a rating like bad. Topics, yeah, you have to define your own topic system. There's several ways to do that. You can see, you can look what other people are doing. You have topic modeling,
14:02
which is in a whole area of expertise, and there's tools like Gen Zim. My experience only works if you have a lot of data, which we didn't have, and you can build a tech cloud, look at it, or you just can ask domain expert what is important to you. And in our case, we just did all of this and mixed the result together.
14:22
So the topics we used for our innkeepers was the ambience, like what's the decoration, the music, the light, what's food and beverages. Everything is related to eating and drinking. Hygieny, like toilet, or if something smells bad or it's dirty. You have the service with the waiters,
14:41
are they polite, are they friendly? There's also the value, so how is the price and are the portions big enough? And with that, we jumped into it. So this is easy to represent in Python. I've used an enum. This just, the topics are just enumerated, ambience, food, hygieny, there you go.
15:02
With a sentiment, yeah, that's a bit more difficult. Literature says, yeah, you can do whatever you want. You can do a five star system. You can, yes or no, great or suck. You can also do numbers from one to 10 or a number between zero and one. Whatever rocks your boat.
15:21
We decided to go with this, so it's again an enum. And essentially, we have good and bad, but we have three different variants. We have somewhat good, really good, normal good and very good. And the same for bad. So usually, I think in hindsight,
15:40
it would have been enough to have four values, bad and very bad and good and very good, but that's what we ended up with. And yeah, now we want to assign sentiments to words. For that, we need a lexicon. So what's in the lexicon? Basically, it's just a table where you have certain words.
16:02
You can assign a topic to it. You can assign a rating to it, but it doesn't have to have a topic or a rating. So it's okay if it only has a topic or a rating. If it has neither, there's no point to add it to the lexicon, but yeah. And we also added a variant where we said we allowed regular expressions. So if there's schnitzel or there's zur schnitzel,
16:21
both is about food and doesn't really matter. So how do you get the lexicon? That also depends, but in our case, you take words that are obvious. For example, from the menu, they always have to talk about food on the menu. You find the most common words in existing feedbacks
16:41
and see how they relate to the sentiment. And then you have, very quickly, you have a first version of your analyzer and then you just throw it at the text. And if you find a sentence where you cannot extract the topic or a rating, then you look at it. What are the words in this particular sentence? And if there's something interesting, you add it to the lexicon.
17:00
So that's a simple approach to do it. If you want to model such a lexicon entry in Python, yeah, it's again mostly a data container because we just have a word rating and a topic. So this is essentially what it looked like. We have lemma, a topic and the rating
17:22
as a parameter to the constructor, and a little bit of messing for the regular expressions. But what we really want is we want to find out if a lexicon entry matches a certain token we found in our text. So we have a matching function here where we can pass the token. And this essentially compares the token
17:42
with texts it derives from the lexicon. So it looks at the plain token as it is in the text. When it can find it, it compares the lexicon entry with the lemma. Then it does upper-lower case transformation. It looks at the regular expression syntax. And you could go even further. You could do fuzzy logic, whatever. So at the end, this matching function,
18:01
it's almost finished, but the screen is a little bit too small, returns a number between zero and one, and that's it. So for every token, you can check the lexicon entry. And the lexicon itself is just a collection of tokens, so basically a list. And what you really want is I want to find
18:20
in all the lexicon entries I have, I want to find the one that matches my token the best. Yeah, so this is even simpler. You have a constructor which just initializes the entries. Then you have an append function where you can append entries. Usually you get them from a database or from a comma-separated values file or from an Excel file, whatever.
18:41
We just used LibreOffice, actually. Not actually, more politically correct. And if you want to find the lexicon entry, you just iterate over all the entries, compute the score for it, and you remember the one with the best score, and if you find one where the score is one, then you can already terminate the loop. Otherwise, in the worst case,
19:01
you have to iterate through the whole lexicon. Performance-wise, this is not very efficient, so this can be improved, but then you have more code, and it was fast enough for us. So this is how you build a small lexicon for our simple example. As I said, we got it from the database. It's not PEP8, the indentation,
19:21
but I think in this case, it's easier to read. Yeah, nothing special. So you have a word, waiter, and the waiter is about the service, and then you have a word like tasty, it's about food, and it also has a rating, good. Or the word quick, which is always nice, doesn't have a topic, but if something is quick, it's good.
19:43
So here we have the first part of code that iterates over the sentences and extracts the lexicon entries and looks at them. So it says, okay, the word there, I don't know anything about that. Music, yes, this is about the ambience. And loud, okay, this is bad.
20:02
So if the music is loud, the ambience is bad. And here we go. That's basically our first simple sentiment analysis. So I have a sentence, the music was very loud, and it says, yeah, the ambience is bad.
20:20
The end, yeah, not quite. One topic I want to talk about because I think you have to deal with it one way or the other is intensifiers, dimmishes, and negations. So intensifiers and dimmishes are basically words that modify a rating. Dimmishes are slightly, so if something is slightly bad,
20:41
it's not that bad as the normal bad. And intensifiers is, for example, very bad, really bad, terribly bad, terribly loud. And if you find one of these words, then you get a different rating than with the word alone. Loud is bad, very loud is very bad. You can represent this in Python easily
21:02
with simple sets and maybe some uppercase, lowercase transformations. And then you can make a function to check if a token is a dimmisher or intensifier. So yeah, that's how you find out if the word vary is an intensifier. Yes, it is.
21:20
So it's simple enough. So what's interesting about intensifiers, the dimmish rating. So you need a couple of functions. In the end, you want to have, you have a rating, and you want to have the dimmished variant of it. So this is basically a little bit math with signum and plus and minus. And if you have enums, then it's even less code.
21:42
And that's about it. So you have a function that you can, okay, you can't see that, sorry. That was an example. So if I call dimmished for good, I get somewhat good as a result. That's an example for a dimmisher.
22:01
Negations. So negations turn the sentiment to the opposite. For example, a word like not. So if I have tasty, okay, it is good. Not tasty. Is bad. So it turns it around. And it can of course be combined with intensifiers and dimmishes. So if you have very good,
22:21
the rating is very good. But if you have not very good, it's not the opposite of very good because it's not very bad if it's not very good. It's actually somewhat bad. That's something you have to keep in mind. So it's not just turn it from minus two to plus two,
22:40
or in this case minus three to plus three. You have to do a little bit of magic. Negations are easy to represent, similar to dimmishes and intensifiers. And if you want to negate the rating, that's just a typical mapping problem. So I get the rating very bad and the negated version is somewhat good.
23:03
And two of those combinations are pretty hypothetical. I don't think you actually use them. Something is not somewhat very bad. You usually don't say that. But if it comes along, you turn it into good. And again, now this fits on the screen. So we have negated rating from good
23:22
and it turns out bad. And the negated rating from very bad turns out somewhat good. Not very bad is somewhat good. All right, so we have building blocks now for words basically. So we can look at a word and we can tell about which topic it is,
23:43
if there's a sentiment in it. We can figure out if it's a dimmisher, if it's a negation. But that's not enough yet because we can't operate on the single word level anymore. So we need to add this somehow and combine this.
24:00
And one approach would be to just use a list of tokens and mess around with it. But spaCy provides something that is very nice for this task. It's called spaCy Pipeline. So what's a pipeline? spaCy basically, when you throw a text at spaCy, it has separate steps where it tries to find out what's a sentence, what's a token,
24:21
when it tries to assign this part of speech tagging and things like that. And it's something you can extend. And what you also can extend, if you have tokens, you can add additional attributes to it. So there's an article explaining this very in-depth. But I'm just going to look at the things
24:41
that are important to us. So we have a token, and you can say token set extension. And we want the rating, a topic, and we want Boolean flags if it's a negation, intensifier, or a dimmisher. And then we can work with these new attributes. There's a funny syntax. So you have token dot underscore dot, and all the extensions are in this underscore attribute.
25:04
So that's something you have to get used to. That's spaCy. But that's some, how would you say, some trick to make it easy to extend from the spaCy people. And I can say token dot topic is food. And when I print it, token topic, then I get food.
25:22
So I can say a schnitzel is about food. And I don't have to introduce a new data structure myself. So I just use spaCy's token. So that's just an intermission for the following slides. There's a little debugging function. Nothing interesting, just that we can print what a token contains.
25:40
And now we want to extend the pipeline. So at the end of the pipeline, we want to look at the tokens, compare them with our dictionary, compare them with our list of dimmishers and diminishers and negations, and assign this information to the token. And that's quite easy.
26:00
We have a small function which iterates over the sentence, iterates over the tokens, looks if it is a negation dictionary, that's true. And if it's a negation diminisher and intensifier, and if none of that, it also checks the lexicon. And if it can find the appropriate things,
26:21
the word in the appropriate tables, then it stores the information at the token. And there's one line of code to add our own function to the spaCy pipeline. And we end up with tokens when we iterate over it the next time.
26:41
And we have a simple function here. So if we have tokens, and we're only interested in the tokens that are essential for our sentiment analysis, we want to look at token that has one of our attributes set. And here's a function that just filters all of them. And if we call it with our sentence,
27:02
the schnitzel is not very tasty. It gets reduced to, okay, schnitzel, which is about food, not, which is a negation, very, which is an intensifier, and tasty, which is about food and is a good sentiment. And all that with extending spaCy a little bit.
27:21
So this is very nice. And we can do the same with rating, but we have to apply the diminishes and the intensifiers and the negations on the rating. So what we basically do here is, when we have our four tokens,
27:40
we first look, is there a rating somewhere? Okay, it's good, yes, tasty, is there a rating? And then we walk to the left and we find the intensifier, very, okay. So it turns into very good, but then it's not. So it's not very good. And we have to turn the not very good into somewhat bad. That's basically the function we created before.
28:02
We can pass it a rating. And this is basically Python code to do that. Essentially, it's combining, checking the various conditions, combining them into something. It doesn't really fit on the screen, but it's not much longer, so it's not that difficult to understand. Most of it is handling the loop
28:22
and special conditions when there is no rating and things like that. So here's an example, how to call our combine ratings function. So first we extract the extension tokens from our sentence, like we did before.
28:41
And then we combine it and there's only two tokens that remains, the schnitzel, because it's about food, there's no rating in it. And the other three tokens, they got summarized into one, it's still called tasty because you can't change the name of the token in spaCy. But instead of the rating good, it now has the rating somewhat bad
29:01
and the other two tokens are removed. So the schnitzel is not very tasty, gets reduced to schnitzel somewhat bad. All right, that was the wrong button.
29:26
We were there, function. Yeah, combine ratings. Okay, so we can reduce it to that. So from that, we can build quite easily a function
29:42
that extracts topics and settings from a rating. So here's an example. Print topic and rating of a certain sentence. Schnitzel is not very tasty. And it scrolled out, okay. But here's a better one. When we want to do it with the whole text,
30:03
feedback text, so we have opinions and the long text. The schnitzel was not very tasty, the waiter was polite, and the football game ended two to one. You come up with three sentiments. The topic food and the rating somewhat bad because of the schnitzel. Then there was the service.
30:21
Okay, the waiter was polite. Service is good. And then there's a football result. And football results are not interesting to us. So both the topic and the rating of the football results are known. So that's essentially it with the code.
30:40
So it's a Jupyter notebook. I'm going to publish the link and you can play around with it. So all the code is available for you. And of course, when you talk about code, you can't really understand everything, I guess, because it's way too fast. So the main information I wanted to transport to you is that all of this is basically small parts of code.
31:04
Most of the functions are just a few lines and it's basically standard functions and it's Python standard functions. So there's nothing really special about it. And you can dig into this, I think, quite easily to come up with something that is at least powerful enough to work with diminishes and negations.
31:24
And in my experience, that is already enough to analyze feedbacks from restaurants to get about 80%, a little bit more right. And that's enough to find out where is my business heard, where are my USBs.
31:43
But is there something else? Yes, of course. So there's plenty. There's modals like could and should. They typically also indicate negative rating. There's idioms that indicate rating. For example, this leaves a lot to be desired. Plenty of words that basically say bad,
32:01
or actually very bad. The spec references, so you have one sentence where I talk about the waiter and then the next sentence just uses he. And so you have to kind of connect that again. One simple solution is if there's no topic, use the topic from the previous sentence. Works surprisingly well with simple feedbacks. You can add a topic hierarchy.
32:21
Like we said before, it's not about food. The schnitzel is small. The schnitzel has a price. The schnitzel has a taste, so there's a different aspect. You can replace my carefully handmade combined rating function with abstract rules and a grammar parser. And this is when you really want
32:40
to get deep down and dirty what you will have to do, because then you're much more flexible. And there's lots of linguistic chess. So still, you should have the basic building blocks to start your own sentiment analysis and extend what's provided here. Yeah, so as I said, special topics.
33:03
I'm going to skim over this, so just to give a few comments, because I think you're going to find it helpful. So emojis also, of course, can include a rating. Like a smiling emoji is good, and an angry is bad. There's an extension for spaCy that
33:22
can combine different kinds of emojis. There's Western, Eastern emojis. There's Unicode emojis, or actually, smiley codes. And there's a list of unicodes you might find helpful, because some of them already have a suggested rating from the Unicode. There's also slang terms. If you compare Scottish English, and Oxford English,
33:42
or Austrian German, and real German. And this is something that's basically synonyms, so you just look for those special words. And you also look only for those that are relevant to you for the sentiment. So nix is like nothing, as in negator, things like that. Unknown abbreviations here, you can add them to spaCy
34:02
or treat them like synonyms. Typos, again, only those that are relevant to you, to the rating. There's fuzzy-vuzzy, for example, to do a fuzzy search. So there was a talk about it yesterday. We didn't use it. Here's a couple of references.
34:20
If you really want to get into sentiment analysis, I recommend the book from Leo. It goes much further than what I showed you. And there's, again, spaCy extensions. Very good blog entry about it. So summary, sentiment analysis is challenging. Well, it can be done to some extent. Python and spaCy help a lot with the development part.
34:43
And the code complexity, as you could see, is quite manageable if you apply simple basics of software engineering. Thank you.
35:03
It's a great talk, and we have time for questions. Hello. I have a question about, for example, negation. If you have multiple negation in one sentence, and only one is supposed to be connected with your token,
35:26
how do you handle it? The way it is now. So it works nice for simple sentences. It doesn't work for complex sentences. I looked at about 1,000 feedbacks, and most of our feedbacks were simple sentences, because they had to input it in a mobile interface,
35:42
and people use very nice, short, condensed language for that. If you want to analyze social networks, you have to use more like that. And I recommend the book from Leo. He talks about these topics. Thanks a lot for the talk.
36:01
I was wondering, in your use case, it's probably all right, but how do you handle your performance? Because you have a lot of for loops, and I guess this is really slow with text. It depends on the size of the dictionary. We were happy with about 500 entries in the dictionary, because it's only a few things that people talk about
36:21
when they give feedback to an innkeeper. We also get the feedback, and we can analyze it when we get it. And you don't get that many feedback all the time, especially if you're an Austrian startup who focuses on regional innkeepers. If you want to improve the performance, as I said before, yes, you have to do something about it.
36:41
But it could analyze the thousand feedbacks I had in, I don't recall, it was less than a minute at least. Okay. That was fast enough for me. Thank you. Did you think at all about automating the lexicon building using some sort of final machine learning
37:01
or stuff like that? What would you gain with that? Well, you would gain a generalized approach. You could not only analyze some reviews for inns, but also reviews for hotels or reviews for other places.
37:22
You're talking about the rating or the topic now? Yeah, well, you could automate building the other topic index, for example. Yeah, you can do that, but then I think it's, in our case, it would have been more trouble, because then you always have to look at the decisions the machine made, when the machine looks at the sentence
37:43
that already contains a couple of negative words, and then it finds words that also come up very often in negative sentences. That's the thing you mean, I guess, yeah. Then you try to learn, ah, these are all the negative words, yeah. And so if it works, it's nice, so you can generate the lexicon, and if it doesn't work, then you have to fool around
38:01
with the machine learning algorithm. And in this case, it was much easier to just maintain a manual lexicon. But of course, it's a valid approach in different scenarios.
38:21
I have lived in Graz, and I have been looking a bit in the restaurant reviews, and you know the Austrian humor, so I was wondering, I have seen many reviews, how would you handle irony? Because I remember very specifically, the food was so good, we ended up in hospital. This sentence. The interesting thing about it,
38:41
when there is no public forum for the people to present themselves, they stay very factual. And we had very few feedbacks that actually used irony or sarcasm. So it was less than a percent. So we just ignore them.
39:02
But it's a big topic when you're in social media, of course, yeah. I know your text, your program works for German language, but what if you had comments in Korean, for example,
39:21
or in Arabic text, to which extent would it be possible to adapt space or train a model for it? I unfortunately have no idea. I don't know how these languages work. I know there's different directions to read it, and you can combine it. There was a little bit about it in the talk yesterday,
39:42
but it also was that you have to deal differently. So that's probably a limitation, yes. This works with Latin-based languages, I guess. So it works with English and German. It probably also works with French, Italian, Spanish, because I think, as far as I know, they have a very similar structure. But if you have radically different languages
40:00
with radically different grammatical systems, then you have to do something else. That's correct. One small question. What if people put like plus one in their comments? Comments. It's common in streams, but in your case, I think it's not in a stream. People just write and press the button
40:21
and send their comment. A comment that is plus one, and it's not the emoji. Yeah, exactly. Thank you. You can treat it as a synonym. So that would be a classic pre-processing step. You look for plus one, and then you translate it to rating good, for example. We don't have time for more questions.
40:41
I believe that Thomas is happy to answer questions, but I need to leave people to go to lunch. Please, a round of applause for Thomas again.