I Hate You, NLP... ;)
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 99 | |
Number of Parts | 169 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/21170 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201699 / 169
1
5
6
7
10
11
12
13
18
20
24
26
29
30
31
32
33
36
39
41
44
48
51
52
53
59
60
62
68
69
71
79
82
83
84
85
90
91
98
99
101
102
106
110
113
114
115
118
122
123
124
125
132
133
135
136
137
140
143
144
145
147
148
149
151
153
154
155
156
158
162
163
166
167
169
00:00
Domain nameInternetworkingComputational complexity theorySign (mathematics)Physical lawLattice (order)Event horizonMultiplication signLecture/Conference
00:35
Coma BerenicesComputer virusVirtual machineSoftware developerExpert systemPhysicsObservational studyWhiteboardNichtlineares GleichungssystemGroup actionMachine learningExpert systemVirtual machineAlgorithmSoftware developerComputer animation
01:11
Virtual machineNatural languageProcess (computing)Mathematical analysisVirtual machineArithmetic meanMachine learningExpert systemNatural languageBitProcess (computing)Lecture/ConferenceComputer animation
01:52
Covering spaceFundamental theorem of algebraMathematical analysisTensorDataflowDemo (music)BitArtificial lifeCovering spaceMachine learningParticle systemDescriptive statisticsRepository (publishing)Mathematical analysisNatural languageTerm (mathematics)Multiplication signMassCodeLecture/ConferenceComputer animation
02:34
Mathematical analysisMultiplicationTerm (mathematics)Data conversionBitCovering spaceNatural languageEmailLibrary (computing)Forcing (mathematics)Matrix (mathematics)SoftwareElectric generatorTensorRecurrence relationVector spaceRecursionLecture/ConferenceComputer animation
03:27
Forcing (mathematics)Web pageLevel (video gaming)Endliche ModelltheorieNetwork topologyDifferent (Kate Ryan album)EmailSoftware1 (number)Lecture/ConferenceComputer animation
04:30
Dedekind cutMathematical analysisComputational complexity theoryMathematical analysisRight angleTwitterDigital photographyDifferent (Kate Ryan album)Computer animationSource code
05:06
Intrusion detection systemMathematical analysisDifferent (Kate Ryan album)DeterminantAuthorizationMathematical analysisWordEndliche ModelltheorieField (computer science)Video gamePhysical systemBoiling pointQuicksortView (database)TwitterLecture/ConferenceComputer animation
06:16
Mathematical analysisProcess (computing)Maxima and minimaEndliche ModelltheorieNormal (geometry)Standard deviationPreprocessorSoftware testingComputer animationProgram flowchart
06:50
Table (information)Electronic mailing listComputer iconDistanceSimultaneous localization and mappingEndliche ModelltheorieCodeTrailTwitterEmoticonMedical imagingHypermediaNatural languageFacebookSource codeLecture/ConferenceComputer animation
07:53
Scale (map)Sign (mathematics)Negative numberPosition operatorNegative numberAreaEndliche ModelltheorieComputer animationLecture/Conference
08:30
Scale (map)Sign (mathematics)Negative numberNegative numberMixture modelScaling (geometry)Position operatorSystem identificationStudent's t-testSelf-organizationService (economics)WordEndliche ModelltheorieRange (statistics)Right angleTerm (mathematics)Multiplication signData conversionSet (mathematics)Morley's categoricity theoremComputer animationLecture/Conference
09:47
Scale (map)Boolean algebraBit rateWordSign (mathematics)Scaling (geometry)Boolean algebraBit rateGreatest elementRight angleMultiplication signRepresentation (politics)Negative numberPosition operatorComputer animationLecture/Conference
10:41
Scale (map)Boolean algebraBit rateWordMultiplication signSampling (statistics)WordMassComputer animation
11:13
Observational studyVirtual machinePerformance appraisalDifferent (Kate Ryan album)Endliche ModelltheorieNegative numberComputer animationDiagramLecture/Conference
11:52
Metropolitan area networkStatisticsUsabilityDifferent (Kate Ryan album)StatisticsNatural languageScaling (geometry)Core dumpMultiplication signEndliche ModelltheorieRight angleData storage deviceLecture/ConferenceXMLProgram flowchart
12:47
Multiplication signComputer programmingDefinite quadratic formDifferent (Kate Ryan album)Mathematical analysisWordCodeSoftware developerLecture/Conference
13:20
Computer programmingState diagramAssociative propertyComputer programmingWeb-DesignerCodeRight angleTouchscreenPhysical systemVirtual machineBitComputer animationLecture/Conference
14:13
Endliche ModelltheorieMathematical analysisFrequencyInverse elementTerm (mathematics)WordEinbettung <Mathematik>CNNSystem programmingMathematical analysisVirtual machineDifferent (Kate Ryan album)ResultantEndliche ModelltheorieSpacetimePhysical systemMultiplication signState of matterComputer animationLecture/Conference
14:47
Endliche ModelltheorieMathematical analysisTerm (mathematics)FrequencyInverse elementWordEinbettung <Mathematik>CNNSystem programmingBitMultiplication signSoftware developerMathematical analysisPhysical system1 (number)WordAnalytic continuationEinbettung <Mathematik>Position operatorComputer animationLecture/Conference
15:20
Mathematical analysisWordMereologyProcess (computing)Speech synthesisWordEinbettung <Mathematik>Event horizonInverse elementFrequencyEndliche ModelltheoriePhysical systemTerm (mathematics)Complex analysisBitState of matterTheory of relativityComputer animation
16:16
State of matterEmulationHazard (2005 film)Sign (mathematics)WordNumberMoving averageMetropolitan area networkDimensional analysisSystem callEstimationPort scannerPunched cardGreatest elementSlide ruleState of matterEndliche ModelltheorieLecture/ConferenceSource codeXML
16:51
WordMathematicsBasis <Mathematik>Natural languageProcess (computing)Natural numberContinuous functionSoftware developerLibrary (computing)ImplementationLibrary (computing)WordSoftware developerEmbedded systemVector spaceInheritance (object-oriented programming)Lecture/ConferenceComputer animation
17:30
WordMathematicsBasis <Mathematik>Natural languageProcess (computing)Natural numberContinuous functionLibrary (computing)Software developerImplementationWordMatrix (mathematics)Representation (politics)Multiplication signVector spaceRight angleType theoryEndliche ModelltheorieEmbedded systemLecture/ConferenceComputer animation
18:07
WordEndliche ModelltheorieCountingEinbettung <Mathematik>Arithmetic meanResultantDistanceTerm (mathematics)Vector spaceLecture/Conference
18:43
WordDifferent (Kate Ryan album)WordArithmetic meanLine (geometry)Einbettung <Mathematik>State of matterMereologySpeech synthesisMixture modelTable
19:17
Pairwise comparisonWordDegree (graph theory)Einbettung <Mathematik>ResultantWordBitVisualization (computer graphics)Einbettung <Mathematik>Pairwise comparisonDecision theoryWebsiteRight angleMultiplication signGreatest elementRevision controlXMLLecture/Conference
19:55
WordMetropolitan area networkDivisorGoogolVector spaceAuthorizationNatural languageSpacetimeRight angleWordGreatest elementElectric generatorEinbettung <Mathematik>Computer animation
20:52
Vector spaceDirection (geometry)Reverse engineeringDivisorCodeLinearizationVirtual machineLecture/ConferenceComputer animation
21:42
LengthEndliche ModelltheorieMachine learningVector spaceResultantLinear regressionVirtual machineSymbol tableMatrix (mathematics)Recurrence relationEinbettung <Mathematik>Decision theoryWordTensorDivisorCheat <Computerspiel>SoftwareFormal grammarComputer animation
22:38
LengthRaw image formatLogicDifferent (Kate Ryan album)Mixed realityWordLecture/ConferenceComputer animation
23:22
DataflowTensorMoving averageSemiconductor memoryEinbettung <Mathematik>SoftwareNatural languageProcess (computing)Endliche ModelltheorieWordRepresentation (politics)MathematicsTwitterLecture/Conference
24:06
TensorDataflowCNNComputer-generated imageryMaxima and minimaInterior (topology)Special unitary groupTensorArtificial neural networkWordMedical imagingPixelEinbettung <Mathematik>Engineering drawing
24:39
CNNComputer-generated imageryMenu (computing)Interior (topology)Machine visionRing (mathematics)Multiplication signBitLink (knot theory)Right angleArray data structureDifferent (Kate Ryan album)SpacetimeTerm (mathematics)Lecture/ConferenceEngineering drawingComputer animation
25:26
DistancePhysical systemCNNArtificial neural networkPhysical systemRepresentation (politics)SpacetimeHydraulic jumpTwitterLogical constantMultiplication signLecture/ConferenceComputer animation
26:01
DistancePhysical systemArtificial neural networkCNNoutputPreprocessorParsingPhysical systemComputational complexity theoryParsingMultiplication signClassical physicsPreprocessorRegular graphMathematical analysisLecture/ConferenceComputer animationXML
26:51
Data modelGroup actionProcess (computing)WordCalculus of variationsMereologyEndliche ModelltheoriePreprocessorComputer animation
27:23
Multiplication signNatural languageSpeciesVector spaceWordSelf-organizationSource codeXML
27:58
Self-organizationTerm (mathematics)Endliche ModelltheorieVector spaceSpacetimeInterface (computing)SpeciesMedical imagingComputational complexity theoryLecture/ConferenceSource codeXML
28:49
DiagramMathematical analysisCore dumpService (economics)Source codePosition operatorCombinational logicLecture/ConferenceDiagram
29:47
Natural languageMathematical analysisNatural languagePosition operatorBuildingDependent and independent variablesComputer animationLecture/Conference
30:26
Natural languageMathematical analysisWeightEndliche ModelltheorieWordMultiplication signLevel (video gaming)Computer animationLecture/Conference
31:00
Control flowVotingMixture modelEndliche ModelltheorieMathematical analysisMultiplication signHypermediaMedical imagingIdentifiabilityLecture/Conference
31:35
DiagonalLine (geometry)Cartesian coordinate systemNetwork topologyElectronic mailing listSound effectHypothesisObservational studyLine (geometry)Negative numberCombinational logicMereologyRepresentation (politics)Dot productLecture/ConferenceComputer animation
32:08
Prediction3 (number)Uniform resource nameObservational studyRepresentation (politics)Office suiteArithmetic meanNegative numberNetwork topologyMathematicsComplex analysisSoftwareVector spaceSeries (mathematics)State of matterLogicLecture/Conference
33:05
NP-hardContext awarenessHash functionJava appletMereologyConstraint (mathematics)Complex analysis2 (number)Multiplication signTerm (mathematics)Observational studyLecture/ConferenceComputer animation
33:55
Machine visionNormal (geometry)Right angle1 (number)Mathematical analysisSlide ruleLecture/Conference
34:36
Electronic meeting systemInstallable File SystemComputer-generated imageryCAN busInclusion mapArithmetic meanMathematical analysisMedical imagingMereologyTwitterNetwork topologyRight angleArmXMLMeeting/Interview
35:17
Computer-generated imageryCAN busPopulation densityWordDiscrete element methodHidden Markov modelInterior (topology)Arc (geometry)Negative numberCommitment schemeIntrusion detection systemData Encryption StandardFinite element methodMultiplication signMedical imagingRight angleDatabaseWordRepresentation (politics)Decision theorySemiconductor memoryEinbettung <Mathematik>BitServer (computing)Lecture/ConferenceMeeting/InterviewComputer animation
36:10
Line (geometry)Population densityWordEinbettung <Mathematik>Data Encryption StandardEndliche ModelltheorieEinbettung <Mathematik>WordPopulation densityVector spaceDifferent (Kate Ryan album)Lecture/ConferenceComputer animation
36:59
WordoutputScale (map)Different (Kate Ryan album)Social softwareDisintegrationMixed realityMusical ensemblePoint (geometry)Different (Kate Ryan album)WordSpeech synthesisNatural languageMathematical analysisEinbettung <Mathematik>Observational studyMusical ensembleLevel (video gaming)Computer animationLecture/Conference
37:36
WordoutputScale (map)DisintegrationDifferent (Kate Ryan album)Social softwareMixed realityMusical ensemblePoint (geometry)Metropolitan area networkDedekind cutGraphical user interfaceNatural languageLevel (video gaming)Einbettung <Mathematik>Revision controlFlowchartSampling (statistics)Solid geometryGroup actionMultiplication signMathematical analysisEndliche ModelltheorieVideo gameCodeSlide ruleComputer animationDiagramLecture/Conference
38:45
Presentation of a groupTemplate (C++)Slide ruleComa BerenicesEndliche ModelltheorieComputer architectureWave packetGoogolTerm (mathematics)Spherical capWordDegree (graph theory)Einbettung <Mathematik>SoftwareField (computer science)Mathematical analysisMultiplication signPressureNatural languageNP-hardSoftware developerSeries (mathematics)Level (video gaming)ParsingBitContext awarenessBound stateOpen sourcePhysical systemObject (grammar)Source codeError messageMereologyDifferent (Kate Ryan album)Cohen's kappaElectric generatorStatement (computer science)AreaMathematicsComputer animationLecture/Conference
Transcript: English(auto-generated)
00:00
And I'd like you to all join me in welcoming Katherine, because Katherine hates your computer. Give her a hand. Start by thanking everybody for coming here, thanking EuroPython for having me, and thanking the EuroPython social domain
00:20
for hosting a really great event. I've had a wonderful time, and I keep meeting people way smarter than me, so that's always a good sign. My name is Katherine Jarmel. I'm known across the internet as KJAM. I run a company called KJAMistan, so you can find me blogging and writing and talking about things at KJAMistan.com.
00:43
And I like working with text and data, and if you're ever in Berlin, come by the Pi Data Meetup Group. We're a fun group of people, and we'd love to meet you. So I feel like, because I'm talking on machine learning, and I don't have a PhD next to my name, I need to have this disclaimer slide,
01:01
which is, I am not an expert in machine learning. I can't sit here and tell you about which algorithms work better and exactly why. I'm a Python developer, and I'm interested in applying machine learning to text analysis and sentiment analysis. So by all means, if you are a machine learning expert,
01:20
and I say something wrong, feel free to either correct me in the questions, or I will buy you a beer later, and you can teach me something. So here's some assumptions I've made about you. I'm not gonna be going over NLP basics. I'm not gonna be going over machine learning basics. I assume that we already have those covered. If you wanna talk about basics later,
01:41
then again, of course, happy to talk about it later, but I'm assuming that you already have done maybe even a little bit of sentiment analysis, but you already have played around with some natural language processing, and you have some basic understandings of machine learning and what methods to use. So here's a little bit of what we'll cover.
02:00
In the initial description, I said that there was going to be live coding and demos and this and that. Upon further inspection, that becomes a little bit more difficult when we go into deep sentiment analysis and cross-language sentiment analysis. So what I'm going to do is I'm gonna present you some code and some repositories that are good to use, but we don't have time today for live coding.
02:21
Instead, we're gonna cover some of the tools. We're gonna talk a little bit about Watson and how it's being used. We're gonna talk a little bit about sentiment analysis, and we're gonna cover a lot of the research that's being done. Right now, I'm seeing a massive gap in terms of the research that's being done around sentiment analysis and the tools we have available in code, and I'm kind of hoping to push forward the conversation
02:43
in the Python community by giving this talk and kind of covering a little bit of what's happening. So what we won't cover is some magical Python library where you can download today, and it works in multiple languages for sentiment analysis. I must say that if you want to take a look,
03:01
please send hate mail to Salesforce. So in case you wanted to send hate mail somewhere or you want to practice a natural language generation and send hate mail somewhere, send it to Salesforce. MetaMind was a really amazing startup using matrix vectorization,
03:22
recurrent neural tensor networks, and they had an open API, and it was fabulous. And then Salesforce bought them, and they shut down the API, and they're currently shutting down the paid API as we speak. I ran into a guy who works for Salesforce who actually gets to work with this team, and I asked him, I begged him,
03:40
please let something be publicly available, and he said, good luck, have fun with that. So again, all the hate mail. And if you work for Salesforce and you can somehow get me inside the internal network, that would be great. So another one that's trying to do some things, and they're one of the only ones that's available via APIs is Monkey Learn.
04:03
And so here you can see that they have these different modules available. They have some that are being actively worked on. You can also create a publicly available module here. So if you absolutely need, hey, KJam, like I need something that works tomorrow, I would start at some of these places. But we can see just even looking through the API
04:21
that we have very different precision levels. And so I would say, that's why I'm giving this talk, is we're gonna talk about some of the methods and the theory behind how to create a sentiment analysis that works for you. So to begin with, we're gonna talk about what is sentiment analysis,
04:40
and how do we go about it in Python. So what is sentiment analysis really? And when I look at this tweet, and being American, some of my tweets in my Twitter feed are very American references, so forgive me for being that way. But when we look at this tweet, we have a lot of different sentiment, right?
05:00
If we look at the emoji, we see crying. If we look at the photo that we have here, then we see a different emotion of crying. If we look at the text, then we might just see the word like, and determine that this is a positive analysis. And so when we're talking about sentiment analysis, we have all of these different things.
05:21
We have stance, how does the author feel towards Jennifer Hudson? How does the author feel towards the BET awards? These are really complex things, and we think we can boil them down into positive, negative, neutral, but I am here to debunk that myth today. So sentiment analysis is used in all different sorts of systems.
05:40
And I've had the pleasure of going to a sentiment analysis symposium recently, and hearing just how many different places it's used. There are even people using it in anomaly detection, and they're charting sentiment across reviews or Twitter, and they're looking at it via release tags. And so there's all different ways that people are using sentiment analysis,
06:02
including in obviously just simple user satisfaction, or brand engagement. So when we look at the sentiment analysis steps, there can basically be broken down into four major steps. And they basically go from left to right, and top to bottom. However, depending on what you choose,
06:21
they could be mixed up. So the first thing is dealing with your corpora. Are you building a lexicon? How are you using your labeling? Next, then you're probably going to choose your algorithm, or you're gonna determine what model or approach to use. Then you're doing parsing and pre-processing. You might even do normalization and standardization
06:40
across your data set. And finally, you're testing, evaluating, and improving. And in that improvement step, you may revisit any of these earlier steps. It's quite possible that you will need to to improve your model. So we're going to go through the talk, and we're gonna basically cover all of the steps, and we're gonna talk about what's happening in research, and what's happening in code.
07:02
The first step is choosing your lexicon. So one of the oldest and most tested and true ways of auto-tagging a lexicon, particularly if you're using social media data, is auto-tagging. And it's this idea of distant supervision. So I'm going to go, and I'm going to gather tweets, or I'm going to go and gather Facebook statuses,
07:21
or whatever I'm going to use, and I'm gonna map with emoticons or emoji. And you can think about this also as a good opportunity to use a thesaurus. So in quite a lot of the research, they will also just expand the corpora that they're using by using a thesaurus. So I wanna grab all hashtag happy,
07:40
hashtag joyful, et cetera. And this can also help expand into new languages. If you're working in one language, and then you wanna apply a different model, as long as you have a very good thesaurus, then you can do that. Then one of the other steps that you're going to need, so now you have all of this data, how do you tag it? We have this very naive idea
08:01
that everything can be tagged positive negative, or positive negative neutral, or even positive negative indeterminate neutral. That's really in research being proven wrong. And so what we have to think about here is if you really need a simple model that's simply positive, negative, and neutral,
08:21
then you're gonna have a lot of gray area, and you're gonna have probably more ending up in neutral or more ending up in false positives or false negatives. So when you start moving away from that, and you start thinking about positive and negative as a scale, you start to see that things converge a lot faster, and that you have less of these false positives and negatives.
08:42
This also means that you have to build your own lexicon, and you have to label it yourself. So there's pluses and minuses, and depending on the resources of your team, you may not be able to do that. There's also quite a lot of people looking at stance detection. So this is a mixture of NLP topic identification or entity detection, and then using stance.
09:03
And a lot of that right now is essentially a bag of words model still, and people are trying to evolve that to be more advanced and to try and look at things and say, okay, can I determine that these set of words are actually applied to this organization or this entity, and then I can start to detect these stance.
09:22
So it might be positive towards, I really like this restaurant. It has great food, but the service was total crap. Then that means that I have one idea of the restaurant, one idea of the food, and one idea of the service. And then finally, we have this categorization by emotion. Because we have this wide range of emotions,
09:41
it's really difficult to determine just positive or negative. So in terms of tagging a lexicon, and it looks like one of my emojis is broken, we have a few different methods. So I'll actually start from the bottom. The simple Boolean is zero, one, right? Or maybe negative one, zero, one.
10:02
Then we have this sliding scale, which is going from completely negative to completely positive, and I'm asking people then to just rate on this scale. What they've actually found with research on that is that people tire over time. So if you're sitting there asking me to rate something for five hours, I'm eventually gonna just be like neutral, neutral, neutral, and try to get done with it, right?
10:22
And we know that just from our knowledge of humans. They also have found that psychologically, people avoid the edges. So you'll get much more representative of like, it's kind of negative, it's kind of positive, because people don't wanna say it's completely negative or completely positive. So one of the best methods that's recently come
10:42
into practice is this best worst scenario. And you have a list of say, four or maybe five words or engrams, and you can use that and you can ask people, say, okay, choose the most positive and the most negative. And that actually is creating some really interesting lexicon, and what they found is that people agree
11:02
after about three examples. So I can show it to three people and move on to my next sample, and I'm in the 90% agreement range, which is pretty massive for tagging lexicons. And this is another simple study, and this is somebody, Saif M. Muhammad,
11:21
who works in the National Resource Council in Canada, does quite a lot of different talks on SemEval competitions, which is like a sentiment evaluation machine learning competition that happens every year. And he has some really great state-of-the-art models. What he found is that when he switched to using a scale of negative one to one,
11:42
that 90% of people agree within .4 of each other. So that was a really interesting revelation to move away from this zero one, because we hear constantly, oh, well, humans don't even agree 77% of the time, or whatever statistic you look at. And what he was able to find is
12:01
when you give people a scale, and you let people speak on a scale, you can actually find this least perceptible difference. And particularly then, if you even focus on native language speakers, you get an even smaller, a smaller score difference there. So I had the opportunity to chat,
12:20
and I'll hopefully be posting our chat, with William Hamilton, who's currently working at Stanford on his PhD. And he's developed a new model called Socialcent. And what he's done is he's taken subreddits, and he's determined that subreddits and different communities use language differently. So when I'm hanging out in our sports,
12:41
and I say, oh man, he's really soft, I don't mean that as a compliment, right? I mean that in a really negative way. So again, we have this mentality a lot of times in this code, that we all use the same words to describe positive and negative. And that's just not the case. Especially if you're doing sentiment analysis on a community, and especially a community,
13:02
like for example, us as Python developers, we're gonna have a lot of different ways that we use words. He has all of his data available, and I took a look at R programming. And I pulled out some of the most positive and the most negative. I also found out that Python has a slightly negative definition.
13:21
So I know you're all talking trash about Python on R programming, and you need to stop. So here, I found some interesting things. It's funny to me that 200's right in the middle, like we can tell that we have web developers in R programming. Spaghetti is really negative, but I love spaghetti. But spaghetti code, maybe not so much.
13:41
And we find that Minecraft has a great positive association and so it's a really interesting data set. It's only unigrams right now, but he's very open to suggestions. So feel free to send him any suggestions or ask him to do a certain engram for you. So now that we have our lexicon, or we've chosen a lexicon, we need to choose an approach or an algorithm.
14:03
So how do we do that? What machine learning systems can we use? And it turns out we can use a whole bunch. And I know this is throwing a little bit of a lot on the screen, but there are all different ways that people are using machine learning with sentiment analysis. And they're achieving really great results
14:20
with really small data sets. So I'll come to this again later, but if you are dealing with movie reviews, you're good to go. But if you're not using the IMDB model or lexicon, you start to enter this space where there's quite a lot of active research and not a lot of agreement on what's good to use, why or why not.
14:41
And people are achieving very different results with very different systems. Obviously the finely tuned state of the art systems are still the best performing ones right now. And this is these people that spend all of their time researching and pulling out little details. But we don't have time for that, right? We're probably Python developers first and sentiment analysis researchers next.
15:02
So I'll talk a little bit about how these are comparing. First I'm gonna talk about kind of the older approaches and the newer approaches. The old approach used to be bag of words or even continuous bag of words. And I would say, oh yeah, it has mainly positive words, they must be positive. The new idea is to start to use word embeddings
15:21
and start to be able to use those word embeddings in deep learning models. And by doing so, I have a little bit of more complexity. And I can say, hey, there are these words bunched together. Maybe they're bunched together because they have a sentiment together. That's not always the case. The old way was term frequency inverse document frequency, which is basically a nice way
15:42
of doing bag of words on a document. And the new way is using these doc2vec or these document embedding models. The old way is kind of unigrams. If I say good job, I mean good and I'm talking about job and they're all related. And the new way is talking about skipgrams or even dependency modifications,
16:01
which is starting to look at the parts of speech and label the words with parts of speech because I may mean very different things if I use a word as a noun versus a verb. And the old way is the supervised state of the art systems and the new way is kind of moving to a semi-supervised approach. So if you're curious about some of the
16:22
state of the art sentiment features, and I'll post these slides because they reference the papers across the bottom, this is a good summarization. And this is a summarization of, again, Saif Muhammad's model, which has done really well in competitions. And these are all of the different things that he's pulling out as features and then he's fine-tuning them and analyzing them and pulling out
16:42
probably more sub-features that we don't know about that haven't been released. But you can see that there's a starting place if you're interested in tuning your own model. But let's talk about what we can actually use as Python developers and what we can use easily. We can use Word2vec and Doc2vec. How many people in here have used Jensim?
17:03
Okay, so Jensim is this great library and what it gives you is these word embeds. Sorry, how many people know about Word2vec and Doc2vec? Okay, super. So it's this idea of representing a word or a document as a vector. So everything is in a vector space and it can be multi-dimensional.
17:22
It is multi-dimensional. And what you're going to do is you're going to use Jensim and Jensim can load in these multi-dimensional vectors and give you a matrix or vector representation of each of your words and your documents. And what this is gonna give you is a mathematical way to represent text, right?
17:40
And this is kind of where text was held back for quite some time is we didn't have a really good mathematical way to represent text. We were using counters and bag of words and that's not that interesting. So there's also GloVe Python. So there's two different ways to word embed. Everybody talks really a lot about Word2vec and I feel like there's not enough attention
18:01
given to GloVe. GloVe is a Stanford model and it has a slightly different representation. It uses a weighted model. So Word2vec is basically, does this word appear in a sentence? Then it means it's somewhat related to these other words in the sentence. Whereas GloVe is gonna say, hey, these words are closer, therefore they might have more of a proximity
18:21
in terms of meaning. So depending on what word embedding model you use or method you use, you can get really different results. So here we're comparing word embeddings built on bag of words five, which is like a five engram, like a five words around count.
18:41
Then we have bag of words two and then we have a dependency, which again is this part of speech. And what we see is if you take a look at the Florida line, what we see is in Florida with bag of words, we get this mixture of other ways to talk about Florida, cities in Florida. But when we move to dependencies, we get other states.
19:02
And this is maybe significant to whatever you're trying to do. If you want other entities, likewise entities, rather than just related words, you might wanna look at a dependency word embedding rather than a bag of words word embedding. There's an online comparison tool
19:20
where you can look at this. And so I took a look at Python and it looks like there's some of them mixed together. If you use bag of words five, two or the dependency, you get slightly different results. And the really neat thing about this is in this as well, they give you a little bit of a visualization of how those word embeddings are used. And so if you wanna just play around
19:40
and start to make decisions on what you're using for your word embeddings, then I recommend going to this site. So another thing that I feel like needs to be raised is it was recently released, there was a paper referenced at the bottom, where word embeddings are not neutral. They're based on human language and they have the same biases
20:02
that you would expect to see in humans. And to come up with some of these vectors, I was using the Google News vectorization, the 300 one. And that one is used by places all over the place. And I found some pretty atrocious stuff in there.
20:22
Maybe not in the first 10 examples, like for example here at the bottom, there was some intense racism going on if I looked at the top 30 words that were within that space. So I forewarn you too, especially if you're using it for something like sentiment, right, which we may have some words mixed in there that are very biased, or if you're using it for natural language generation,
20:43
then you need to be very aware that these biases exist. One interesting thing however, is the authors of this paper were able to find, they could find vectors showing the actual like misogyny. So they specifically focused on the misogyny in it, and they were actually able to reverse it,
21:02
and reverse the vectors by finding these vectors that were pushing things in a direction where like women had lesser professions. So let's talk about what that actually means. Here I have some great examples used. It is too much code to go over in a short talk,
21:24
so I would recommend looking at them. These are great examples for simple machine learning. So this is using naive buys, or simple linear approaches, linear SVM. And what it's doing is it's gonna take the vector, and it's gonna chart it to whatever it is, whatever you're using as labels.
21:42
And if you want to do it on a document approach, you're essentially taking a vector of the document model, or the sentence model, it depends, and they have different results. And then you're passing it as a labeled sentence. A labeled sentence is actually a model in Gensym, so if you're using Gensym, you can label your sentences, and then pass them directly in,
22:01
let's say a logistic regression, or whatever it is you'd like to use. This is like a simple approach, right? These are the shallow machine learnings. Another thing to be aware of when you're making these decisions is you're choosing your n-grams, right? So you have to choose unigram or bigram, or how do you want to represent these? And when you're choosing word embeddings,
22:21
you're generally making those decisions as you're choosing the word embedding. And what they found, this is the research by Stanford, which enabled them to come up with the matrix vectorization, recurrent neural tensor network, and what they found is the longer you let your n-grams, the more mixed sentiment you have,
22:40
which is kind of logical. When you think about it, and I'm gonna go on and on and on and have a big long sentence, I'm probably going to express maybe different viewpoints, or a mixed emotion. And so, if you're choosing unigrams or bigrams, you can see you're way over here. And there's some words that are obviously negative and positive,
23:01
but there's probably a lot of words that are just filler words. And the further you go towards exploring the n-grams, the more complex sentiment you get, and the more people are able to say, okay, yeah, this phrase is definitely negative, and the whole phrase is negative. So this is based on the Stanford Sentiment Word Bank,
23:20
which is available in Java. Another approach that people are using for this that's going pretty well is LSTM, which is Long Short Term Memory Networks. And these are deep learning networks. And what they're using is they're generally using these word embeddings or these document embeddings, and they're pushing them into these deep learning models,
23:42
and the nice thing about LSTM, and why it's been doing some really amazing stuff for also other natural language processing, is that it can forget things. So it can learn things, and it can forget things. So because it has that ability, what it can do is it can change as it sees
24:00
more or less of a representative model. So language we know changes, and trends come and go. And because of that, this approach has been really powerful. And you can use it with both Theano and TensorFlow. I have some examples. There's also the ability to use convolutional neural networks.
24:21
And so initially the idea is that convolutional neural networks are really great at images because you can chunk and label images and then move accordingly through the pixels, right? You can crop them. But now people are using it for sentences or for documents, and they're able to use the word embeddings for each of those and chunk and move through them. And then through the same principles
24:42
of pooling and softmax, you can then reach a conclusion and say, okay, according to my sigmoid softmax, this is positive or negative or neutral. And again, I will be posting these so you can take a look at the links. It's a little bit different when you're dealing with deep learning labeled text.
25:02
So what a lot of times you'll have is you'll have the document, and then you'll be passing an array. And a lot of times people are just using simple arrays like, okay, this is positive, so it's going to point up, and this negative is going to point down. There's different things happening in this, so I would keep an eye on this space in terms of how it performs.
25:21
Because again, right here we're having this, everything is either positive or negative. I'd be curious to see, and I know that some of the academics are working on more complex representations of the labeling. And because of that, this might become an interesting space sooner rather than later. And then here's where a lot of people are kind of using this dual system.
25:41
Maybe I have a simple classifier, and I have a lexicon, and I have auto-tagging. And I'm feeding them both into a deep learning system. So maybe I have a constant Twitter feed or this or that, and I'm feeding it into a simple classifier, and I'm allowing my deep learning system to learn from that classifier. This is where a lot of people are starting to experience some big jumps
26:01
without having to do a lot of work. So if you're like me and you're lazy, this might be the best approach. There's still obviously some open research, but it's an interesting theory, this idea that I can just use my simple state-of-the-art classifier and feed it into my deep learning system. And my deep learning system will eventually become more intelligent and have
26:21
a much wider lexicon over time. So depending on your system, you might need to handle regularization, pre-processing, parsing. This is like the classic NLP problems, right? Is pre-processing essential? It really depends who you ask. There have been some really interesting research
26:41
on sentiment analysis and saying doing too much pre, sorry, doing too much pre-processing actually is a problem. And what ends up happening is you get rid of some of this way that we speak in slang to one another, or the way that we use colloquialisms, and by getting rid of those,
27:00
you actually are removing part of how we're expressing ourselves. So what I would recommend, and what generally the literature agrees upon, is do minimal pre-processing, like maybe just some simple lemmatization or some simple tagging, and then try it. And then do more pre-processing, and then try that. And see where you hit a nice accuracy in your model.
27:25
And this is kind of where Sense2Vec and Spacey are really making some cool inroads. Spacey is a startup based in Berlin, and they are doing essentially Parsi McParser face in Python.
27:41
And what they're able to do is they trained on, or they have the English language and the German language corpus available, trained on Reddit data. And what they're able to do is they're tagging each of the vectors. So Sense2Vec is you have this tagged vector. So when I talk about Google the organization and Google the verb,
28:00
I mean different things. And I might feel differently about them in terms of sentiment. So they're making some great inroads. When I pass in, this is using Spacey, when I pass in the vector for the winky face, winky face old school emoji, then I get back some of these other things. And I can see that I'm getting back interjections,
28:21
which is probably a lot more of how we use the winky emoji in the way we express sentiment. Like I might say something that I don't mean, like I hate to computer, and then use winky face to say, I'm just kidding. So here we can see that he, he, psst, Spacey is picking up on the fact that I may actually be negating what I'm saying.
28:44
And I'd be really curious to see how Spacey gets incorporated into a lot of these deep learning models. It hasn't been incorporated in most of them yet. Finally, we get to kind of like what IBM Watson is doing. And we're talking about emotions here.
29:01
So there's this famous diagram that you see almost at every sentiment analysis talk, and it's this idea that we have all of these emotions, but our emotions can be made down into just these eight core emotions, and then everything else is just some combination of those eight core emotions. I am no psychologist, so I can't comment on this,
29:23
but you know, it makes somewhat sense. So if you're looking into sentiment analysis and you want to start moving away from just positive and negative, and you want to move into emotions, let's say you're in charge of like customer service channels, and you're trying to say, I really want to know when the rage and loathing are activated,
29:41
because we need to work on that first. But I don't care so much about the positive emotions, I care about more of these anchors. IBM Watson has the Tone Analyzer that's available on the Bluemix API. And it tries to classify into anger, joy, fear, distrust, and sadness. It also has the social tendencies and the language style.
30:02
So one of the interesting things that I was asking some researchers when I was able to chat with them is, hey, could you maybe say, hey, I'm angry, but I'm saying something positive. That means maybe I'm being sarcastic or ironic. And the response I got is that nobody's really doing that yet, but that would be interesting to hear about.
30:22
So I will hopefully be building something that does something similar to this, and taking a look at, can we start to talk about when our words say one thing but our tone, or maybe our tone over time says another? And by building these models, by even necessarily saying like, you as one person have a lexicon,
30:41
and you speak a certain way, can we actually start to understand sentiment at a much deeper level? We'll get into that as we talk about the next portion, which is, what is still unsolved, which is sadly quite a lot. So apologies if you live in the UK and are offended,
31:02
but I found this to be very funny. But yeah, humor. Humor is lost on sentiment analysis models. We don't know how to tell when somebody's being funny, and we a lot of times can't even tell what they're being funny about. So here I have this mixture,
31:22
this social media presence, which is this mixture of images and text. And how can we start to, we know how to understand images now, and we know how to understand text. But there's very few that try to do both. And in sentiment, this is a really key part, because how we're talking online is with this combination.
31:44
So negation is still a really big problem. And here's a unigram and bigram study that says when you negate a unigram or a bigram, the red line is the assumption that you simply create a negative value for that unigram or bigram.
32:00
And what we see here is the blue dots of the actual representation. So when I say, yeah, it was not that bad, then I don't actually mean that it was good, right? And so we can start to look at these and talk about, hey, it's not so simple to say just because even somebody negates an emotion, it is some opposite.
32:21
And there's been some really interesting studies about modifiers, and how maybe they can point towards a mathematical representation of negation of a sentiment. So back to Stanford sentiment trees, which are these MVRNTN networks. And they actually have gotten pretty advanced
32:42
at determining negation and overall sentiment of a complex sentence. And what they are doing here is they're creating smaller engrams, and they're creating sentiment vectorization for that. And then they have a series of logic rules. And the logic rules state, okay, well this is more negative than this is positive,
33:03
so overall it is negative. And they're doing some really, really interesting research. Again, they have their Java jar that you can download and play around with. It's still mainly trained on these movie reviews, which we'll get to in a second. So we still have this problem that it's trained on movie reviews.
33:23
So there's still quite a lot of things that are hard. We have mixed sentiment, we have complex emotions, sarcasm, irony, humor, lost, right? We also have speaker intention and personality. And there's been some interesting research in that, in terms of trying to determine who you are as a speaker.
33:40
And if I study you over time, can I determine who you are as a speaker, and then can I get a better idea of what you mean? We also have slang in new and older phrases. So part of William Hamilton's study with social scent was looking at sentiment over time. And he was pointing out that if you look at historical documents, you can't use any normal sentiment analysis.
34:01
Because in 1920, people were talking a lot differently about things. And we have these new phrases too, where cool AF means something very different than just cool, right? And so, then we also have just general NLP issues. So I feel like I would be left out of EuroPython
34:20
if I didn't have a Pokemon reference in my slides. So I put one in here, which I really hope people in America do anyways, because I'm very fearful watching from Berlin about what may or may not happen in November. And so yeah, let's put some Pokemon lures and unleash them in the polling places. But here's an idea.
34:41
This is obviously exploring a sentiment, but no sentiment analysis would understand this. So when we're talking about cultural references, we also need an ability to say, hey, right now Pokemon means emphasis. We also have speaking with images and GIFs, right?
35:04
So now that Giphy is part of Twitter, this is important, right? I would not understand the sentiment of this tweet if I couldn't also look at the image and then determine sentiment of the image, right? And the really interesting thing is a lot of these are tagged, right?
35:22
So these are in a tagged database somewhere, and maybe somewhere there's like hashtag hate or hashtag mm-hmm, or whatever it is. And I can apply sentiment to these, and then I can pull them out, pull out the visual representation and put it back into words and use those words to make a decision.
35:45
Another really big problem with these is speed, speed and memory. So if you ever try to make your own word embeddings, like I hope that you have a server somewhere set up with a lot of memory and you have a lot of time on there. Because sometimes compiling these word embeddings,
36:02
GloVe is actually a little bit faster, a lot faster, almost half the time. But Word2Vec, if you're making your own word embeddings, it can take 14, 15 hours. And if you're using GloVe, it can end up eating gigabytes and gigabytes of RAM. So if you're making these embeddings yourself, it can become a real problem
36:22
and a performance bottleneck, both for how do I keep feeding my model and also how do I apply it, let's say, in a real-time situation. So what some folks in Munchen have been studying is the ability to create dense word vectors and dense document vectors.
36:42
And they created this idea of Densify. And what they were able to see, I don't know if you can see over in the corner, but they were able to get it from English, where it's four executions per second, to the Densify model at 178 with only a slight difference in accuracy.
37:01
So the more solutions we find for sentiment analysis, the more problems we have. And depending on, especially if you're working in different languages, this can get even more difficult. What some people are doing that is interesting is attempting to do aggregation within Lexicon. So if I aggregate everybody speaking English
37:20
in this particular city, what they're finding is maybe there's more agreement within that community on sentiment and words. So that's something to look at. I also haven't seen a lot of studies on ensemble methods, and I think that that's coming into vogue, as well as character-level embeddings. So character-level embeddings are the idea that we can predict and attune ourselves
37:43
on a character level, given a language. So here is a graphic I made. It's very tongue-in-cheek, and sorry if it's a little small, but this is like my own version of Andreas Müller's sidekit learn flowchart, which basically says if you have movie reviews
38:01
or any reviews, you're set, and if you have anything else and you don't have money and time and samples, you're pretty much screwed. So sorry about that. And I'll post this as well. I'll go. So have we solved sentiment analysis? Not really, but if you have time or are willing to work on building a lexicon,
38:22
particularly for your model, what they've found is that there can be some really great wins. So I've found a lot of papers and also a lot of companies that are building their own corpora and their own lexicon, and they're achieving in the 90th percentile. So I want to say thanks.
38:40
You can reach out if you have any questions. I'll be posting the slides so you can read all of the papers and look at the code that I referenced for the LSTM, as well as the CNN and the simple model architecture.
39:03
All right, everyone, we have time for two or three questions, so who would like a first one? If you want to leave desperately and go and get coffee, please do so very quietly and respectfully. Okay, here we go. Hi, thanks for the talk. It was great.
39:21
I wonder if you had a chance to look at the Google Sentiment Analysis API they just released this week. I haven't had a chance to look at that. I do have it bookmarked, but I'll let you know when I do. The problem really is is that a lot of these really depend on the lexicon, and Google is likely using their word embeddings,
39:41
and their word embeddings have problems because everybody else is also using their word embeddings for sentiment. So I would doubt it's massively better than what's available from Stanford, which is generally in the high 70s percent, low 80s. But I'd be curious to see, and if you have done any, have you used it yourself? Yeah, it was a little,
40:00
there's so much happening, it was hard to keep up with everything, and also prepare the talk, so. Thanks very much for the talk. I'm a linguist by training, so a lot of the examples you mentioned really rely, like being able to accurately tell what the sentiment is really relies on knowing a lot about the context.
40:21
So how much do you think we can actually expect, how accurate do you think we expect to be just based on the text? Because it seems like people are really, really pouring a lot of effort and time into making the text models really great, and this is wonderful, but maybe there's just a cap at how well we can do only based on text.
40:41
Yeah, I agree to a certain degree to that statement. And basically what I'm curious about is more when we start to look at these dependency models, and we're start to able to detect stance and phrases, that if I'm using a series of phrases directed at a particular object, and particularly if I can do that in a multilingual situation,
41:01
which people are starting to prove that you can do, that that's gonna take sentiment analysis really to the next level. Because if I can start to say, this cluster of words applies to this object, and it's obviously negative or it's obviously positive, then I can start to make a little bit of these leaps and bounds and moving away from just this idea that somehow words surround other words.
41:23
So the better our parsers get, and the more we incorporate the parsers into our models, the better sentiment analysis has gotten over time. There's a question way up at the top. Sorry, thank you.
41:40
Amazing talk, Katherine, that's really interesting. It seems to me, it's a bit like the previous question there, this sentiment analysis is really a problem of hard AI, and I wonder whether it's going to be the thing that helps hard AI, or whether hard AI is going to give you
42:01
the ultimate goal in sentiment analysis. Does that make sense? Yeah, it does. And I was really, really impressed with some of the deeper learning models that have been coming out of this, and the fact that they're actually having really, really good accuracy with very little training. So there's actually some great papers, and I'm happy to reference them
42:21
and post them. There's some great trainers that are saying, ah, yeah, well, we just decided to play around with it for a month, and we were able to nearly get to state-of-the-art, fine-tuned systems. So I really think that deep learning is likely the answer here, and as deep learning evolves, I think sentiment analysis will too, because we've seen what it's done in terms of natural language generation,
42:41
and if it's making these inroads in natural language generation, and we're getting these networks that are able to start to understand what we mean, or what we're trying to say, then we can start to maybe predict how we feel. But the problem is the sentiment analysis field is still very, very much a closed model. Most of the places that are doing it are owned by places where they can't
43:01
open source their code, metamined. And so I think that there's this pressure on us as open source developers to try and keep up with these things that are happening behind closed doors. And that's all we have time for, folks, so I think you'll join me in thanking K-Jam once more. Thank you.