Baby steps in short-text classification with python
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 160 | |
Autor | ||
Lizenz | CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben | |
Identifikatoren | 10.5446/33671 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
EuroPython 2017151 / 160
10
14
17
19
21
32
37
39
40
41
43
46
54
57
70
73
85
89
92
95
98
99
102
103
108
113
114
115
119
121
122
130
135
136
141
142
143
146
149
153
157
158
00:00
DatenmodellTranslation <Mathematik>MereologieBinärdatenSystemaufrufSoftwareKolmogorov-KomplexitätNeuroinformatikNotepad-ComputerSprachsyntheseZentrische StreckungMultiplikationsoperatorGebäude <Mathematik>Elektronische BibliothekDeskriptive StatistikTaskGewicht <Ausgleichsrechnung>SoftwareentwicklerWort <Informatik>InformationMereologieIRIS-TLineare RegressionProzess <Informatik>NeuroinformatikEin-AusgabeSchreiben <Datenverarbeitung>BeobachtungsstudieGrenzschichtablösungSchreib-Lese-KopfVerschlingungInstantiierungTermVirtuelle MaschineKlasse <Mathematik>GruppenoperationOrdnung <Mathematik>SystemverwaltungWiderspruchsfreiheitDatenmodellInternetworkingNatürliche SpracheCoxeter-GruppeBinärcodeMAPAggregatzustandPhysikalisches SystemDatenbankStandardabweichungRauschenDifferenteEinsAutomatische IndexierungCASE <Informatik>Front-End <Software>DebuggingKategorie <Mathematik>DatenfeldNatürliche ZahlDifferentialChatbotGüte der AnpassungEndliche ModelltheorieSoftware
06:53
NeuroinformatikKolmogorov-KomplexitätNotepad-ComputerSoftwareLeistungsbewertungNegative ZahlVorzeichen <Mathematik>PrognoseverfahrenEindringerkennungBernoullische ZahlSupport-Vektor-MaschineEntscheidungsbaumUltraviolett-PhotoelektronenspektroskopieInformationDatenbankEndliche ModelltheorieDeskriptive StatistikOrdnung <Mathematik>Regulärer Ausdruck <Textverarbeitung>Wort <Informatik>AppletDifferenteZahlenbereichSchaltnetzE-MailDatenstrukturProzess <Informatik>Vorzeichen <Mathematik>SystemverwaltungGrenzschichtablösungSchnittmengeSummengleichungSupport-Vektor-MaschineNatürliche SpracheStochastische AbhängigkeitMultiplikationsoperatorResultanteNegative ZahlDatenmodellZweiTransformation <Mathematik>QuaderBinärcodeTopologieUnrundheitEntscheidungstheorieInformationEinflussgrößeProgrammbibliothekMatrizenrechnungDatenverwaltungBitTaskSystemaufrufInstantiierungHilfesystemRechter WinkelUnüberwachtes LernenWasserdampftafelCASE <Informatik>Metrisches SystemIntegralKonfiguration <Informatik>GamecontrollerWurzel <Mathematik>ZeichenketteJSONXML
13:46
Wort <Informatik>DifferenteMultiplikationsoperatorVersionsverwaltungNatürliche SpracheRechenschieberInformationDeskriptive StatistikBildschirmsymbolExogene VariableSoftwareInstantiierungSoftwareentwicklerRauschenProzess <Informatik>t-TestMailing-ListeEndliche ModelltheorieToken-RingElektronische BibliothekProgramm/Quellcode
16:56
DatenmodellStichprobeFunktion <Mathematik>SkriptspracheAppletCodeMessage-PassingInformationUniformer RaumProzess <Informatik>MultiplikationsoperatorKartesische KoordinatenZehnPasswortDeskriptive StatistikMetrisches SystemFunktionalVersionsverwaltungSchätzfunktionSoftwaretestSchnittmengeDienst <Informatik>REST <Informatik>CodeValiditätEndliche ModelltheorieAppletFunktion <Mathematik>MereologieInstantiierungInterpretiererTypentheorieAlgorithmische ProgrammierspracheSkriptspracheMusterspracheProgrammbibliothekPhysikalisches SystemResultanteParametersystemStichprobenumfangSynchronisierungDateiformatBenutzerbeteiligungMessage-PassingFront-End <Software>Mixed RealityServiceorientierte ArchitekturMehrrechnersystemClientServerZweiJSONXML
20:37
IntelApp <Programm>Computeranimation
21:05
MAPEntscheidungstheorieHilfesystemNatürliche SpracheSchnittmengeWort <Informatik>Endliche ModelltheorieSchlussregelDichte <Physik>InstantiierungProzess <Informatik>TropfenDatenmodellDifferenteRechter WinkelObjektorientierte ProgrammierspracheComputeranimation
23:52
MultigraphNeuronales NetzSoftwareBitMultiplikationsoperatorTermInstantiierungVerkehrsinformationQuick-SortResultanteVirtuelle MaschineDifferenteFaltungsoperatorJSONXML
24:33
StandardabweichungKreuzvalidierungSoftwaretestSchätzfunktionStatistikEndliche ModelltheorieReverse EngineeringBitWeb SiteEntscheidungstheorieMultigraphResultanteDreiecksfreier GraphDatenmodellKontextbezogenes SystemWort <Informatik>GraphData MiningMultiplikationsoperatorVirtuelle MaschineLokales MinimumCASE <Informatik>Algorithmische LerntheorieVorlesung/Konferenz
26:31
HydrostatikAnalysisInformationRückkopplungApp <Programm>Wort <Informatik>DatenmodellFormation <Mathematik>Support-Vektor-MaschineRandomisierungInformationPoisson-KlammerEndliche ModelltheorieCoxeter-GruppeProzess <Informatik>Formale GrammatikGeschlecht <Mathematik>MAPNatürliche SpracheResultanteWald <Graphentheorie>Selbst organisierendes SystemMultiplikationsoperatorMailing-ListeFigurierte ZahlSchlüsselverwaltungDatenstrukturStandardabweichungVorlesung/KonferenzComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:08
So, hi everybody, I'm Alisa and today I'm going to present you my personal horror story how actually I started with short text classification. Just the short information I work for a startup from Hamburg that aggregates jobs
00:24
and presents to the end user. So we deal a lot with job descriptions and believe me, these are the shortest and the weirdest formatted text you can ever find. What I'm going to talk about is how I actually approached the problem, what information I
00:41
found useful and if something was not there, something was missing. Then what model I've chosen, for what reasons and how did I train it. Then, really funny story, how we actually deployed this because we are completely Java-based system and afterwards conclusion, did I learn anything or can I make something better?
01:03
Spoiler, yes I can. What can I do actually with a text? Text can be classified, you can generate text like using Markov models if you want just have some fun or you can write chatbot based on some input that you gave previously.
01:21
You can tag words as part of the speech, you can build syntax models, all that jazz. For our purposes, we also didn't know at first what we want to do. We can automatically detect synonyms, for instance, developer, software and Twikler. In German, it's kind of natural that English terms are also very often can be found in
01:45
the internet as the official title of the job. Something like new words or generate better description for a job offer that we already know, we decided to go with a classification.
02:02
Profession class or industry are incredibly hard to classify because there are no canonical way to define those categories and they kind of differ from country to country so we had to do something else. Our marketing department suggested that we do binary classification for jobs that require
02:24
education and those that do not. For instance, babysitter does not require any education normally and developer, architect, they all do. It seems easy as a first thought but it's not.
02:43
I started to think, okay, I have this text, what can I do with it? The first thought that happened to appear in my head was keywords. There are lots of problems with keywords though. First, as I mentioned, the quality of the text itself were awful. Sometimes we would get like null.
03:02
That's a great description. I always wanted to work like a null or for a null, I don't know. And there are really big descriptions that are apparently made by a human but they contain lots of topics. Here you can see like a very generic example.
03:20
Here you can see keywords for healthcare, secretary and the blue ones with work with papers and so on and so forth. These are three different topics. Industry the first, the profession class or seniority level can be the second one and the three others might be profession keywords.
03:41
They overlap. They are not, it is just not possible to humanly define all the keywords for all the classes, for all the items we have in our database and for all the languages we have plus count synonyms in. My second thought was, hey, machine learning, it is a kind of hype now so I don't have
04:04
labels. I don't want to label more than a thousand items I have to read through. So unsupervised, let's go unsupervised. I tried several models out and I have ended up with LDA. If you want to read it about it, I have links in the very end of my presentation.
04:23
Basically you just teach this model on several texts and you give us an input amount of topics you want to generate from all those texts and in the end you get something like regression with the keywords and their weights. So and for each topic you get a different regression.
04:43
In this case you kind of just throw text again, the regression, you have the score and you just can compare scores and say, okay, the highest one is our topic. Um, wasn't good. These model, the like LDA does not work well with short text.
05:02
Plus again, we have too many noise inside. Let's see again. Maybe I do have labeling. Actually I did. The month before I got this assignment, we worked with so-called LDB or ISEO.
05:20
These are international, the first one is a German and the second one is an international standards for profession class defining. You can see five digits. This is originally defined by the German state, comes from their sources and so. The first three digits define the highest level of the profession class and then kind of going
05:43
down into depths of the, actually what it actually does. An example 434 will stand for scientific, four, three is informatic, four again is a software development. The last two digits actually show exactly where your field is.
06:01
For instance, are you a system administrator? Are you a front end developer? They don't go into front and back end differential but about that. And the last digit, it is from 0 to 9 shows how complex is the task. So technically 0, 1 and 2 were all for the titles that didn't, from the human standpoint,
06:23
require any education. Again, some stuff that helps out in the hospital, some stuff that helps out for conference, I'm very sorry, but you are no education required guys, and so on and so forth. We had like 600,000 items labeled with these LDBs.
06:43
It is about, by the time it was 50% of our German database. The problem was we kind of tried, we had these official titles and we had our titles and we tried to create a regex that completely will match the titles from our base to the official ones. The problem with that was, first of all, synonyms.
07:04
It was real German titles. Like you won't ever find something like Python master or administration guard, never. And actually there are plenty of job offers with that title out there.
07:24
Synonyms were the problem. Then the German language structure where you can combine words to build a new word and you can combine words in different order. RegEx is out. And also the quality of the titles themselves from our database was not as good as we
07:43
would like it to be. Sometimes it was just helper. And in the description you could see that it's not even a helper, but it's a secretary which has almost all the tasks from the high manager or CEO or something like that. So it was kind of different.
08:03
Anyhow, I have my labels. Yay! I digged into many tutorials and I really wanted to set Python in. I had opportunity to look into Java. I actually took a look into Scala and Java libraries, but it's such a pain to set up
08:22
those and I like Python and I really wanted to take Python into our infrastructure. So the most interesting things were NLTK, Scikit, and Gensim. Gensim I mostly use for unsupervised learning, but you shouldn't use it anymore. It's deprecated and Scikit actually uses some of the Gensim libraries.
08:44
So I went with NLTK mostly. How can I evaluate the model? Well, there are lots of tools. Depending on the model you can do different things, but there is something called confusion matrix. When this is a confusion matrix, the left side, like the vertical is the actual label
09:06
of the text, for instance, and the vertical or horizontal is the predicted one that our model says it is. So the true positive is like you've done it right. Imagine we have our binary classification, so it's like A, B, and AB again.
09:24
Our model said no, you're actually B, but it was A. This is a false negative and you can imagine further. The accuracy is how many labels got right. I actually took the accuracy as the main measure and the false negative.
09:42
In my case, I wanted to minimize the bias towards education required because it turned out we had really small amount of non-educational job offers. So I didn't want to decrease it even further. I've ended up checking four different models.
10:03
Bernoulli classification, APISIN, support vector machines, and decision tree. These are the models you will start with. Bernoulli and decision tree went from the first round just out because they yelled coin flip accuracy results. I've trained the models on like 10,000 item data set.
10:24
It's not that big of a deal, but I just wanted to know what are their learning time and accuracy out of the box. Before I invest more time, I didn't have much time. So second round, support vector machine versus naive biasing. The problem with support vector machines was that it took way longer to be trained,
10:45
but it yielded really good results. And the second thing was that the bias towards education required was way higher than by naive biasing. So that was enough for me by the time and I decided naive biasing it is.
11:03
So I trained up the, I took a look at this train model that was trained on 10,000 items and the accuracy was about 70 something percent. Not good enough. What can I do about it? I can do like micromanagement and micromanagement, like micromanagement, micromanagement.
11:24
I can tweak the data set itself to try to balance out the amount of labels presented in the training set. And I can tweak each item independently. I, just a disclaimer, balance set works for all models, deep or not deep learning,
11:41
way better than unbalanced set, but I underestimated the impact. How big is the impact of unbalanced set? So I turned, I ended up with a 50-50 label data set and it couldn't be bigger than 50,000 items because I had about 5% in general items that can have the non-required education
12:04
label. So I went back to the second option, tweak each item in the data set separately. Since it is short text, there are not too many things you can do. You can add information. In my case, not all descriptions had the title inside and sometimes it was crucial.
12:24
For instance, in title, title was saying no education required for this job and no other sign in the text. You can remove information. As I said previously, these texts have like very many different topics inside and
12:41
some of them like contact information or start date of the job or salary, salary maybe but it doesn't matter that much. So I would take out, for instance, numbers and dates and emails. And one more thing that I could do is I can stem the words.
13:01
Stemming is just following. For instance, you have in German, you have koch and köchen. It's like cook and cook girl and they look completely different. The perfect world will yield results like koch or wisdom loud.
13:21
So what stem does is it shrinks the word up to its root, maybe a bit more than this. So that I will catch running and run in English together as one word without bloating the feature set. Spoiler alert, German stem are so bad.
13:43
So let's go for that transformation.
14:00
Very sorry. Really nervous. So I prepared, already prepared job item description and with title. I'm very sorry it's a German because I really want you to see how bad stem are.
14:22
Title is software developer in PHP, whereas the description is like this big. And just believe me, it's about we are looking really cool PHP developers. We are really cool team.
14:41
Here's your responsibilities. Here's what you're going to get from us. What I wanted to do first was to add information. So title and description concatenate. You can imagine, I'll just skip this step because you can imagine how it looked like. I put title in the front.
15:00
I actually tried different strategies, putting at the front, in the back, in the middle of the text, no difference. So I wanted to remove some information. We can remove stop words. These are unique words for the language. They normally do not yield any information unless you do something like language detection.
15:27
The Naive Bison works on the back of word and what comes in is actually for each text, it's the tokenized list of words normalized tokenized with the label.
15:46
You can see here some real crappy things like 25, this thing, this thing, so let's take out all the noise. All the punctuation and as I say, data, time, and digits.
16:07
Looks better. At least I don't see too many of the noise words anymore. And stemming. For instance, there are like several, five I counted in the beginning of the working
16:21
on these slides. There are like five different versions of software development, software something. Even less words are now going in. You can actually see this innovative and personal and actually this be our software
16:48
that allows us to build any further words. So this is what's going into the model. Now, I'm not kidding.
17:01
This is all the code I had to write to create my model. In Scikit, it actually looks completely the same. Sorry. And training it. I split the whole sample into training and validation set or testing set. I create, like I format the data, both train and testing sets.
17:25
I build my model and I get the estimation. This is a custom function that just says, okay, your accuracy, your confusion metrics and this is how long you have taken. I decided to go with pickle because TensorFlow wasn't the buzzword back then.
17:43
It is now. We actually do use TensorFlow in our days, but yeah. This is the output of the pre-trained model. Now it is a model trained on English dataset containing 5,000 items for each label.
18:01
There are three labels. These are part-time, full-time, and mixed time job offers. You can see that it's about that what I said in the beginning of accuracy. Those texts are really not that good at all. And the training dataset was like really small.
18:20
More you can get in, the better your model will be. Now I completely forgot that we were a Java system. And now I have my pickled model and like, okay, what now? They were crazy ideas like I can save it as a JSON and change the parameters, all the
18:44
names of the model so that I can only preserve the feature set and put it into 10-fold NLP. No, no, no, just that was a second of blackout. I could use JITEN. Not a good idea. It is only with Python 2 compatible and my whole model was saved and pickled away
19:04
with C libraries, Python 3, not going to work. I've tried starting a Python script inside of a Java code. It was so long. Every time a procedure will run the script, it will start Python interpreter, do the thing and stop the interpreter.
19:22
Imagine you have like a million, it's not that many of the items, but even this will completely destroy our performance in the backend. Should I write in Java? Nah. Message brokers. For instance, I could combine, like connect our Java resource with our Python service
19:42
via Kafka or RabbitMQ. It's not always, it is possible, but it's not always a good idea because the versions of those tools are not always in sync. You have to actually keep an eye on that. And yeah, no. So I will take a pass of microservices and I will build a REST application.
20:06
This is it. Nothing more than this. We just use the Flask as our web server. We deployed with Green Unicorn. And just use the JRC client on the Java side. And they exchange just a simple JSON that says,
20:23
hey, I have this item with this title and this description. Please say me something. And the result, as the result, you get something. Hey, I'm Python service and here's my answer. And I've used this model.
20:50
For instance, sorry, now we are starting our Flask app.
21:15
And here you are. As I said, this model was pre-trained for German markets.
21:21
So what it says, hey, I have a text that is work for all helpers and people helping out. Language German. What we see, education not required. Yay, we did something right. Let's continue this. Let's check something different.
21:52
I'm going to just write it.
22:10
Education level. We don't need all of this.
22:26
Oh, yeah. I did forget to mention where there is no features in the text. The model will be discrete about a decision, but it will be just a coin flip. For instance, for some text, it is education required.
22:44
I'm not actually even sure whether some text appears in a German feature set. But let's go with the next bird job. Oops.
23:14
Next bird for us. Whoops. Oh, thank you.
23:22
For some reason, it's no education. The final accuracy I could take out of this model was 95%. So I guess we just found the edge case. 95 was the outside restrictions. For instance, in Germany, all the healthcare jobs are education required,
23:43
no matter what they do. And there are several other industries where this rule stands, at least for German market. Yeah. So did I solve the problem? Or could be something done better?
24:01
It most definitely could be better. For instance, I could have spent a bit more time on the research, how I can work with the text. Maybe I can transform it a bit differently. If you're doing your first tech classification, just don't go into deep. Machine learning, like deep neural networks, like convolutional neural networks,
24:20
or recursion, or so on and so forth, they are good. They do yield good results. But they have to be done. You really need to be careful with those, and you have to invest way more time than with this. Actually, another advice of mine is try graphs first.
24:41
You can map all your words to the nodes of the graph, and then search for cycles, or search for certain subgraphs, for certain topics, for certain synonyms, or for even a context. You can do that, and actually it will be a bit faster, I guess. Plus, you can combine those two. Just check against each other.
25:03
Don't be afraid to alter the features they are. You can actually, like, it is a bit easier to reverse engineer it, to know exactly what the features are that the model learned, and you can alter them for a better result. Like, those models are still not better than the humans.
25:22
Another really good idea is monitoring over historical data. You should log all the decisions of all items you've did, or like every tense item, so that first you can actually check whether it's true or not by sampling, or you have your label test. Like, you have your data.
25:40
You can alter it a bit, and then build an even better model. The estimation methods, as I said, there are tons of it. You can use something like poor statistics, or cross-validation. You've probably heard about that. It's never a bad idea. If you're using a model that constantly relearns,
26:02
that is also a possibility, then you should have at least minimal quality test. If something goes wrong, you will be notified about that. It is also an interesting idea to have a golden standard test. It's like the very edge case. If the model can does this, you can sleep well.
26:23
So, this is pretty much it. Invest your time in machine learning. It's pretty interesting. Thank you.
26:41
Are there any questions? Yeah, wonderful. And please, everyone, do give your feedback on the app for the talk. I think it was very good, so it's rated. Thank you very much for your talk. I have a somewhat unrelated question. So, you showed this job opening for Entvicular PHP,
27:01
and there was in the brackets M slash W. Yeah, sorry, a little unrelated question. So, there was the vacancy, and it was called Entvicular PHP, and the brackets was M slash W. What this M slash W means, because I'm seeing this a lot in German.
27:23
So, that's the problem that people don't assume gender from the grammar structure, right?
28:04
So, this M W means that they say yes, this is for both genders. Okay, great. Thank you.
28:22
Thanks. It was a great presentation. And I have a question about the modeling you've chosen. Like, you've chosen the support vector machine NaviBase, right? So, there is a lot of information about a lot of random forest, or maybe another one is like an ensemble model,
28:42
or like kind of doing very good. Have you tried those models?
29:12
Thanks, Vedak. I wanted to ask about the pre-work that you did with data, especially with stemming. So, how, because maybe I missed it, or it was not super clear for me.
29:22
How did you actually do it? Is there like a pre-built thing that you can use as a beginner to stem words in a language? And then, and then also immediately second part, with the data you actually use your model on, you have to stem it as well, if you want to use the stem data for processing, right?
30:29
Thank you very much, Alyssa, for this wonderful talk. We have like five minutes.