Baby steps in short-text classification with python

Video in TIB AV-Portal: Baby steps in short-text classification with python

Formal Metadata

Baby steps in short-text classification with python
Title of Series
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Baby steps in short-text classification with python [EuroPython 2017 - Talk - 2017-07-12 - Anfiteatro 1] [Rimini, Italy] This talk aims to provide an information about where and how one could start using simple text-classification models. Additionally it will be shown how a python classificator can be incorporated into existing system. The presentation will be broken into 3 topics and a conclusion. First, the presentation provides an overview of how the problem was approached, what information was useful or not and how the technologies stack shown in the second part was decided on. Second part will concentrate on using Naive Bayesian model for text classification. How the model was trained, what difficulties were met and how they were solved. Additionally the talk will give a brief overview of other possible model choices (random forest, SVM). The third part will show how the model was deployed and used in the production. One architecture solution will be shown in details (REST calls between Java Client and Flask Server), while other possibilities will be mentioned briefly. As the conclusion the possible improvements for the model in use will be suggested as well as short example of supervised learning algorithm (CNN) and unsupervised classification algorithm (LDA) for the same purpose. Along with the examples the proc and cons will be named. Technologies mentioned and used: Flask, Green Unicorn vs uWSGI, NLTK, Sci-Kit, Python 3, Java 8, Jersey, Docker, Kubernete
Building Group action Presentation of a group System call State of matter Multiplication sign System administrator 1 (number) Mereology Disk read-and-write head Neuroinformatik Data model Different (Kate Ryan album) Personal digital assistant Descriptive statistics Social class Physical system Linear regression Software developer Binary code Mereology Instance (computer science) Translation (relic) Flow separation Data model Process (computing) Order (biology) output Writing Link (knot theory) Observational study Virtual machine Digital library IRIS-T Term (mathematics) Internetworking Software Energy level Task (computing) Noise (electronics) Standard deviation Scaling (geometry) Information Weight Consistency Neuroinformatik Database Binary file Subject indexing Word Personal digital assistant Speech synthesis Natural language Kolmogorov complexity
INTEGRAL Java applet Multiplication sign Decision theory System administrator Combinational logic Set (mathematics) Water vapor Sign (mathematics) Roundness (object) Different (Kate Ryan album) Computer configuration Personal digital assistant Matrix (mathematics) Information Endliche Modelltheorie Descriptive statistics Email Binary code Bit Instance (computer science) Unsupervised learning Measurement Data management Data model Process (computing) Prediction Order (biology) Right angle Bernoulli number Metric system Resultant Game controller Support vector machine Transformation (genetics) Online help Decision tree learning Number Performance appraisal Root String (computer science) Software Data structure Task (computing) Support vector machine Information Chemical equation Neuroinformatik Database Ultraviolet photoelectron spectroscopy System call Sign (mathematics) Word Intrusion detection system Personal digital assistant Network topology Natural language Kolmogorov complexity Negative number Library (computing)
Noise (electronics) Slide rule Dependent and independent variables Information Software developer Multiplication sign Electronic mailing list Student's t-test Instance (computer science) Computer icon Revision control Word Process (computing) Software Different (Kate Ryan album) Natural language Descriptive statistics
Functional (mathematics) Scripting language Service (economics) Java applet Code Multiplication sign Set (mathematics) Parameter (computer programming) Function (mathematics) Mereology Revision control Data model Estimator Synchronization Software testing Information Endliche Modelltheorie Message passing Descriptive statistics Physical system Scripting language Validity (statistics) Java applet Sampling (statistics) Code Representational state transfer Instance (computer science) Cartesian coordinate system 10 (number) Type theory Process (computing) Sample (statistics) Uniformer Raum Function (mathematics) Password Interpreter (computing) Pattern language Procedural programming Metric system Resultant Library (computing)
Intel Data model Word Set (mathematics) Online help Natural language
Artificial neural network Decision theory Multiplication sign Bit Instance (computer science) Drop (liquid) Rule of inference Process (computing) Population density Software Term (mathematics) Graph (mathematics) Energy level Endliche Modelltheorie Quicksort Traffic reporting Resultant
Context awareness Standard deviation Statistics Graph (mathematics) Multiplication sign Decision theory Graph (mathematics) Virtual machine Bit Maxima and minima Data mining Data model Word Estimator Personal digital assistant Cross-validation (statistics) Website Software testing Endliche Modelltheorie Cycle (graph theory) Resultant Reverse engineering
Support vector machine Randomization Standard deviation Key (cryptography) Gender Multiplication sign Electronic mailing list Mathematical analysis Data model Word Process (computing) Fluid statics Forest Formal grammar Energy level Self-organization Natural language Information Data structure Figurate number Endliche Modelltheorie Resultant
few so high everybody I'm at its end today I'm going to present you my personal horror story actually I started with this shot texas ification just a short information I work for start a post from Hamburg beds and these jobs and presence to the end user so we do a lot with his job descriptions and believe me this of the shortest and the weirdest formatted text you can never find what I'm going to talk about is how I action approach the problem what information I found useful and if something was not there something was missing then what Muller chosen for what reasons and holiday train at then really funny story how which deploy those because we are completely Java-based system and afterwards conclusion did I learn anything or can I make something better for that yes I can what can I do it she was a text text can be classified so you can generate text like using Markov models if you want us have some fun or you can write um transport based on some that you gave previously you can tag words as part of the speech you can build index models all that jazz um for our purposes we also did know at 1st what you want to do we can automatically detect synonyms for instance develop G um softened be club in German it's can unnatural that's English terms are also very often can be found in the internet as official title of the job um something like new words or generated better description for a job offer that we already know all we decided to go was a classification it um profession Klaus support industry are incredibly hard to classify because they're not canonical way to defined those categories and they kind of differ from country to country so we have to do something else so our marketing department suggested that we do binary classification for jobs that require education and those that do not for instance if a B C 3 babysitter does not require an education knowledge and developer architects they're all do it seems easy for the 1st time uh at the 1st order but it's not a study thing OK I have this text what can I do with it the 1st salt that's happened to be in my head was words there are lots of problem was keywords those 1st as I mentioned the quality of the text itself or awful sometimes we would get like now that's a great describe a description always wanted to what lies like cannot will for not I don't know and there are really big descriptions that are at Burnley made by a human body contains lots of topics here you can see like the very generic the example here you can see he was for healthcare secretary and the blue ones who is workers papers and so on so false this at 3 different topics industry the 1st um the profession class or seniority level can be the 2nd 1 and these 3 of us are my profession keywords they overlap that they are not so it is just not possible to human we define all the keywords for all the classes for all the items we have in our database and for all the languages we have lost consonants in my 2nd thought was OK machine learning it is a kind of now so I don't have labels I don't want to label more than a thousand items I have to read through so unsupervised let's go on supervised I tried several models out and I've ended up with LDA that if you want to read about it I have links in the very end of my presentation basically you just teach this model on several of the text and you give us an input amount of topics you want to generate uh from all those texts and in the end you get something like regression with is the keywords and their weights so and for each topic you get a different regression in this case it kind of just throw text again the regression you have a score and adjusting compressed cost and great the highest 1 is our topic I M. wasn't good these models like LDA does not work well with short text must again we have too many noise inside that's again maybe I do have a labeling actually I did the month before I got this assignment we worked with some so-called can indeed be or by seal these are international the 1st 1 is a german and the 2nd 1 is an international standards for profession defining you can see 5 digits this is a region defined by the German state comes from their sources and of the 1st 3 digits defined the highest level of the profession class and then they're kind of going down into the depths of the X what actually does an example for 3 4 will stand for scientific for 3 is informatics for again is a software development and the last 2 digits actually show exactly where you field this for instance iris system administrator ideas from and develop but they don't go into front and back and differential but about that and the last digit edges from all 2 9 shows how complex is the task so technically all 1 and 2 are were also the titles that devant from a human standpoint require require any education again some standard helps solid in the hospital some stuff that helps out for conference I'm very sorry but you know education applied guys and and so suppose we had like 600 thousand items labeled was the scale it is about by the time it was 50 % of our German database be problem was we kind of tried we had this official
titles and had our ties and we try to create a red it's that completely will match the times from our base to the official once the problem was that was 1st of all synonyms it was real German titles like you won't ever find something like yeah piton master or administration got never and actually there out plenty of them job offers with that title of their button synonyms for the problem then the German language structure where you can combine words to build a new word and you can combine words in different order right it's is out and also the quality of the titles themselves from our database was not as great as we would like it to be sometimes it was thus helpful and in the description you could see that it's not even the help but it's some secretarial which has almost all the uh task from the manager were seal or something like that so it was kind of different and now I have my labels yi I dig into many tutorials and I really wanted to use and I had opportunity to look into job well actually it took a look into scholar and Java libraries but it's such a pain to set up those and I and I really wanted to take by control infrastructure so the most interesting things were not a case like at engines in gensim I mostly used for unsupervised learning but you shouldn't use it anymore it's deprecated and take it actually uses some of you in the library so I went mn is to mostly and how and when you in the model well there are lots of tools depending on the model you can do different things but there is something holds confusion matrix when find this is a confusion metrics that left so I like the for vertical is the actual labels of the text for instance and the court Berdugo over zonal is the predicted 1 that all my model say so the true positive is like you don't right imagine we have all binary classification so it's like a a B and AB again all models said that no you actually be but it was a this is a false negative and you can imagine for the the accuracy is um how many labels were world got right I actually to the accuracy as the main measure and the false negative in my case I wanted to minimize the bias towards education required because it turned out we had a a really small amount of non educational job offers so I didn't want to decrease it even further I have some ended up checking for different models bring only justification I bias in support vector machines and decision tree these are the models you will start with um Bernoulli and decision tree when from the 1st round just out because the yield to conflict agarose results I've trained at the models on like 10 towels and item dataset it's not that big of a deal but I just wanted to know what's out there learning time and accuracy of the books before I invest more time I didn't have much time so 2nd wrong Support Vector Machine VS made by the problem with support vector machines was that um it's took way longer to be trained but until it really good results and the 2nd thing was that um be biased or its integration quiet was way higher than buying a and so that was enough for me by the time and I decided it made by senators the so a train B I take a look at this trained model that was trained on 10 thousand items and the accuracy was about 70 something per cent not good enough what can I do about I can do you like micromanagement and increment which would micro-management management I can't read the dataset itself to try to balance out the amount of labels presented in the training set and I can tweak each item independently I a does a disclaimer that balance that works for all models steeple not deep learning where better than unbalanced set but I underestimated the impact how because the impact of unbalanced so I turned out and I ended up with a 50 50 labeled data and it couldn't be bigger than 50 thousand items because I had about 5 % of in general items that can have been non required education label so I went back to the 2nd option priggish each item in the dataset separately would this is a short text there are not so too many things you can do you can add information in my case is I'm not all descriptions had the time uh title inside and sometimes it was crucial for instance entitled title was saying nor education required for this job and no other sign in the text you can remove information as I said previously in these texts have like very many different topics inside and some of them like contact information water start date of the job or celery so maybe but he doesn't matter that much so I would take out for instance um numbers and dates and e-mails and 1 more thing that I could do it is I can't stand the words stemming is um does following for instance you have in German the you have calls and cation it's like cook and cook girl and they look completely different the perfect world will yield results like all all with them out and so what stand as is it's strings the word up to its root made a bit more than this so that I will um catch running and running in English together as 1 word without bloating the future splendid jomon stemmer sell so let's go for that transformation
yeah the trick the story really nervous so I prepared already prepared and job item description and and was title and very a German because I really want you to see you hope that are title and is software developed development speed whereas the description yeah the is like this big and does believe me it's about well can really cool PHP developed as we are equal team has a responsibility this here's what you going from us the what I wanted in the 1st was to add information so title and description competence we can emerge either skip this step because you can measure how would look like I put title in the front I actually tried different surgeons putting at the front and the back in the middle of the text no difference so I wanted to remove some information we can remove stopwords is a unique words for the language they normally do not yield any information unless you do something like language detection beat made by some works on the bag of word and what comes in is actually um 4 inch text these tokenized list um of words normalized organized with the label you can see here some real great things like 25 this thing this thing so let's take out all the noise all the punctuation and as I say data time and students but that's better at least I don't see too many of the noise words anymore and stemming for instance there are like several 5 icon and uh the beginning of the working on this slide there are like 5 different versions of software development software that every suffer something the even less words and I'm going in you can actually see this in normative on and personality and actually this be always often people that allows us to build and further words so this is guy what's going into the mobile
and yeah now I'm not kidding this is all the cold I had to write to create my yeah incited actually looks completely the same way for it and the training it I split the whole sample into a training and validation set where testing set I created like I formant the data of both training and testing sets I built my model and I get the estimation this is a constant function that just says OK your accuracy on fusion metrics and this is how long you can take I decided to go with this because because tens of wasn't the password and then it is now I a we actually do use tens of all you nowadays that in this is the output of the pre-trained model and now it is a model-based trained on the English dataset containing 5 thousand of items for each label there are 3 labels this or part time full time and makes time jobs off you can see that if the ball that what I said in the beginning of accuracy in its those text are and really not that good at all and the data training dataset was like through small more you can get in the better your mobile model will be no I completely forgot that we were jobless systems and now I have my pickled model and like what what what's what now the now they were created the ideas like I can say that is adjacent and change the parameters all the names of the model that I can only preserve the feature set and put it into 10 for the the no just that was on a 2nd of like out um his ji not a good idea it is only was to compatible and my whole model was saved and people away with C libraries tightened 3 not work I tried starting a Python script outside of the jar inside of a Java code it was sold alone every time a procedure will run descriptive dwells type Python interpreter do a thing and still the interpreter imagine you have like a million so that not uh many of the items but even this will completely destroy our performance in the back and should a writing job now the method proposed for instance I could combine like connect our job resources the use of our pipe and uh so this year Kafka or evident you it's not always a dis possible that is no was a good idea because the versions of those tools are not always in sync you have to actually keep an eye on that and yet no so I would take a passive services and will know the rest the application this nothing more than this we just use the flask and as I up so we divided his green uniforms and just use the jurors declined from the job aside and they change just a simple Jason that says they I have this item with this title and description please saying something and the results as the result you get something pay and pattern service and the few months were and used this small only
it it for instance for and or sorry now we're starting our plus
good and here you are um as a set of these models were trained on for drama markets so what is says hey I have a text that is word for all help others and people helping out language what we see education not required yeah you wouldn't something right but consider this where check something different all find just
yeah education level I will not in this our ideas to mention where there is like all features in the text is on the model will be a discrete about a decision but it will be just appointment yeah suffices for yeah some text it is education of quot election even sure where the some text appears in the German features the but let's go for with an extra drop but yeah the next for yeah put all think the for some reasons notification the final accuracy Acc would take out of this model was 95 per cent so I guess we just on the um 95 was the outside restrictions for instance in Germany all the health care jobs I-deficient quiet no matter what they do and there are several other uh industries where these rules densities for German
and get so did us of the problem term or could be something done better it will definitely could be better for instance I could have spent a bit more time on the research how it can work was a text maybe I can transformative differently but if you're your 1st at fuzzification just don't go into deep machine-learning deep metal networks slide convolutional neural networks or a carton or sorts of falls the uh good they do yield good results but the report they have to be done like you really need to be careful with dolls and have to the way more time than with
this my actually another advice mine is trigraphs 1st you can map only words to the nodes of the graph and then social cycles or search for a certain stop graphs for certain topics for certain synonyms or for even a context it into that and actually it will be a bit fast of I guess was you can combine those 2 just take them against each other yeah don't be afraid to alter the features they are you can actually that like it is a bit easier to reverse engineering to know exactly what the features are that the model learned and you can alter them for but the result like those models are still not better than humans um and neither a really good idea is monitoring over historical data you should not all the decisions of all items you did or like every 10 site so that's 1st you can actually check whether it's true or not I'm something or you can label test like you have the data you can alter the bits and then build a bit even that more and the estimation methods as I said there are tons of it you can use something like Perot statistics or cross-validation you've probably heard about that it's another that idea the if you are using a model that constantly reiterates that is also a possibility then you should have at least a minimal quality test if something goes wrong you will be notified about it is also an interesting idea to have a golden standard test it's like the very edge case if the model can does this you can think what so this is pretty much it's invested time in machine learning it's pretty interesting thinking and
yeah the don't talk about other
lessons here wonderful and please everyone of them do give you feedback on the up for the talk i think it was very good so it's really thank you very much for the talk have somewhat unrelated
questions so you show this job opening for antiquity and there was in the brackets and let w here yes sorry unrelated questions so there was a the vacancy and was called ambiguity and breakers was and slush w with this and list of these because they see this a lot in Germany results organism level my in my mind in the in the in the in the in the the in the so this is a problem that yes so that people don't assume gender from the grammar structure it's we well so and this and that means that the city yes this is for both genders figures think you know what and yeah so we talked for 1 more thank stating it was a great presentation and that because he never going on the modeling that you choose in the key you which was in the support vector machine at the base right so there is a like a lot of time you know information about like a lot of like random forest or maybe they know and another 1 is like I know in similar models are like kind of doing the and the title is modeled I I the of we and it yeah at the the expense of of I want to ask about although we were computed with the temple especially with standard so whole because it may be any sister and was not suppose the former coach introduces there like a preview thing that you can use as a beacon of stem words in the language and then and then also take the immediately secondfront will with the WHT uh you use your model on you have to estimate as well if you want to use the standard that the proposed think they work at In a way back a the time of the we and we I bring here most of what is 1 of the things that we have like a