We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Natural language processing with neural networks.

00:00

Formale Metadaten

Titel
Natural language processing with neural networks.
Untertitel
Solve your language processing problem with neural networks without going bankrupt.
Serientitel
Anzahl der Teile
118
Autor
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Getting started with a natural language processing and neural networks is easier nowadays thanks to the numerous talks and tutorials. The goal is to dive deeper for those who already know the basics, or want to expand their knowledge in a machine learning field. The talk will start with the common use cases that can be generalized to the specific problems in a NLP world. Then I will present an overview of possible features that we can use as input to our network, and show that even simple feature engineering can change our results. Furthermore, I will compare different network architectures - starting with the fully connected networks, through convolution neural networks to recursive neural networks. I will not only considering the good parts, but also - what is usually overlooked - pitfalls of every solution. All of these will be done considering number of parameters, which transfers into training and prediction costs and time. I will also share a number of “tricks” that enables getting the best results even out of the simple architectures, as these are usually the fastest and quite often hard to beat, at the same time being the easiest to interpret.
Schlagwörter
20
58
Prozess <Informatik>Natürliche SpracheNatürliche ZahlNeuronales NetzPunktwolkeGoogolVirtuelle MaschineSichtenkonzeptWort <Informatik>ProgrammierumgebungVorlesung/Konferenz
EinschließungssatzFolge <Mathematik>MustererkennungSampler <Musikinstrument>Neuronales NetzAnalysisRuhmassePolareTaskEin-AusgabeAttributierte GrammatikWort <Informatik>Arithmetisches MittelE-MailAnalysisTaskCASE <Informatik>Heegaard-ZerlegungReelle ZahlMatrizenrechnungAdditionNeuronales NetzEin-AusgabeGenerator <Informatik>RauschenNatürliche SpracheFunktionalWellenpaketApp <Programm>VerschlingungComputerspielParallele SchnittstelleSoft ComputingRechenschieberTranslation <Mathematik>VakuumpolarisationServerKardinalzahlToken-RingParametersystemMustererkennungSelbstrepräsentationExogene VariableProzess <Informatik>MultiplikationsoperatorAlgorithmusLesen <Datenverarbeitung>MultiplikationEinschließungssatzInformation ExtractionSuchmaschineGemeinsamer SpeicherVirtuelle MaschineFolge <Mathematik>VektorraumProgrammierumgebungZweiSprachsyntheseHochdruckEinsSchnittmengeProgrammbibliothekWeb SiteAutomatische DifferentiationZahlenbereichEreignishorizontFokalpunktPersönliche IdentifikationsnummerComputeranimation
Lemma <Logik>Neuronales NetzDatenmodellFokalpunktTensorSelbstrepräsentationEin-AusgabeToken-RingWort <Informatik>WechselsprungNeuronales NetzDifferenteWhiteboardEinbettung <Mathematik>VisualisierungStatistikDatensatzKlassische PhysikEndliche ModelltheorieSchnittmengeSyntaxbaumFunktion <Mathematik>MereologieNichtlinearer OperatorPlotterResultanteGraphArithmetisches MittelZentrische StreckungInformationDatenbankÄhnlichkeitsgeometrieOrdnung <Mathematik>Lemma <Logik>DimensionsanalyseStreuungComputerarchitekturPunktMinkowski-MetrikInverseFrequenzCASE <Informatik>TermSpeicherabzugCluster <Rechnernetz>BildschirmmaskeWellenpaketEinsCoxeter-GruppeGewicht <Ausgleichsrechnung>SinusfunktionBestimmtheitsmaßSoftwaretestComputeranimation
Folge <Mathematik>Vektor <Datentyp>Ein-AusgabeDichte <Physik>Einbettung <Mathematik>Zellulares neuronales NetzNeuronales NetzResultanteMereologieWort <Informatik>StörungstheorieSprachsyntheseVektorraumFolge <Mathematik>SelbstrepräsentationGraphiktablettOrtsoperatorMinkowski-MetrikEin-AusgabeNeuronales NetzProdukt <Mathematik>Data DictionarySpannweite <Stochastik>Endliche ModelltheorieHilfesystemBildgebendes VerfahrenSchwach besetzte MatrixÄhnlichkeitsgeometrieWellenpaketSoundverarbeitungDifferenteCASE <Informatik>Negative ZahlTrigonometrische FunktionDickeInformationUmwandlungsenthalpieZweiMatrizenrechnungWeb SiteDokumentenserverDifferentialgleichungDemoszene <Programmierung>Funktion <Mathematik>Einbettung <Mathematik>InterpretiererMailing-ListeSchnitt <Mathematik>Token-RingTuring-TestNP-hartes ProblemGrenzschichtablösungHidden-Markov-ModellSinusfunktionComputeranimation
FaltungsoperatorEin-AusgabeEinbettung <Mathematik>Dichte <Physik>BetragsflächeZellulares neuronales NetzNeuronales NetzDifferenzengleichungSelbstrepräsentationWort <Informatik>OrdnungsreduktionBildschirmfensterMittelwertFaltungsoperatorStandardabweichungLesen <Datenverarbeitung>Neuronales NetzOrtsoperatorForcingOrdnung <Mathematik>DifferenzengleichungMatrizenrechnungMultiplikationRichtungParametersystemNichtlinearer OperatorCoxeter-GruppeMereologieSpezifisches VolumenVektorraumZellulares neuronales NetzMini-DiscPhysikalische TheorieComputerarchitekturDimensionsanalyseComputeranimation
PrognoseverfahrenArchitektur <Informatik>Einfach zusammenhängender RaumProzess <Informatik>Natürliche ZahlNatürliche SpracheNeuronales NetzWort <Informatik>BildschirmfensterSelbstrepräsentationEinbettung <Mathematik>UmwandlungsenthalpieMereologieFlächentheorieKontextbezogenes SystemMultiplikationsoperatorLogischer SchlussComputerarchitekturCoxeter-GruppeNeuronales NetzVerschlingungResultantePhysikalische TheorieGewicht <Ausgleichsrechnung>MatrizenrechnungVorhersagbarkeitWeb logAggregatzustandAlgorithmische Lerntheoriesinc-FunktionZellularer AutomatWellenpaketGradientArithmetisches MittelTransformation <Mathematik>Hidden-Markov-ModellNichtlinearer OperatorPropagatorInformationMultiplikationVerknüpfungsgliedAuflösung <Mathematik>VektorraumRastertunnelmikroskopPROMComputeranimation
Computeranimation
Transkript: Englisch(automatisch erzeugt)
So hello everyone and thank you for attending this talk I know that the title may seem a little buzz-worthy, but I hope that I will present more down-to-the-ground View, so first few words about me. I
am Backend engineer in the machine learning team. So besides of the problems with the regular Engineer in the distributed environment. I also have to Incooperation with our data science Well, if you want to talk with me, I will be really happy to discuss the machine learning
engineering stuff by phone, but also my hobbies like 3d printing and homebrewing so What event is NLP as a natural language processing? We may think of first Processing of human voice and voice recognition, but I will not touch the stuff here today
But I will focus on the processing on the already text that we have textual data and Why natural language processing is hard first language is ambiguous if we say that hey, I had the sandwich with bacon It's hard to say Whether we met Kevin bacon for lunch, or we had a sandwich of pork meat
Second text our compositional Characters composed the words and then words composes sentences and finally paragraphs and whole books and here the problem is that Even if we had the same letters that compose two words, let's say burger and pizza They share none of the characters, but they still carry the meaning of junk food
But we cannot confide them only on based on their characters. So that's a few reason why this is hard And If you want to learn more about the traditional NLP approaches and more
Overview of down the pin itself. I highly recommend you to listening to the last year Last year lecture about introduction to sentiment analysis with spacey The slides will be shared with you later. So I will have a lot of links to the further reading Then what do common problems we have in NLP because well if somebody already
Solved the issue we are having and we can generalize to that We can either use the ready solution or at least be inspired by it So the common problems are for example the current classification And this also includes the human sentiment whether for example the reviews on our business on the website are positive or negative because we if we have
Thousands or millions of them. It will be hard to classify them by hand, but also offer attribution Who wrote the document and this is a little exciting because not only by the words used but also by the way, they
Construct the sentences and also like very practical use case whether the email we just received is a spam or not Or whether it's important or not Another common issue is sequence to sequence and this includes but it's not limited to translation like Google Translate Summarization of text like we have a whole article and we just want to create abstract of it
just to know if we would be interested in it and also response generation Really need Feature in Gmail when we receive an email. We just have one of your possible responses just to tap and Respond to our sender
Another common problem is information extraction. We have a sentence Jimmy bought Apple shares and Jimmy bought an apple and we want to know the Apple refers to the company or to a fruit and This is really useful in search engines or I would say in ads because when we have an ad we Would want to know it would be relevant. Should it be relevant to the fruits or should it be relevant to iPhones?
So these are only a few the most common ones I would say But also you you will learn if you want that there are a lot of more than a lot of them Why neural networks are good for NLP? Texts carry a lot of features and we can extract them by hand. We can label them by hand
We can in sentiment analysis For example find and hand-pick the words that conclude that the sentiment is either positive or negative Whether the if somebody you mentioned terrible probably they mean that our business is bad and
The idea is that the neural networks will learn those features on their own and as the practice shows They usually do I Will focus and show examples from this quote-unquote real life problem the IMDb sentiment analysis why I put real in quotes because
In real life, we do not have such beautiful data sets. These are 25,000 highly polar movie reviews and first our reality data will usually not be so polar and not so Pure we will have usually a lot of more a lot of noise there. So we will have to leave that but as a
Exercise, I think it's really good. I Also as the cost I will work on the training time Why training time not like the number of parameters because right now when we are paying for servers on working on our computer
That's the thing that we are concerned most about and when we have complex networks, it won't translate directly into that But the downside is that in the future when we have better ways of parallelization or better algorithms
We may find out that the networks that are really expensive to run right now will be cheaper in the future So like with everything it depends. What do you choose? Our task definition we have a movie review We want to put that in our neural network and we want to decide whether it was good or bad so looks simple enough, but there's a catch on the review itself because
We since the neural networks are basically matrix multiplications and additions and also activation functions We cannot throw text directly there. We have to we need to have some
numeral representation and To obtain that numerical representation. We first need to have some features So this is example of the text like we can see a big disappointment Incredibly bad very pretentious. It's highly poor Or polar like I mentioned But first to use this text as an input we need to do something with we need to translate the first in features and then
Into into this vector. I Will focus on this simple sentence what possible we can extract a quick brown fox It looks like we only have few words, but let's see what we can do with it First we can tokenize that sentence and by tokenize. I mean split into chunks
For the English most often they will be Words, so we have a tokens like app then the quick and brown and Fox and let's focus forward in the last one but if you work with another languages You may find that it's not so easy and especially for German because they glue words together
I do not know German, but I know that there's a case but you can use a library like sentence piece from Google to try and live with that or Try some other money. So get back. Let's get back to the word Fox. What do we know?
I only touch the cable Okay, let's focus about the word Fox. What do we know about if we use classical NLP models statistical ones? We can extract information that it is a noun Or we can label them that by hand, but we can automate it
Also, we if we use the word in a database we can extract that the word Fox belongs to since that since that is the Wider meaning of the canine canine is like doggish animal. Let's say simplify probably if there's the biologist They will be angry about that Okay, what else when can we know about this word?
We can extract its stem stem is the core of the world and we can extract its lemma Lemma is a basic form of the word and for that simple case. They will both be Fox if we have an Access to the whole corpus of the data. We can also calculate the term frequency inverse the current frequency
It tells us basically how? Important that this word is in this given sentence Just to simplify and it can be another feature So now for this specific token, we can have one two, three, four five six possible features
I will focus on the word itself now, but remember that they are here and they can prove useful you can also create syntax parse trees from the classical NLP models and they can also Boost your accuracy when we want to represent a word or a bunch of words in the
Mm-hmm way that our neural neural network will understand we can use for example the bag of words is the simplest possible Representation that I know first we construct a dictionary here it is constructed from the sentence a quick brown fox jumps over a lazy dog and for each word that
occurs in our sentence we are saying one and For we each word we do not have in our dick in our sentence we put zero and We also usually reserve one of the tokens for the unknown words
And now since we already have a representation We can work with some network first. The most basic architecture is fully connected neural network We have multiple inputs then we have Hidden layer or not, and then we pass our values through it the important
Part here is that everything is connected with everything. So we will have a lot of operations And this is example network. It's constructed by Keras. We have only one input One hidden layer and one output. So this is very easy and after we train that network
What will happen with the first layer because it is as big as our dictionary here it was 1,000 words on this IMDB data set and Each row will contain now Really dense representation of a given word as it is all as it is very often called embedding
So our first layer will construct embeddings for this for words in our dictionary and What we can do with that we can use it as the features in different models, but we also can visualize it I reduced Dimension 93 from 64 to 2 by t-SNE and we have this beautiful scatter plot
But what information can we extract about it as it was also stated in the First keynote we can look at the similarity here I looked at the similarity to the word ridiculous and the closest words to it are waste boring worst worse
So our network without even knowing the meaning of the words learned that they basically mean that the review was bad On the other hand if we choose the words fantastic The nearest neighbors are excellent 7 probably like on the numeric scale from 0 to 10 simple 8 amazing and so on and
If we compare where these two, I would say clusters are They will occur in totally opposite points in space on the one side We will have the representation for the positive sentiment on the other hand for the negative
So this is nice tools always to show to your company Hey, because it looks awesome if use the tensor board for visualization You can do the t-SNE already there you do not have to do it by your by hand earlier and You can move around this interactive also 3d graph
Pros and cons of fully connected network with a bag of words as a representation It's simple. So it's cheap and fast to train one epoch took about a second or two in collab. So It's you can iterate on it fast and that's also a really good Upside because then you can conduct experiments really really quickly
We always look at the whole text because at every word It's kind of interpretable because it's so simple that we cannot we usually can explain why The given result was chosen But the downside is that we can get close state-of-the-art
My best result was about 89% current. Best result is 96% and We also do not carry the order of the words because it's just a bag just a set. We'll lose that information So how can we fix the thing about the order of words?
Let's consider this to reviews. I love the cinema the movie but cinema was terrible I loved the cinema, but the movie was terrible if we put them together in the bag this two sentences becomes the same the representation for this two will be exactly the same and If we had two sentences like this in our training set
It will be only as half as bad because we'll produce I would say undefined results for this kind of sentences but if we had first one in training and then the second one in our Production then we would conclude that well, it is definitely a positive review. So we have to watch for that
one of the ways of anticipating that is to use a Sequence of one whole vectors. So instead of smashing them all into the one. We just concatenate the sequence and One like word in practice if you will plan to do with implore if you plan to do it that way use
Sparse matrices from scikit-learn for example because sparse representations will be much more made more efficient here hmm and If we have another sentence like a quick brown vixen or vixen this it's a female Fox it's not in our dictionary We will need to assign the unknown back to it
But since every single word will go into this unknown bag It may be not so good for our performance Instead of doing that we can assign for example I either the sin set for that word if it is a dictionary you can try that But we can also try assigning the specific parts of speech that this
given word represent and This actually improved my results even in Especially when I had a very small dictionary of 1000 words it boosted I think from 86 to 89 percent. So really good if you want to work with small dictionary and I also played a little and created model only on the part of speeches and I know it doesn't make sense, but
It actually had better results than 50% so it wasn't totally random. It's had like 60 I think so I think the Outcome is that people when I are angry and do not like something will write in different style than people that are happy with something
And this is example network. We have our input to the layers and output simple enough Another way to approaching well since The part of speeches for unknown tags worked maybe we could assign them to every word
maybe it will improve something and Then we will have even bigger dictionary because we'll also want to include this include the part of speeches there But for me it didn't help anything But remember that was only my case on this IMDB for you. It will make improve something
hmm another way of representing these features like we first we had this sparse representation of Matrices and here we can have a dense representation. So to each word we assign much more
much smaller vector, but not only one hot values, but whole range from minus one to one usually and Also, usually it will have this vector will have length of one because then we have Calculate this custom cosine similarity easier and better and
Here I also created with the embeddings for the words and then with the part of speech another model Why five thousand sixty input because we have five thousand words and about sixty possible part of speeches from the spacing and Then by thousand because this it was how big my
Review can be also the problem is that I had to pad those sequences to work with those network pad so either extend it by this token pad or Cut them if they were tooling
So we'll list some information pros and cons of fully connected networks with Sequence it's still simple. So cheap and fast to learn still under two seconds order of words matter now They are still kind of interpretable but we can't get close to state-of-the-art 0.96 and
words at given position matter more what they mean by that because of the how neural network works if the word bad Occured here on the first position and then it occurred at the second or third It will be treated complete not completely But it will be treated differently and that also may be a problem and negations are thanks to that hard to catch nothing possible
But hard if you want to learn more about basic I think that Andrew and G deep learning course is the best way to start so you have If we have a review this movie was not good and we have a negation like I mentioned it's hard to catch We can use few things to anticipate that we can use the tool called the word to phrase it's from work to back repository and
We can also use similar Turing gensim, but the upside of the one from the work to back it's written in C and it's blazing fast on the Few million or sentences data set it
works with in seconds in gensim, I wasn't I wasn't patient enough to wait for the result and Word to phrase or gensim will produce this sentences So now we have another word in our dictionary instead of having not in the good separately. We will have them together It works by looking how often this word
Occured together and how often they are separated simply Okay, but the other way of anticipating that issue is for example to use convolution on your own network CNN's I Know that they are used Usually on the images and today I learned that can be also used on audio transcriptions on or on audio in general
But we can also apply them to text Let's think about this Review, this movie was not good what CNN's will do Well, we'll have this sliding window of one Let's say neuron that will first conclude the representation of the words inside that window. So
For this example, I choose the window of two words We will have this matrix that will will multiply our representation of the This and movie and we will create the representation of this movie together then for the movie was
I will have representation for those two words and so on and so on that we have everything and this part of the operation is called convolution and Then we do operation called pooling so we reduce dimensionality and Pooling does not go in sliding window. It goes into
It goes over the multiple Representations and reduce dimensionality. I think that's better to show basically, then we have I know representation of the triplets of words and on top of that we will use Standard fully connected network, you can use either max pooling
So in each dimension, we will pick one maximum value or average pooling So in each dimension, we will calculate the average of the volume new vector. I Heard that for text. It's usually better to use max pooling, but I think it's always best to try both and find out
What works for you? It's as It's and just another hyper parameter For the convolution you have the window size how big it is and also the stride size So how many words you will jump if here the stride size was one and the window size was also was two
so We move our window size to every every one word if we had the stride of size two then we would only have our presentations for this movie then was not and It may work with bigger strikes, especially when you have long texts
so if you want to work on whole programs, please consider that and That's the simple architecture. And that's another direction. I came up with Mmm pros and cons of CNN's they parallel as nicely and have fewer much fewer parameters Than the fully connected neural networks. Usually of course, it depends how you build stuff
Order of words matter finally position of force also matter and If we want we can create a network that will look at the whole sentence. It's not so easy in practice, but Interior we can and if you want to learn more about the stuff another further reading understanding convolution
Now neural networks for NRP. I also recommend reading that so now recurrent neural networks Mmm Remember how in CNN's we had a presentation constructed of the two words together now, we will have a
representation for the word this then We Move to the next word. We create a representation for the word movie, but we also take the previous word into the context Then we create a representation for the word was but we also take into account the previous words and
we always use the same matrix of weights and When we get to the end we'll have a representation of the whole sentence and This is really really nice because in theory we catch the whole sentence
We know what is there and we can work on that. But the problem here is We usually We can have problems with vanishing credits and I will talk about that in a second and fully connected on top And now we can create a prediction we can also stack those layers like a regular network and
When we have a sent Review like terrible. I loved her previous movies. The word terrible that indicates sentiment is at the beginning and When we get to the end our Representation will have to go through very deep network and we'll have to deal with problems like vanishing gradient
when it will totally vanish and we lost the meaning of the word terrible or exploding Because we are always multiplying by the same matrices of weights to anticipate that we can use Bi-directional RNN. So first we'll go front to back then but going back to front Merge in some way the results either concatenate some whatever and fully connected and work on top. It should work
So pros and cons can give better results. We look at the whole sentence, but they are hard to train because as you may see the network will be as big as your sentence is or Whole review so we will have to deal with training very deep networks
And this is really slow if you want to learn more about RNNs I put a link to the Stanford lecture about it So how to what is another way to anticipate that forgetting we can use LSTMs or GRUs and
Unfortunately, I won't get into the detail here because well the architecture of the neurons is very complex. This is LSTM. We pass the state and We are not only pass the representation, but we also pass there Let's say state of the cell and these are GRUs. These are a little simpler
but the important thing is We do not only carry then representation for what happened in the past in the vector, but we also will carry a state and Since it's more complex and we have more operation and gates that remember or forget things
We won't be necessarily forgetting stuff So since it's not always the simple matrix multiplication like it was in RNNs We will contain the info about the words at the beginning But again, as you may see the design is pretty complex
So when we are training network, we have to do a lot of operations a lot of back propagation So it will take time, but they can give best results and until the transformer came up They were most of the time state-of-the-art and we can create Even in Keras the architecture that will look at the whole sentences. It's hard, but it's possible
hmm and Here another lecture from Stanford and link to a blog post about understanding LSTM networks Now going back to the result of my experiments They weren't really successful. So fully connected network with a bag of words
achieved 0.89 accuracy why LSTM were really near it was 88% but the training time was like 60 times higher so It's not always worth into to throw yourself into the most complex architecture at the beginning
I think that it's always best to start with something simple and then iterate and compare with that Because with the simple architectures you will also gain the quick inference time, which can be really useful and
If you are I barely spread the safer surface here And if you are interested in the machine learning in context of NLP, I highly recommend that book It's I think not only the one to read but also the one if you want to work in that to have because I Find myself often going back to reading specific parts
Because for example, if you know how the size of the window in work2vec works with the produced embeddings You can find it there So that will be all and thank you