Taking AI/ML to the next level: Text!
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 69 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/67344 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202122 / 69
11
25
39
43
45
51
53
54
60
00:00
MAPCodeRechenschieberRechnernetzVirtuelle MaschineEndliche ModelltheorieExpertensystemSelbst organisierendes SystemProgrammbibliothekWellenpaketLesen <Datenverarbeitung>NeuroinformatikBildgebendes VerfahrenDemo <Programm>MaschinencodePhysikalisches SystemComputerunterstützte ÜbersetzungBitNotebook-ComputerPhasenumwandlungLogistische VerteilungMooresches GesetzLochkarteRechenschieberFacebookFramework <Informatik>SoftwareMessage-PassingHalbleiterspeicherWort <Informatik>VerschlingungBestimmtheitsmaßWeb-SeiteEingebettetes SystemOpen SourceMultiplikationsoperatorDienst <Informatik>PlastikkarteInformationsspeicherungServerGoogolBenutzerschnittstellenverwaltungssystemE-MailNeuronales NetzZweiTypentheorieXMLUMLComputeranimationVorlesung/Konferenz
04:49
Quick-SortBestimmtheitsmaßPhysikalisches SystemComputerunterstützte ÜbersetzungBildgebendes Verfahren
05:25
BestimmtheitsmaßBildgebendes VerfahrenComputerunterstützte ÜbersetzungStreaming <Kommunikationstechnik>Zwei
05:50
Bildgebendes VerfahrenBitInverser LimesWellenpaket
06:38
AlgorithmusSpezialrechnerSelbst organisierendes SystemViereckTabelleInhalt <Mathematik>TermGleitendes MittelMaßstabBeobachtungsstudieInformationStichprobeInverser LimesFramework <Informatik>Gewicht <Ausgleichsrechnung>SoftwareentwicklerEin-AusgabeVariableDatenmodellLogischer SchlussPrognoseverfahrenToken-RingGruppenoperationMaß <Mathematik>Minkowski-MetrikVektorraumIndexberechnungData DictionaryZählenStetige FunktionKonfiguration <Informatik>CodeBitrateMatrizenrechnungFrequenzGeradeInverseStreaming <Kommunikationstechnik>Virtuelle MaschineSystemprogrammierungDigital Object IdentifierOpen SourceMaschinelles LernenLastWellenpaketAbfrageÄhnlichkeitsgeometrieFunktion <Mathematik>Gebäude <Mathematik>Linearer CodeGüte der AnpassungWort <Informatik>VideokonferenzProjektive EbeneElektronischer ProgrammführerWellenpaketFlächeninhaltCASE <Informatik>Algorithmische ProgrammierspracheTermDigital Object IdentifierSelbst organisierendes SystemData DictionaryZählenProzess <Informatik>DifferenteMultiplikationsoperatorDatenstrukturMathematikVektorpotenzialMomentenproblemLastMessage-PassingAlgorithmusBitrateExpertensystemEinsZahlenbereichSystemzusammenbruchDienst <Informatik>PunktBildgebendes VerfahrenToken-RingQuick-SortAutomatische IndexierungHilfesystemMailing-ListePhysikalisches SystemWeg <Topologie>Framework <Informatik>OrtsoperatorEndliche ModelltheorieMaschinencodeInternetworkingProgrammbibliothekGeradeUnrundheitFormale SpracheMatrizenrechnungDatenverwaltungBitLogischer SchlussMinkowski-MetrikSoftwaretestGoogolAbstraktionsebeneTabellenkalkulationPatch <Software>Zusammengesetzte VerteilungComputerspielRechter WinkelRichtungCodeComputeranimationXMLUML
16:38
Normierter RaumLokales MinimumLipschitz-StetigkeitApp <Programm>Linearer CodeKonvexe HülleDatenmodellFlip-FlopGebäude <Mathematik>AbfrageHochdruckVolumenvisualisierungDemo <Programm>GeradeMinkowski-MetrikEndliche ModelltheorieBitOffene MengeAbfrageTermAutomatische IndexierungMaschinencodeBenutzerbeteiligungVirtuelle MaschinePatch <Software>Grundsätze ordnungsmäßiger DatenverarbeitungÜberlagerung <Mathematik>Framework <Informatik>ComputeranimationProgramm/QuellcodeXML
18:21
CodeDatenmodellGebäude <Mathematik>Abelsche KategorieLASER <Mikrocomputer>MenütechnikBAYESKlumpenstichprobeGruppenkeimFehlermeldungTermMittelwertLokales MinimumGeradeDemo <Programm>AbfrageTeilbarkeitGewicht <Ausgleichsrechnung>Komplex <Algebra>SchlussregelTextur-MappingHydrostatikWeb-ApplikationRückkopplungSchwach besetzte MatrixDigital Object IdentifierOrdnung <Mathematik>GrößenordnungTeilmengeMatrizenrechnungRechnernetzDichte <Physik>Modul <Datentyp>MerkmalsextraktionSoftwareentwicklerMaschinelles LernenInformationWort <Informatik>Kontextbezogenes SystemEndliche ModelltheorieGoogolDivergente ReiheDimensionsanalyseÄhnlichkeitsgeometrieVektorraumMinkowski-MetrikAssoziativgesetzEin-AusgabeWellenpaketPrognoseverfahrenSoftwaretestParametersystemHypercubeStochastische AbhängigkeitSpannweite <Stochastik>AusgleichsrechnungIterationNegative ZahlDistributionenraumVorzeichen <Mathematik>BinärcodeAggregatzustandBereichsschätzungPunktrechnungGeschlecht <Mathematik>AppletFatou-MengeAlgebraisch abgeschlossener KörperMaschinelles SehenSchnelltasteFormale SpracheDesintegration <Mathematik>ProgrammbibliothekRechenschieberÜberlagerung <Mathematik>Ein-AusgabeEndliche ModelltheorieFehlermeldungNegative ZahlDienst <Informatik>Eingebettetes SystemVirtuelle MaschineMathematikGeschlecht <Mathematik>Wort <Informatik>ProgrammbibliothekWellenpaketMultiplikationsoperatorEinsReelle ZahlMetropolitan area networkZahlenbereichKontextbezogenes SystemDatenfeldDifferenteÄquivalenzklasseFramework <Informatik>Physikalisches SystemMaschinencodeCluster <Rechnernetz>GeradeGruppenoperationLokales MinimumDimensionsanalyseParametersystemFitnessfunktionHilfesystemExpertensystemDigital Object IdentifierTermGraphAbfrageAnalysisInformationPatch <Software>TUNIS <Programm>VektorraumGoogolVorhersagbarkeitBitfehlerhäufigkeitNeuronales NetzQuick-SortNotebook-ComputerArithmetisches MittelAggregatzustandThumbnailTouchscreenLinearer CodeHydrostatikAlgorithmische LerntheorieProgramm/QuellcodeXMLComputeranimationDiagramm
26:48
W3C-StandardOrdnungsbegriffAnalogieschlussWort <Informatik>DatenmodellSoftwaretestGerichtete MengeRückkopplungOpen SourcePhysikalisches SystemProgrammbibliothekLineare RegressionFehlermeldungWeg <Topologie>Elektronische PublikationSystemidentifikationZufallszahlenProzess <Informatik>DatensatzParametersystemKonfiguration <Informatik>Elastische DeformationGruppenoperationWeb-SeiteROM <Informatik>ÄhnlichkeitsgeometrieBefehlsprozessorMini-DiscIndexberechnungBitfehlerhäufigkeitMatrizenrechnungDigital Object IdentifierCodeRechenschieberSchnittmengeMaschinencodeFlächeninhaltHalbleiterspeicherNotebook-ComputerTermHeuristikElastische DeformationQuaderSoftwaretestLastMultiplikationsoperatorTouchscreenArithmetisches MittelBitWort <Informatik>CASE <Informatik>InternetworkingEndliche ModelltheorieInformation EngineeringLineare RegressionMinkowski-MetrikGruppenoperationInformationPerfekte GruppeToken-RingOpen SourceLogischer SchlussElektronische PublikationKernel <Informatik>ProgrammbibliothekStreaming <Kommunikationstechnik>Lesen <Datenverarbeitung>ComputeranimationXMLUML
30:28
XMLUML
Transkript: Englisch(automatisch erzeugt)
00:07
Okay, welcome, everyone. So, my name is Nick Birch, I'm Director of Engineering at a Software as a Service startup called Fleck based in the UK helping the logistics industry.
00:21
As I go through the talk today, it's a beginner's talk, I have to assume that some of you are more advanced than others, I will skip over a few bits. All of the slides and all of the code are available at this GitHub link here. So, if you want to go back later, try and understand what I've done, try and recreate the code, learn a bit more, have a play around, it's all there.
00:43
So, don't panic too much when the text is a little bit small on the screen, or you're frantically trying to take notes, it's all there, you can catch up later. All right. So, our talk today, it's not going to be your typical AI talk, because, well, it's about text. So, we're going to start off looking at some of the key things you need to know when you're working with AI and text.
01:05
We'll build ourselves a simple little AI for text. We'll try and make it better with some neural networks and some word embeddings. And important at the end, I have a whole bunch of resources for you. So, the slides are up there on that GitHub page. So, later on, when you want to, you can go through those resources, read up on those things, and get to know it all a bit better.
01:25
So, what is AI in ML, and why is it a buzzword? Why is it at Berlin Buzzwords? So, Larry Tesla's theorem, AI is whatever hasn't been done yet, which makes ML, machine learning, what we can do today.
01:41
The first big AI bubble was in the mid 1980s. There was over a billion US dollars spent on AI startups. Generally, they were called expert systems, and it was bubble, and it popped. There were two big problems that everyone faced. There was not enough training data available to train the expert systems, and the computers were much more expensive than the experts that they were trying to replace.
02:04
Lots of money spent, it went pop. We're into a new phase now. Maybe it's a bubble, maybe it's not, who's to say? But there's certainly a lot of interest. So, why is that? Well, the first thing is Moore's Law has come to the rescue. In 1985, what used to cost you a million dollars, that's now under $1 today.
02:23
Amazon, give them a credit card, off you go with some servers. One terabyte of memory, that's not storage, that's memory on a single machine, is about one coffee an hour, one latte an hour. If your data is even bigger, a four terabyte machine of memory, it's about $25 an hour if you rent it on the spot, less if you reserve it.
02:45
And apparently, you can get a 24 terabyte memory machine from Amazon. You will have to phone them and talk to them, and I dread to think how much it costs, but that is available. So, if your ML system needs four terabytes of memory, have a chat with your boss, get a bit of approval for accounting, off you go.
03:03
It's no longer completely impossible, you just need to spend a bit of money. And there's an awful lot of stuff that you can do just on your laptop now. Another thing that we have that's a big advantage today is that we've got a lot more data available to train our models on. Now, you might not have Google's three and a half billion searches a day or Facebook's 65 billion messages a day,
03:22
but you certainly have a lot more than you had in the early 80s. It's no longer just a few punch cards. You know, your organization is probably generating gigabytes of data a day, maybe even more. And you can now use a lot of open source libraries and frameworks to get going on that. So that means that you're able to focus on your problem and not just on the engineering bits of the system.
03:45
Now, this does also lead to some slightly dodgy stuff. So obligatory XKTD. And for a lot of people, they just kind of try it a few times when it looks all right. They call it their ML system and launch it and terrible things happen.
04:01
I'll go over a few techniques later on to try and avoid you falling into that trap. OK, a little bit of audience participation. So your typical AI ML demo is images, image classification, to be specific. So it's fun. It's easy. Let's have a quick try.
04:21
So audience participation. What is this? Wait for a few seconds, people, to type. OK, well, easy answer is these are dogs.
04:41
OK, image classifier. Cats or dogs? Dogs. What about this one? Dear image classifier, what is this? Is it a dog or a cat? Hopefully you'll know that this one is a koala. So here is your first challenge.
05:01
If you train your ML system to give you a very simple answer, A or B, dogs or cats, and you give it something different, your AI ML system will fail. So you need to be aware of what kind of problem you're solving. Don't just do a very simple AB classifier if you have all sorts of other things.
05:23
So let's not do an image classifier AB dogs, cats. Let us instead say what animal is this? Give you a few seconds because the stream is laggy.
05:44
Hopefully you'll know this one. This one, kangaroo. What about this one, though? This one, a little bit harder for you.
06:00
I'm going to say it's a porcupine. I missed the chance. This one is a wombat. But this shows you a key problem that you're going to have if you're trying to get other people to give you training data. If your people, your humans, who are providing the training data do not know what the answer is,
06:22
they can't properly train your image classifier. So be aware of the limits of the humans providing the training data, because otherwise if they can't tell you what it is, then your image classifier can't learn. Assuming your humans know what everything is and they can give you the right labels to train data
06:44
and you've got the right kind of algorithm, image classification, really cool. Off you go. The problem, though, that I have and some of you probably also have is that my data does not look like that. My data looks a lot more like this.
07:00
Looks like this. It is text. Some numbers, some spreadsheets, but mostly it's a lot of text. So if what you have is lots of images, I can really recommend this talk from Berlin Buzzwords from a couple of years ago. A really good example, videos up online.
07:20
But what I have, what hopefully you will have is some text. So I have lots and lots of documents. Now, some of those are from big companies who use lots of non-standard words and terms. In many cases, they're in pretty specialist areas. So they'll take a commonly used word or term and use it in a very different way.
07:42
And they have lots of policies, they have procedures, they have training guides, all that kind of thing. So what I need to do is inexact searching over these documents. So for example, project training and help. Someone is new to the project. They can't go and search the help, the training, to find what they need because they don't know what the right term is.
08:04
They don't know the abbreviation. They don't know that what they need is the FLM training because they don't know what FLM stands for and they don't realize that someone thinks that's their job title and that's what the training is there for. So at the moment, what we have is really busy team leads keep getting messages saying, Help, I'm new, I don't know what to do, there's a load of training, I can't find the right stuff.
08:23
That's not great. We have RFP questionnaires, maybe you do too. We have lots of different potential customers who keep sending us these long questionnaires to fill in all about the system, the service that they want us to provide. But different companies use different words for the same kind of thing.
08:40
So at the moment, what happens is that our poor overworked RFP team keep rewriting lots of the same text because they're trying to give a different answer to effectively the same question from a different starting point, and they can't go and find it in the library. We also have in the list industry, we have a lot of temps, people trying to find the right job that they want in the job posting.
09:04
Our different customers tend to use different words to describe the same job. And then we have lots of people who have English as a second or third language, who are typing in the kind of job they want. They're not using the right words, they don't find the job they want. So at the moment, what we end up doing is training the hiring managers to put the same job title in about five or six different ways, which are the common ways that people tend to use.
09:25
But that's not really great, and it makes the job posting look terrible. What we'd really like is to be able to find nearly the right thing, even though we don't know what it is. Now, I'm not allowed to show you a lot of those documents, and even if I could, they probably wouldn't make much sense to you. But what we do have is a whole load of Berlin buzzword talks. We have six years worth to be exact.
09:46
So what we're going to be using is these Berlin buzzword talks, titles and abstracts, as the test data as we go through the talk. So they look a bit like this. So you've got a track, you've got a title, you've got abstract, you've got a speaker, and you'll also see that it's changed over the years.
10:05
This is this year's one. It's a couple of years ago. So we're going to need to do some pre-processing. What we're going to want to do is a mixture of sort of clustering or recommendation, because we don't have the exact answer here.
10:20
So we're going to have to try and cheat a little bit. Now, the first challenge that we have to deal with is that AI and ML frameworks do not like Word documents, does not like spreadsheets, does not like raw HTML. So even though I've got my abstract here already available and I want to learn on it, I can't directly.
10:44
The main tool though, just you look at is Apache Tika. It's a framework for dealing with all sorts of documents and giving you back clean HTML. You can then split that HTML into chunks and work on it. What I did for the Berlin buzzword talks is I used a bit of Python beautiful soup, turned it into JSON.
11:05
What we need to do then is get from something like this into something a bit like this. Some nice JSON, a nice bit of structure to it, some text. Unfortunately, though, that's not enough.
11:21
You can't just currently feed JSON into an AI and have magic. It's great if we could. I suspect in a few years time there will be things like that. Amazon SageMaker, some of the Microsoft Azure stuff starting to get that way. But at the moment, and certainly for you learning, that's not something we can do.
11:42
What we need, and sorry for those of you who didn't like maths in school, is we need numbers. Generally, you need them minus one to plus one or zero to one. What we're going to need is one value for each feature. And by that, we mean the thing that we're going to learn on and predict on.
12:00
So we need to turn our text from this JSON that we have here into a whole bunch of ones and zeros. So how do we do that? And if this is all new to you, things like feature label and so on, Google have put together a really good crash course. And it has some terminology in there that explains those to you.
12:21
But generally speaking, a feature is something that you feed into the model. The label is something that your experts, your humans have given you. The thing that's going to be predicted. Training is the process of feeding in those features and the labels and teaching the ML system. And inference is then when you give it something new and say, hey, what is this? What should this be?
12:45
So we need to take our text and get it into the numbers. The first thing we need to do, and those of you who know Lucene or that sort of thing will understand this, is tokenization. Easiest thing that we can do is just split on whitespace and punctuation, one token for word.
13:02
Ideally, what you're going to be doing is stop words and stemming and all those kind of things. But we'll skip over that for now. Look at any Lucene talk for help on that. The first thing that we're going to do is build up a term dictionary. So if we have the text, the mouse ran up the clock, we can just assign indexes in our term dictionary and just put each word in intern.
13:23
So the very first word we come across, position one, mouse position two, ran position three, up position four, clock position five, down position six. Come to our second thing, the mouse ran down. A lot of those words are in common so we can reuse those same terms. And we end up with here a list of numbers. They're not quite zero to one, but at least we're getting indexes.
13:45
We've gone from the text, broken it up into terms with tokenization, assigned a term dictionary and got them some unique indexes. The simplest thing that we can then do is called one-hot encoding. So we say if the term exists, give it a one. If it doesn't exist, give it a zero.
14:06
More advanced thing you can do is the counts, just taking the number of times that each word occurs. Then the next thing we need to do really is what's called the TF-IDF. So if you have a document that contains a term a lot, so keeps using the
14:23
word text, for example, maybe we want to rate that document quite high for that term. But if all the documents have the same term, it's probably not very interesting. All the documents contain the word the, maybe we don't need to care about that. But if only a few documents contain that term, that's probably going to be a really relevant document for that term.
14:43
So we want to rate the rare terms higher, common terms lower, and within a given document want to rate the term higher the more it's used. So if a talk in our abstract keeps talking about tikka, then when we're searching for something for tikka, that's going to be a relevant thing to use.
15:00
The great thing with modern day coding is that you can just do a couple lines of Python and get that. If this is new to you, there's a couple of good talks there from past Berlin Buzzwords about some of the other ways of doing it, some more advanced things. But generally, TF-IDF is good enough. So what I did is I took the Berlin Buzzword talks, I stuffed them into a TF-IDF, and then I took the top terms out.
15:28
And if you look at those, hopefully they're the kind of top terms that you'd expect from Berlin Buzzwords for the last six years. So we can get a sense of what's going on. We're not seeing the A, those kind of really common things in there.
15:45
We're tending to see the things about our talks, about our sessions, which is good. So the first thing that we're going to try and do is build a classification model based on the talks. We're going to use the TF-IDF to take the text and turn that into a matrix of numbers.
16:05
And we're going to classify each of the talks, and then we're going to say, here's some text, find me the talk that looks most like that and find me the other things like that. Now, the great news is that it's only a few lines of code to do that using a library called Scikit-Learn, which I'd really recommend if you're getting started.
16:25
It's very simple. It's not necessarily the fastest, but it is pretty fast, but it makes life pretty easy. So I am going to risk something here, and I'm going to go over here and I'm going to show you some code.
16:41
This is just importing it all, loading up the talks. What I'm going to try and do here is I'm going to build up a model, do a query and try and see what are the best talks for,
17:02
based on a simple classifier for these terms here. Hopefully, you can see those. So we're just going to pray to the demo gods and hit play. And you'll see here that we've got 60 lines of Python.
17:22
A lot of it is comments and white space. So it's run, built up a model, and it said, what is the best talk for Apache Teaker? What's new in Apache Teaker? Talk by Tim yesterday. I think that's probably a pretty good one. Ngram, I'm not sure that the opening session was necessarily the best one for Ngram.
17:43
So maybe we've maybe we failed a bit there. NLP, GainSpade, some NLP. Spark, low latency web crawling with Apache Storm. That seems like a good one. Apache Spark, if only it worked.
18:01
Yeah, that all seems pretty good. So how's it gone? Not bad, I'd say. Just with a few lines of Python and indexing. So if you want to play around with this more, all the code is there in the GitHub. Have a look yourself. This here is the Microsoft Azure Machine Learning Framework that's pretty good for getting started.
18:22
I'll cover that later. So back to the slides. Next approach we can do is clustering. So what we're going to try and do is group all the similar talks together. Now, we don't know what the right groups are.
18:41
Maybe if we sat there with one of the experts, they might be able to group them together. But what we're going to try doing is just use something called k-means and try and get that. Things to know when you're doing clustering is the more clusters you have, the better the fit there will be, but the less useful they'll be.
19:01
So if we have one cluster, it's going to be the maximum error, but everything's in together, not very useful. If we have as many clusters as we have talks, then every cluster is perfect. But again, it's not very useful. But what we're going to have to do is we're going to have to run this a whole bunch of times, try and find the best size of the cluster, and then go from that.
19:23
It's only a few lines again. I'll skip over that because I'm running short on time. But next thing to remember is that you're going to need to do analysis yourself. So here's a graph where we compare the cluster size against the accuracy, and we see that it doesn't go linearly. And so we're, as a data scientist, going to need to look in here and say, what is going to be the best tradeoff for me?
19:44
What's going to be the best error, the best size? And then we're going to have to take a guess. You'll also end up doing analysis like this. TSNE is a pretty common thing. So I built a cluster earlier, ran some analysis, and then I reduced it down.
20:03
And we can look at this and say, did everything come together? I'm going to say maybe, but it's not amazing, to be honest. But that was 50 dimensions down to three. So maybe that's not the end of the world. But you are going to have to spend a lot of time as a data scientist looking at this kind of thing and saying, well, if I tune that parameter, does it get better?
20:22
Does it get worse? What are we going to need to do? I'm going to skip over these because we're running a bit short on time. Another thing to consider is adding the year to our scoring. So can we prefer newer talks to older ones?
20:40
Things move on in time. Maybe a talk on Apache Tika from this year is more relevant than a talk from five years ago. Now, we can't easily add it as a feature because we're not going to know that at query time. Most people are not going to say, can you show me the talks on Apache Tika from this year? They're just going to say, help, I want to know about Apache Tika. So we're not going to have that year available.
21:02
We can add it at scoring time, but we need to be careful to make sure that we don't push it into a different cluster. We're going to want to fit it into the model. We need to take care with this sort of thing. We also need to be aware that this data isn't long term static. So let's say I sit down with an expert and I go through and say for each of these possible query terms, what is the best talk?
21:25
Maybe my data scientist, maybe my expert will give me that information for 2015. Then I keep adding new talks in as Berlin Buzzwords goes on, but I don't retrain. And what will end up happening is that we'll just keep recommending the same talks even though new ones have come in.
21:41
So we need to take care that if we have some known good answers, that we don't just bias things to those handful of pre-trained things. Embedding and some feature extraction. TFIDF, really good way to get started, but it's probably not the state of the art.
22:04
The TFIDF is going to end up with 20, 30, 40,000 different terms in it. A lot of the advanced techniques we want to use really prefer 15 to 50 terms. And the TFIDF is very sparse. Some techniques are perfectly fine with that.
22:21
Neural networks doesn't work so well. Maybe if you take that TFIDF with the 40,000 terms in, you are going to need that 24 terabyte machine rather than just your laptop. So what we're going to want to do for a lot of advanced techniques is use embeddings to get that down. TFIDF also has no semantic information.
22:42
So if I take these two phrases, my Kindle is easy to use, I do not need help. This is a user that's pretty positive. If I rearrange those words, I do need help, my Kindle is not easy to use. That is a completely different meaning. And maybe if someone's saying the first one, thumbs up. Or great, if they're saying the second one, we're going to need to get customer services involved.
23:03
TFIDF loses that. Some of the other embeddings do keep that in. Embeddings approach is to be aware of Word2Vec and GloVe, some of the simpler ones, Elmo and BERT, and GPT-2. No time at all to go into those, sorry.
23:22
But if you're interested, go Google them later. I've got some references. Word2Vec was originally developed by Google. It's based on a two layer neural network and it gives you a vector space that includes the semantic information. If we train it on enough text, we can say, man is to boy what woman is to, and it can predict girl.
23:43
So as we see in these lovely examples that I've stolen from Google, it's able to make predictions. It's able to see for what a country is to its capital. It's able to deal with verb tenses and it's even able to deal with irregular verb tenses. Walking is to walk what swimming is to, and it can predict swam, even though that's an irregular verb.
24:04
So it's really powerful for working on real text. Going to return to that one in a few minutes. A few other words to be aware of, hyper parameters are basically anything that you, the data scientist, change when you're trying to do the training.
24:25
So in K-means, that's the number of clusters we do. Often that's number of steps, number of layers, number of things. If you pick the wrong parameters, your model might wander off and locally tune to something that's not ideal. A lot of your time as a data scientist is actually going to be tuning these things called hyperparameters.
24:44
Should I have a K-means cluster of 50 or 51 or 52? The other thing that you're going to spend a lot of time doing is tuning your input data and cleaning up and dealing with all the crap that comes from real data. You need to be aware about the errors.
25:01
For example, if you're trying to detect cancer, is it better to give someone the all clear when they actually have cancer, false negative? Or is it better to send them for treatment that you need? I'm not a medical ethicist. I'm not sure about this one. But be aware of the impact of the different kind of errors that go on in your system. You need to be aware of biases in your input data.
25:21
So if you have biases in your input data, you will have a bias in your model, which means that you will make unfair judgments. And those biases can sneak through. So you might say, I don't want my machine learning system to be biased by men and women. So what I'm going to do is I'm going to hide the gender field from them.
25:42
But I leave in the name and it can still learn the biases from the name. I might say I'm worried about race. I'm going to exclude the race column when I'm training my system. But what I am going to leave in is the postal code and zip code. And it turns out that in many countries there are effectively ghettos. People from certain races live in the same place.
26:03
If you leave in the postal code or zip code, the bias that is present in your society then seeps through and you end up with a racist AI. So be aware of those data biases. Be aware of the biases in your society. Try and correct for them. Don't just assume that if I feed in ones and zeros, it's all safe.
26:24
OK, the examples I've given before were with Scikit-learn, very easy to get started with. What we can also use is Apache MXNet. So this is a really powerful framework, really fast. What we're going to need to use for a lot of stuff is the Gluon NLP library.
26:44
If we want to load in GloVe, which is an equivalent of Word2Vec, the code required is just this on the screen here. It will take about five minutes the first time you run it because it's going to go and download a load of the Internet. In this case, it's going to be a lot of Wikipedia.
27:01
But we can take this bit of text here and we can run it and we can say, what are similar things to search? And based on the word embeddings, it comes back and says searching searches information. That's probably good. What are similar tokens to Linux, Unix, open source kernel? That's all pretty good. So it's managed to learn that based on the Wikipedia text and come back with the answer.
27:26
We can also say Berlin is to Germany, Paris is to and it was able to predict France. So you can do that inference. Have a look at the code for it. Other things you need to be aware of. How are you going to tell whether your model or pipeline got better or worse?
27:43
How are you going to measure it? How are you going to productionize it? You don't want to take something like this, say this is already good, and then half the size of your Wikipedia and find it starts producing gibberish. So how are you going to test that? How are you going to monitor that? And there was a talk earlier about Apache Tika.
28:01
I'd have a look at the Apache Tika eval approach there. We're trying to say if you have two terabytes of regression corpus, how do you tell if a library upgrade made things better or worse? You can't go and check all those millions of files. You have to use heuristics. Think about data engineering. Lots of talks on that. Don't just say it works on my laptop. Ship it. Make sure that you have a way to reprocess the test.
28:26
Final question, since this is about the text. Did Elastic and Lucene do better? Out of the box with some limited stuff, mostly no. But as soon as we tuned Elasticsearch, it got pretty good. If we have loads and loads of data, too much to fit into memory, Lucene, Solar Elastic will almost certainly win.
28:48
They have some really smart techniques in there that let us do low memory stuff that let us stream through, whereas almost all the AI stuff needs to all be in memory. If we'd had a much bigger set of text going in, maybe we could have done it better with the ML.
29:06
The area where the ML really wins out is really specific terms. So if we can train the ML on our really weird words, the ML can win over something more generic because it can pick up the meaning we give to FLM and know what that means.
29:23
The big thing I would say, though, is it was really invaluable for us as a team to put this together so that we could go off and learn how to do AI and ML and to do it using the kind of problems that we have to go off and solve those things. So I'd say if you have lots of text, do the same thing yourself.
29:41
Go away, follow through the tutorials, work through the examples I've given, go off, learn it for yourself, try it for yourself, see if it works. In the end, you might say, actually, we'll stick with Elastic. But maybe you'll find that AI and ML is better, but you'll certainly learn something along the way. Now, we are out of time. So I'm going to have to say, if you want to know more, go off, have a look, read through the text, go through the slides.
30:06
I have a whole bunch of resources in the slides for you to go off and learn. If you have any questions, you're going to need to accompany me into the breakout space because Josh is about to kick me out. I'll be delighted to answer any questions that you have in the breakout space.
30:22
And I thank you for joining me on this whistle stop tour. Perfect transition, Nick. Thank you very much.