Text categorization with Apache Lucene
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 69 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/67337 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202129 / 69
11
25
39
43
45
51
53
54
60
00:00
LuceneFormale GrammatikDatenmodellText MiningMaschinelles LernenAlgorithmusPhysikalisches SystemEin-AusgabeKategorie <Mathematik>StatistikInverses ProblemSpeicherabzugIndexberechnungAbelsche KategorieDemo <Programm>KategorizitätRechenschieberObjekt <Kategorie>Web SiteRandomisierungDemoszene <Programmierung>XMLUMLComputeranimation
00:45
SichtenkonzeptLineare AbbildungStochastische AbhängigkeitVariableSpieltheorieMehrschichten-PerzeptronWorkstation <Musikinstrument>SoftwareVirtuelle MaschineBenchmarkInhalt <Mathematik>Elektronische PublikationDateiformatMatrizenrechnungDiagonale <Geometrie>Kernel <Informatik>Web SiteKlasse <Mathematik>Token-RingAbfrageKategorie <Mathematik>QuellcodeDatenfeldDebuggingTotal <Mathematik>HypermediaInternetworkingAusnahmebehandlungSpielkonsoleTermMini-DiscLineare RegressionZahlenbereichAbelsche KategorieSchnittmengeAutomatische IndexierungAbfrageWeb SiteKategorizitätMereologieKategorie <Mathematik>Prozess <Informatik>GeradeMittelwertProgrammverifikationDiagrammComputeranimation
03:37
RückkopplungRechenschieberDivergente ReiheRechnernetzGruppenoperationMinkowski-MetrikBrowserGemeinsamer SpeicherFreier ParameterComputeranimation
03:52
Total <Mathematik>HypermediaInternetworkingMixed RealityGoogolAusnahmebehandlungMereologieDivergente ReiheWeb logRechnernetzGruppenoperationEinsAbfrageDatenfeldSpielkonsoleDebuggingAbstimmung <Frequenz>Bus <Informatik>BetafunktionLesen <Datenverarbeitung>EntscheidungstheorieKategorie <Mathematik>RouterTurnier <Mathematik>Lemma <Logik>GarbentheorieQuellcodeSoftwareKategorie <Mathematik>OrtsoperatorEinhängung <Mathematik>Freier ParameterSoftwaretestComputeranimation
05:59
Total <Mathematik>MarketinginformationssystemSchnittmengeKategorie <Mathematik>AbfrageDebuggingSpielkonsoleZellularer AutomatTransformation <Mathematik>Divergente ReiheWeb logElektronischer ProgrammführerAuthentifikationTermInformationDefaultStatistikDatenfeldVektorraummodellAutomatische IndexierungParametersystemMultiplikationFunktion <Mathematik>FrequenzInverseProgrammbibliothekAnalysisMathematikRegulärer Ausdruck <Textverarbeitung>Kette <Mathematik>Text MiningOperations ResearchSelbstrepräsentationVirtuelle MaschineWeb-SeiteGeräuschStreaming <Kommunikationstechnik>InterpretiererInterface <Schaltung>QuellcodeDigitalfilterVariableMatrizenrechnungAutomatische IndexierungVektorraummodellTermSoftwaretestCASE <Informatik>Kategorie <Mathematik>Web logWort <Informatik>DatenstrukturTransformation <Mathematik>Web SiteRechenschieberComputeranimationXML
07:59
AbfrageVektorraummodellFormale GrammatikRechenschieberRechenschieberVektorraummodellVektorraumWort <Informatik>DimensionsanalysePhysikalisches SystemKategorizitätDiagrammComputeranimation
08:50
VektorraummodellAbfrageDivergente ReiheWeb logRechnernetzLogarithmusToken-RingDebuggingSpielkonsoleElastische DeformationComputerspielUnternehmensarchitekturKategorie <Mathematik>QuellcodeCoxeter-GruppeElastische DeformationKontextbezogenes SystemWeb logMehrwertnetzPerkolationDiagrammComputeranimation
09:43
VerkehrsinformationStatistikLokales MinimumRahmenproblemSpieltheorieVerkehrsinformationDemo <Programm>SchnittmengeSoundverarbeitungXML
10:10
IndexberechnungToken-RingSpielkonsoleDebuggingTurnier <Mathematik>TermKategorie <Mathematik>MultiplikationsoperatorDemo <Programm>SoundverarbeitungComputeranimation
10:44
SoundverarbeitungWort <Informatik>TermVektorraummodellHypermediaResultanteBesprechung/InterviewXMLUML
Transkript: Englisch(automatisch erzeugt)
00:08
So here we go again, we will talk about text categorization and we will use AI, oh sorry, Apache Lucene. So actually I don't have slides for this, we will move to the demo and I also
00:25
will browse some websites that are needed for what I want to explain and for credits. So the objective here is to take some random text and to categorize it and we will use Apache Lucene
00:42
for this. We will not even use linear regression, no AI, just pure Apache Lucene. Just to give some credits, this talk was first presented by my colleagues at a meetup that we held here in Paris and it was in French and on Solar. This one is in English and on Elasticsearch, but just to say
01:07
that it's actually Apache Lucene that does the job, so you can do it with Solar or Elasticsearch, all of those work. Still credits for this really great data set which contains
01:22
2225 documents from the BBC News website in 2004 and 2005 and they are categorized, manually categorized in these categories. This data set is actually used in some scientific publications
01:41
and we will use it to, we will index it in with Lucene and we will use it for text categorization. So here it is, we have the 2,224 documents. There's one missing, I don't know why, but it will still work and here's the definition of the index that we created with this.
02:03
Pretty straightforward, I mean nothing fancy. You have the category which is a keyword and you have the text. And here's the tricky part, this one does the magic. So we will talk about this later, but just keep in mind this line. Okay, so again some verifications here, we have the number of
02:25
documents all there, we also have them grouped by categories, the five categories that we mentioned, so yeah they are kind of equally ventilated through the categories and here's how a document looks. You have the category and then the text which is the BBC article from 2004-2005.
02:48
Now how we will categorize text? We'll actually use a query like this which is the more like this query and here we will plug the text that we want to categorize
03:04
and we will also do an aggregation. So of course it's sorted by the score, but we also will do an aggregation that will display the most likely categories that they are ordered by a
03:21
pipeline aggregation on average score. Okay, so this is another trick that does the magic, so nothing fancy, but yeah this is useful for the next. So here we go, let's take an article, a random article from the Berlin Baselworlds website. So actually this one is
03:42
my colleague that shares a success story, so thank you Berlin Baselworlds for promoting diversity. So we just copied and pasted this article into the placeholder here and here we go, let's see how it goes. So we see that we have some category tech,
04:02
some category tech again, some politics here, yeah you might understand. So it's actually Berlin Baselworlds, so it talks about tech, but you also have politics because of the feminism and all that. So in the aggregations we have tech on the first position and then we
04:20
have some sport entertainment, I don't know why, but yeah it's an article about tech definitely. Okay this is great, let's go further. I took another article, so I don't think all these articles are kind of fresh, so the other one is from Reuters and it was yesterday when I was talking about Jet subsidy pact and so about China. So I imagine this would be
04:44
something like business and politics, let's see how the software works. Yeah we have business here, we have business again, business, politics and politics and on the aggregations we have business and next we have politics. So it's an article about business, pretty neat.
05:05
Some other examples here, this one is from sports, yeah you imagine that, so this is from France 24 and it talks about the Euro 20. No suspense, I think it will be categorized as sports,
05:23
so here it is, sports, sports, sports and aggregations, the sport come first. It's also something about tech and entertainment, pretty neat. Yeah we also took some other examples, I will go faster through those. This is one taken from Washington Post just to be like
05:45
have diverse sources, so this one is from the actually from the books sections which is under arts and entertainment. We took an article from there, let's see how it goes, test another category
06:02
so it's entertainment. So yeah it works pretty well, just pure Apache Lucene. So how does it work actually? So well it uses something that's called term vectors and before we go into that we just
06:21
need to know what happens with a document that's indexed into a Lucene index. So sorry for this, this one is in French, it's taken from a blog that's on the Elasticsearch website. So the documents that we send to Apache Lucene are actually transformed and they get into the index
06:43
in some different manner. In this case we have an agenda gram transformation but we also have the term vectors in here, so the index will contain something that's called term vectors that you can see here. So here is for example a document that I have in the index
07:02
and these are the term vectors. So this is the word, okay so you have plenty of words here. Yeah let's take something more meaningful. So a lot of words again, yeah like add, adjust and also the properties that they have here, these are the term vectors. And when you come
07:25
with a text, this one is a text, will actually it will get translated into the same structure. So you have the words here, so this is a text coming from Berlin Buzzwords and it is also translated into this vector space model. So the term vectors exist in Elasticsearch,
07:47
as I said it's actually Lucene that does the trick, so it also it's also available in in Solr just to give credit for that. And it kind of looks like this and I actually have a slide
08:03
that I will show now and I would also like to thank my colleague Vincent Bosc for this. So he thought it will be more, you will understand better with this slide. So actually the vector space here is very simple, it is very simplified, it's just two dimensions but
08:23
actually it has a lot of dimensions. And you have the words here and then you have the documents that are kind of spread into this vector space. And when you come with a new document, you see here the question mark, will actually the system will figure out which are
08:40
the documents that are the closest to this one. And this is how it gets categorized. Simply with this vector space model. Okay, just to finish the presentation, I would also like to thank my other colleague Vincent Brian that has also a talk that we did at the Elastic
09:05
Communities conference about the percolator and it actually uses the same technique in a different context. So if you want to dig in further, you can go and see this presentation, still a lightning talk, it takes six minutes. And what do we have here? Yeah, we also have
09:23
a blog post about the same subject if we want to go further. And last but not least, just to show you that there's no trick there, we are live today. Let's see something because articles that I've showed you date from yesterday. This one is really live because we have a football
09:45
game that's going and this is the live reporting from BBC to give them credit again for the data set. And yeah, you have the score here. I think this is live. Yeah. Okay, so it's it goes live. Let's take the summary here. I don't know why can I? Yeah. Okay, so I will just
10:05
copy and paste the summary. And hope for the best. I hope I don't have a demo effect here. Let's see. What does this text talk about sports? Okay, so no demo effect, we have
10:23
sports and some tech there. So this is it pure Apache Lucene that use is used to categorize any random text that you throw at it. Just 2200 documents from BBC. And that's it.
10:41
Thank you Lucene. I just had one question for you. We're a little short on time now heading into our final lightning talk. But my one question would be, yeah, further to what Cedric said, like, what's the scenario where this should perhaps not be used where term vectors are not best applied or where they're they produce the least effective results? Yeah, so great question. Thank you for this for this question. Of course,
11:08
there are some drawbacks and there are some some some things that are not working. I didn't show them because yeah, with this example, it works pretty well.
11:20
When the text is really short, it's kind of tricky to figure out what it is. So you this is actually suited for like unstructured text that are kind of articles coming from the media, or word documents or stuff like that. And yeah, it's this is mostly suited for
11:45
rather large pieces of text. But it Thank you.