Text categorization with Apache Lucene

Zitieren

Zugehöriges Material

Plain Schwarz

Precup, Lucian

Formale Metadaten

Titel

Text categorization with Apache Lucene

Serientitel

Berlin Buzzwords 2021

Anzahl der Teile

Autor

Precup, Lucian

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/67337 (DOI)

Herausgeber

Plain Schwarz

Erscheinungsjahr

2021

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

As many text mining applications, automatic text categorization is usually implemented with flavors of Machine Learning algorithms, which are trained with an appropriate training set to build the model. This model contains the statistical data based on training set texts that will later allow the system to match an input document with the corresponding category. But wait… statistical data on text documents? Doesn’t it remind you of our dear improved inverted index at the core of Apache Lucene? Maybe we could consider the index containing the training set documents as our trained model? And maybe a simple query against this index could give us the category (or categories) more likely to apply to a given document? In this talk, we will demonstrate this approach and show that it can perform well and possibly "for free", as we sure all have Lucene based tools in our application portfolios!