Text categorization with Apache Lucene

Plain Schwarz

Precup, Lucian

Formal Metadata

Title

Title of Series

Berlin Buzzwords 2021

Number of Parts

Author

Precup, Lucian

Contributors

N. N. (Moderation)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/67337 (DOI)

Publisher

Plain Schwarz

Release Date

2021

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

As many text mining applications, automatic text categorization is usually implemented with flavors of Machine Learning algorithms, which are trained with an appropriate training set to build the model. This model contains the statistical data based on training set texts that will later allow the system to match an input document with the corresponding category. But wait… statistical data on text documents? Doesn’t it remind you of our dear improved inverted index at the core of Apache Lucene? Maybe we could consider the index containing the training set documents as our trained model? And maybe a simple query against this index could give us the category (or categories) more likely to apply to a given document? In this talk, we will demonstrate this approach and show that it can perform well and possibly "for free", as we sure all have Lucene based tools in our application portfolios!