We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Text categorization with Apache Lucene

Formal Metadata

Title
Text categorization with Apache Lucene
Title of Series
Number of Parts
69
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
As many text mining applications, automatic text categorization is usually implemented with flavors of Machine Learning algorithms, which are trained with an appropriate training set to build the model. This model contains the statistical data based on training set texts that will later allow the system to match an input document with the corresponding category. But wait… statistical data on text documents? Doesn’t it remind you of our dear improved inverted index at the core of Apache Lucene? Maybe we could consider the index containing the training set documents as our trained model? And maybe a simple query against this index could give us the category (or categories) more likely to apply to a given document? In this talk, we will demonstrate this approach and show that it can perform well and possibly "for free", as we sure all have Lucene based tools in our application portfolios!