Word2Vec model to generate synonyms on the fly in Apache Lucene

Plain Schwarz

Antuzi, Daniele Petreti, Ilaria

Formale Metadaten

Titel

Serientitel

Berlin Buzzwords 2022

Anzahl der Teile

Autor

Antuzi, Daniele

Petreti, Ilaria

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/67187 (DOI)

Herausgeber

Plain Schwarz

Erscheinungsjahr

2022

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

If you want to expand your query/documents with synonyms in Apache Lucene, you need to have a predefined file containing the list of terms that share the same semantic. It's not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match with your contextual domain. The term "daemon" in the domain of operating system articles is not a synonym of "devil" but it's closer to the term "process". Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary. Two words with similar meanings are identified with two vectors close to each other. This talk explores our contribution to Apache Lucene that integrates this technique with the text analysis pipeline. We will show how you can automatically generate synonyms on the fly from an Apache Lucene index and how you can use this new feature along with Apache Solr with practical examples!