Hybrid Search with Apache Solr Reciprocal Rank Fusion

Plain Schwarz

Benedetti, Alessandro

Formale Metadaten

Titel

Serientitel

Berlin Buzzwords 2024

Anzahl der Teile

Autor

Benedetti, Alessandro

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/70204 (DOI)

Herausgeber

Plain Schwarz

Erscheinungsjahr

2024

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Vector-based search gained incredible popularity in the last few years: Large Language Models fine-tuned for sentence similarity proved to be quite effective in encoding text to vectors and representing some of the semantics of sentences in a numerical form. These vectors can be used to run a K-nearest neighbour search and look for documents/paragraphs close to the query in a n-dimensional vector space, effectively mimicking a similarity search in the semantic space (Apache Solr KNN Query Parser). Although exciting, vector-based search nowadays still presents some limitations: - it’s very difficult to explain (e.g. why is document A returned and why at position K?) - it doesn’t care about exact keyword matching (and users still rely on keyword searches a lot) Hybrid search comes to the rescue, combining lexical (traditional keyword-based) search with neural (vector-based) search. So, what does it mean to combine these two worlds? It starts with the retrieval of two sets of candidates: - one set of results coming from lexical matches with the query keywords - a set of results coming from the K-Nearest Neighbours search with the query vector The result sets are merged and a single ranked list of documents is returned to the user. Reciprocal Rank Fusion (RRF) is one of the most popular algorithms for such a task. This talk introduces the foundation algorithms involved with RRF and walks you through the work done to implement them in Apache Solr, with a focus on the difficulties of the process, the distributed support(SolrCloud), the main components affected and the limitations faced. The audience is expected to learn more about this interesting approach, the challenges in it and how the contribution process works for an Open Source search project as complex as Apache Solr.