Supercharging your transformers with synthetic query generation and lexical search

Zitieren

Zugehöriges Material

Plain Schwarz

Shyani, Milind

Formale Metadaten

Titel

Supercharging your transformers with synthetic query generation and lexical search

Serientitel

Berlin Buzzwords 2023

Anzahl der Teile

Autor

Shyani, Milind

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/66625 (DOI)

Herausgeber

Plain Schwarz

Erscheinungsjahr

2023

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Pre-trained transformers have revolutionized search. However, off-the-shelf transformers, at a fixed model size, perform poorly on out-of-domain data. Larger models have better generalization capabilities but strict latency and cost requirements limit the size of production models. In this talk, we will demonstrate how small transformer models can be fine-tuned on specific domains, even in the absence of labelled data, using the technique of synthetic query generation. Our process involves releasing a fine-tuned 1.5B parameter query generation model that, given a document, generates multiple questions that are answered by the document. These query-document combinations are then used to train a fine-tuned model. We combine the fine-tuned model with OpenSearch lexical search tools and benchmark them. Using these tools, we demonstrate a state-of-the-art, zero-shot nDCG@10 boost of 14.30% over BM25 on a benchmark of 10 public test datasets. We elaborate upon lessons learned from training and using large language models for query generation. We also discuss some open questions around representation anisotropy, keyword filtering and index sizes of dense models. Ultimately, audiences will take away from the presentation an understanding of the processes used to fine-tune small transformer models and combine them with lexical search, along with step-by-step guidance with which to pursue their own improvements in search accuracy using open-source tools.