We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.

Supercharging your transformers with synthetic query generation and lexical search

Formal Metadata

Supercharging your transformers with synthetic query generation and lexical search
Title of Series
Number of Parts
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date2023

Content Metadata

Subject Area
Pre-trained transformers have revolutionized search. However, off-the-shelf transformers, at a fixed model size, perform poorly on out-of-domain data. Larger models have better generalization capabilities but strict latency and cost requirements limit the size of production models. In this talk, we will demonstrate how small transformer models can be fine-tuned on specific domains, even in the absence of labelled data, using the technique of synthetic query generation. Our process involves releasing a fine-tuned 1.5B parameter query generation model that, given a document, generates multiple questions that are answered by the document. These query-document combinations are then used to train a fine-tuned model. We combine the fine-tuned model with OpenSearch lexical search tools and benchmark them. Using these tools, we demonstrate a state-of-the-art, zero-shot nDCG@10 boost of 14.30% over BM25 on a benchmark of 10 public test datasets. We elaborate upon lessons learned from training and using large language models for query generation. We also discuss some open questions around representation anisotropy, keyword filtering and index sizes of dense models. Ultimately, audiences will take away from the presentation an understanding of the processes used to fine-tune small transformer models and combine them with lexical search, along with step-by-step guidance with which to pursue their own improvements in search accuracy using open-source tools.