How to train your general purpose document retriever model

Cite

Related Material

Plain Schwarz

Veasey, Tom Herreros, Quentin

Formal Metadata

Title

How to train your general purpose document retriever model

Title of Series

Berlin Buzzwords 2023

Number of Parts

Author

Veasey, Tom

Herreros, Quentin

Contributors

N. N. (Moderation)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/66654 (DOI)

Publisher

Plain Schwarz

Release Date

2023

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Large language models augment traditional information retrieval (IR) approaches with both high quality language parsing skills and knowledge external to the corpus. However, training a state of the art general purpose model for document retrieval is challenging. This talk is motivated by our experiences training a high quality retriever model for use alone or together with BM25 to improve relevance out-of-the-box in Elasticsearch. We chose to focus on the learned sparse model (LSM) architecture. LSMs for information retrieval (IR) were recently popularised by SPLADE [1] and have various attractive properties for our purpose. They enable retrieval via inverted indices for which Elasticsearch has a high quality implementation in Lucene. They provide tuneable parameters which allow one to trade off accuracy with index size and query latency. They enable word level highlighting to explain matches. And they perform well in zero-shot settings. In this talk we survey LSMs and discuss how they fit into the IR landscape. We describe some challenges training language models effectively. We briefly survey some techniques which have been studied previously and found to improve performance both in and out of domain. These include downstream task aware pre-training and knowledge distillation. Finally, we give an overview of the key ingredients of our full training pipeline and useful lessons we learned along the way. Our goal was to consistently improve on BM25 relevance in a zero-shot setting. In particular, we set out to beat BM25 across a suite of diverse IR tasks gathered together in the BEIR benchmark [2] without using any in domain supervision. We survey other published results on this benchmark and discuss how we compare.