Searching large data sets in (near) constant time

Cite

Related Material

Plain Schwarz

Bøgh Köster, Torsten Berger, Dennis

Formal Metadata

Title

Searching large data sets in (near) constant time

Title of Series

Berlin Buzzwords 2023

Number of Parts

Author

Bøgh Köster, Torsten

Berger, Dennis

Contributors

N. N. (Moderation)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/66650 (DOI)

Publisher

Plain Schwarz

Release Date

2023

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

In low latency search environments, queries producing large result sets are a real pain. A proper ranking of large result sets burns a lot cpu. Those queries have the potential to slow down or even brick your cluster. On the customer side it is questionable whether it makes sense to return millions of documents as the customer has to filter them afterwards anyway. Those large result sets caused us heavy headache as they significantly reduced the available compute head room on the nodes of our Solr cluster. They even bricked the whole cluster when hitting the cluster in high volume. In this project report we'll guide you through the steps (and math) how we: - constructed index based random experiments, - estimate the rough query hit count of a query by extrapolating bucket search results, - collect and apply static first phase ranking information, - use the information collected to filter the result set to the most relevant documents to return no more than a given number of documents, - extrapolate hit and facet counts to mimic the original search result and - handle document collapsing and facetting. In this talk we'll guide you through the software architectural aspects as well as the math applied. Although applied on a Solr search system, this concept can be applied on other search engines as well.