Accelerating TopK Queries

Cite

Related Material

Plain Schwarz

Waack, Juliane

Formal Metadata

Title

Accelerating TopK Queries

Title of Series

Berlin Buzzwords 2024

Number of Parts

Author

Waack, Juliane

Contributors

N. N. (Moderation)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/70221 (DOI)

Publisher

Plain Schwarz

Release Date

2024

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

This talk is about a recently added optimization to the Snowflake Cloud Database for speeding up TopK queries. In it the audience will learn about how we use runtime information to skip scanning data partitions that will not contribute to the result of the query. TopK Queries are SQL queries that contain both an order by and a limit clause. This is a very common query pattern. Examples are queries looking for the last k events (order by time descending), the k cheapest products (order by price ascending) or simply any limit query that is expected to deterministically return the same k rows (order by any unique identifier). In contrast to limit queries without an order by, the query engine cannot simply return the first k rows it has scanned. Instead it has to take all data into consideration to find the k smallest or largest rows. We can, however, come up with an upper / lower bound for values that can make it into the result after having seen at least k rows. This boundary can then be used to filter out rows that will not be a part of the result. Our optimization takes this a step further and additionally uses the boundary for pruning entire data partitions. This means that we compare it with metadata for these partitions and based on this comparison may decide to skip downloading and scanning a partition entirely if none of its contained rows can become a part of the result.