We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Accelerating TopK Queries

Formal Metadata

Title
Accelerating TopK Queries
Title of Series
Number of Parts
64
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This talk is about a recently added optimization to the Snowflake Cloud Database for speeding up TopK queries. In it the audience will learn about how we use runtime information to skip scanning data partitions that will not contribute to the result of the query. TopK Queries are SQL queries that contain both an order by and a limit clause. This is a very common query pattern. Examples are queries looking for the last k events (order by time descending), the k cheapest products (order by price ascending) or simply any limit query that is expected to deterministically return the same k rows (order by any unique identifier). In contrast to limit queries without an order by, the query engine cannot simply return the first k rows it has scanned. Instead it has to take all data into consideration to find the k smallest or largest rows. We can, however, come up with an upper / lower bound for values that can make it into the result after having seen at least k rows. This boundary can then be used to filter out rows that will not be a part of the result. Our optimization takes this a step further and additionally uses the boundary for pruning entire data partitions. This means that we compare it with metadata for these partitions and based on this comparison may decide to skip downloading and scanning a partition entirely if none of its contained rows can become a part of the result.