Approximate Distinct Counts for Billions of Datasets

Cite

ACM SIGMOD

Ting, Daniel

Formal Metadata

Title

Approximate Distinct Counts for Billions of Datasets

Title of Series

SIGMOD 2019

Number of Parts

155

Author

Ting, Daniel

License

CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/42956 (DOI)

Publisher

ACM SIGMOD

Release Date

2019

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Cardinality estimation plays an important role in processing big data. We consider the challenging problem of computing millions or more distinct count aggregations in a single pass and allowing these aggregations to be further combined into coarser aggregations. These arise naturally in many applications including networking, databases, and real-time business reporting. We demonstrate existing approaches to solve this problem are inherently flawed, exhibiting bias that can be arbitrarily large, and propose new methods for solving this problem that have theoretical guarantees of correctness and tight, practical error estimates. This is achieved by carefully combining CountMin and HyperLogLog sketches and a theoretical analysis using statistical estimation techniques. These methods also advance cardinality estimation for individual multisets, as they provide a provably consistent estimator and tight confidence intervals that have exactly the correct asymptotic coverage.