Don’t forget to sketch! Running with large datasets
Formal Metadata
Title 
Don’t forget to sketch! Running with large datasets

Title of Series  
Number of Parts 
34

Author 

License 
CC Attribution  ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and noncommercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license. 
Identifiers 

Publisher 

Release Date 
2016

Language 
English

Content Metadata
Subject Area  
Abstract 
Large datasets got you down? Have no fear! Make them small! Sketches are probabilistic data structures: they store a rough outline of a dataset in way less space than the dataset itself takes up. We'll sketch out three sketches to determine if an item is missing from your dataset (Bloom Filters!), count how many of an item are in your dataset (Countmin Sketches!), and count how many distinct items are in your dataset (HyperLogLogs!). In the spirit of the sketch, this talk will be handdrawn (!!!) and leave some details to the imagination!

00:11
Randomization
Group action
Addition
Multiplication sign
1 (number)
Set (mathematics)
Insertion loss
Coma Berenices
Web 2.0
Uniform resource locator
Estimator
Mathematics
Bit rate
Semiconductor memory
Bubble memory
Singleprecision floatingpoint format
Core dump
Position operator
Physical system
Algorithm
Software engineering
Mapping
Electronic mailing list
Bit
Variable (mathematics)
Entire function
10 (number)
Category of being
Type theory
Curvature
Process (computing)
Hash function
Computer science
Moving average
Right angle
Spacetime
Row (database)
Filter <Stochastik>
Asynchronous Transfer Mode
Functional (mathematics)
Web browser
Streaming media
XML
Field (computer science)
Number
2 (number)
Goodness of fit
Term (mathematics)
Wellformed formula
String (computer science)
Integer
Data structure
Computerassisted translation
Hydraulic jump
Wireless LAN
Key (cryptography)
Uniqueness quantification
Bound state
Counting
Analytic set
System call
Approximation
Graphical user interface
Uniform resource locator
Word
Software
Personal digital assistant
Boom (sailing)
Data center
09:55
Freeware
Power (physics)
00:20
alright so I'm Adam I'm the cofounder of a new york city based startup called b12 that focused on the future of creative and analytical work and today I'm going to draw for you a few sketches of sketches which has absolutely nothing to do with my day job so in short a sketch is a data structure that summarizes some data set and it does this by being a little bit less accurate than perfect but in exchange for taking up a lot less space and you've already worked with sketches in the past that are really accurate so if you've ever counted something or some the data set then you've worked with a sketch in the space of just a single integer you can summarize how many items are in a data set up to data sets of size you know whatever 4.2 billion but there's other questions that you might have about your data set like how many items of a particular type are in the data set or is the particular item in the data set that you can't answer with that completely accurately without storing the entire data set and if a data sets really large that might be a prohibitive problem so today we're going to talk about two sketches one call to bloom filter and the other called account min sketch that help answer these questions approximately let's start with bloom filters so a bloom filter answers this question of whether a particular item X can be found in my data set D and if you've ever played around with something like Google's Chrome then you've interacted with a bloom filter you want to go to some some URL you might go to bad URL com and you want your browser to stop you from going there if it's in some list of malicious urls ah luckily Google and its data centers has a few gigabytes worth of bad URLs in some data set but that's way too large a data set to distribute to every chrome installation so Google is packaged those up into a bloom filter that takes up only a few tens of megabytes and that's something that you can actually distribute to every browser let's see how it works so at its core a bloom filter is a bit set it's an array of bits that all start off with the I use zero as well as a collection of hash function so in this case we have three hash functions and they'll map the key that we're looking to insert or look up onto this bit set so let's see it in action I want to insert bad URL com into my bit set for my bloom filter I'm going to hash bad URL com against the three hash functions those three hash functions are going to map me on to three locations in my bit set and at those locations I'm going to flip the bits from 0 to 1 now let's say I want to add another bad URL com into the bit set I'm gonna hash those I'm gonna hash another bad URL com against the three hash functions find myself at three new locations in the bit set and flip those bits to one and you'll notice here actually that one of the hash functions has mapped another bad URL com onto the same location as bad URL com so anything that was previously set to one just remains a one now let's look up a key in this bit set let's say that we're going to good URL com we are going to hash good URL against the three hash functions find ourselves in three places in the bit set and we know that if good URL com was previously inserted into the malicious URL list then all three locations would have been ones but in fact one of them is a zero so we have a hundred percent certainty that good URL com is not in the malicious list of URLs but it doesn't always work out so well so let's say we're going to nice URL com which I can assure you is just a nice URL it's not in the malicious list of URLs and unfortunately when we hash it three ways we end up at three locations that overlap with other URLs that are in the malicious set so we see three ones and we're led to believe that nice URL com is actually a malicious URL this is called a false positive and we can summarize bloom filters as guaranteeing to us that an item is not present in the data set if we see any zeros when we look it up but bloom filters also have this really small chance of telling us that an item is present in the data set when it's actually not and the question is how small a chance do we have hitting a false positive luckily there's some math that can help us with this there's a formula that maps the number of bits in the in the bit set so basically the size that we've allocated to this bit set on to two different variables one is the number of items that we'd like to insert into the bit set so the more items the more likelihood that of having false positives and the other variable is the false positive rate but I'm not expecting you to look through this formula in parsad I've actually made a handydandy chart for you that shows that you can achieve really low false positive rates something like one in every 10,000 lookups will result in a false positive in exchange for just 20 bits for item that you're inserting and to make that concrete let's imagine that we want to achieve a false positive rate of 1 and every 10,000 look up so we've allocated 20 bits per item that we're going to insert into our data set and let's imagine that this is our malicious URL list there's 10 million URLs each of them takes up on average 30 characters well if you multiply that out just then the size of the strings alone in this malicious URL set is greater than a gigabyte in size so it's prohibitive to send that over to every browser but the bloom filter at 20 bits per item only takes up 24 Meg's and so it's really easy to transfer that to everyone let's jump into the next sketch called the count min sketch and this sketch helps us approximate how many of a particular item appear in our data set and to motivate this let's imagine that we've crawled the web we have the entire web corpus at our disposal and we want to know approximately how many times each word appears on the web so everyone in this room is cool we all know the most popular word on the web is going to be cats but there's other words that we might want to approximate right so there's cats and bats and flats and mats and rats and we want to know approximately how many times each one appears the problem is that there's a lot of unique terms or words on the web and so we're going to have to approximate this so let's see how that works we again have a collection of hash functions just like in our bloom filter but unlike in the bloom filter we're not mapping those hash functions on to a single array of bits instead each hash function Maps us onto its own row counters so in this case we have three hash functions will have three rows and each of the hash functions maps onto its own row of hammers so let's insert the word cats into into this into this count min sketch we encounter the word cats we want to increment its count by one so we'll hash cats against the first hash function the second function and third hash function find the location that these hash functions mathis on to in their individual rows and take the value that was previously there so in the first row it was a 12 will increment it by 12 b 13 now let's use this count min to get to approximate how many times the word Flatts appears so we'll hash flat into the three hash functions be mapped onto three different locations one per row and we end up with these three numbers these three counters a five a seven and a 12 and we want to get the approximate count of the word flats well what does each of these counters represent it represents the number of times that we've incremented it the counter for the word flat as well as any other word that is unfortunately overlapped with with that word by the hash function and so each of these numbers is actually an over estimate of the number of times that the word flats appears in our data set will take a min of them to get the most accurate approximation in this case 5 and that's where the name count min sketch comes from so in summary account min sketch is going to provide us with an over estimate of the count of items in a data set and that over estimate tends to be more meaningful for the frequent items the heavy hitters in our data set the intuition here is if you're a really popular item you're going to drown out all the other accounts that that land on the same location as you and if you're a really infrequent number then you're going to get drowned out by the heavy hitters so some final thoughts we've reached the end of the talk unfortunately the first is that today we talked about two sketches a bloom filter and account min sketch but there's lots of other fun data structures that help us approximate various properties of our data sets the one thing they have they all have in common is they have these wonderful names like the hyper log log and the T digest you should use sketches whenever you have really large data set that you want to summarize in a small amount of space or if you have an unbounded stream of data let's say you have network packets that are coming in out to infinity and you'd like to summarize it in a bounded amount of memory with that I want to invite you to embrace randomness so in computer science and in software engineering we're trained to think about how a system will either work in a hundred percent correct fashion or how it'll break down but in practice and data no data set is a hundred percent accurate and so there's entire fields like randomized algorithms and probabilistic data structures that help us embrace randomness to give us really accurate approximations of what it is we're looking for so with that sketch away thanks