When Probably is Good Enough - TIB AV-Portal

When Probably is Good Enough

00:00

0

Related Material

Norem, Savannah

Formal Metadata

Title

When Probably is Good Enough

Title of Series

Berlin Buzzwords 2023

Number of Parts

60

Author

Norem, Savannah

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/66616 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Examining the probabilistic data structures that come built into Redis Stack will allow us to fully understand how, why and when they work best. We'll examine each of: count min sketch, top k, and bloom and cuckoo filters. Each of these has a distinct structure that we'll start with so we can see how they work. We'll then look at why each one is probabilistic and what the consequences are for that. Then we'll look at use cases for each to see when they would best be used in the wild. We'll wrap up with a demonstration of the space saving capabilities, for example the size difference between a bloom filter and a set with the same items added to each.

Berlin Buzzwords 202349 / 60

1

17:48

Declarative Data Collections for Portable Parallelism

2

40:20

Introducing Multi-valued Vector Fields in Apache Lucene

3

41:10

Cross Data Center Replication in Solr - A new approach

4

18:32

Vectorize Your Open Source Search Engine

5

21:16

Migrate Data, Mesh in mind

6

19:14

Big data in the service of reliable news

7

40:46

Building MLOps Infrastructure at Japan's Largest C2C E-Commerce Site

8

43:33

Building On-Ramps for Non-Code Contributors in Open Source

9

34:11

Column-level lineage is coming to the rescue

10

37:37

How to Implement Online Search Quality Evaluation with Kibana

11

40:36

How to train your general purpose document retriever model

12

47:24

Learning to hybrid search

13

39:56

Maschine Learning with Domain-Specific Ontology for IT Security Industry

14

38:38

Model Fine-tuning For Search: From Algorithms to Infra

15

41:00

Searching large data sets in (near) constant time

16

53:29

The Debate Returns (with more vectors): Which Search Engine?

17

34:30

Towards a decentralized and collaborative search engine

18

41:35

Using TensorFlow in a Solr Query Parser

19

35:54

Apache Airflow in Production - Bad vs Best Practices

20

25:03

From keyword to vector

21

36:49

Search saves lives: solving healthcare problems with search

22

42:26

A Kafka Client’s Request: There and Back Again

23

40:51

Tip of the Iceberg

24

44:02

Platform Engineering is All About Product

25

43:02

Rethinking Autoscaling for Apache Solr using Kubernetes

26

39:18

A Fresh Start? The Path Toward Apache Solr's v2 API

27

42:59

Ingesting over 4 million rows a second on a single instance

28

54:31

What defines the "open" in "open AI"?

29

40:46

Boosting Ranking Performance with Minimal Supervision

30

44:41

Scalable distributed messaging & streaming with Apache Pulsar

31

34:40

ChatGPT is lying, how can we fix it?

32

35:18

Highly Available Search at Shopify

33

17:43

Privacy-Preserving Web Search

34

42:16

How to not kill people

35

36:51

Creating chaos in containers

36

41:15

Connect GPT with your data: Retrieval-augmented Generation

37

23:02

When milliseconds matter: Maximizing query performance in CrateDB

38

39:00

Using Dense Vector search at the EU Publications Office

39

29:49

Cooking up a new search system: Recipe search at Cookpad

40

43:03

Supercharging your transformers with synthetic query generation and lexical search

41

46:29

Laptop-sized ML for Text, with Open Source

42

41:38

ClickHouse: what is behind the fastest columnar database

43

21:56

Catch the fraud — with observability and analytics

44

33:04

Who broke the build? Using Kuttl to test and Release faster

45

20:00

Tiny Flink — Minimizing the memory footprint of Apache Flink

46

1:03:29

Advanced Search Plays with GraphQL

47

41:22

Deep dive into an Elasticsearch plugin for query-time joins

48

31:07

Hadoop Vectored IO: your data just got faster!

49

36:45

When Probably is Good Enough

50

18:46

No Mean Feat: Upgrading a Customized Solr to Upstream Solr

51

20:49

Avoiding Anti-patterns in Technical Communication

52

38:24

A Crash Course in Error Handling for Streaming Data Pipeline

53

36:56

Fact Checking Rocks: how to build a fact-checking system

54

38:44

Alexa, is The Smart Home vision failing?

55

41:54

Kaldb: serverless lucene at petabyte scale

56

39:59

Building Real-Time Applications: Cyclist Crash Detection

57

22:01

Semantic vs keyword search as context for GPT

58

21:39

What's coming next with Apache Lucene?

59

40:09

Synthetic data: when, why, and how

60

21:58

Joining Dozens of Data Streams in Distributed Stream Processing Systems

Automatic playback

Speech

Text

Image

00:00

Musical ensembleSoftware developerData structureGoodness of fitDiagramLecture/ConferenceMeeting/Interview

00:29

Control flowBit1 (number)Demo (music)Cache (computing)Data structureMultiplication signLecture/ConferenceMeeting/Interview

01:16

Slide ruleCache (computing)Server (computing)Data structureOpen sourceAtomic numberStack (abstract data type)Data storage deviceSequel

02:11

Key (cryptography)Cache (computing)SpacetimePoint (geometry)Web browserMultiplication signServer (computing)Statistical hypothesis testingOrder (biology)Uniform resource locatorHash functionTable (information)Bit rateData structurePosition operatorLecture/ConferenceMeeting/Interview

03:23

Multiplication signData structurePosition operatorBit rateElectronic mailing listDeterminismSet (mathematics)Element (mathematics)1 (number)Software developerTable (information)Lecture/ConferenceMeeting/Interview

04:34

Hash functionTable (information)CollisionBitMathematicsVariable (mathematics)Set (mathematics)Different (Kate Ryan album)Slide ruleSpacetimeData structure1 (number)Functional (mathematics)Multiplication signLevel (video gaming)NumberVisualization (computer graphics)String (computer science)Lecture/Conference

07:09

Hash functionWell-formed formulaApproximationNumberLink (knot theory)MathematicsElement (mathematics)BitCalculationWebsiteLecture/ConferenceMeeting/Interview

08:37

BitThresholding (image processing)Functional (mathematics)ApproximationLatent heatImplementationNumberStack (abstract data type)Default (computer science)CircleMultiplication signCollisionSquare numberLimit (category theory)CuboidRow (database)Hash functionPoint (geometry)Maxima and minimaLengthDifferent (Kate Ryan album)IP addressTrapezoidRepresentation (politics)RandomizationData structureShape (magazine)CountingSpacetimeModulo (jargon)Graph coloringGreatest elementSet (mathematics)2 (number)Insertion lossLecture/Conference

13:34

Set (mathematics)IP addressSquare numberQuicksortFingerprintMaxima and minimaMemory managementGreen's functionCircleHash functionMultiplication signCollisionTriangleMereologyOrder (biology)Default (computer science)CountingTable (information)DataflowBitTrapezoidSound effectOcean current1 (number)ImplementationPoint (geometry)Address spaceMathematicsSimilitude (model)AlgorithmRow (database)Array data structureData structureApproximationElectronic mailing listFlow separationMathematical optimizationGaussian eliminationMIDIFilter <Stochastik>Greatest elementGraph coloringLecture/ConferenceMeeting/Interview

18:48

CountingSpacetimeImplementationIP addressWebsiteNumberLecture/ConferenceMeeting/Interview

19:21

Covering spaceBit rateMessage passingWebsiteCache (computing)Error messageHypothesisWindowCore dumpVideoconferencingCountingQuantileDifferent (Kate Ryan album)Game theorySpacetimeImplementationView (database)Link (knot theory)NumberBitYouTube1 (number)Lecture/Conference

22:27

Uniqueness quantificationIntegerMathematicsIP addressSlide ruleSkalarproduktraumRange (statistics)Demo (music)Query languagePoint (geometry)TrailWebsiteLecture/ConferenceMeeting/Interview

23:16

Software development kitComputer-generated imageryMaxima and minimaGoodness of fitSemiconductor memoryNumberOrder (biology)Set (mathematics)Different (Kate Ryan album)WordDiagramCountingComputer fileElectronic mailing listOrder of magnitudeComputing platformHash functionPoint (geometry)BitType theoryBounded variationRange (statistics)DeterminismData structureTouchscreenQuicksortOnline gameCollisionInformationMultiplication signNumbering schemeInsertion lossProcess (computing)CodeExpected valueLecture/ConferenceSource codeComputer animationXML

31:11

GoogolDemo (music)Point (geometry)SpacetimeComputer animation

31:58

Streaming mediaOnline chat2 (number)WordLecture/Conference

32:31

Data structureFormal languageEndliche ModelltheorieCASE <Informatik>DiagramShape (magazine)Lecture/ConferenceMeeting/Interview

33:10

Gamma functionObject (grammar)Representation (politics)View (database)Perfect groupComputer animationLecture/Conference

33:39

Single-precision floating-point formatOrder (biology)Perfect groupTrapezoidPoint (geometry)CircleTouchscreenDiagramLecture/Conference

34:13

TrapezoidTriangleOrder (biology)CirclePoint (geometry)Matching (graph theory)Fingerprint

34:50

TriangleSquare numberOrder (biology)CircleCountingMultiplication signGreen's functionTrapezoidLecture/Conference

35:21

TriangleOrder (biology)Metropolitan area networkIP addressNumberFingerprintShape (magazine)Lecture/Conference

36:04

CountingFingerprintMusical ensembleLecture/ConferenceDiagram

Transcript: English(auto-generated)

00:07

Hello, I guess technically it is now good afternoon. My name is Savannah Norm. I am a developer advocate at Redis, and I'm talking to you guys today about probabilistic data structures.

00:23

So to get started, how many of you all have heard of Redis before I just said it? How many of you have used Redis? How many of you have used Redis aside from for a cache? Okay, cool. So Redis has a lot of things inside of it aside from being a great caching tool.

00:44

So we're going to start with a little bit of what is Redis. Most of you are familiar enough, but there are a few things that I want to let you know about that you may not know about if you've only used it as a cache. I'm going to tell you a little bit about why you should care about probabilistic data structures.

01:01

What is a probabilistic data structure? Which ones are we talking about for each of those? How do they work? Why would I use it? Where are they currently used? A real live demo that I hope doesn't break this time. And then any questions, but also feel free to stop me throughout. Especially like if a slide is confusing, that way I don't have to go back to find the slide.

01:22

Feel free to raise a hand and say hey, that doesn't make sense to me. So yeah. So Redis is actually short for remote dictionary server. The main thing that you may not know about if you've used Redis primarily as a cache is this thing called Redis stack.

01:41

Redis stack is the piece of open source Redis that has these probabilistic data structures that I'll be talking about. In Redis stack you also get support for JSON. You get full text support of those JSON documents along with atomic updates of JSON. There's a few other things in there,

02:01

but JSON is a usually the big one for a lot of people being able to use Redis with JSON natively. If you've used it, you are probably aware that it is no sequel and it's just a good old key value store. So why you should care? Big data takes up big space and a lot of it doesn't need to be granular or accurate. If I told you the top

02:24

80 movies from 1960 and I got the order of 76 and 80 mixed up, you're probably not going to be too mad at me. It doesn't really matter that much. When it gets to be, you know, a lot more data than 80 movies, you can get a little bit inaccurate with a lot of data and still get the points you need across.

02:46

So for example, a 1.1 billion URLs. The hash table will have about 25 megabytes. A remote server lookup for every single time a user enters a URL. As opposed to a bloom filter, we get

03:06

1.13 and a 1% false positive rate, which means that that can be stored in a browser cache. So there's only a remote lookup when that bloom filter actually has a positive result. Sneak preview of where these are used in the wild. Google Chrome used to store one of these in your browser cache.

03:24

That way, if you enter a URL, it has a very quick turnaround time for whether or not it's malicious. Assuming you don't hit the bloom filter and then go actually do the remote lookup, but with that 1% false positive rate, it doesn't happen all that often unless it's actually malicious.

03:43

What's a probabilistic data structure? To cover first like a deterministic data structure, it's probably the ones that you're very familiar with. Lists, sets, tuples, all of these things. When you put something in it, you know exactly what's in it. You know exactly how to get stuff out of it. None of that's changing.

04:02

So here's some examples. These are some of my colleagues. They're also developer advocates. All that data is exactly what it says it is. There is no question about whether or not Savannah is in that list. So with deterministic data structures,

04:20

we add things, remove things, assuming that the data structure is mutable, and we check for membership. So what would make a list probabilistic? Any guesses? I threw out kind of a hint earlier with hash tables. Hashes! So the briefest explanation of a hash, which I'm guessing most of y'all are familiar with, but

04:45

any function that can be used to map data of arbitrary size to fixed size values. So this whole document can be represented by the number one. This is a terrible hash function. Use more than one number. Use a longer number.

05:01

If you think of like SHA-256, that is 256 bits worth. Okay, thank you. So longer the better. Fewer collisions, which is where we'll get to in just a second. So right now, if I asked what do I have in this set of documents,

05:21

I have 1, 3, and 5. If I have a new document that also hashes to the value 3, well, I think I already have that document. That document's not a new document as far as this probabilistic set is concerned. So a probabilistic data structure is a way to sacrifice a bit of accuracy for space and speed.

05:45

Which ones we're going to talk about? The bloom filter, where x is probably present, but y is definitely not. A countman sketch, where we'll have x has been seen no less than y times.

06:01

And a top k, where k number of things probably have the highest scores, plus x has a score, probably, of y. So how do they work? If we look at a bloom filter, this is essentially the way that a bloom filter is stored in Redis.

06:20

It's actually really just a string of zeros and ones, but as far as an easy visualization, we have a 20-bit long bloom filter, where bits 0, 4, 6, 7, etc. have been set to 1. Those bits have been marked. Something has set those bits to 1 by being added to the bloom filter.

06:42

So we have our fake documents. We use three different hash functions. I think I have a slide in here that says something about the math dragons that you can get into if you want. There's a very formulaic way to say if you want a bloom filter to take up this much space, and you want it to have this accuracy,

07:02

here's how many hash functions you should use. Or you can sort of play around with those variables in any way you want. You can say I want to use three hash functions and have this much accuracy. How long does my bloom filter need to be? There's a formula for that somewhere. I think I have the website linked in here where that comes from.

07:22

So we have our bloom filter, and we want to check is this document in it? Well, we look at bits 0, 9, and 11. Those are all set. So we say probably. This has probably already been added to our bloom filter. And then we look at a document that has not been added to our bloom filter, but those three hash values are 4, 9, and 7.

07:48

Those three hash values are already set. So there's no way to say whether or not this document has actually been added to the bloom filter, or if it just probably has. And that's why you get into that math of how many hash functions do I need?

08:02

How long does my bloom filter need to be? What's the accuracy I can get with all of those things? So the highlights. It tests membership. Has X already been added? We have a probably yes, or definitely not. Because if one of these bits was not set, we can absolutely say with certainty that that has not been added.

08:26

Knowing an approximate number of things you'll be adding helps optimize that size from the start. There is that link for that math dragon formula. I think if you just Google like bloom filter how to calculate hash functions, you'll find that website.

08:41

It is one of the quickest and smallest ways to check for have we seen X before? So a note on the approximate number of things in the Redis stack specific implementation of a bloom filter. In that specific implementation, once you hit a certain threshold of bits that have been set,

09:02

Redis will create a secondary layer to that bloom filter, which does increase lookup time. It increases the space you need, all of those things. So if you can avoid that, and if you can say like I'm going to add about a million things, that's really helpful to know from the start to help optimize that length of your bloom filter.

09:23

So a countman sketch. So this looks, this is really pretty. I like these colors. But you'll notice that that bottom row doesn't have any of the blue squares. And that's because it actually is going to be in this same box as all of these green circles.

09:43

So when we hash whatever this blue square represents, it hashes to the same spot as our green dots. And sorry, each of these rows is a different hash build, hash function. Excuse me. One second.

10:01

Now I'm making a mess, too. So, each of these rows is a different hash function that we're using. So in the first row, the blue square and the red heart hash to the same spot, as well as those three bright pink trapezoids and the green things. So, an

10:24

We don't know the difference really between the green circles and the pink trapezoids. All I know is that there's five things in that box. So if you look here, we have some IP addresses, which hint, that might be one of the ways that I bring up a way that these are used later.

10:42

So if we look for something like how many times have we added this specific IP address, and you'll note that I went over from, instead of the different shapes, what the countman sketch actually has is just these numbers. So, J rows means J hash functions. We modulo by the length of the column and

11:05

we find, these are somewhat arbitrary, but we end up looking at the numbers five, three, six, four, and five. And then we take the minimum of those. So that's why it is called a count min sketch. There's no way to know that three is the correct number,

11:24

but we do know that it's no less than three. You can't have added anything to that box and have it not be counted. So when we're talking about a count min sketch, we know that three is the minimum number of times that we've seen that thing.

11:41

Sorry, three is the maximum number of times we have seen that thing. It gets very confusing taking the minimum for the maximum. But, you can't have hashed the same thing and put it in the same box and then have it not be counted. You can have hashed different things, have added them to the same box, and then be counted. So you can over count,

12:01

which is why when you then look at all of these different numbers, you take the minimum and say this is the best guess I have. So, for example, another one where potentially we say that we've added this random arbitrary IP address three times when really for one, two, three, four, five, six, seven, one, eight, nine, we only added it once.

12:27

But in my lovely representation of a count min sketch, there is no box that has that blue square by itself. Everything has had a collision, or sorry, that blue box has had a collision with everything. So that's just the best answer we have.

12:43

So the highlights. We've got what's the upper limit of the number of times that X has been seen. Everything gets added together in whatever bucket it hits. The defaults may or may not be good for your data set. That's another one of those. How many hash functions do you want? What's the length of your column?

13:00

How much space can you actually devote to this thing? In general, you'll find that the more space you want to use, the more accurate you can be. But then you start to defeat the point of a probabilistic data structure. So it's a little bit of push and pull of what you want to optimize for and what you care about losing.

13:21

One of the other highlights for this that leads into the next one is that for a count min sketch, there is no easy what's the top thing? There's no way to very quickly say what's the thing I've added the most to this data set. With a top K, however, we'll see these same IP addresses. They're arbitrary, but they're labeled.

13:43

And here you'll see that we start keeping a fingerprint. So for each of these things, we have IP address A and we have 2. But we've added A three times. Sorry, let's look at B instead. In that very top row, you'll see the fingerprint B

14:01

and it says 2. But if you look at my list, I've added it three times. So what's up with that? The other part of a top K is the top piece. The top K is actually sort of two separate data structures that work together very cohesively. So the other half of the top K is actually a min heap. So whenever you're adding things to the count min sketch, sorry,

14:29

wrong one. Whenever you're adding things to the top K, it's checking what that fingerprint has as its value and then checking whether or not that means the min heap needs to be updated.

14:43

So we'll go back to our lovely bright blue squares and bright green circles. But this time when we add a pink trapezoid, we remove a circle. And so on and so forth. And I really wish I had fixed this animation to go a little bit faster.

15:03

But you can see how at the end here, there are no collisions, but some of these counts have been reduced. And where we started with a bright blue square, we now have the two pink trapezoids and no remnants of the bright blue square.

15:21

So the top K, if you want to get really into the math, it uses an algorithm called the heavy keeper algorithm. It's designed to sort of optimize for large flows and sort of minimize small things. So it's very good at keeping things that you see a lot and those counts will generally be more accurate.

15:41

As opposed to things that you see once or twice that probably end up getting washed away when there's collisions. So very similar principle here. We're starting with the same data set. We're starting with the same colors. But as we go through, we start eliminating things.

16:01

So you'll notice here, because there are three and three, it doesn't really matter what the order of those things was added in. They'll actually end up all canceling out. We'll end up with that spot in our pseudo hash map, pseudo hash table being empty. That'll be a zero.

16:25

So we end up with significantly fewer things in this table, which is why when I talked about the overcounting versus undercounting, very briefly at the beginning, a countman sketch might over count and a top K will under count.

16:41

So if we add a lightning bolt, a new IP address, you can see that that one goes away, that triangle goes away, and now our lightning bolt is part of our top three. Because when we update that min heap, it's a last one added wins. So now that our current maximum count for the purple triangle is also one,

17:05

when we add one with the lightning bolt, that triangle gets kicked out of our min heap, and now that lightning bolt is part of our min heap for our top three. So that's a little bit of a gotcha potentially that some people miss, is that actual implementation of the min heap, and how it can end up sort of a

17:24

little bit wrong, but that's why it's probabilistic. So the highlights of the top K, it never overestimates. A tie in the bottom of the min heap is last added wins. Large flow data can be hard to eliminate. Small flow data can be hard to get established.

17:42

Same as the count min sketch, defaults may or may not be great for your data set, and you always know the top K things. I forgot to delete that bullet point, but that's okay. As far as some other ones that are worth mentioning go, cuckoo filters. They are named after cuckoo birds.

18:03

They actually start ending up kicking things out of their spots. Recursively, until I believe in the Redis implementation, it's 200 things. So if you have a very large cuckoo filter, you add one thing. It can cause a cascading effect of 200 things being kicked out of their spots before it eventually says,

18:22

I'm not adding that, do something else. It also stores a fingerprint like the top K does, which is why you can delete it. So when I say bloom with delete, the cuckoo filter's main purpose is, has this thing been added or not? But you can delete things from it because that fingerprint gets kept.

18:41

A hyper log log is an approximate count of things added. It is also just a bit array. Slightly different implementation than the bloom filter, but the main thing is if you want to just count a really big number, and it's okay if it's a little inaccurate, you can use a hyper log log.

19:01

So this one is sometimes used for things like IP address hits. You know, if you are Google and you want to know how many people are coming to your website, it's okay if it's a little inaccurate, if it's a really big number. Google probably does not care about the space that that needs. They probably keep a very accurate count. That's just an example.

19:22

Recently added in Redis stack is TDigest, which I didn't cover in depth because I don't understand in depth how it works. I'm still working with one of the core engineers on that one, but it does probabilistic quantiles and percentiles. It lets you ask, hey, what do you think the top 25% of my data is? What do you think the 50% mark is?

19:47

And that's also a little bit probabilistic. So where these things are actually used, like I mentioned as a little bit of a teaser, Google Chrome previously used one of these stored in the cache for malicious websites.

20:04

Adobe uses a bloom filter to deduplicate push messages and err on the side of not sending a message if it may be a duplicate. And that link, I have a little few citations at the end. But that one is actually because Adobe uses Redis's implementation of a bloom filter, and they've talked about it.

20:25

And so their idea was if everyone in here is subscribed to the same sports game, we all want push notifications for, let's see, I actually went to the Potsdam Royals American football game. So if we all wanted to know when the Potsdam Royals football team scores, if one of those messages might be a duplicate,

20:46

Adobe will say, don't send it. They would rather, one, with that one person error rate, they would rather one person in this room not get that notification, excuse me, than someone get it twice. They've decided with their user research that duplicate messages are more annoying than not getting a message.

21:08

Akamai uses a bloom filter to prevent one-hit wonders from being cached. They found that roughly 75% of the websites they were seeing were only accessed once. And if you're storing every single website that I go to on my phone,

21:22

I only go to a lot of them once. They don't need to be cached. I'm going to close that tab before I ever go back to that window. So previously they were caching things very often, very frequently, and finding out that they were never being touched again. So there's no reason to cache that data. So they started using a bloom filter to say, have you been to this website before?

21:43

Okay, now that it's actually something you've done more than once, now let's cache it. A couple of other ones that are a little less concrete and a little more hypothetical. View counts for videos, the difference between 1.1 million and 1.1,

22:01

or sorry, the difference between like 1.1 million as opposed to 1 million, 100,002, is negligible. It doesn't matter. Maybe for monetary things, which is why these are not concrete examples. They're hypotheticals. I don't know that YouTube or Netflix does these things. But for me personally, the difference between those two numbers doesn't matter.

22:22

So if I was really trying to save on space, that's somewhere that I could potentially say instead of storing an actual integer, I can store a hyper log log, which is usually very very small. IP address tracking to see what the top hitters are. That would be why I used IP addresses for the countman sketch and the top K.

22:43

Unique IP addresses that come to a website is a great use for a hyper log log because you're just counting unique things. With the t-digest, you can find point, range, and inner product queries. You can get quintiles and heavy hitters. So you can find things that are added a lot and you can find,

23:04

you know, all of those inner product queries that are math that I don't know. But it's a thing you can do. So I have a demo. Does anyone have questions before I back out of my slides and mess up this aspect ratio? Okay.

23:21

So, goodness, I meant to make this one bigger too. So, very basically, I have a bloom filter, a countman sketch, a top K, and then these sort of

23:42

deterministic counterparts to those are a set and a sorted set. A set will tell us very deterministically whether or not something has been added and the sorted set will keep an accurate count of how many times we've added a thing. So I reserve or create those probabilistic data structures as

24:03

knowing those sizes and knowing how many things I'm going to add makes a big difference in their size and their final sizes. So like this set and the sorted set, you don't have to initialize those. You can just start adding things to them. But because when you think about the diagram I used for countman sketch and top K,

24:22

the number of hashes being used and the number of columns you have have to be initialized at some point. You can't just start throwing things in there and then changing how many hash functions you use. So, what I did was I took the text of Pride and Prejudice, partially because I love the book, partially because I knew that no like words that I really didn't want to pop up on screen were gonna pop up on screen.

24:44

I tried to strip some of those weird characters out and then we add them. I add them to the top K. I increment the countman sketch by one, which this is a little bit of what I do as my job is writing this code to say here the differences between these things.

25:02

So you'll notice that to a top K you just add something, but to the countman sketch you actually increment it. The sorted set you also increment. And then with the users dot text, I got this lovely text file from GitHub that is

25:20

eight hundred thirteen thousand six hundred ninety-five usernames. They're not real usernames. They're things like SM Matthew and somewhere up on this list is Matthew SM. It's a lot of you know, variations of the same names that someone very kindly made so that I could do something like this.

25:44

But the important thing there is to note that there are eight hundred thirteen thousand six hundred ninety-five usernames in this text file. So if I pull up my handy-dandy notes, we'll first see

26:04

the top five things in the sorted set with their scores. So Z rev range will give us the reverse range, zero through four will give us the top five. With scores says I want to see what those scores are that are associated with those words.

26:20

They're very boring words to the of and her. Which you know, it's a book. That's sort of expected. And then if I list the top K and I actually stored the top 50 words but we can just look at the top five and in the correct order we have to the of and and her.

26:46

So if you'll remember here to is technically in the book via the sorted set 4078 times, but if I do a little top K dot count on

27:03

my top K for the word to 4073. It's a little bit off. It's pretty close and I can you know for the as well 4051 and 4051. So we're not even always inaccurate with these like these counts can be accurate as well. It's just not guaranteed.

27:27

So for the countman sketch as well 4084. So we're over counting now with the countman sketch, but we're still very close to that ballpark of

27:47

4078. But if we look at the memory usage for the sorted set and

28:04

this is in bytes by the way. As opposed to the memory usage for top K we see an order of magnitude difference. Countman sketch is on the same order of magnitude as the top K. So in bytes these numbers aren't huge. Pride and Prejudice isn't a lot of text in the grand scheme of big data.

28:26

But if you're getting an order of magnitude difference here with very little loss of accuracy here the more you scale these things and the more you accurately say this is how many things I plan to add. Give me a countman sketch that can take that many things with reasonable accuracy.

28:44

The more important that difference in size becomes. So then on that note with the balloon filter information it's called Bloom. We can see that their items inserted was

29:02

813,243 Which means that of these 18 8 Sorry 813,695 We lost about 400 things which out of 800,000 is

29:21

not bad And you can see up here where I said that I was going to add about 850,000 things So even though that 850,000 number is higher there are still some collisions. Some of those usernames did not get added. An example of the usernames in particular is a very large online gaming type platform

29:44

where you want to check very quickly whether or not a username has been taken. If I tell you hey sorry that usernames already taken you got to pick another you're probably not going to be that mad about it even if it's not actually taken. It's sort of one of those things where you have to identify where you can lose a little bit of accuracy without

30:02

sacrificing your needs and using that to your advantage. So I had one other command on here the memory usage. Yeah, so the memory usage for the set which is where I actually meant to add a little thing in here.

30:28

This bloomfilter.add command will return whether or not it was successfully added. I meant to add a check in here to say if it was not successfully added tell me what it was that way I could show you guys some of those things that are in fact in the set, but are not in the bloomfilter.

30:43

But I think it's pretty explanatory how that happens. So but that memory usage for the set again in bytes, and I actually have a little bytes to megabytes.

31:08

46 megabytes for that set as opposed to if I can highlight the right thing

31:22

1.17 for the bloomfilter with 400 usernames lost. Which I guess I could do 400 out of 800 thousand

31:40

800 thousand Move that over to so 0.05% of these usernames were lost and we have a massive space saving. So yeah, that is my talk and my demo. Does anybody have questions? Does anybody want to know what the 430th most common word in Pride and Prejudice is?

32:05

Okay, well you can find me later and I'll tell you. Well, then I guess I will let you guys go early for lunch. Let's wait some seconds until the live stream is catching up and let's see if there are questions in the live chat.

32:25

In the meantime, I submitted this talk because this conference was called Berlin buzzwords and I was like I can throw in a lot of buzzwords. This is big data, probabilistic data structures. Some people have commented on how with large language models these will become increasingly interesting.

32:44

Yeah. So I noticed in the the diagram that you had with all of the the shapes the different colored shapes

33:01

if you had a new one that came in and canceled out another one, is that the same principle is going on? I there was a case where there were three different kinds of objects in there and they all canceled out all at once. Is that a quick thing to explain?

33:23

So it's not quite actually all at once. Those things get canceled out as they get added. That representation could have been better. So here if I go view If I do view slideshow is it gonna take me here to the beginning? Perfect. Okay, so

33:44

These things like these two green circles canceling out the two trapezoids is just a quicker way than doing adding a single green circle and then adding a single pink trapezoid at some other point. But these things I did maintain the order. So like this order that things get added is

34:03

important and that is reflected in this in diagram. So because there are three green circles, we're gonna be left with one. It doesn't really matter which green circle cancels out which green which pink trapezoid or vice versa. Same for like those two, those two.

34:23

In this one where it's all three, if we actually like look in order, it's one green circle, one purple triangle. So technically realistically at one point this thing is already empty. It's gone from one green circle to one purple triangle. Sorry, it's gone from one green circle had something added that doesn't match the fingerprint to being empty again.

34:45

And then when I add in another green circle, so here I'll add two green circles at a time, but then I'll add a bright blue square. So one of those green circles will go away. That count will be reduced to green circle one instead of green circle two.

35:03

And then when I add that final triangle, it'll get reduced down to zero. So those things are all like that order does matter where if you look at I believe, yeah, this one you can see that the order we went was

35:21

purple triangle and pink trapezoid. Those get canceled out and then again, pink, oh man, should have made these a little more indicative of which IP address they represented. But so we end up with the purple triangle at the end just by virtue of how those things actually get added in what order they get added in.

35:45

Because you can see like this heart gets added here and it's going to cancel out the other shape and then because this very last one that gets added is the purple triangle. That's what's left over. So they really it happens like that number for that specific fingerprint goes up and down and up and down until

36:07

it's either a new fingerprint gets like solidified as the only thing that has a count still or it goes to zero. That's very good note. I will note that that was not the clearest way to talk about that.

36:24

Yeah. Thanks for coming.