Correlating messy data with "correlate" - TIB AV-Portal

Correlating messy data with "correlate"

00:00

0

Hastings, Larry

Formal Metadata

Title

Correlating messy data with "correlate"

Title of Series

EuroPython 2022

Number of Parts

112

Author

Hastings, Larry

License

CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/60783 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Data correlation! What could be more computer science-y! Ever needed to find matching items between two sets of data? Maybe even messy real-world data, with inexact string matches? Come find out how the novel scoring algorithm and clever heuristics at the heart of **correlate** solve this problem with ease!"

Speech

Text

Image

00:00

GoogolLibrary (computing)Cross-correlationWordCross-correlationQuicksortAlgorithmFile archiverComputer fileElectronic mailing listProcess (computing)Multiplication signDescriptive statisticsFile formatMatching (graph theory)SpacetimeHeuristicMP3Goodness of fitSoftware testingObject (grammar)Broadcasting (networking)LoginComputer iconSet (mathematics)Physical lawNumberCASE <Informatik>40 (number)Function (mathematics)BitComputer animation

02:31

Function (mathematics)outputMultiplication signQuicksortCross-correlationCross-correlationCore dumpLibrary (computing)Computer animation

03:07

Correlation and dependenceLetterpress printingAlgorithmUniqueness quantificationSet (mathematics)IterationKey (cryptography)CASE <Informatik>Multiplication signObject (grammar)String (computer science)File formatPlastikkarteMetadataComputer fileMP3Endliche ModelltheorieSoftware testingProcess (computing)Perspective (visual)Content (media)Matching (graph theory)Electronic mailing listConfidence intervalQuicksortCross-correlationGreedy algorithmCross-correlationNumeral (linguistics)Data dictionaryMappingWordStatisticsNumberGoodness of fitResultantLoginAlgorithmLibrary (computing)Letterpress printingBlogPoint (geometry)Core dumpMilitary baseTotal S.A.Physical lawBitTheory of relativityBoss CorporationSystem callPairwise comparisonGroup actionElectronic program guideNeuroinformatikImpulse response19 (number)PlotterComputer animation

10:26

Control flowMUDKey (cryptography)Total S.A.Ring (mathematics)Maxima and minimaInflection pointMatching (graph theory)Set (mathematics)Multiplication signInformationKey (cryptography)WordNumberTotal S.A.Greedy algorithmCross-correlationQuicksortFuzzy logicAxiom of choiceSequelWeightBitPoint (geometry)ExistenceDrop (liquid)State observerGraph (mathematics)Electronic mailing listAlgorithmProcess (computing)Goodness of fitGreatest elementOptical disc driveCross-correlationSubsetRoundness (object)Pairwise comparisonMultiplication2 (number)MappingParameter (computer programming)Control flowGodCountingResultantComputer virusProjective planeOnline helpMaxima and minimaCASE <Informatik>Duality (mathematics)Broadcasting (networking)Hydraulic jumpBoss CorporationOrder (biology)Different (Kate Ryan album)Ranking

17:44

Letterpress printingKey (cryptography)Installation artMultiplication signProcess (computing)Key (cryptography)Fuzzy logicStrategy gameString (computer science)RankingParameter (computer programming)Matching (graph theory)Range (statistics)Point (geometry)MetadataNumberSet (mathematics)1 (number)AlgorithmRight angleBit2 (number)Different (Kate Ryan album)Existential quantificationCross-correlationType theoryScripting languageObject (grammar)Cross-correlationState observerMathematicsGroup actionSocial classGreatest elementElectronic mailing listOrder (biology)BefehlsprozessorTerm (mathematics)Likelihood functionLine (geometry)QuicksortPower (physics)Pairwise comparisonLibrary (computing)DivisorFigurate numberFunction (mathematics)Sign (mathematics)Instance (computer science)WordDuality (mathematics)Similarity (geometry)Inheritance (object-oriented programming)PreprocessorLocal ringLatent heatBinary multiplierBit rateBinary codeStapeldateiComputer animation

25:03

System callSign (mathematics)Cross-correlationVirtual machineNP-hardRandomizationForestMatching (graph theory)Endliche ModelltheorieAlgorithmLecture/Conference

25:28

AlgorithmEndliche ModelltheorieMatching (graph theory)Meta element2 (number)Virtual machineWeb pageInformation retrievalComputer fileMachine learningObject (grammar)Set (mathematics)Domain nameRandomizationElectronic mailing listMaxima and minimaQuicksortKeyboard shortcutDatabaseRoundness (object)MeasurementReal numberInformationLecture/Conference

27:39

Roundness (object)Lecture/Conference

Transcript: English(auto-generated)

00:06

So, you may be wondering what Correlate is, technically it's a library that I wrote. It's solving a problem that I had, and let me go through what that problem is. The first thing, I don't like the word corpora or corpus, I use the word dataset

00:21

instead. I kind of mean the same thing. But my description of Correlate is, Correlate matches up values between two datasets which conceptually represent the same data. So you have, imagine that you have two datasets that have some values inside, and there really should be matches between them, like this is really the same piece of data as this,

00:41

but it's in a different format. How do you find matches between those two? You could do it by hand, the idea with Correlate is to automate that process. I realize this is still a little bit abstract, so I think it's best to go through a concrete example. I like to listen to old time radio shows from the 40s and 50s. This is a list of MP3 files for a detective show called Boston Blackie.

01:03

And I downloaded this off of archive.org, and I really hate these file names. These are terrible file names. You can see what's in it. It's Boston Blackie, and then the date, and then the episode number, and then the name of the episode with like no spaces and uppercase characters and all this sort of thing.

01:21

And they kind of made up their own titles too, I don't know what's going on here. Happily, there's very clean data for this sort of thing. This is an episode log of Boston Blackie, which has the same episodes with the broadcast dates and a much cleaner title. So what I want to do is I want to take this data and use it to rename the MP3 files and make them nice clean file names that I'm going to enjoy.

01:43

And Correlate is just the tool to do that. So I gave a lightning talk about this exact thing at PyCon this year, and I knocked together a quick test case, and I found some nice examples in here. So this is an example of matching up an MP3 file to an episode, I created my own episode object off of the episode log, and the only thing these two have in common is the date.

02:04

But Correlate said, this looks like a good match, and it's right, those are the matching episodes. On the other hand, this one, the dates don't match up at all, and the titles aren't really all that similar, but the words John and Davis both appeared in both places, and so Correlate said, I think these are the matching episodes, and again it was correct.

02:20

So Correlate really kind of does a little bit of magic, figuring out what these matches are. The important thing to keep in mind though is that Correlate is not an algorithm. I would describe Correlate as a heuristic. Correlate is not provably correct, it's not like merge sort or quicksort where you run it, and you can prove that it's providing the correct output when you're done.

02:44

Correlate is very much of a guess, and I've spent a lot of time just sort of thinking about the problem and trying to come up with ways to improve the quality of its guesses. At the end of the day, it is kind of a guess, it's based on the quality of its output is going to depend on the quality of the input that you put into it.

03:03

But I can't prove that it's correct, it just seems to work okay. So I'm going to go over, we're going to start with the basic API. When I started out on this, I was like, okay, how do I even represent this so that I can write a library to solve this problem? So obviously you import the Correlate library and you create a Correlator object.

03:22

The Correlator object has a bunch of stuff hanging off of it. Data sets A and B, those are the two data sets that you're going to stuff your data into. I also put it in a list in case it's convenient for you to iterate over those, I never use it. Print data sets is just for debugging, it just dumps the contents of your data sets out to standard out.

03:40

Sometimes I discover that I've been inputting my data wrong, like I've been breaking up the strings into individual characters, it's weird, Correlate still does a good job even in that case, but occasionally I will discover I was doing it wrong and print data sets shows you what your data looks like from Correlate's perspective. And finally the Correlate method itself does the actual correlation and we're going to look at that more in a sec.

04:02

So the conceptual model for Correlate is that you establish your values and these are the items in the two data sets. Correlate really doesn't examine the values at all, all it needs to know is whether two values are the same value or not, so it's doing a quality testing but nothing else. Values don't even have to be hashable, but what you do then is you establish keys that

04:22

map to those values, and this is metadata that you have culled out of the values. So for example, you know, you parse the file name and you pull out all the individual strings, I think it's best to lowercase them, I'm also parsing out the date and I'm setting the date in the same format on both sides, and Correlate is going to examine all those keys

04:40

and find keys that the two values have in common and say that looks like it might be a good match. So in this example we have unlucky at cards and 1945, 1004 in common between the two things. So Correlate is going to consider those keys in common to figure out whether or not this is a good match. So I'm just going to show you an example of calling the API here, this isn't really

05:02

very useful, but I just want to show you the basic API. I'm going to create two values, they're just strings, again they could be any Python object, but it doesn't really care what the object is, I'm just using strings just so it's nice to present when I print it out at the end, and then you call set

05:20

on the dataset you want to stuff the values into, dataset A, dataset B, you set the key equal to the value. You can set a key multiple times to a value, you can set a key to multiple values, you can do whatever you like, and as a matter of fact, setting keys more than once is very interesting to Correlate, we'll look at that in a sec.

05:40

In case you're just iterating over an iterable like I am here, there's a set keys which just takes the iterable and it just sets all of the keys from the iterator as set values, just to save you a little time, and then when you call Correlate you get a result object. The result object, the most important thing of course is the list of matches, these

06:01

are sorted by confidence level, and then unused A and unused B are values from the two datasets that Correlate couldn't find a good match for, so it's just showing you it's like, yeah, I couldn't find a good match for this guy. Statistics is a dictionary mapping strings to numbers, basically times, the idea is, if you're wondering where Correlate is spending all of its time, this is a little

06:21

bit of internal logging data that will give you an idea of what's taking so long. And finally, the match object itself, that's the thing that we have a list of in the matches list up at top, that contains the value from dataset A that we said matches this value from dataset B, and then the score, which is a sort of numeric confidence

06:41

level in how good a match this is. Now just to round the bases, I'm going to iterate over the matches, there's only one match, and it's this Kingston Unlock It cards, and it computed the score of 3.6 bar. Where did this score come from, how did Correlate compute it, we'll talk about that in a sec.

07:01

But let's start at the basic algorithm, how Correlate does its work. You feed in all of your data, and you call Correlate, Correlate more or less does iterate over every value in dataset A, and compare it to every value in dataset B, and look at all the keys they have in common, and compute a score, and then it takes that score,

07:22

that match object, and it adds it to a list. Then it sorts that list by score, then it starts at the top, and it does this sort of greedy algorithm, where it says okay, looking at this match, have I used value A or value B yet? And of course the very first one, it hasn't used anything yet, so it says oh, that's a good match, okay. So it stores that in a list it's going to return to you, and then it writes down

07:45

oh, I've seen value A and I've seen value B now. And then it goes on to the next one, it says have I seen value A yet, oh I have, then this match is toast, so I don't use it, so it just goes on to the next one. So it's finding all of the matches that contain values that haven't been matched yet, and adding those to the list over there, and that's the list that it returns to you.

08:05

There are flaws in all of this, so we're going to spackle over a bunch of these flaws with some additional technology we're getting to later in the talk. But at the heart of correlate is the scoring algorithm. I had to figure out how to figure out what a good score was for a particular match,

08:22

and I'm very lucky in that my intuition was right, the first thing I tried worked really well, and then I tried some other things that didn't work as well, and I went back to the first one. I turned out to be right all along, so this is, let me say one thing here though, this is dealing specifically with what I call exact keys. So in correlate you have all sorts of keys, you can use almost anything as a key,

08:44

you can use strings, integers, floats, datetime objects, custom objects if you want to. Those are all valid keys. There's a special kind of key in correlate called a fuzzy key used for fuzzy comparisons, and just as a piece of terminology, any key that is not a fuzzy key I call an exact key.

09:04

So we're going to look at the scoring algorithm for exact keys now, and we're going to get the fuzzy keys later. So what I tried to do with my scoring algorithm, I wanted to score, if two values have a key in common, I wanted to score more highly if the key was rare than if the key was common.

09:24

So I had to figure out how to represent that. Ultimately what I did was I said, okay, let's count the number of times that this key has been used in dataset A, and we're going to count the number of times that we use that key in dataset B, we're going to multiply those numbers together, and that's become our divisor. So as an example, in my Boston Blackie dataset, in the MP3 file names we use the word

09:46

the 75 times, we use the word the 136 times in dataset B, the episode list. And so we multiply those together, and that's our divisor, and that turns into a number that is about one ten-thousandth of a point, which is almost no signal.

10:03

The fact that two values have the word the in common doesn't tell us a lot, and the score now reflects that. On the other hand, the word jewel doesn't come up very often, it's used three times in the MP3 file names, it's only used once in the episode log. So if two values have the word jewel in common, that scores a third of a point,

10:22

which is a lot more. That's a lot more signal, that's a lot more evidence that these two are a good match. And then if two keys have the word Atkins in common, which is a last name, there's only one appearance in each dataset. If those have, if the two values have that key in common, it's very likely that that's a good match, and so that gets a huge score boost.

10:43

So, to show you the example, the Atkins Jewel Thief, or Robert Atkins Jewel Thief, that's a good match, because it has the word Atkins in common and the word jewel in common. And of course the date matches as well, the broadcast date. In this case, the word jewel didn't help, because we spelled it differently.

11:02

It's the Winthrop Jewel Robberies versus the Winthrop Jewelry Company Thefts. So comparing exact keys kind of failed us here, because we used jewelry in one place and jewel in the other. This is the sort of thing that Fuzzy Keys is good at. We're going to talk about that. So already, with just that basic approach, that basic scoring algorithm,

11:22

and that basic greedy algorithms, correlates working pretty well. But I thought about the problem a lot. This is actually like my early COVID coronavirus project. So I had a lot of time to myself, so I would just sit on my mountaintop contemplating this problem. And I thought of a problem. So, let's consider this scenario.

11:40

We have a value in dataset A, and we map one key to it, breakin. For you younger folks, breakin was the name of a movie in the 80s about breakdancing. Value B1 in dataset B, we also map the key breakin to it. But in dataset A, we also map breakin to Electric Boogaloo, which is the sequel to breakin.

12:04

Now what's the problem here? The problem is that these are both going to score the same, because all we're doing is we're looking at the keys that match. We don't consider anything else when we're writing the scoring algorithm so far. So the idea was, okay, how do we prefer the top one over the bottom one? Because clearly, if we're looking at it, we say breakin,

12:22

this top one is a much better choice than the bottom one. How do I highlight that? And what I resolved to do was, I count the number of keys that matched, and I divide it by the total number of keys that are mapped to that value on both sides, and I multiply those together, and that becomes a bonus that I add to the score after I'm done adding up all the scores for the keys.

12:42

So in this example, the top one is getting basically a score of one, a bonus of one point, and the bottom one gets a bonus of a quarter of a point, because it only has one out of four keys matched on value A2. So that inflates the score of the top match, and we prefer that one, and again, we get the correct answer.

13:02

So crisis averted. Now the thing again with correlate is that it is a guess. It is not an algorithm. And again, it's dealing with messy data, and you kind of have to work with it a little bit. So there are some things that you can do to increase your odds of getting good matches out of it. The first and probably most important one is setting a minimum score.

13:22

Let's consider you have two data sets, and they are a perfect match for each other. Every single datum in every data set has a corresponding value on the other side. In that case, correlate's already going to do a good job. If you have one is a subset of the other one, where again, every single item in the smaller one

13:41

has an exact match in the larger one, correlate's going to do a great job. But what if you have some values in one or both data sets that just don't have a corresponding match on the other side? Well, correlate, bless its little heart, it really wants to do well by you, and it's going to try and find matches even when those matches cannot exist.

14:00

So it's going to find matches that are arguably wrong, they're just bad matches. What can you do about it? Well, here's my observation about that. This is a graph of the scores of all the matches in my Boston Blackie data set, and you'll notice there's a huge drop off at the end. In point of fact, I mean, it looks like there's a drop off after about a score of 4.0,

14:22

but it's not until we get to the very bitter end, a score of below one where the matches actually get terrible. And what tends to happen is there's an inflection point. You start at the beginning of the list, and they're all great matches, and they're good matches, and then at a certain point you hit something and suddenly none of the matches are good, they're all junk.

14:42

And there's an inflection point in the score there. So in this case, it's happened somewhere between 1.0 and 0.25. So all we need to do is pick a point there, like one half, and tell correlate, you know what, you see any match that has a score lower than this, just ignore it, just pretend it doesn't exist. Correlate won't keep those matches, and you won't have this junk at the end of your match list.

15:04

Another way you can improve your matches is just by weighting your scores a little bit. This is just a little bit of extra data. You can say, if you match with this key, give it a little bit of extra score. I just multiply that in when I do the basic scoring. So if I set the weight to equal to two on this key

15:22

and two on this key, then the total possible match is going to be four rather than one. And you just pass that in. There's a weight parameter to set and set keys where you can just set what the weight is for that mapping. Now I want to talk about rounds for a second. This is where redundant keys come into play. Originally, I thought that if you mapped the same key

15:43

to a value multiple times, I didn't think that was interesting. And again, this is one of those things where I just sat on my mountaintop contemplating, and I realized actually it's very interesting, and I'm going to show you why. So here are two values. This is a different imaginary data set. This is just for examples. These are movie titles from the 70s and 80s.

16:02

The Day the Clown Cried and The Day of the Dolphin. In both cases, we have mapped the word the twice to the value. What does Correlate do with that? So internally, when Correlate is doing its comparison work, it splits everything out into what it calls rounds. A round is a set of all the unique keys that are mapped to a value.

16:24

And then subsequent rounds represent duplicates of those keys, so they tend to get smaller very quickly. If you map the word the to a value five times, then there would be five rounds, and the word the would appear in all of them. So here, round one for the value on the data set A,

16:40

round one is The Day Clown Cried, and round two just contains the word the. And on the other side, we have The Day of Dolphin, and round two just contains the word the. So what's interesting is, conceptually, the word the for the second time is a different key than the word the for the first time. And the is very common in your data sets.

17:03

Again, I'm making up numbers here, but imagine that the appeared 85 times for the first time in data set A, and 76 times in data set B. Well, again, that puts us at about 1 10,000th of a point. But the for the second time is a lot rarer, and if that only appears a handful of times,

17:21

suddenly we're getting a lot higher score from that. So my goal here in telling you this is, if you use Correlate and you have Radona keys, absolutely pass them in. Another thing you can do to make your scores better is use what I'm calling ranking information. And this is just the ordering of the data inside of the data sets. So if one or both of your data sets are unordered,

17:42

there's really no sensible ordering to them, you can't use ranking. But if both of your data sets are really in order, where this value should come before this value over here, and this value should come before this value over there, then it's more likely that your matches are going to be sort of local like that, then they are going to reach all the way across.

18:04

So the way that we work with ranking in Correlate, there's a special method on a data set where you can say, you can specify extra metadata about a value, and here there's only one parameter which is ranking, you just pass in a number. Correlate is automatically going to figure out the range of it, and it's going to figure out,

18:20

it has two strategies on how to score ranking, it's going to use the one that's more successful. You just pass in ranking. I think you also have to enable it on Correlate, you have to set either a ranking bonus or a ranking factor, I don't remember. But it's time to talk about fuzzy keys. So fuzzy string comparison is a way of examining two keys

18:43

and saying, these look kind of similar, so it's not exact, these are the same string, but these are pretty similar strings. There's a very popular library for this called Fuzzy Wuzzy, I prefer one called Rapid Fuzz, which has the same API but is MIT licensed. So here we have two titles,

19:01

the top one is from, again, the MP3s, and the bottom one is from the episode log, but they're for the same episode, the Winthrop Jewel Robberies versus the Winthrop Jewelry Company Thefts. And so, rapid fuzzy, rapid fuzz, is comparing those two and it's expressing how the same they are based on a percentage.

19:23

So this is the number 69, this is about a 70% likelihood that they're similar strings. So it's just doing it lexicographically though, it's just doing it by examining the letters. So this is saying jewel and jewelry are very similar words, but it doesn't understand, for instance,

19:40

that robberies and thefts are basically the same concept. Now, fuzzy keys are very expensive, but in terms of CPU time, it slows down correlate a great deal, but sometimes you just have to use fuzzy keys, so fair enough. In order to use fuzzy keys, you create your own subclass of a base class that I established for you called Fuzzy Key,

20:02

and you have to implement a compare function, and compare is gonna return a number from zero to one. And that's it. I only compare objects of the exact same type to each other for fuzzy keys, for speed reasons. The hardest thing was figuring out the scoring algorithm for fuzzy keys. This is something that took me months,

20:20

I tried a bunch of different things, nothing felt right. Eventually I realized that I had this successful scoring algorithm for exact keys, I should make the fuzzy key scoring algorithm look the same, but more complicated because it had to handle fuzzy keys. So I'm gonna walk you through how we get from the exact one to the fuzzy one.

20:41

The first observation is that you can multiply or divide by one as much as you want, and it doesn't change anything, so I'm gonna just add and multiply by one here. And then one is really the score of the exact key. Exact keys are binary, either they have a score of zero, not a match, or one, perfect match. So really we can replace ones with score

21:02

everywhere that it's used. But the final step is that the number of uses of the key in A versus the number of uses of the key in B isn't actually what we're measuring, we're measuring the score. So now I have to add up the score of how much this key cumulatively scored

21:22

when it was comparing to stuff in dataset B, and I have to add up how much this key was scored when talking to stuff in dataset A, and I divide the score by that. So now the score is, I take the score, and I multiply it by the ratio of that score versus all scores using that key in dataset A, multiplied by the ratio of that score

21:42

over all scores in dataset B. That seems to work, finally. So I haven't touched it. And then every so often I get like, oh, I don't know if this is right. Oh, I think it's right. Oh, I don't know if it's right. Oh, I think it's right. The final little bit of technology I'm going to talk to you about is the match boiler in the grouper. This is solving, again, a specific problem.

22:01

Let's say that we have not such good data and we have a whole run of matches that have the exact same score. Which one of them should we pick? Uh, if we just pick the first one, then whichever one becomes the first one wins, and that's not necessarily the best one. How do we decide which one is the best one? So what I wound up doing here was,

22:21

I said, okay, let's run an experiment where we pick each of the ones with the duplicate score and then recursively examine the rest of the match list and compute the cumulative score for all the rest of it. And then after we've tried all the experiments, we keep the one with the best score. So that works, but it's expensive

22:42

because if we have eight values here, now we remove one and now we have seven values that are in a run and now we want to recursively do a match boiler on that. So this becomes like an nth to the nth power problem. So there's now a pre-processing step called the grouper, which is observing that sometimes these things sort of group together naturally

23:00

and sometimes they are off by themselves. So for example, again, this is all made up examples, but these first three values, they don't have anything in common with the, excuse me, these first three matches don't have anything in common with the other matches in the list. So if I select the first one, which uses value A and value G, well, that doesn't affect anything down the line.

23:21

So it's safe to go ahead and pick that off. So all the ones that group together and are by themselves, I just commit those to the match list immediately. And now I only have to do the match boiler experiments with the bottom five values because every value in that bottom five, excuse me, every match in that bottom five has at least one value in common with at least one other entry in that list.

23:43

So that's what groups them together. And we have to, there's nothing for it but to run the experiments. The bad news there is again, we're talking about n over n time. So when I wrote my example using Boston Blackie, I came up with it like hanging out at PyCon in order to do the lightning talk that night.

24:02

I wrote a quick example using Boston Blackie and I ran the script and it correlated for 20 minutes. And I said, okay, something's going on here. This isn't working. And so I killed it. I started a little and I said, oh, I should have the dates. I added the dates and now it's correlating in 10th of a second. So it just needed more data to run. But I went back and I said, okay,

24:21

how long is that gonna take if I don't have the dates? So I started that thing again before I left for EuroPython. I actually started it on July 6th at 1 30 a.m. And it's still running a week later. It's been like over 11,000 minutes of CPU time and it shows no signs of stopping.

24:41

So my point in telling you this is if correlate seems to be taking a while, maybe you just need to go back and share your data and see if you can give it more to work with. The more you give correlate to work with, the better job it can do. That's really everything. I wrote a lot of documentation for correlate. That's set up on GitHub and you can of course install it with pip three.

25:01

Thanks for your time. Thanks very much for that interesting talk. Do we have some questions about correlate from the audience? Somebody's brave enough. By all means.

25:22

So you've had to do a lot of hard work of thinking about this problem. Isn't there some random forest machine learning algorithm, meta algorithm where you just feed in some examples of your matches and how you'd like them to correlate and you build some model and now it can correlate many things?

25:42

That sounds like the future. I would have to say, first of all, I don't know anything about machine learning. But second of all, fundamentally you need to establish a way for the machine learning algorithm to understand the data. So you would have to establish things like the scoring algorithm. Once you did that, maybe you could teach it to say, okay, I prefer these higher scores.

26:02

Really, I'm not sure what it could do that was smarter. The only thing I can think of here is just like, I'm thinking I may add some sort of maximum list of how many experiments we can run in the match boiler just to cut down on, okay, pick something random. But I don't know how AI would make it better.

26:20

On the other hand, given all the keynotes recently, I'm worried about how adding AI to correlate may make it murder me in my sleep. Thanks. Second question, yes? Sure, please. Thank you very much for the talk. That was really lovely. I was thinking about the information retrieval domain

26:43

while listening to a talk. Have you had some inspiration there or was it just like completely random things coming to my mind when I heard this? Because when you create a set of keywords basically for an object, this object is kind of a document

27:02

in the terminology of information retrieval and algorithms and measures, like IT-IDF, for example, should work there. Oh, I don't know anything about like real, like again, I'm off on my mountaintop just thinking about my problems for myself. I've used this for matching up podcasts

27:21

with data scraped off of web pages. I've used this for, again, it's usually like MP3 files scraped off of web pages. I haven't used it for anything like that. I don't really think about document retrieval technology or document databases, anything like that. I'm sorry. Okay, thank you. Sure, sure. Okay, so that concludes the questions.

27:40

Thank you very much again for the talk and let's have another round of applause for Larry. Thank you.