JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 155 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/42894 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
SIGMOD 201961 / 155
34
37
44
116
120
122
144
148
155
00:00
ÄhnlichkeitssucheTabelleDatenverwaltungMagnetooptischer SpeicherInformationAbfrageGoogolZeichenvorratNormierter RaumSichtenkonzeptTabelleInformationAbfrageSuchverfahrenOffene MengeKategorie <Mathematik>Computeranimation
00:26
TabelleSichtenkonzeptAbfrageGoogolZahlenbereichPräprozessorPhysikalisches SystemAutomatische IndexierungFormation <Mathematik>SchwellwertverfahrenErlang-VerteilungZeitbereichProgrammbibliothekMereologieCodeTeilmengeMatrizenrechnungPerspektiveInverses ProblemSchnittmengeÄhnlichkeitssucheE-MailBenchmarkOffene MengeIndexberechnungMengeLesen <Datenverarbeitung>Mailing-ListeAlgorithmusMenütechnikStochastische AbhängigkeitLokales MinimumAbfrageSuchverfahrenBenutzeroberflächeDiagrammKreisflächeCASE <Informatik>Ein-AusgabeStichprobenumfangTabelleGreen-FunktionMengeLesen <Datenverarbeitung>Lokales MinimumEindeutigkeitBenutzerschnittstellenverwaltungssystemAlgorithmusInverter <Schaltung>Rechter WinkelAdditionGrößenordnungDifferenteOrdnungsreduktionDiskettenlaufwerkPolstelleSchnittmengeMatrizenrechnungAutomatische IndexierungBenchmarkProzess <Informatik>SprachsyntheseCharakteristisches PolynomCall CenterNatürliche SpracheWort <Informatik>GraphPhysikalisches SystemSchwellwertverfahrenApproximationMailing-ListeGradientPuls <Technik>StatistikMereologieProgrammbibliothekSystemaufrufMultifunktionFaserbündelStereometrieZahlenbereichInformationsspeicherungEliminationsverfahrenVollständigkeitFormation <Mathematik>DatensatzEindringerkennungFilter <Stochastik>DatenunabhängigkeitSystemprogrammÄhnlichkeitsgeometrieVorlesung/KonferenzComputeranimation
08:23
MengeLesen <Datenverarbeitung>Mailing-ListeSchätzungTotal <Mathematik>EinflussgrößeOrdnungsreduktionDigitalfilterLokales MinimumGraphCodeBenchmarkOffene MengeTabelleAbfrageNäherungsverfahrenPräprozessorBenutzeroberflächeVektor <Datentyp>IndexberechnungSystemprogrammierungMaßerweiterungDatenverwaltungVersionsverwaltungTabelleSchnittmengeMengeFormale SemantikArithmetische FolgeSelbstrepräsentationAutomatische IndexierungMatchingOffene MengeZahlenbereichEinfacher RingDifferenteAbfrageAlgorithmusLesen <Datenverarbeitung>OrdnungsreduktionÄußere Algebra eines ModulsAntwortfunktionPhysikalisches SystemCodeMaßerweiterungVersionsverwaltungMultiplikationsoperatorVektorraumBenutzerbeteiligungBenchmarkROM <Informatik>MereologieProzess <Informatik>BruchrechnungGradientenverfahrenFigurierte ZahlDistributionenraumTeilbarkeitForcingSchätzfunktionCASE <Informatik>Automatische HandlungsplanungFehlertoleranzAusdruck <Logik>AbstandLateinisches QuadratAbgeschlossene MengeNatürliche SpracheAlgorithmische LerntheorieGeradeResultanteEndliche ModelltheorieGruppenoperationSuchmaschineFormation <Mathematik>Charakteristisches PolynomRichtungAdditionGleichverteilungRechter WinkelKartesische KoordinatenTermLeistung <Physik>Mailing-ListeDatenverwaltungMultimengeDatensatzPlug inHypergeometrische VerteilungSchreib-Lese-KopfEinbettung <Mathematik>FrequenzEinflussgrößeComputeranimation
16:12
Versionsverwaltung
Transkript: Englisch(automatisch erzeugt)
00:02
The data scientist has a table on companies, and he wants to find out more information about those companies. So what we can do is to translate this question into a query for a table that could be joined with this table. But the existing data lake, such as the open data portal
00:21
that we have, only support keyword search or category browsing. So it doesn't really support this query. So in this talk, I'm going to discuss a solution into this problem. So let's first define what is joinable table search. So the user would start with an input table.
00:42
We call it query table. And specify a column that we call query column. So from a data lake of many, many tables, we want to find a table that could be joined with my query table on the query column. So how do we measure the relevance of the tables
01:03
that we get from the data lake? Or in other words, how do we define joinability in this case? So we define joinability as the number of distinct values in a query column that could be covered by the join or the intersection size.
01:21
So this definition makes our problem of the overlap set similarity search problem. So how do we build a system for joinable table search? So we first preprocess the table to extract sets from the columns. And then we can index those sets in a search index.
01:44
So when the user input the table, we use the same process to extract the query set and search the search index, which will return a list of column IDs and table IDs that will be used to render the user interface.
02:03
So we have developed two search indexes and Josie is the one we're gonna talk about today. And our search ensemble was published in VLDB 2016. So just a quick overview of what our search ensemble was. So it's a threshold-based approximate search index.
02:22
And it's a part of, right now, it's a part of DataSketch Python library. It has been used by many people, including Google, MIT, and Stanford. It has over 100 dependencies and almost 900 stars on GitHub.
02:40
So some might wonder why do we develop two algorithms for the same problem? But to be fair, there are different problems. So our search ensemble solved the problem of approximate search with threshold. On the other hand, Josie solved the top-k problem and is an exact algorithm.
03:03
So before I go into talk about the algorithm, I wanna give an overview of inverted index. So because it's used by many exact algorithms, including Josie, you can visualize an inverted index as a very giant matrix.
03:23
And the rows are the sets, and the columns are data values. So in this diagram, the query set at the top, so if you have a solid circle, they're corresponding to the set containing the data value. So when you take a row, you effectively read the set.
03:44
When you take a column, you get the set that contain the corresponding data value. So we also call a column a posting list. So given this, what is the challenge when we try to do overlap set similarity search
04:01
in data lakes? So in previous benchmark for set similarity search, utilize data mostly from natural language texts, semi-structured data, and keywords. So on the other hand, for data lakes, we observe very different characteristics.
04:21
So it had very large sets, and it has order of magnitude more unique data values. So the cost of reading, posting this, and sets are much bigger. They cannot be neglected. And so an efficient search algorithm should consider the costs. Let's take an attempt, a baseline attempt.
04:44
Let's say we want to find top one. Though the first baseline, we call it posting list union. It's a very classic algorithm. All it does is to read all the posting list. So in this case, the query at the top and the rest circle indicates the overlapping data values
05:03
between the sets and the query. So the posting list union algorithm in this case reads seven data values from the matrix. So the second baseline is what we call prefix filtering. And the prefix filtering algorithm works different. Start by reading posting lists one by one
05:22
and read the sets as you observe in the posting list. It will keep reading the sets until it finds no better sets, no set that has higher overlap than the current top one set.
05:40
So it will stop when we see X4, because in this case X4 has overlapped the query completely. So in this case, the prefix filtering algorithm reads 18 data values from the matrix. So the posting list union algorithm wins in this example. But let's just change to a new data set
06:03
in the second example. So in this second example, the posting list union algorithm actually reads more data values than the prefix filtering algorithm. And all I did was change the sets in the matrix. So why is there such a difference?
06:20
This is because none of these algorithm actually consider the cost of reading the sets on the posting list. So they are data independent algorithms. So that, so I'm going to introduce Jiozi, which is a data dependent algorithm, and consider read costs to minimize the read cost.
06:45
So how do we, let's start with the intuition behind Jiozi. So how do we reduce read costs? So we can reduce read costs of, the cost of reading posting lists by reading candidate sets. So I'm going to give you a walkthrough
07:01
of this example. So let's say we have read posting list V2, and candidate X4. And X4 has an overlap of four, right? With the query. So we know the maximum overlap from any unseen candidates in posting list V4, V5, and V7 is going to be three,
07:24
because there's only three posting lists out there, which is less than four. So what it says is that we know for sure that X4 is now the top one, because none of the candidates in the unseen posting list is going to beat X4.
07:44
Therefore, we can eliminate V2, V4, V5, and V7 without reading them. Now, it also works the other way around. By reading posting lists, we can also reduce the cost of reading candidate sets.
08:03
So let's say we have read two posting lists, and the current candidate sets are X1, X2, and X3, X4. Assume the current best one has four overlaps. Because the inverted index stores the size of all sets, we also know that the candidate sets X1, X2, and X3
08:26
has only one more overlap possible after V4. And within the posting list that we have already read, they only have one overlap with my query as well. So in total, these three sets
08:42
can have at most two overlaps. Less than the current best one, which is four. So we can eliminate those sets without reading them. So what we see is that there is kind of a trade-off. If we read posting lists, we effectively reduce the cost of reading candidate sets.
09:02
The other way around, if we read candidate sets, we reduce the cost of reading posting lists. So that inspires the creation of JODI. Okay, so I'm going to give you a very fast overview, high-level overview of how this algorithm work.
09:23
So we measure work as the total cost of reading the remaining posting lists and candidate sets. And we measure the progress as a reduction in the work that we have to do. So reading posting lists reduce the work
09:41
of reading candidate sets, and reading candidate sets improves the power of preview children, which I actually describe more in detail in the paper, which in turn leads to reducing work and reading posting lists. So the progress always comes with price. So we measure price as the cost
10:01
of reading either posting lists or candidate sets. So this is just a pseudocode of JODI. Let's say as long as there is work left to be done, we first estimate how much progress each direction will give us, and then we chose the direction, either reading posting lists or candidate sets, that maximizes the difference between progress and price.
10:24
So we always take the one that will pay the least price but make the most progress. So the magic behind all these algorithm, the magic behind the estimation is a cost model. And we use, the cost model is built
10:44
on the foundation of estimating a set intersection. And so to be able to estimate set intersection size between sets, we assume that the overlapping values between the query and the candidate sets
11:01
are uniformly distributed. So what does that mean? We can quickly go through this formula. Let's say we have find four overlapping values between Q and X from the posting list. So by reading all the posting lists we have, we see that, okay, Q and X overlapping four of the posting lists.
11:20
So we also know that there are four posting lists between the first and the last appearance of set X. So that's the distance. And we use a uniform distribution assumption to multiply the fraction by the whole of Q to get an estimation of the total intersection size.
11:42
This is as simple as this. So in fact when we developed the algorithm we also tried more complicated estimation techniques that include using hypergeometric distribution assumption and also include a global frequency or co-occurrence frequencies. But we figured that actually uniform assumption
12:03
give us, is accurate enough and is also really fast compared to other algorithm, other technique. So in the end actually this one gave the best outcome. Okay, so that come to the experiments. So we used two benchmarks
12:20
with very different data characteristics. So first one is open data. We extract sets from over 200K open data tables and from WC web tables. So in the first experiment results using open data we observed that GOC actually outperformed both baselines.
12:42
So the blue line is the posting list union and the orange line is the prefix filter. So it outperformed both baseline by a factor of two. So doing pretty well when the K is 10. And we also measured the performance
13:00
for small query and large query. So the left part is small queries and the right part is for the large query. So the GOC also outperformed both baselines in this case. And we also tried different K values and the result is the same. It's still doing better than either baselines.
13:23
It also maintains the performance when we change the benchmark to web table. So the difference between open data benchmark and web table is that the posting list in open data is typically shorter.
13:40
But the sets in open data benchmarks are much, much bigger. So what happened is because you're posting this union read only posting this. So it works, it's more performant than prefix filter which mostly read sets when it comes to open data. But in web table, the portion lists are very long.
14:01
So it's very expensive to read portion lists. But on the other hand, the sets are small. So the posting in the union actually spent more time than previous filter. But Josie outperformed both because it always take the best of the two alternatives.
14:22
So in terms of future works, there's a lot of possibility to improve the existing joinable table search system. So the first possibility is to consider multi-sets instead of semantics so that we can use number of joinable rows as the relevance measure.
14:42
We can also use entities to represent data values so that we can match values with different representations. We haven't figured out a way to handle numerical data. We can also consider using embedding vectors to represent natural language text so that we can match values that have a similar semantics.
15:02
So in terms of improving the search index, we can build a plugin or extension to exist in DBMS. We can also make the system more distributed and improve performance. So more interestingly, for potential application of this kind of system, in addition to traditional machine learning
15:21
and really this targeting system, we can apply this to make it work with citizen data size. So we have so many open datas out there with lack of unified search engine. So we can use this system to help us build a search engine for open data. And as well as using this technique for data marketplace
15:43
where you can help the user to find tables that actually join with the table. So you may want to buy those table. So that's the end of my talk. So any questions? And the code is, by the way, just out of the heads.
16:01
So we have a tutorial in VLDB this year on data lake management. And we're gonna talk about a lot more stuff than just data discovery. So including extraction, ingestion, cleaning, discovery, and versionings.