A Deepdive into Tantivy
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 561 | |
Autor | ||
Lizenz | CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/44121 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
FOSDEM 2019207 / 561
1
9
10
15
18
19
23
24
27
29
31
33
34
35
38
39
40
43
47
49
52
53
54
55
58
59
60
63
65
67
69
70
78
80
82
87
93
95
97
102
103
104
107
110
111
114
116
118
120
122
123
126
127
131
133
136
137
139
141
142
148
153
155
157
159
163
164
168
169
170
171
172
173
174
181
183
185
187
188
193
196
197
198
199
200
201
205
207
208
209
211
213
214
218
221
223
224
226
230
232
234
235
236
244
248
250
251
252
253
255
256
257
262
263
264
268
269
271
274
275
276
278
280
281
283
284
288
289
290
293
294
296
297
300
301
304
309
311
312
313
314
315
317
318
321
322
327
332
333
334
335
336
337
338
339
340
343
345
346
352
353
355
356
357
359
360
362
369
370
373
374
375
376
377
378
383
384
387
388
389
390
391
393
394
395
396
406
408
409
412
413
414
415
419
420
425
426
431
432
433
434
435
436
438
439
440
441
445
446
447
448
453
455
457
459
466
467
471
473
474
475
476
479
480
484
485
486
489
491
492
496
499
500
502
505
507
508
512
515
517
518
529
531
533
534
535
536
539
540
546
550
551
552
553
554
555
557
558
559
560
561
00:00
SoftwareProgrammbibliothekSuchmaschineCodeGeradeARM <Computerarchitektur>BenchmarkServerProgrammierungMultiplikationsoperatorBitRechter WinkelInformationSuchmaschineVerschlingungBenchmarkDämpfungVersionsverwaltungVirtuelle MaschineVerzeichnisdienstMehrrechnersystemPaarvergleichGlobale Optimierungp-BlockProgrammierspracheBenutzerbeteiligungCodePunktLesen <Datenverarbeitung>SystemplattformDifferentialService providerDemoszene <Programmierung>DifferenteFormation <Mathematik>Computeranimation
05:43
Lokalkonvexer RaumIndexberechnungTermAbfrageFunktion <Mathematik>SteuerwerkVirtuelle MaschineAutomatische IndexierungDienst <Informatik>Objekt <Kategorie>AbfrageMigration <Informatik>Boolesche AlgebraTermKonfiguration <Informatik>Mailing-ListeMomentenproblemBitProgrammbibliothekSchnittmengeFreier ParameterAssoziativgesetzSuchmaschineDatenstrukturBefehlsprozessorWeb-SeiteInternetworkingSoftwaretestZentrische StreckungEinfach zusammenhängender RaumTypentheorieMultiplikationsoperatorFunktion <Mathematik>Exogene VariableService providerSpeicherabzugPi <Zahl>Leistung <Physik>SymmetrieE-MailOrtsoperatorProzess <Informatik>AlgorithmusGüte der AnpassungCASE <Informatik>SchwebungFormation <Mathematik>RechenschieberRechter WinkelStrategisches SpielComputeranimation
14:06
TermInformationData DictionaryHash-AlgorithmusZufallszahlenDeterministischer endlicher AutomatRegulärer Ausdruck <Textverarbeitung>SchlüsselverwaltungSkalenniveauDivisionUnendlichkeitKompakter RaumIRIS-TOperations ResearchDatenkompressionTermMultiplikationsoperatorAggregatzustandBefehlsprozessorPunktZweiMAPFolge <Mathematik>Hash-AlgorithmusZeichenketteArithmetischer AusdruckDatenstrukturData DictionaryFestplatteZeitrichtungRechenschieberRegulärer Ausdruck <Textverarbeitung>ImplementierungTopologieGerichteter GraphAutomatische IndexierungMehrrechnersystemFitnessfunktionArithmetisches MittelMini-DiscShape <Informatik>Automat <Automatentheorie>Streaming <Kommunikationstechnik>HalbleiterspeicherGammafunktionMapping <Computergraphik>Elektronische PublikationSuchmaschineBetrag <Mathematik>Kategorie <Mathematik>Regulärer GraphUnrundheitWort <Informatik>Ein-AusgabeDeterministischer endlicher AutomatSchlussregelBildgebendes VerfahrenVirtuelle MaschineWurzel <Mathematik>Physikalisches SystemAuswahlaxiomKnoten <Statik>MatchingAbstandSchlüsselverwaltungKreisflächeMailing-ListeFamilie <Mathematik>SoundverarbeitungTypentheorieSkalenniveauInformationGraphfärbungZeiger <Informatik>Perfekte GruppeAbfrageAppletSpeicherverwaltungAssoziativgesetzTransduktor <Automatentheorie>Boolesche AlgebraCodeComputeranimation
22:28
DatenkompressionGanze ZahlPrimzahlzwillingeSimulationCodierung <Programmierung>Mailing-ListeDesintegration <Mathematik>p-BlockImplementierungBefehlsprozessorElektronischer FingerabdruckÜbersetzer <Informatik>InterleavingInformationIndexberechnungDatenstrukturNummernsystemMultiplikationsoperatorInformationMultiplikationDatenkompressionStreuungp-BlockAbfrageGanze ZahlBitBefehlsprozessorRechter WinkelMomentenproblemCASE <Informatik>StellenringZahlenbereichAssemblerMinkowski-MetrikTermVersionsverwaltungWechselsprungDatenfeldDeltafunktionFrequenzMailing-ListeAutomatische IndexierungPunktCodeZeitrichtungDefaultFunktionalGeradeDifferenteSkalarfeldInstantiierungAlgorithmusCliquenweiteSynchronisierungMini-DiscProgrammbibliothekCodierung <Programmierung>FehlermeldungGenerator <Informatik>BandmatrixWort <Informatik>Caching
30:51
DatenstrukturGasströmungStochastische AbhängigkeitIndexberechnungStapeldateiDateiformatMini-DiscRechenwerkBootenGamecontrollerStrategisches SpielHeuristikDefaultÄhnlichkeitsgeometrieImplementierungSerielle SchnittstelleData MiningDualitätstheorieWarteschlangeAutomatische IndexierungEinfügungsdämpfungData DictionaryZahlenbereichKompakter RaumDatenstrukturAutomatische IndexierungDefaultStrategisches SpielDynamisches SystemWarteschlangeHydrostatikBildgebendes VerfahrenDatenbankMaßerweiterungThreadMultiplikationsoperatorGemeinsamer SpeicherMini-DiscDatenkompressionProzess <Informatik>AuswahlaxiomComputerarchitekturWort <Informatik>UmwandlungsenthalpieResultanteHypermediaGamecontrollerKanalkapazitätVirtuelle MaschineProgrammbibliothekRechter WinkelFortsetzung <Mathematik>Automatische HandlungsplanungKategorie <Mathematik>SystemaufrufComputeranimation
36:12
Automatische IndexierungMultiplikationIndexberechnungStochastische AbhängigkeitLesen <Datenverarbeitung>ROM <Informatik>TermDistributionenraumFrequenzPrädikatenlogik erster StufePlateau-ProblemE-Funktionp-BlockMailing-ListeMinkowski-MetrikVererbungshierarchieAppletCASE <Informatik>BitCodeAutomatische IndexierungSuchmaschineSpeicherbereinigungProgrammbibliothekSpeicherverwaltungPlastikkarteProjektive EbeneData DictionaryHalbleiterspeicherObjekt <Kategorie>MomentenproblemGamecontrollerWort <Informatik>MultiplikationsoperatorKontrast <Statistik>RechenschieberElektronische PublikationResultanteHIP <Kommunikationsprotokoll>Natürliche ZahlFitnessfunktionBenchmarkGruppenoperationQuellcodeMAPSchnittmengeSynchronisierungNotebook-ComputerStapeldatei
41:33
Serielle SchnittstelleDatenkompressionFunktion <Mathematik>IndexberechnungTermAbfrageWort <Informatik>Projektive EbeneArithmetisches MittelRechter WinkelComputeranimation
42:13
Computeranimation
Transkript: Englisch(automatisch erzeugt)
00:05
you to speak it out, take it away.
01:00
Hey, I'm glad you're here. I'm glad you're here. Everybody in the room is familiar with the scene, and it's coming here to hear a lot of information about the machine ecosystem. So you're probably familiar with machines around the last research. Within this community, Tontivi is closer to the scene.
01:21
So it's very, very close in concept to the scene, by which I mean if you want to build a search engine using Tontivi, you're going to have to have programs that use Tontivi as a library, and you're going to use all of the reading blocks in Tontivi to actually get the search engine in there. It's not like an official solution in which you run
01:42
the server and everything is ready. It's written in Rust, so I don't know if everyone is familiar with Rust, if you're familiar with Rust in your present tense. Oh, wow. This one is getting popular, I guess. So for those who are not familiar with Rust, it's a programming language that is kind of young,
02:03
and you need to predict all this. Within all the, to give you an idea of what Rust is, you can, Rust people will take you for that. You can consider it as a C++.
02:23
So the code that you write is basically safe, and your program is very similar. It's very difficult to tell about Rust programs. C++ code, if you're just looking at that, it's possibly code generated. You will get the exact same performances.
02:41
It's, yeah. So you do get flight safety, and you get a great ecosystem, and you get great coding, and the syntax is better. It's simple to learn. So a bit of a surprise question. An OK definition, I think. So Tontivi, so Rust is a, which you mean get IDS,
03:04
which is also because I think that's Tontivi, you can actually compare and, of course, you have Mac OS, Windows, but also you can compare on the platform that are not x86 supported, so you can actually run Tontivi on ARM,
03:21
like we did before. And I actually even provide Tontivi for whether something you want. So it's not something that I, like, probably won't talk to you about too much, because there is no support from Tontivi, right now, in the web assembly, but at one point. Maybe Tontivi will be in that, and at times it will run, such a new program.
03:44
It's still huge. It's very important for me to keep the number of times, quite small, there's only 30,000 miles of code, even if you include all of the different crates that I have to write for Tontivi.
04:02
Because, as I told you, it's a bad project, and if you'd like to bet on that project, it is the best one. And also, ideally, I would love, in the future, to have, like, companies that are considering making their own search engine from scratch,
04:20
if they are for one reason or another, if they cannot use PC for that project, rather than searching from scratch entirely, I'd love them to at least consider using Tontivi. And the thing that Tontivi is not good at, makes it possible for people to actually embrace the code and own it, and work with it.
04:47
It's fast, so I put a link on the benchmark here. I think we want that then, to go through the benchmark, but there is an interesting benchmark that breaks down the performance of Tontivi that way,
05:02
so you can look for different kinds of way. Usually, it's when we work with different versions of Tontivi and Tontivi right now. But what's really interesting is that you can actually identify, oh, for this way, you see the directory faster, what's going on there? So that's a great way for me to explore
05:20
different optimizations that are in PC, and do not necessarily Tontivi. I can know what will be the next step for Tontivi. I thought there was a way around. For some reason, Tontivi is abnormally faster than PC. Maybe that's a good thing, that there is some differentiation between Tontivi and PC.
05:43
So I suspect that Tontivi is more interesting than PC. So I'm not going to talk too much about Rust, actually.
06:00
I understand that from now on, we're going to build those options in together. And as a matter of fact, Tontivi is totally inspired by the machine. All of the services that they're going to do from now on are very valid for the machine and the experience of the same design. So hopefully, at the end of this talk, you can be able to understand maybe a little bit better.
06:26
So yeah, let's build those options in together. Most basically, our objective will be, we want a search engine that is, I'm trying really hard not to use the scale here, because we are still talking about a search engine that
06:42
is sitting in one machine. But we want to be able to search and index terabytes of data, a lot of data, on a machine that only has gigabytes of time. So we want to make sure that all of the stuff that we are doing and that is required to take on the scale
07:01
among the problems. And so Tontivi can do that. I actually tested it. That's something that I did. So right now, we are considering 0.8. And I did test, in this case, responsibility 0.6.
07:22
And what I did is that, so this is publicly available data sets for its command pool. And what it is, it's kind of a snapshot of a core of the web. And it's a pretty large one, a very decent one. It's around 3 billion pages.
07:40
So I took these snapshots. I downloaded it with, to be accurate, I downloaded half of it. I downloaded half of it using my own personal internet connection. I think my internet provider based on it. It took around one month full time.
08:02
I'm in Japan, so we have particular type of stuff. So we did pretty good integration with scale. Yeah, and I wanted to test if I could index it, see what happened, because it's a very good test. Because in your home, especially, I bought a house in Japan.
08:23
And the electrical network was actually great for my usage. I'm French, so I drink coffee. And I have a ghost in the morning. And when I switch ghosts at the same time, I will always have a clock in my phone. It's a great test for such engineering.
08:41
It makes me so sorry. Yeah, so I did that. And I put the Python library on top of that. Well, basically, the goal of the Python library is to extract, like you give it a query,
09:04
and there is a placeholder. So for instance, you could say French people are, and then you ask for all of the adjectives that follow this sentence. And so the two libraries together, Tonsil B plus this small NIP library,
09:22
made it possible to look for, among the internet, the most common adjectives listed with French people. I think it's pretty accurate.
09:46
OK, so let's move on. So to make sure that you have a sorted list of the queries. Can you check if you're muted, please?
10:07
All right.
10:29
OK, so usually that means that you're going to want to build an inverted index. And an inverted index is simply something
10:40
that associates to each term a sorted list of the queries. And this simple data structure is sufficient to actually compute any kind of Boolean query. So I'm going to explain rapidly how our intersection works.
11:01
So actually means that we are trying to find all of the documents that are matching the query ordinary and migrate. We are going to have this two posting lists. We take the first one. The first increment of the first one is two. And we are going to seek into the second posting list for the document two.
11:21
So by seeking, what I mean is that we are going to do exactly as if we are advancing, scanning through the second posting list. And we are going to stop at the moment where we find a document that is greater or equal to the document. If it's equal, then that means that two is in the intersection.
11:41
If it's greater, then it's not. So yeah, it's not. And we found the document five. And we are going to do symmetrically. We are going to seek for five in the first posting list. Here we go. Five was found. So it's going to be in our output. We can advance the first posting list.
12:02
And now we are seeking for seven in the second list. And so on, so forth, until we reach the end of one of the posting lists. So the only takeaway there is that if you want fast intersection, you need to have a fast way
12:21
to seek through the posting list. For union, we could use the same strategy. Basically, it's like the merge. Everybody has a different name for this algorithm. But we could merge those two posting lists and get the union. But that's actually inefficient. So I'm using the same trick as Lucine does.
12:42
The idea is that you prepare in advance some kind of bit set here. And so just for the sake of these slides, the bit set is actually often eight. But in Tontivi, we're using 4,096 bits. So it's much larger.
13:00
And what we do is that we are going to take the first posting list. So we are going to append all of the docs that are between two and two plus. So this horizon, in this case, eight, but in Tontivi, 4,096.
13:22
And we append those documents in this bit set. So in this bit set, the first bit will mean two. The second bit will mean three, and so on, so forth. We do that for the first posting list. And we do that on the second time, on the second posting list. Here's what we get for bit sets. And now we can flush it.
13:41
So we transform basically this bit set into the document that it represented. So it can be done quite fast because your CPU usually has an instruction to pop the lowest significant bit out of a 64-bit world. So you just pop it and so on and so forth.
14:05
All right. So that's how an inverted index is used to compute the Boolean query. Now the question is, how do we represent that on disk? So Tontivi is relying on mmap for all of its IOs. And when you start Tontivi, the only thing
14:22
that it does is basically mapping all of the files of the index. So it goes really, really fast. And it's ready to go. So we don't have to load any data structure and put that in anonymous memory and have some kind of hash map in anonymous memory. Everything is on mmap.
14:40
So startup time is very, very nice. But that means that we need to have data on the disk that is usable as is. For Java people, anonymous memory is more or less like the heap. So the first data structure that we are going to have
15:02
in our index is a term dictionary. So a term dictionary, our term dictionary will be broken down into two steps. One step will associate the terms of the sequence of bytes with a term to some kind of term ID. And then we're going to have another data structure that
15:22
associates the term ID to some kind of term info structure that basically has pointers to files and the beginning of the posting list. I'm not going to talk about the second data structure because it's boring. But let's talk about the first one. So how do we go from terms, sequence of bytes, to a term
15:44
ID? So if you are actually trying to build your own search engine, you will have two broad kind of family for this solution. One could be hash-based in general. Maybe you will go fancy and have perfect hash or something like that.
16:01
It's a very nice solution in the sense that you will get very fast hookups. And especially if you don't have much RAM and everything is sitting on your hard disk, you will probably require very little IO to do that. Your hash map will be able to send you directly in the right place on your disk.
16:21
Another solution is using a tree-based solution or a trie. So this will require slightly more CPU. And you will have a lot more random IO.
16:41
So that kind of depends on the layout of your data. But you tend to jump from one node of your trie to another node of your trie. So it might be a lot of 6 if your data is on your disk. But it has a lot of benefits. So one benefit is you can iterate
17:03
through a round of keys. So naturally, you will probably use term ordinals. And so by term ordinals, I mean if arabica is your first word in your dictionary,
17:20
then it will get term ID 0. The first word, when sorted lexicographically, and then the second word might be something starting by a B, then it will get term ordinal 2 and so on and so forth. So term ordinals will tell you term ID D will be sorted
17:40
exactly the same as your terms. So that's a very nice property. But more importantly, you can do an intersection of your trie or your trie-like structure with a DFA. So are you guys familiar with what a DFA is?
18:03
OK. We'll still explain it later. I actually made a mistake in the DFA here. So DFA stands for deterministic finite automaton. And that's one way to implement regular expression. So you can transform any regular expression
18:21
into an automaton like this one. And the way it works is once you have it into this shape, then matching a string on this automaton means that you're going to consume every char in your string.
18:41
You start in the state over there. I was hoping I could point out stuff, but you can't jump. So you start in the white state over there. And let's say that you are trying to match carousel. Your first char is C. You look at the outbound arrows that are emitted by the state you are in.
19:02
One is labeled with a C. So that means that after consuming the C, I'm in that state, the yellow one. And then carousel's second letter is an A. I'm going to follow this letter, this arrow. And I'm in the blue state. R, I end up in this state.
19:20
And oo-sel, until the oo-sel, I guess, without the L, will be following the arrow with a star over there. And then the last L will bring me to the end state. So it's really nice because I'm advancing one char at a time.
19:41
I just have to look at the end state of my string to say whether I matched or not. I marked the state that matched by a double circle here. So our try looks like this. And it's actually possible to match this DFA
20:03
over the try very efficiently. So I'm going to show how it's done. So here, if we consume the C, we end up in the yellow state, just like for carousel. If I consume the A, I end up in the blue state. If I get a B, then I end up in this gray state,
20:22
which happens to be a sink. So the sink is a dead end. We will never match if we reach this state. We don't need to see what happens with the following characters. And that's great because that means that all of this subtree, we don't care about it anymore. So for the purpose of fitting soon into the slides,
20:43
this try is very simple. But you can imagine that maybe there is a gigantic tree that is a child of this node. And we just cut that. So that's much, much faster. And now we want to match M. We don't
21:00
have to recompute what is a state required for A. We can just look at the state of A before. So it's blue. And we see that we are in the sink state again, and so on and so forth. So we are just putting colors over our try to guess which term in our dictionary
21:22
are matching our deterministic finite automaton. That means that I can go through all of the terms that match a regular expression. But also, I can get very rapidly all of the terms that are at a Levenshtein distance or edit distance of one or two. So that means that if you are asking me, please stream me
21:42
all of the terms that are one typo away from what I typed. I can do that very, very efficiently. So Tentv is using a tree-based solution. It's actually using a finite state transducer machine, those two.
22:01
And I didn't have to code anything. The Rust ecosystem is nice enough that somebody already coded a very nice implementation of a finite state transducer. I will not explain how it works, because I didn't code it. The thing that you might want to know about it is it's pretty much like a try,
22:24
except that nodes can share suffixes. A try will never have an error that goes like this. So because it can share suffixes, you end up with something that is slightly more compact than a try.
22:41
And that's always a very nice feature when you're paying space with your RAM. Now let's talk a little bit about how we are going to encode posting lists. So posting lists are a lot of integers. It's going to take a lot of the size of your index.
23:00
And you're going to want to compress them. Integer compression is a field that has been well studied. There is a lot of solution to do integer compression. I put a chart over there. But basically, my point here is you have a trade-off between something
23:21
that is compressing a lot. So it's always like that, right? You have to choose with something that compresses a lot and something that is very fast. Another thing that I need to point out is basically all of the algorithm over there and all of these, all of the best algorithms,
23:42
they're all using SIMD instructions. The SIMD instructions are instructions on your CPU that makes it possible for you to process four or eight integers at a time. So that's something that, actually, Tontivi is using a lot.
24:04
Tontivi is actually using this guy over there. So it's a compression scheme that has been designed by Daniel Lemire. I recommend you to read this blog. It's always very interesting if you're interested in things
24:23
related with search or data structures. I used to depend on his library, actually, in C. And I removed it because I prefer to be entirely in Rust. And I re-implemented it entirely in Rust. The gist of it is, like many of those schemes,
24:45
your posting lists are increasing. So it starts by doing something called delta encoding. So instead of compressing your integer directly, what you do is that you take the difference between two consecutive integers, and you get something that is much smaller.
25:01
You take a pack of those. So in my case, blocks are 128 integer long. And out of those 128 deltas, you look at the one
25:21
that is the largest. So I think in this example, it should be this one at 16. And you notice that it can be represented using five bits. So what you do then is that you represent all of your deltas over five bits. And you just concatenate the version.
25:41
So that's called bit packing. So I describe what was the solution for the scalar version of bit packing. What I do is actually I use CNG instruction for that. And I do that with four integers at a time. So I decode four integers at a time. And the algorithm really looks like this trick
26:04
that we were using at school to avoid writing lines when we are punished by our professor. So it uses exactly the same algorithm as the scalar solution. You just use CNG instruction in place
26:23
of those scalar instructions. It's very, very simple. And there is an interesting improvement compared to the method that existed before Daniel Lemire's paper, which is the algorithm also take advantage
26:41
of the fact that as we are decoding those integers, we end up at one point where we have one register with the deltas that we are decoding. And we can decode the delta using CNG instruction as well. So this is also a CNG instruction. And we do that at the point where they are already
27:01
in the register. So it goes really, really fast. When I say fast, to give you an idea, I'm talking about four billion integers a second. So we are already flirting with the bandwidth of your RAM, basically. So I told you that Rust was nice and was able to compile very efficient code.
27:23
It was very close to C++. And so I wrote that in Rust. And this is something generated. I'm not going to tell you that I understand what's in there. But the important point here is that all of the long-looking instruction over there, every single one of those, is working on four bytes at a time.
27:44
And you will see scalar instruction maybe here, here, here, here. So the assembly code that is generated is really as good as possible. It's very, very close to. It's probably exactly the same as what we had in C++.
28:05
So I think we talked this morning a little about bm25 and tfidf. This requires two. So those are the default scoring function of Lucene. Tntv comes in with bm25 also.
28:24
This requires to have access to the term frequency as we search. So when we match a document, we need to have access to the number of time a term appears in the document. And it's nice to have good locality for that. So we actually interleave blocks of the KDs
28:43
with the block of term frequencies so that we are likely to have it in a nice cache at the moment where we are reading the, as we are doing the match, as we are running the query.
29:02
So we have one block of the KDs and then one block of term frequency and so on and so forth. Each block is representing 128 blocks. So unfortunately posting is not always exactly a multiple of 128 blocks. So the last one is using a different compressing scheme.
29:22
That is not interesting. And so we also need to have the information of how many bits are used to encode each block. And we also would like to have some way to avoid decompressing this block
29:41
if we are running an intersection, for instance. Maybe the document is not in this block and in that case we would like to entirely avoid decompressing it. So for that we have another structure on top of that that allows us to precisely skip and decompress only the blocks that might be interesting.
30:06
We're going to talk about something else now. So I said in the beginning of the talk that we wanted to be able to index an arbitrarily large amount of data. I use the word terabytes.
30:23
And that's a big problem because if we have a reduced amount of RAM, how are we going to be able to build this nice data structure on this? We have another problem, which is people might create an index and they want to add new documents. We want something that is dynamic.
30:41
We want to be able to add documents to an existing index. That's another problem. It happens that the solution is the same for both of these problems. So just to refine the idea of why it's difficult to add new documents in an index like that.
31:01
So the data structure that I described to you on disk is extremely compact. Everything is super compressed. Everything is laid out on disk, one after the other. So if it was a bookshelf, it looks like this. And if you're in front of a bookshelf like that and somebody gives you a book and tells you
31:22
to put that in the library, you're going to suffer. You're going to have to move everything away. It's a nightmare. What you want is to have something that looks like that if you want to update your bookshelf all of the time. So generally speaking, there is a tradeoff between being dynamic and being compact.
31:44
So there is a nice trick that exists in many databases. It is called the logarithmic method. The idea will be we're going to have compact bookshelf like that, but we are going to have many of them.
32:00
So the way it works is the user will tell Tentivy, or will tell Lucene, hey, Tentivy, I will give you a budget of, let's say, 300 megabytes. And this 300 megabytes would be used for the dynamic bookshelf here. So there's going to be a big bookshelf,
32:20
and people will add books into that until the bookshelf is full. Once the bookshelf is full, we are going to transform it into a very compact bookshelf. So I'm going to stop with the image here. But basically, we serialize our dynamic data structure
32:41
into the static data structures that I described before. And this piece of very compact index is called a segment. So that's a word that you probably heard in Lucene, but that's a segment. And both in Lucene and in Tentivy,
33:03
there is a strong architecture choice. A segment is basically an independent index. So you could literally pick a segment and copy it into another index, and it will run just the same as long as the schema are the same.
33:20
That's a very interesting property, actually. So if we do that, we are going to end up with a lot of compact segments. And maybe that's not optimal. If we have one terabyte of data, that would be a lot of small segments. And when we do a search, we are going to have to do as many lookups in the dictionary,
33:41
and that's very inefficient, right? So on the background, we also have a process that merges those segments together. And you can actually control, you can define on your own as a user what is a strategy used to merge these segments. But it comes with a default strategy
34:03
called the long-merge policy, and the heuristic behind it is just we try to merge. By default, it's going to be eight segments that have about the same size. And yeah.
34:21
So this technique is also really nice, because that means that we can do multi-threaded indexing very, very easily. So we can't really ask you how much data you give as a budget and also how many indexing threads you want to use. And then when you open the document, what you are doing
34:41
is that you are opening the document to a document queue. And in the background, you don't control it, but your given number of indexing threads that are consuming this document queue and that are populating these dynamic data structures that I was talking about.
35:01
And once they reach their capacity, then they're going to serialize the segment. So the serialized segment is a very compact data structure that we use in the end. And some merging thread will merge those very transparently. Yes.
35:21
Thank you. So there is a bunch of pros and cons to this. One problem that is very evident when you use TantCV and I think is still true for to some extent
35:40
is true with Lucene is that when you add a document, it's not searchable right away. So it's quite puzzling for people who are using SQL database. Because we decided to have our segments to be independent index, they do not share any dictionary.
36:01
And sometimes that makes specific users of search difficult not to be able to actually merge the result from the different segments very easily. Yes. There is not one dictionary.
36:20
I didn't talk about deletes and updates, but those are a natural nightmare. It's very complicated. Yes. That's pretty much all I have on this slide. And then the pros will be the index throughput is actually excellent if you can batch stuff.
36:44
If you don't have to commit all of the time, you can re-index a gigantic amount of data. So I think on my laptop, I index Wikipedia. I rest for two or three minutes, but that's the amount of English Wikipedia. But that's the speed we are talking about.
37:03
Wikipedia is like a tiny benchmark in search. It's not a big data set. Yes. So segments are independent index.
37:22
So you could decide to actually do your indexing on Hadoop and just copy the file of the segment in the same place, and everything will work just the same. Or you could decide to do a distributed search by sending the segment and dispatching them. It's really just copying files. It's very transparent.
37:43
And because we write files and then we never touch them again, we just write very large files, and then they are read-only. There is no problem of locks or anything. Readers just read the file, and nobody is touching them.
38:01
So it simplifies a lot of my work. I have a bunch of slides there. Let's keep them so that we have a bit of time to ask questions. Do you guys have questions? Yeah.
38:20
I'll try to say it out loud. Do you see is a little bit of a corner case for Java? It's like basically a library which is difficult for Java to handle, because there's lots of low-level stand going on in here.
38:41
So what about your project and Rust? Do they fit together, or did you find trouble using Rust as beautiful? So yeah, I wanted to see the little card. So the question is, is Rust and building a search engine
39:05
a good fit, I guess, especially in contrast with Lucene and Java? So I would say yes. So the main benefit that I get from Rust,
39:22
I guess, compared to Java, is first I can access SIMD instruction. Second, I have a lot of control. I know when I write code whether I'm going to get static dispatch or dynamic dispatch. That's something that is extremely hard in Java. You never know what happens.
39:43
Also, all of the things, like mmap being a nightmare in Java, it's something that I don't have any problem about. I don't know if everyone knows what I'm talking about, but basically when you're working with off-hip memory in Java, like if you are mmapping a huge amount of data
40:04
and you have very little data that is on your heap, your data will be unmapped only at the moment where the object that is holding the mmap thing is garbage collected. And garbage collection happens only if your heap is actually kind of full.
40:26
I think that's a big problem with Java. So some JVM give you a non-official Java API to unmap stuff. Some people just decide to use GNI code to do the mmap.
40:41
Well, I never have that kind of problem with Rust. I'm working with C or C++. When I started this project, I didn't know Rust at all. And I started on CV to actually learn Rust, which is a bit stupid and crazy.
41:03
So I already knew C++, and I had done a lot of Java as well. And after around two weeks, you don't have to believe me, but I was more productive in Rust. And I felt safer writing code in Rust than I was in both Java and C++.
41:24
Yeah? So PongTV is an English word. I often do that, actually. I like choosing English words that nobody knows
41:43
as project names. It's actually not bad at all for SEO. And yeah, people learn a new word, so everybody's happy, right? So PongTV means at full gallop. Hands or horse. And I drew the horse myself.
42:00
I think that's one of the best achievements in the project.