Elasticsearch from the bottom up
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Teil | 85 | |
Anzahl der Teile | 119 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/19937 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache | ||
Produktionsort | Berlin |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
| |
Schlagwörter |
00:00
MinimumZentralisatorGemeinsamer SpeicherBenutzerbeteiligungKartesische KoordinatenServerSuchmaschineApp <Programm>Dienst <Informatik>Speicher <Informatik>ComputeranimationVorlesung/Konferenz
01:07
Physikalische TheorieSoftwareentwicklerDienst <Informatik>GrundraumQuick-Sort
01:45
ROM <Informatik>MatchingApproximationLatent-Class-AnalyseQuellcodeGemeinsamer SpeicherAuswahlaxiomCachingDatenfeldReverse EngineeringTermDatenstrukturSoftwaretestQuick-SortHeegaard-ZerlegungSchlüsselverwaltungDifferenteCodeHalbleiterspeicherElastische DeformationAppletMailing-ListeCachingSpannweite <Stochastik>ZeichenketteMultiplikationsoperatorKardinalzahlZweiKontextbezogenes SystemTwitter <Softwareplattform>BenchmarkKlassische PhysikMinkowski-MetrikEigentliche AbbildungMultiplikationProzess <Informatik>Kategorie <Mathematik>Data DictionaryResultanteZentrische StreckungMereologieLesen <Datenverarbeitung>Komplex <Algebra>Leistung <Physik>Speicher <Informatik>Ganze FunktionCASE <Informatik>ZahlenbereichDefaultProgrammbibliothekGraphQuellcodeTypentheorieArithmetisches MittelSchnittmengeGüte der AnpassungLastBitmap-GraphikAutomat <Automatentheorie>Computeranimation
10:54
MAPMathematikVorlesung/Konferenz
11:27
Web logCachingRechenzentrumDefaultDatenstrukturEchtzeitsystemMultiplikationsoperatorRoutingDatenfeldTermReelle ZahlAbfrageSummengleichungMultiplikationAttributierte GrammatikVirtuelle MaschineAutorisierungFilter <Stochastik>LastMatchingMinkowski-MetrikCASE <Informatik>Einfache GenauigkeitData DictionarySoftwareTypentheorieDifferentePunktProdukt <Mathematik>Elastische DeformationValiditätBildschirmmaskeKlasse <Mathematik>Prozess <Informatik>Zirkel <Instrument>TabelleRuhmasseAggregatzustandMapping <Computergraphik>ComputeranimationVorlesung/Konferenz
19:06
Inverser LimesQuellcodeResultanteDatenfeldSchlussregelProzess <Informatik>Computerunterstützte ÜbersetzungQuick-SortAusnahmebehandlungKreisflächeDemoszene <Programmierung>Elastische DeformationKartesische KoordinatenFilter <Stochastik>CachingMinimalgradAbfrageMatchingWald <Graphentheorie>Familie <Mathematik>MinimumCASE <Informatik>Program SlicingKlasse <Mathematik>AutorisierungOrtsoperatorBildgebendes VerfahrenDefaultMultiplikationNichtlinearer OperatorAbgeschlossene MengeLeistung <Physik>KoordinatenMereologieCodeDatenstrukturPunktBitmap-GraphikEindringerkennungZeichenketteHyperbelverfahrenRoutingAbstraktionsebeneDiagrammFlussdiagramm
26:25
TermPunktGrenzschichtablösungMultiplikationProzess <Informatik>AbstraktionsebeneSuchmaschineElastische DeformationGruppenoperationDatenstrukturComputeranimationVorlesung/Konferenz
28:23
MultiplikationsoperatorAtomarität <Informatik>EinfügungsdämpfungComputeranimation
29:24
SoftwareentwicklerPartitionsfunktionCodeBitDifferentePhysikalisches SystemWort <Informatik>PropagatorSchnittmengeTermChirurgie <Mathematik>SoftwareGrenzschichtablösungFrequenzDatenkompressionElastische DeformationLesen <Datenverarbeitung>EreignishorizontRankingProdukt <Mathematik>Demoszene <Programmierung>Algorithmische ProgrammierspracheKonfigurationsraumPunktDatenreplikationAusdruck <Logik>FehlermeldungRechenschieberInverter <Schaltung>Vorlesung/Konferenz
33:43
Quick-SortFilter <Stochastik>Wort <Informatik>FrequenzFunktionalGarbentheorieInhalt <Mathematik>Elastische DeformationComputeranimation
34:56
Einfache GenauigkeitFunktionalStatistikUnrundheitResultanteDifferenteDefaultFrequenzZentrische StreckungMailing-ListeTaskAtomarität <Informatik>Wort <Informatik>Besprechung/Interview
Transkript: Englisch(automatisch erzeugt)
00:15
Thank you. So who here is using Elasticsearch already?
00:21
Awesome. So Elasticsearch is becoming quite popular these days, whether it's for backing your apps search or your web search or having your applications and servers logs all in one central place, Elasticsearch is gaining lots of mind share.
00:42
However, as a search engine, it's quite different from more traditional data stores. So this talk is about how a search engine works and how a distributed one like Elasticsearch in particular.
01:04
My name is Alex, I work for Found. We do hosted Elasticsearch as a service. My background from the university is within search, that's mostly what I've been doing ever since. And through Found, I've been in contact with hundreds of developers and have an impression
01:26
of what kind of challenges they face when they go from the basic usage of Elasticsearch. So this is about the sort of background theory. I have great experience from sharing with other developers.
01:46
So the kinds of questions you'll hopefully be better able to deduce the answer to are things like why isn't my search returning what I expect even if I search for exactly the same text as in my document?
02:03
Or how can it make sense that deleting documents doesn't immediately shrink the index but adding documents can cause it to be smaller? And why does Elasticsearch use so much memory?
02:20
So before I get into the good stuff, I just want to set some context around what we're going to talk about. This is sort of like an agenda in reverse. I'm going to first go in and then back out later on. So when you work with Elasticsearch, you have a cluster of nodes.
02:43
And within the cluster, you have lots of Elasticsearch indexes that can span multiple nodes through shards. And a shard is essentially a Lucene index. Lucene is the full text search library Elasticsearch is built on.
03:05
Elasticsearch makes Lucene's awesomeness available in a distributed setting. So this talk is also a lot about how Lucene works. And lots of Elasticsearch documentation sort of assumes some familiarity with Lucene as well.
03:24
So within a Lucene index, you have segments, which is sort of like mini indexes. And within the segments, we have certain data structures like an inverted index, stored
03:45
fields, document values, and so on. And this is where we'll start. So the inverted index is the key data structure to understand when you work with search.
04:01
It consists of two parts. The sorted dictionary, which contains the index terms. And for every term, you have a posting list, which is the documents containing the term. So when you do a search, you first operate on the sorted dictionary and then process
04:27
the postings. So if you have this quite simple document, you can turn, you can index it by first lowercasing the text, removing some punctuation, and splitting or tokenizing on whitespace.
04:51
So when you want to search for the theory, for example, you first find the terms in the dictionary and then intersect or union the postings, depending on what kind of search
05:04
you want. So this is quite a basic example, but the principle is the same for all kinds of searches. First you operate on the dictionary to find candidate terms, and then operate on the postings.
05:23
So the terms you generate that end up in your index structure decide how you can search. Therefore, how you analyze and process the text is key when you work with search. You really need to understand the text processing that's happening.
05:48
So for example, if you want to do a prefix search, like in this case, find everything with C, starting with a C, in more realistic case, things like auto completion, you can
06:02
easily do so by doing a binary search in the dictionary. But if you want to, for example, find every term containing the substring hour, you have to essentially go through every term in the index. And this is quite expensive and doesn't scale.
06:22
But it's what happens if you, for example, wrap wildcards around your search. So the right approach in this case would be to generate the proper terms. There's lots of different things you can do. What you have is the inverted index, you want to transform the search problem until
06:45
it looks like a problem where you have to find some prefix. So if you want to search for suffixes, you can index the reverse text and search for the reverse. When there's things like geolocations, Lucene will convert the data into a geohash, which
07:06
as your prefix is longer, means more precision. And something similar is done for numerical data, because just indexing the string one through three doesn't really allow for good numerical range searches.
07:23
So even things that don't appear to be about string prefix lookups get converted to it. So this ranges from the rather simple to the mind-bogglingly complex, which we won't really get into.
07:41
But it's an interesting story about how some really bright people came up with, we can use what's called Levenshtein automatons to sort of go through and find misspellings in a really efficient way. And they found a Python library that they used to generate some Java code.
08:02
They didn't know exactly what was going on, but the tests proved it worked, and the benchmark said it was like 100 times faster. By now it's cleaned up, but it's just an example of the really hardcore things Lucene will do to make things insanely fast.
08:23
So when you work with search, text processing is really important. The inverted index is not very useful, however, when you want to look up the value given a document, like what's the title for document number two.
08:43
So to do that, there's other data structures like stored fields, which is essentially a simple key value store where we have a data blob that you want to retrieve when you want to render the search results. By default, Elasticsearch will store the entire JSON source using this.
09:07
But even this kind of structure isn't very helpful when you need to read millions of values for a field, such as when you sort or facet or aggregate, because you would
09:20
be reading lots of data that you don't really need. So there's another structure called document values, which is sort of like a columnar store. It's highly optimized for storing values of the same type. So this is quite useful when you want to aggregate or sort on millions of values.
09:47
If you don't specify that you want these document values, Elasticsearch will use what's called the field cache, which means that it'll load all the values for the field
10:01
in the entire index into memory. It'll be quite fast to use, but it'll use tons of memory. So these data structures, the inverted index, stored fields, document values and certain caches are chunked up into what's called segments.
10:26
So when Lucene searches across an index, it searches all the segments and merges the results. There's a few properties with segments that's quite important.
10:43
First, they are immutable, so they never change. So this means, for example, when you delete a document, there's a bitmap that marks the document as deleted, and Lucene will filter it out for every subsequent search.
11:06
But the segment itself doesn't change. So an update, for example, is essentially a delete followed by a re-index. So keep that in mind, for example, if you store things like rapidly updated counters
11:20
in your index. On the upside, however, Lucene can use all the tricks in the book to compress things. Lucene is really great at compressing data. And as it turns out, segments are a great scope for caches, and we'll get back to why.
11:49
So these segments get created in one of two ways. First, as you index new documents, Elasticsearch will buffer these documents, and then every
12:01
refresh interval, which defaults to every second, it will write a new segment, and the documents will become available for search. This, of course, means that over time, you'll get lots of segments.
12:23
So every now and then, Elasticsearch will merge them together. And during this process, deleted documents are finally completely removed. So that's why adding documents can cause the index to be smaller. It can trigger a merge, which causes more compaction.
12:46
So say you have these two segments that get merged, they'll then be completely replaced by the new segment. And we'll get back to it a bit later, but this new segment will, of course, have cold
13:03
caches. But the majority of the data is in the older, untouched segments at this point, which has warm caches. And this is key for Elasticsearch real-time capabilities. As new data comes in, the amount of cache invalidation it has to do is quite limited.
13:31
So all this happens within a single Lucene index, which is a shard in the Elasticsearch
13:41
index, which is allocated across nodes in your cluster. So when you search these shards, it's pretty much the same as searching segments. You search them all and then merge things together.
14:02
But at this point, the searching can happen across different nodes. And as you merge data here, you need to transfer things across the network. One key thing to notice is that an Elasticsearch index with two shards, searching one Elasticsearch
14:27
index with two shards is essentially the same as searching two Elasticsearch indexes with one shard each. In both cases, you are searching across two shards, that is, two Lucene indexes.
14:41
So sharding and partitioning into different indexes are two different yet similar approaches to slicing up your data to prepare for handling massive amounts of data. You can easily fill a talk about different approaches to this, but one approach is so
15:03
common, it's worth mentioning. When you have log-like data with a timestamp, it's often a good idea to partition it into one index per day, for example. This will massively reduce the search space when you only need to search today's data,
15:24
for example, or last week's. And when you need to delete older data, you can simply delete the entire index. You don't have to delete, have the documents marked as deleted and then eventually removed
15:42
later on. And also, the indexing performance on today's data isn't affected by the fact that you have old data in other indexes. So we have multiple Elasticsearch indexes with two shards each in this case.
16:04
So shards are used to evenly distribute data across one index. In this case, because you have too much data for one single node to cope with. So when you plan how you're going to scale, it's important to remember that you cannot
16:23
split a shard. You can easily add more nodes and move data, move shards around, but you cannot turn one shard into two. While this might be possible in the future, the reason is that if by the time you realize
16:44
you need more shards, you probably have a high enough load that adding the extra load of redistributing everything would be problematic. So it's important to plan ahead. So lots of people try to avoid the problem by, okay, I'm just going to make a thousand
17:04
shards and forget about the problem. But then you have lots of duplicated internal data structures like the dictionary. And there's also overhead to searching multiple shards. So you want to have a balance between having enough and having too few.
17:27
So these shards get allocated to nodes in your cluster. You can associate any attribute with the nodes like this node is running in data center A in a certain rack or is quite powerful machine.
17:45
So you can do things like make sure there's a replica in every zone or make sure this popular index is hosted on the more powerful machines. The cluster also has what's called the cluster state, which is replicated to all the nodes.
18:07
It has things like mappings, which is sort of like the schema that tells how a certain field has its text processed, for example. It has the entire shard routing table.
18:20
So any node in the cluster knows how to route any search request. So at this point, we're essentially back on top abstraction wise. So we'll try to piece things together by looking at how a real search request is processed.
18:42
So say you have this search with a query. The query is of type filtered. It has a simple term filter and a match query across multiple fields.
19:05
We also have an aggregation on authors. We want the top 10 authors as well as the top 10 hits. And I also specify shard size, which is something I'll get back to.
19:21
So this search request can be sent to any node in your cluster. That node becomes the coordinator for that search request. It'll decide which shards to route the request to based on what indexes you have specified
19:41
to search across and which replicas are available and so on. So it sends the request to the relevant shards. But before the search can actually be executed on the shard, there's a certain amount of rewriting that needs to happen.
20:04
Elasticsearch's query DSL is sometimes criticized for being quite verbose and deeply nested. I actually think it's quite awesome for precisely the same reasons. When it's deeply or it's nested structure makes it a lot easier to work with in code.
20:24
You don't have to compile this huge search string. And there's also quite a close match between how Elasticsearch defines its filters and queries and how the Lucene operators it ends up being converted to works.
20:47
So your knowledge of Elasticsearch or Lucene will sort of go both ways. One exception to this rule, however, is the match family of queries.
21:00
And the match query is something you're going to become quite familiar with because it's the kind of query that will look up in the mapping and see how the text is processed. And as we remember, how text gets processed is really important when you deal with search.
21:20
And quite a common source for pulling out hairs when you work with Elasticsearch is having incompatible text processing when you index and when you search. So when you do not get the results you expect, the text processing should be your first suspect.
21:44
But the match query does not exist in Lucene. So it's Elasticsearch abstraction to make different things quite a lot nicer than having to do it yourself. What it would actually look like when converted to Lucene is something like this.
22:04
The match is actually converted to a Bool query that puts together the different fields. And the text, holy grail in this case, has been processed, it has been lower cased and so on.
22:21
If you were to configure your match query differently, say by specifying fuzziness, this would be rewritten to something with FuzzyQuery in the bottom. So at this point you have a Lucene query that can be run. It will be run on all the segments.
22:43
And at this point it matters what has happened before. Often you need to use the same filter or the same fields you aggregate or sort on across multiple requests. And Elasticsearch will cache these as we remember per segment.
23:06
So assuming these two red segments here are newly created because of new documents or a merge, it will have cold caches and the filters and fields will need to be reprocessed.
23:22
But the majority of the data is in the segments with warm caches. This is sort of the source for Elasticsearch's mind-boggling performance. When the filter and the fields are already in the cache, using them is really fast.
23:48
So filters are pretty much the same per search. They can be cached as a really compact bitmap.
24:01
Whereas queries are scored, it's not just whether the document matches, the document matches to a certain degree. So queries are not cached. If you need to do the same query over and over again, you should probably cache it in your application later.
24:22
So knowing this, you should prefer to use filters when you can and use queries only when you need scoring. So this is run on all the segments within the Lucene index,
24:45
which is A-sharp in the Elasticsearch index. And the results get sent back to the search coordinator and the amount of data transferred here can matter a lot.
25:06
By default, Elasticsearch will just ask for the IDs of the documents for the top hit because it doesn't really need all the documents' sources, it just needs it for the top 10 results.
25:24
But this is quite different when you do aggregations. It's quite possible that an author that should be in the top 10, the global top 10, is in the 11th position of one of the shards. That's why we specified a shard size of 100
25:43
to make it less likely that that happens. Of course, it's still possible, so we always need to weigh and balance the amount of data you transfer to the precision unit. And this is inherent in any distributed aggregation.
26:08
So the coordinator has all the data, it merges it together, asks the shards again for, hey, can you please give me the source for these documents and send it back to you as the user.
26:27
So at this point, we have been through, we have looked at the inverted index and seen how the index terms you generate largely dictate how you can search
26:41
and that the text processing that generates these terms are quite important. We have looked at how a search happens by segment and how a segment has several data structures, some used when you search,
27:01
some used when you aggregate and so on. We've discussed the consequences of the segments being immutable and that this can affect indexing performance, for example, when you need real-time or when you need great indexing throughput,
27:22
you may want to, for example, adjust the refresh interval so you don't constantly merge new segments. We've seen how a shard is essentially the same as a separate Lucene index and that the Elasticsearch index
27:41
is generally just an abstraction on top of Lucene indexes. And you can combine them either as shards in one Elasticsearch index or across multiple indexes. And at this point, of course, across nodes in your cluster,
28:03
it's a distributed search engine, you can easily add nodes, but you need to also be aware of the kind of data being transferred between the nodes as you search. So this was intended to be an introduction
28:23
to different things I hope you want to learn more about. The talk is based on an article of the same name and you can find it in Foundation, which is our article collection about Elasticsearch.
28:40
We try to keep them as helpful as possible for anyone using Elasticsearch. It's not just for found customers. There's also an Elasticsearch meetup later today. It's here around six, I think.
29:00
So if you want to learn more about Elasticsearch, I hope to see you there. And if you have some questions, now's your time. Thank you.
29:27
No questions. I have a question about replication. So, for example, if I have some important documents that I would like to search even
29:41
if one of the nodes or several nodes go down, what's the recommended way to do it in Elasticsearch? You want to... You have documents indexed in replicas and a node goes down? Well, so I'm adding a new document to the index. What's the recommended way to add it
30:01
in such a way that one node failure doesn't take down the document? Okay, so... This talk wasn't that much about Elasticsearch in production. I used to do another talk about it. There's lots of different things to keep in mind when you run a cluster of any distributed system.
30:24
You want to have a majority of nodes available, for example, to avoid things like split-brains. You want to have replicas available in different... For example, if you're on Amazon, in different availability zones, to make sure that you always have a replica available when errors happen.
30:43
And in a distributed system, failure is guaranteed to happen. So, in any production configuration, you should have multiple nodes running on infrastructure that's not have any common failure points.
31:01
You should make sure you have in bigger production clusters, you should have dedicated master nodes, for example. And you need to have at least three to have a majority in the event of failure. A quite common setup is to have two nodes in your cluster,
31:21
one with one replica each. But when there's a network partition between these, you cannot have a majority when you have just a single node and your cluster is composed of two. So, there's lots of different aspects and I'm happy to talk more about
31:40
Elasticsearch in production after. So, just come find me. Alright, thanks. Thanks a lot for the fascinating talk. Would you say... What would you say about the code base of Elasticsearch? Is it worth reading through? How is the quality?
32:00
Would one actually learn something by looking through it? Yeah, it's quite a complicated system. I think the code quality is generally quite high compared to other search systems I've read. Lucene has really, really high quality code.
32:22
It's a bit higher than Elasticsearch, I'd say, but Elasticsearch is still quite good. You can see the fact that you now have tons of new developers, which is good, but it's a code base I'd recommend looking at.
32:41
It's pure Java, right? Yeah. Okay, thanks a lot. One over there. Thanks. So, if I recall it right, Lucene has this proceeds formula to rank documents and how is this working between charts?
33:05
How is the ranking working between charts, if you like, having documents and term frequencies and inverted document frequencies, just like you showed it? Yeah, I'll try to find the relevant slide here.
33:23
I still have a few minutes. So, when Lucene is scoring documents, it takes into account things like the frequency of the term.
33:41
For example, the words like the and in don't add much value, but more rare words are considered to be more relevant. And so, it uses sort of like, it tries to find rare words in your query
34:03
and prioritize them while sort of not caring too much about the really common words. But of course, these frequencies can be different across the different charts. So, it's possible to tell Elasticsearch to,
34:23
before the search itself happens, have all the charts report the true frequencies so you can get more accurate scoring. But when it comes to actually ranking and scoring, I would pay just as much attention to things like function score,
34:44
where you can boost based on, for example, filters. You can say prefer new documents or prefer documents within a certain section of your content and so on. So, do not just judge relevancy out of the default relevancy that Lucene gives you,
35:04
but also look at all the tools Elasticsearch has to tweak your scoring. Do we have any more questions? Is it measurable to compare it if you just have a single Lucene index
35:25
and you put all the documents in the single Lucene index and then you have the same index across charts? Do you get the same results? Or is it different because you have statistics between it?
35:41
So, if your data can fit in a single chart and you don't need to scale it, for example, you should prefer to have a single chart. Storing it in two charts will be more than twice as expensive. So, usually you want to prefer having fewer charts.
36:00
When you search multiple charts, these frequencies can differ between the charts, so you can get different results. So, indexing everything into just a single chart can yield different results from having it in two. Usually, it shouldn't be huge differences.
36:21
And again, you probably want to also look a lot into function scoring, for example. Thank you. Any more questions? People are hungry. Thank you very much, Alex. Please give a round of applause.