Merken

Elasticsearch from the bottom up

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
thank you so here is using
search already possible so elastic cities so are becoming quite popular these days whether it's for back in your apps search of or your web search or having your applications and services all in 1 central place ElasticSearch is gaining lots of mind however as a search engine it's quite did different from more traditional data stores so this talk is about and some how search engine works and how 8 distributed 1 like Search in particular and my name is Alex I found we
do hosted ElasticSearch has services um my background from the university is rendered within search and that's mostly what I've been doing ever since and 3 found that had been in contact with hundreds of developers and have an impression of what kind of challenges they face when they go from the basic usage of plastic search so uh this is about the sort of background theory I have great experience from sharing with other developer the
so the kinds of questions you'll hopefully be better able to to deduce the answer to but things like but why is my search returning what I expect even if I search for it exactly the same text as in my document or company makes sense that do deleting documents doesn't immediately shrink the index but adding documents can cause it to be smaller and why does classic search useful much memory so before I get
into the good stuff I just want to set some context around what we're gonna talk about this is sort of like an agenda in reverse and
then 1st in and then have backed out actually later on so when you work with ElasticSearch you have a cluster but notice and within the cluster you have lots of elastic search indexes that can span multiple nodes through shots and a charge is essentially
a Lucene index lucene is the full text search library Elastic search is built Elastic Search mates who seems also messed up available in a distributed setting so this talk is also a lot about having seen work and lots of ElasticSearch documentation sort of assume some familiarity with seen this so within a Lucene index you have segments which is sort of
like meanie indexes and within the seconds we have certain data structures like an inverted index stored fields document values and so on and this is for real so the inverted index is the key data structure to understand when you work with certain it consists of 2 how that part of the escorted dictionaries which contains the index terms and for every term you have a posting lists which is that documents containing the term so when you do a search you 1st operate on the sort of dictionary and then process the posting so if you have this quite simple documents so you can turn even indexed by 1st lowercasing the text removing some punctuation and splitting or tokenizing on white space so when using want to search for the fury for example we 1st find trends in the dictionary and then intersect or union that their postings depending on what kind of search so this is quite a basic example but the principle is the same for all kinds of searches 1st operate on the dictionary to find candidate terms and then operate on the poster so that terms he generated that end up in your index structure decide how you can search therefore how being analyzed and process the text is key when you work with so you really need to use our understand the text processing that's happening so for example if you wanted to a prefix search and like In this case find everything with C starting with list C in a more realistic case things like autocompletion you can easily do so by doing a binary search in the dictionary but if you want to for example find out every term containing the substrate power they have to essentially go through every term in that in that and this is quite expensive and doesn't scale but it's what happens if you for example graph white spots around your search so the right approach in this case would be to generate the proper term and there's lots of of different things you can you can do 1 thing what you have is the inverted index you want to transform the
search problem until it looks like a summary have to find some prefix so if you want to search for suffixes we can index the reverse text and search for the reverse when there's things like geolocations was seen will convert the data into a geohash which as your prefix is longer means more and something similar is done for numerical data because just this industry 1 of the 3 doesn't really all I have for you and numerical range searches so even things that doesn't appear to be about string prefix lookups it converted to so this ranges from are rather simple them mind-bogglingly complexity which we want really does get into but it's an interesting story about how how some really bright people and I came up with we can use what's called leverage find atomic tends to sort of go through and find out what misspellings in a really efficient way and they found Python library that they used to generate some Java code and they didn't know exactly what was going on but the tests proved it work and benchmark set it was like a hundred times faster and by by now it's
cleaned up but that is just an example of the really hard things will do to make things insanely fast so when you work with search text processing is really important the inverted index is not very useful however when you want to look up the values given a document like what's the title for document number 2 so to do that there's all the data structures like stored fields which is essentially a simple key-value store where we have some a data that you want to retrieve when you want to rent their research results by default ElasticSearch will store the entire Jason's source using the but even this kind of structure isn't very helpful when you need to read millions of values for a field such as when used for facet or aggregates because you will be reading lots of data that you don't really need so there's another structure called document values which is sort of like the a column restore it's highly optimized for storing and values of the the same type so this is quite useful when you want to aggregator on millions of about if you don't have to specify that you want these document values of 6 search will use what's called field which means that it will load all the values further and the field in the entire index into memory it'll be quite fast use but it'll use tons of memory so these data structures the inverted index is stored fields document values in certain caches are chunked up into what's called segment so when the Lucene searches across an index it searches
all the segments and merge the results there is a few properties with segments that are quite important 1st they are immutable so they never change so this means for example when you do document
I think there's a big map that marks the document deleted and see most filter it out for every subsequent search but that segment itself doesn't change so 1 update for example is essentially delete followed by the reading the so keep that in mind for example if you store things like rapidly updated counters in your and upside however lucene
can use all the tricks in the book to compress lucene is really great at compressing the data and as
it turns out segments are at greater scope for caches and we'll get back to what so the segments that are
created in 1 of 2
1st as the index new document ElasticSearch will offer these documents and then every refresh interval which defaults to every 2nd it will write a new segment and the documents will become available for research this of course means that over time you'll get lots of assignments so every now and then search will merge them together and during this process to lead to documents are finally completely remove so that's why I'm having documents can cause the index to the smaller it can trigger a merge which costs more compassion so say you have on these 2 segments that emerge failed them be completely replaced by the new segment and we'll get back to it a bit later but in this new segment will of course have a cold caches but the majority of today are used in the older untouched segments of this point which has form of caption and this is key for elastic real-time capabilities as new data comes in the amount of cash validation it has to do is quite limited so always happens within a single Lucene and which is a shot in the elastic in the which is allocated across nodes in your cluster so when you search these shards it's pretty much the same as searching segments you search the all and then merge things together but at this point of a certain can happen across different nodes and ask you are merged data here that you need to transfer things across the network 1 key thing to notice is that and the last search index with 2 shards searching 1 and certain with 3 shots is essentially the same assumption to search indexes with 1 shot In both cases you're searching across 2 shots better is to Lucene indexes so cardinal and partitioning into different in this but to different yet similar approaches to slicing of your data to prepare for handling the massive amounts of data you can easily fill a talk about different approaches to this but 1 approach is so common it's worth mentioning when you have a lot of a lot of data and with the tides them it's often a good idea to partition it into you on 1 index per day for example this will massively reduce the search space when you only need to search days they for example or a last week and when you need to delete all the data that you can simply delete the entire you don't have to delete have documents marked as deleted and then eventually removed and also the indexing performance on 2 days isn't affected by the fact that you have all data in of the indexes yeah so we have multiple ElasticSearch indexes is with 2 shots feature in this case so shots are used to evenly distribute data across 1 and that's in this case because you don't have to much data for 1 single node to cope so
when you plan how you're going to scale it's important to remember that it cannot place shot you can easily add more nodes and move data move shots around it 2 candidates that turn 1 trying to while this might be possible in the future that the reason is that if by the time you realize you need more shots you probably have a high enough load that adding that extra load of redistributing everything would be problematic so it's important to plan ahead so lesser people try to avoid this problem by edges gunmaker make thousand shots and forget about the problem but then you have lots of duplicated internal data structures like the the dictionary and there's also overhead to searching multiple shots so you want to have a balance between having enough and having too few so these
shots get allocated to nodes in your classes you can associate any with the nodes like this node is running in data centers a in a certain product or
is quite powerful machine so you can do things like make sure there's replica in every cell or make sure this popular index is hosted on the more powerful machine so the
cluster also has what's called the clusters states which is replicated to to all the nodes it has things like mappings which is sort of like a schema that tells how a certain fields has its text process for example it has the entire shot routing table so any node in the cluster knows how to write out any search so at this point we're essentially back on top of obstruction lies so will try to piece things together by looking at how the real search request is processed so they have this search with a query the current is of type filtered it has a simple terms filters and a match vary across multiple fields we also have an aggregation of all others we're going to top 10 authors as well as the top 10 and I also specify font size which is something I did that so this search requests can be sent to any class at any node in your clusters that a node becomes the coordinator for search request it will decide which parts to rank the question based on what indexes you have specified search across and we tread because there available in some way so it sends the
request to the relevant charges but before the search can actually be be executed on charges there's a certain amount of rewriting that needs to happen thus exerts current is sometimes criticized for being quite verbose and deeply nested I actually think it's quite also known for precisely the same reason when it's deeply or it's nested structure makes it a lot easier to to work with in code you don't have to compile this huge search for and there's also quite close match between power Elastic Search defines a filtering theory or and how the Lucene operators depends on being converted to works so your knowledge of elastic search for scene will sort of global ways 1 exception to this rule however is very much a family of 4 and the match for something you're gonna become quite familiar with because it's is a kind of worried that we look up in the mapping and see how the text this process and we remember how text gets process is really important when we deal with search and quite a common source for pulling out here so when you were with with ElasticSearch search is having incompatible text processing when you index and when you start so when you do not get the results you expect of the text processing should be your 1st suspect but the match match for it does not exist in those scenes so it's elastic obstruction to make different things up quite a lot nicer than having to do it yourself what it would actually look like and when converted to the scene is something like that the match is actually converted to make all
query that outputs together different fields and the text holy Grail in this case has been processed it have been lowercased and so on if you were to configure your match very differently say by specifying passing this would be rewritten to something with plus occurring in the in the bottom so at this point you have seen worry that can be run it'll be run on all the segments and at this point it matters what has happened before often you need to
use the same filter
on the same user aggregators for come across across multiple requests and elastic circle catch these as we remember per 2nd so assuming these 2 red segments here are newly created because of new documents for marriage you'll have a cold caches and the filter is and fields the process but the majority of the data we'll use in their segments with born cat and this is sort of the sources for the last 6 search mind-boggling performance when the filter and the fields are already in the can using them is really fast so of filters
are pretty much the same search they can be cashed as you really compact this whereas worries forest scored it's not just whether the document matches the document that matches up to to a certain degree so queries are not catch if you need to do the same query over and over again you should probably cashing in your application later so knowing this you should prefer to to use filters when you can and use only when you need score so this is run on all segments within the Lucene index which is a In the ElasticSearch and the
results get sent back to the interesting to be
search images and the amount of data transferred yeah can matter a lot by default Elastic will just ask for the ideas of documents up for the top it because it doesn't really needed all the document sources we just needs it for the top 10 results but this is quite different when you do aggregation but it's quite possible that an author that should be in the top 10 the global top 10 is in the 11th position of 1 of 2 shots that's why we specified a charge slice of 100 to make it less likely but that happens of course it's still possible so we always need to wear and balance the amount of data to transfer to today precision you need and this is inherent in any distributed aggregation so the coordinated has all the data it emerges it together also the shots again for Haiti can you please agreements for these documents and send it back you that the user so at this
point we have been through we will look at the inverted index and see how the index terms you generate
largely dictated common cancer and that the text processing that generates this terms for quite some important we have looked at our research happens by segments and how a segment has several data structures some use when you search some users when you aggregate and so on discussed the consequences of these segments being immutable and that this can affect the indexing performance for example when you need real-time or when you need great in the same group but you may want to for example at just the refreshing so we constantly merge using we've seen how a shot is essentially the same as a separate Lucene in and that the elastic search index is come is generally just an abstraction on top of the Lucene indexes and you can combine them either in shots 1 that's search in this work across multiple index and at this point of course across the nodes in the cluster it's a distributed search engine you can easily have no but you need to also be aware of the kind of data being transferred between the nodes search so this was intended to be an introduction to
different things I hope you want to learn more about it the talk is based on an article of the same name and you can find it in foundational which is our article collection about what we try to keep them as helpful as possible for anyone using acid search is just it's not just profound customers there's also a such needs up later today it's here
around 6 I think so if you want to learn more about the loss in search of I hope to to see you there and if you have some questions mountain time and without
compression will have a question about propagation example is if I have some important document that I would like to search even if 1 of the nodes also remote go down uh what's the recommended way to do it in last such that you want to you have documents indexing replicas and a node goes down well so I'm adding a new document using this well command late to added in such a way that 1 mold trailer doesn't they don't look and yeah but it's so this talk wasn't that much about plastic surgery production I used do another talk about it there's lots of different things to keep in mind when you run a cluster of any any distributed as system you want to have a majority of nodes available for example to avoid things like split brains you want to have replicas available in different for example if you're and in different availability to make sure that you always have a replica available when errors happen and in a distributed system of Taylor is guaranteed to happen so in any production in configuration you should have uh multiple nodes in running on infrastructure that's not have any common thing points you check out make sure you have in the year production clusters you should have a dedicated master nodes for example and you need to have at least 3 to to have a majority in the event of failure a quite common set of these to have 2 nodes in a plus the 1 with 1 replica each but when when there is a network partition between these so you cannot have a majority when you have just a single node and your cluster is composed of 2 so there's lots of different aspects and I'm happy to talk more about the succession production of laughter so just sometimes I think thanks also the fascinating talk would you say what to say what the court is of elastic soldiers of growth reading through is it's pose the quality who would want to actually learn something by looking through a a yeah it's so it's quite a complicated system and it's I think the the code quality is generally quite high are compared to and others such several Lucene has really really high quality code our it's it's a bit higher than us search I'd say that there is still quite quite the and you can see it the fact that you know have tons of new developers which is good but it's a code base and idea recommends looking at it's huge uh as a young but things like but 1 over the it's so I I recall right scene has a standard procedure from there to rank documents and how is this working between charts always a ranking working between chart of you like having about documents and uh term frequencies and converted to the document frequency is just like you to come back to that trying to find relevant still have a few minutes so um when Lucene scoring documents it takes into account things like the frequency of the term for example the words
like in that and in our
don't add much value more couple of rare words are considered to be more relevant and so it uses sort of like as it tries to find rare words in your query and prioritize them while sort of not caring too much about these are really common words but of course these frequencies can and be different across the different shots so that it's possible to tell Elastic Search to you as an ending before the search itself happens have all the shops and report the true frequencies so you can get more accurate scoring but when it comes to to actually ranking and scoring I would look to paid just as much attention to things like function score where you can who was based on for example filters you can say prefer new documents or preferred documents within a certain section of your content and so so do not
just judge the out of the default relevancy that seen using but also look at all the tools of acid search tasks to tweak your scoring do we have any more questions the is a measurable to and to compare it to you like to have a single listing the next and you put all the documents in the a single losing index and then you have the same index across charged to get the same results always it different because you have statistics between so when you uh if your data can fit in a single shot and you don't need to scale if for example you should prefer to have a single shot storing it in 2 shots will be more than twice as expensive so usually you want to prefer having fewer shots when we search multiple shots and these frequencies can differ between shot so you can get different results so indexing everything into just a single shot can yield different results from having it in 2 are usually it shouldn't be huge differences and again you probably want to also look a lot into our functions going for thank thank you and in the question people who would you like to remove the words we use and
Zentralisator
App <Programm>
Dienst <Informatik>
Benutzerbeteiligung
Suchmaschine
Minimum
Vorlesung/Konferenz
Kartesische Koordinaten
Speicher <Informatik>
Matching
Dienst <Informatik>
Festspeicher
Klassische Physik
Softwareentwickler
ROM <Informatik>
Grundraum
Physikalische Theorie
Quick-Sort
Computeranimation
Knotenmenge
Multiplikation
Reverse Engineering
Elastische Deformation
Kontextbezogenes System
Zentrische Streckung
Subtraktion
Prozess <Physik>
Graph
Zwei
Mailing-Liste
Term
Raum-Zeit
Quick-Sort
Computeranimation
Data Dictionary
Datenfeld
Twitter <Softwareplattform>
Heegaard-Zerlegung
Mereologie
Dynamisches RAM
Programmbibliothek
Elastische Deformation
Datenstruktur
Eigentliche Abbildung
Schlüsselverwaltung
Leistung <Physik>
Resultante
Prozess <Physik>
Applet
Zahlenbereich
Kardinalzahl
Komplex <Algebra>
Code
Computeranimation
Open Source
Spannweite <Stochastik>
Reverse Engineering
Datentyp
Programmbibliothek
Speicher <Informatik>
Datenstruktur
Ganze Funktion
Default
Benchmark
Caching
Softwaretest
Approximationstheorie
Quellcode
Quick-Sort
Auswahlaxiom
Datenfeld
Festspeicher
Caching
Zeichenkette
Lesen <Datenverarbeitung>
Mapping <Computergraphik>
Resultante
Kategorie <Mathematik>
Mathematisierung
Vorlesung/Konferenz
Subtraktion
Punkt
Prozess <Physik>
Datennetz
Zirkel <Instrument>
Validität
Ruhmasse
Raum-Zeit
Computeranimation
Bildschirmmaske
Knotenmenge
Multiplikation
Echtzeitsystem
Caching
Vorlesung/Konferenz
Elastische Deformation
Default
Zehn
Rechenzentrum
Summengleichung
Knotenmenge
Last
Klasse <Mathematik>
Datenstruktur
Computeranimation
Autorisierung
Filter <Stochastik>
Prozess <Physik>
Punkt
Matching <Graphentheorie>
Klasse <Mathematik>
Abfrage
Term
Computeranimation
Mapping <Computergraphik>
Virtuelle Maschine
Knotenmenge
Datenfeld
Reelle Zahl
Datentyp
Mereologie
Koordinaten
Aggregatzustand
Tabelle <Informatik>
Resultante
Nichtlinearer Operator
Multiplikation
Punkt
Prozess <Physik>
Matching <Graphentheorie>
Familie <Mathematik>
Abfrage
Abgeschlossene Menge
Schlussregel
Ausnahmebehandlung
Quellcode
Code
Computeranimation
Demoszene <Programmierung>
Datenfeld
Minimum
Elastische Deformation
Datenstruktur
Leistung <Physik>
Multiplikation
Filter <Stochastik>
Datenfeld
Prozess <Physik>
Kreisfläche
Caching
Quellcode
Elastische Deformation
Computerunterstützte Übersetzung
Quick-Sort
Computeranimation
Resultante
Filter <Stochastik>
Minimalgrad
Wald <Graphentheorie>
Abfrage
Kartesische Koordinaten
Computeranimation
Autorisierung
Resultante
Punkt
Ortsoperator
Program Slicing
Elastische Deformation
Quellcode
Term
Default
Bildgebendes Verfahren
Computeranimation
Multiplikation
Knotenmenge
Prozess <Physik>
Punkt
Suchmaschine
Atomarität <Informatik>
Abstraktionsebene
Gruppenkeim
Ablöseblase
Vorlesung/Konferenz
Elastische Deformation
Datenstruktur
Term
Computeranimation
Einfügungsdämpfung
Bit
Subtraktion
Punkt
Ausbreitungsfunktion
Ranking
Term
Code
Computeranimation
Demoszene <Programmierung>
Knotenmenge
Vorlesung/Konferenz
Elastische Deformation
Softwareentwickler
Konfigurationsraum
Quellencodierung
Datennetz
Physikalisches System
Biprodukt
Frequenz
Partitionsfunktion
Algorithmische Programmiersprache
Ereignishorizont
Chirurgie <Mathematik>
Menge
Ablöseblase
Wort <Informatik>
Lesen <Datenverarbeitung>
Lineares Funktional
Filter <Stochastik>
Garbentheorie
Wort <Informatik>
Elastische Deformation
Inhalt <Mathematik>
Frequenz
Quick-Sort
Computeranimation
Task
Resultante
Lineares Funktional
Statistik
Subtraktion
Atomarität <Informatik>
Vorlesung/Konferenz
Mailing-Liste
Wort <Informatik>
Default
Frequenz

Metadaten

Formale Metadaten

Titel Elasticsearch from the bottom up
Serientitel EuroPython 2014
Teil 85
Anzahl der Teile 120
Autor Brasetvik, Alex
Lizenz CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/19937
Herausgeber EuroPython
Erscheinungsjahr 2014
Sprache Englisch
Produktionsort Berlin

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Alex Brasetvik - Elasticsearch from the bottom up This talk will teach you about Elasticsearch and Lucene's architecture. The key data structure in search is the powerful inverted index, which is actually simple to understand. We start there, then ascend through abstraction layers to get an overview of how a distributed search cluster processes searches and changes. ----- ## Who I am and motivation I work with hosted Elasticsearch and have interacted with lots of developers. We see what many struggle with. Some relevant theory helps a lot. What follows has already lead to many "Aha!"-moments and developers piecing things together herself. ## The inverted index The most important index structure is actually very simple. It is essentially a sorted dictionary of terms, with a list of postings per term. We show three simple sample documents and the resulting inverted index. ## The index term The index term is the "unit of search", and the terms we make decide how we can search. With the inverted index and its sorted dictionary, we can quickly search for terms given their prefix. ## Importance of text analysis Thus, we need to transform our search problems into string prefix problems. This is done with text analysis, which is the process of making of index terms. It is highly important when implementing search. ## Building indexes The way indexes are built must balance how compact an index is, how easily we can search in it, how fast we can index documents - and the time it takes for changes to be visible. Lucene, and thus Elasticsearch, builds them in segments. ## Index segments A Lucene index consists of index segments, i.e. immutable mini-indexes. A search on an index is done by doing the search on all segments and merging the results. Segments are immutable: This enables important compression techniques. Deletes are not immediate, just a marker. Segments are occasionally merged to larger segments. Then documents are finally deleted. New segments are made by buffering changes in memory, and written when flushing happens. Flushes are largely caused by refreshing every second, due to real time needs. ## Caches Caches like filter- and field caches are managed per segment. They are essential for performance. Immutable segments make for simple reasoning about caches. New segments only cause partial cache invalidations. ## Elasticsearch indexes Much like a Lucene index is made up of many segments, an Elasticsearch index is made up of many Lucene indexes. Two Elasticsearch indexes with 1 shard is essentially the same as one Elasticsearch index with 2 shards. Search all shards and merge. Much like segments, but this time possibly across machines. Shard / Index routing enables various partitioning strategies. Simpler than it sounds, so one important example: Essential for time based data, like logs: can efficiently skip searching entire indexes - and roll out old data by deleting the entire index. ## Common pitfalls We must design our indexing for how we search - not the searches for how things are indexed. Be careful with wildcards and regexes. Since segments are immutable, deleting documents is expensive while deleting an entire index is cheap. Updating documents is essentially a delete and re-index. Heavy updating might cause problems. Have enough memory and then some. Elasticsearch is very reliant on its caches. ## Summary We've seen how index structures are used, and why proper text processing is essential for performant searches. Also, you now know what index segments are, and how they affect both indexing and searching strategies. ## Questions
Schlagwörter EuroPython Conference
EP 2014
EuroPython 2014

Ähnliche Filme

Loading...