Beyond the basics with Elasticsearch
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 65 | |
Number of Parts | 173 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/20127 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Bilbao, Euskadi, Spain |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
Elasticity (physics)Form (programming)WordConcordance (publishing)Goodness of fitQuicksortCASE <Informatik>Data structureOpen sourceProduct (business)InformationPosition operatorComputer fileWordMultiplication signData storage deviceAdditionWeb pageElectronic mailing listSet (mathematics)NeuroinformatikType theorySubject indexingStatisticsComputer animationLecture/ConferenceProgram flowchart
01:58
Data structureStatisticsResultantElectronic mailing listMatching (graph theory)Computer fileInformationLine (geometry)Different (Kate Ryan album)State of matterActive contour modelLecture/ConferenceProgram flowchart
03:20
LengthFaktorenanalyseDistanceDigital object identifierStandard deviationFloating pointSpacetimeQuery languageNumberMatching (graph theory)Digital object identifierFrequencyInverse elementTerm (mathematics)Keyboard shortcutWell-formed formulaMereologyCore dumpInformationWordMultiplication signLecture/ConferenceComputer animation
04:33
LengthFaktorenanalyseSpacetimeDistanceFloating pointDigital object identifierStandard deviationWordLatent heatSoftware frameworkMereologyLengthField (computer science)Normal (geometry)DivisorBuildingMultiplication signLecture/ConferenceComputer animation
05:13
Normal (geometry)MereologyInformationGroup actionWater vaporWell-formed formulaField (computer science)LengthLecture/Conference
05:50
LengthFaktorenanalyseDistanceFloating pointSpacetimeDigital object identifierStandard deviationDistanceElasticity (physics)InformationContent (media)Numeral (linguistics)Forcing (mathematics)Range (statistics)AdditionComputer animation
06:31
Software engineeringRange (statistics)Standard deviationQuery languageType theoryFunctional (mathematics)Multiplication signElasticity (physics)Lecture/ConferenceComputer animation
07:18
Software engineeringLie groupCurve fittingQuicksortMatching (graph theory)Lecture/ConferenceComputer animation
08:19
Fermat's Last TheoremPlot (narrative)Punched cardPointer (computer programming)NumberDivisorField (computer science)MereologyAxiom of choiceRandom number generationBitVideo gameReal numberLecture/ConferenceComputer animation
09:11
Software engineeringMatching (graph theory)DivisorResultantVirtual machineCartesian coordinate systemNumberComputer configurationQuery languageLimit (category theory)Set (mathematics)Wave packet1 (number)Lecture/ConferenceComputer animation
10:27
Query languagePrice indexThresholding (image processing)Formal languageGeometryArmCASE <Informatik>Loop (music)Analytic continuationResultantWeb browserMultiplication signSubject indexingContent (media)Query languagePercolationFunctional (mathematics)WebsiteQuicksortEmailElasticity (physics)Lecture/ConferenceComputer animation
12:53
ArmExecution unitMathematical singularityDedekind cutInfinityBinary fileNewton's law of universal gravitationNumbering schemeCASE <Informatik>Subject indexingAddress spaceSet (mathematics)Uniform resource locatorEvent horizonPercolationNumberSoftware testingForcing (mathematics)Shape (magazine)Field (computer science)GeometryGeometryWordPolygonReverse engineeringInheritance (object-oriented programming)Cellular automatonFormal languageQuery languageMultiplication signLatent heatType theoryLecture/ConferenceComputer animation
15:44
Formal languageCAN busSurjective functionProgrammable read-only memoryWordQuicksortSet (mathematics)CASE <Informatik>PercolationComputer programmingFile formatUniform resource locatorDescriptive statisticsMereologyBlogEvent horizonCategory of beingFormal languageQuery languageCanadian Mathematical SocietyGeometryInheritance (object-oriented programming)Lecture/ConferenceXML
17:30
Wechselseitige InformationPercolationQuery languageBlogCategory of beingPercolationBitMultiplication signDistribution (mathematics)MetadataDatabaseDimensional analysisMultiplicationResultantPlotterMereologyWordInterface (computing)Content (media)Presentation of a groupLink (knot theory)Dependent and independent variablesLecture/ConferenceProgram flowchart
19:31
Network operating systemDirection (geometry)Term (mathematics)Uniform resource nameInformation managementStatisticsTerm (mathematics)Group actionInformationConnected spaceResultantShared memoryCentralizer and normalizerSampling (statistics)Right angleNumbering schemePattern languageWebsiteMultiplication signDifferent (Kate Ryan album)WordLink (knot theory)Direction (geometry)CASE <Informatik>FrequencySoftware bugMereology1 (number)Electronic mailing listSkewnessMusical ensembleCharge carrierLine (geometry)DiagonalDistribution (mathematics)Set (mathematics)Dot productNormal (geometry)Focus (optics)Lecture/ConferenceComputer animation
25:36
FrequencyMultiplication signAreaOcean currentLecture/ConferenceComputer animation
26:19
OvalLengthInformationPower (physics)Query languageFitness functionCombinational logicComputer configurationDistribution (mathematics)StatisticsMatrix (mathematics)Electronic mailing listSemiconductor memoryGene clusterTwin primeSet (mathematics)WordInformationPower (physics)Multiplication signLengthLecture/Conference
28:18
Event horizonGoodness of fitMultiplication signNumberQuery languageLecture/Conference
29:58
WordFormal languageQuery languageSampling (statistics)Basis <Mathematik>NumberResultant
30:36
Scaling (geometry)Percolation
31:19
BefehlsprozessorScaling (geometry)PercolationDependent and independent variablesNatural numberCuboidMiniDiscSemiconductor memoryElasticity (physics)Lecture/Conference
32:21
CASE <Informatik>PercolationComputer engineeringData storage deviceMetropolitan area networkVirtual machineQuicksortGoodness of fitGene clusterProduct (business)AliasingCartesian coordinate systemPrice indexSubject indexingElasticity (physics)EmailLecture/Conference
34:46
Formal languageQuery languagePrice indexReverse engineering
35:20
Matching (graph theory)Multiplication signNatural numberProcess (computing)Query languageAdditionMathematical analysisFormal languageCross-correlationType theoryTupleCombinational logicWordExistential quantificationQuicksortMixed realityFuzzy logicDecision theoryLecture/Conference
37:08
RobotComputer animation
Transcript: English(auto-generated)
00:06
So hello everyone and and good morning. So I'm here to talk about what's beyond the basics with Elasticsearch. I work for Elastic, the company behind it, so we've seen a lot of use cases and some of them actually surprised us and
00:23
definitely surprised many people that are familiar with Elasticsearch as sort of the full-text search solution. But before we get beyond the basics we first need to know what the basics are. So super quickly for us that's where we come from. We are a search product. It's an open source
00:41
search product and search is not a new thing. Like it's been around for a while, for a long while, and the basic theory, the really down-to-earth basics haven't changed that much since those times. We still use the same data
01:00
structures. We still use the same data structure that you find in any book at the end. The index, especially the inverted index, which looks something like this. It looks the same in a book as it does in a computer. It is a list of words that actually exist somewhere in our data set. Notice that they're
01:22
sorted and for each of these words we have, again sorted, the list of documents or files or pages when it's a book where actually these words exist. And we have some additional information store there too. For example, how many files
01:44
does actually contain the word Python or how many times is it present in file one and on what positions and stuff like that. Those information, those statistics will be very important for us as we go on through the talks. So this is
02:04
the data structure that we use. So how does search work then? Well it's super simple. If we're looking for Python and Django, it's the same search that you would do if you were looking for those things in a book. You locate the line mentioning Django and the line mentioning Python. You can do that
02:22
effectively both as a computer and as a person because again it's sorted. And then you just walk the list and if you find a file or a document that is actually present in both lists, that's your result. Naturally if you want to do an OR search instead of AND, you just take everything from both lists. But
02:45
that's not enough because this will give you the information what matches but it doesn't give you the most important thing for us. And that is the information how well does it match. What is the difference between
03:01
the Django book that talks specifically about Python and Django and the biography of Django Reinhardt when it mentions in one passage that he had an encounter with the Python the snake. Obviously there is a big difference between those two books. And the difference is in relevancy. It is a
03:24
numerical value, a score essentially saying you how well does a given document match a given query. And a lot of research has gone into how best to calculate the score. And again it hasn't changed that much since the beginning.
03:45
At the core of it there is still the TF-IDF formula. Those are fancy words, fancy shortcuts. It's a term frequency and inverse document frequency. It essentially represents how rare a word we are looking for and how many times
04:05
have we found it in the document. So this essentially represents that if you found the word the in a document that doesn't really mean much. Like every document in the world if we're talking in English will have the word
04:24
that's not a good information because the IDF the inverse document frequency that's the part that it will tell you that this is not a specific word it's almost in every document if you however find the word framework or
04:42
something like that that is fairly specific so that's the idea of part and the TF part is just how many times did you find it there if it's only mentioned once in a book doesn't mean much but if it's if it's there a hundred times that probably means more and we can we can keep building on top of that
05:03
so for Lucene for example adds another factor to it which is a normalization for the length of the field that's essentially equivalent of saying that yeah there is there is a fish somewhere in the ocean probably true not really that relevant or surprising but if you have a bucket
05:24
with water and you say there is a fish in it that is much more actionable information so that's that's the second part of it the the normalization for the field length if you find something in a super big field okay if you find it in a much shorter field for example title compared to
05:43
that probably means much more so already we have a formula that's baked into Lucene it's baked into elastic search that does very well for for text and for search but sometimes even that is not enough for example you're not dealing with text but with numerical information or you have some
06:04
additional information that elastic search is not aware of for example you have the quality of a document you have some user contributed value or even somebody paid you to to promote this this piece of content or something or
06:22
you want to penalize or favor things based on a distance let's say from a from a geolocation or distance from some from some numerical range so how do you do that we have a we have a few ways of expressing that and the
06:41
best way to show it is on an example so this is a this is a standard query for for elastic search and it's using the function score query type the function score query type takes the regular query so normally we're looking for a hotel and we're looking for a hotel that's called the Grand Hotel so
07:02
far so good and then we want that hotel to have a balcony we want our balcony in our room but we don't want to just filter just the hotels that have balconies because then we would be robbing ourselves of the opportunity to
07:21
discover something else but if a hotel has a balcony we want to we want to favor it we will just add two to the score so all the hotels with balconies will be towards towards the top then we want the hotel to be in
07:41
central London within one kilometer of the center if it's within one kilometer it's a perfect match the further away from it that it gets the score decreases it will still match but it will the score will be smaller
08:00
again that means that the hotel that perfectly matches our criteria will be at the top but if if we have a super good match outside it will still show up and then we also have the popularity of how have people been happy with with the hotel and let's take that into account so we have a
08:23
special thing called field value factor which is essentially just telling elastic search there is a numerical value in there that determines the quality put it into into the score and finally we add some we add some random numbers and this is a this is actually taken from a real
08:42
life example because people use this to to mix things up a little bit to give users chance to discover something new something they wouldn't otherwise see so all of these things together will make sure that you find your perfect hotel we're not limiting your choices we are not just because you say that
09:02
you want a balcony we will still show you the hotel that is almost perfect for you except for the balcony part we are also not just sorting by popularity so that's something that's really not that good a match but as is really popular would be at the top we're just taking all these factors and
09:22
combining them together so this is one of the main main ways what we can do with the with the score and how we can use it in in a more advanced way just take all the factors that go into the perfect result and just combine them
09:41
you you're not limited to just picking up one and sorting by it you can combine them all together and then it's just a matter of figuring out what these numbers are supposed to be to one and what will actually give your application the best result some people actually use machine learning
10:02
techniques to figure out the best ones they have they have a they have a training set and everything and it's it's not that hard because you have only limited number of options and typically that those are fed numerical values so if you know what a good match would be you can actually train
10:21
the perfect query for you so this is if you're doing if you're doing search when you already know what you're what you're looking for but sometimes it's the it's the other way around sometimes it it is you don't have the
10:43
document but you have you have the query and you want to find the documents so imagine that you have you want to do something like alerting or classification for example you're indexing documents you're indexing stock prices and you want to be alerted whenever a stock price rises above a
11:03
certain a certain value sure you could keep running a query in a continuous loop and and see if there is something something new but what we can do instead with with the percolator feature of elastic surge is to actually index that query into elastic surge and then we just show
11:23
it a document and it will tell us all the queries that matched and that is very powerful especially because it can use all the features of a plastic search so that's the alerting use case sort of the stored search
11:40
functionality if you if you supply your users with a search functionality and you want them to be able to store the search and then be alerted whenever there is a new piece of content that actually matches their search with percolator you get it essentially for free you just index their query and whenever there is a new piece of content you just run it
12:01
by the percolator and it will tell you hey you should probably send an email to that user that was here the other day he was really interested in that that's this that's the sort of stored search you can also use it to do to do a live search so if you've ever been on a website you did some
12:22
some searching you were looking through the results and suddenly there was a pop-up that there are five new documents that match your query since you've been looking at it again easy once you execute a query you also store it as a percolator and then whenever there is a new piece of content during that time you can just push it to the browser to say hey
12:44
there are there are new results more more recent so again something that's otherwise fairly hard to do or would require some busy loop or something and you can do it this way but we'll go we'll go a little bit further than that we'll look at the classification use case that is essentially if you use
13:05
the the percolation to enrich the data in your document so imagine that you're trying to index events and all you have as as location goes is a coordinates and you want to find the address this is something that's easy
13:25
to do the other way around if you have the address and you want to find all the events in that location you just you just do a geo shape filter that you're looking something that falls within this shape within a shape of the
13:40
city of Warsaw and that's a super simple search so with percolator we can make it in into super simple reverse search let's say we get our hands on a data set with all this all the cities in in Europe or in the world it's not so much we index the cities in into an index so we don't
14:00
have to construct the polygon every single time we store it in the index called shapes under the type city and then we create a query for each city we register it with a name and then when a document comes along and it and its coordinates the field location fall within that shape we will know that it
14:26
is actually happening in Warsaw Poland so something that is super simple to do one way but difficult to do the other we can we can do with the percolation just essentially using brute force but in a smart way and
14:43
outsourcing the brute force to to elastic search we can do it very effectively and in a distributed fashion so that's geo classification and the other thing that's easy to search for but not that easy to to do the other way around usually is language classification usually any language has a
15:08
few words that are super specific to that language they don't exist in any other these are some of the examples this is essentially just a test how many Polish people there in the audience and the assumption here is
15:25
that if we look for these specific words and we find at least four because four is always a good number because 42 would be too high then the assumption is that this is actually a document that contains Polish language and sure it's a
15:48
simplification it's a heuristic but it it actually works fairly well it just depends on the quality of your words mine are super good for Polish that is
16:00
so again and if you if you have a set of a set of words for for each language you can just start start a collection of queries like this and then when a document comes along with a description of an event with a geo location and a description you can immediately get back the classifiers you can get back
16:23
the location in actually a human readable format that it's actually Warsaw note that it's 473 minus 74.1 which is by the way not Warsaw but whatever you also get the language back that it's in Polish you can you can use a similar
16:41
classifiers to determine the topic like if you have within keywords something like programming and Python and Django it's it's fairly accurate assessment to say that the conference probably is something about Python so this is how we can use percolation to enrich our data and sort of to to
17:05
determine something that otherwise would be hard to do another use case for this is imagine that you have a blog a CMS and you have a category defined as a search that's super easy to do one way but then if you have a blog post
17:23
and you want to see in which categories this blog post is that's the harder part again with percolation with something like this it's it's super easy to do when you can actually tag tag the blog post with the categories as they
17:40
come in and you can do obviously a little bit more with a with a percolator you can attach metadata to the to the percolators and you can filter on the metadata you can aggregate them so as the response you will not only get the percolators that matched but also let's say their distribution across categories you can even use them to to highlight something so you can search
18:05
for some words in your documents and then just highlight the fragments that actually contain those and store them separately in the document for easy presentation etc etc you can get the top 10 hottest categories for for this
18:22
piece of content or something like that but next those are if you if we're working with individual individual documents but we can also look at and more documents at the same time so this is the traditional search interface you're just looking something and you get bagged the top 10 links
18:44
what we also have here is something that's called faceted search this part the search part is really good when you know what you're looking for this part shows you what is actually in your data so you can immediately see see the distribution you can see that if you're looking for something related to
19:03
Django the most results are in Python and some in JavaScript so it allows you to discover data some people have taken it even further and we have we have allowed that with with aggregations with multi-dimensional aggregations that you can aggregate over multiple dimensions at the same time
19:24
but that is still boring that is still just counting things and that's not really interesting like any database can do that what we need is we need to use the data that we have the statistics so to do that we let's look how we
19:40
would do recommendations using elastic search we this is our data set we have we have doc document for users and then for each user we have a list of artists of musicians that they like and we want to we want to do recommendation like assuming that I like these things like what should I listen to next so we
20:04
have to in this case we have two users they have artists be in common and there are three other artists so the naive way to do it is to just aggregate just ask for the most common thing that they have in common so give me all the users that like the same things that I do and then give me the
20:26
most popular artists in that group without the ones that I already know that way I will get the most popular artists but not necessarily the relevant it's like asking you like what is the most common website that you
20:42
go to probably Google not interesting because everybody goes to Google but if I ask the people in this room and I think about it what is the more specific part for this group compared to if I ask somewhere on the street it will be
21:01
something like github you probably all go to get up nobody in the outside world goes there nobody even knows that it exists that is relevant that would be a good recommendation and we can do that with elastic search we have all the information we have the statistics about how rare a word that is and what
21:23
is the distribution again across the populace so we can ask for is simple we ask for the significant terms it will use all the score compare it to the background and then the results will look something similar this part is important because what I would expect is all those dots to be on on the
21:46
diagonal line because that's what would happen if I had a random sample the more it moves away from the central line the more specific it is and that is
22:03
how we can do relevant recommendations because we see that this dot here it is obviously much more common in this group than in the general populace that we hear so it has moved greatly and because we have all the information
22:21
because we've analyzed the data because we are we are the search people we understand the text we understand the frequencies and we can use it we can actually produce something like that there are obviously some caveats for example if I like a very popular band like One Direction then it will skew my
22:43
results because everybody likes One Direction right so I need a way to to combat this because otherwise I would just get completely irrelevant recommendations and again we are the search guys we understand data we
23:02
understand documents so we can find and sample just the users that are most similar to me and we have all the tools already at our disposal remember TF IDF normalization and everything TF the people who like the more things that I
23:25
like the better they match me IDF the people who who like the share the rarer things that I like put them to the top and then just take 500 of those
23:40
best results and only drive the recommendations based on that group it will make it both faster and more relevant it will allow you to discard or all the irrelevant connections that you might find and only focus on the meaningful connections on the things that are relevant for your group in this
24:03
case the group of people who like the same things that you like it will provide you with with a recommendation so just by just by applying the concepts that we have we have learned from search into other things like
24:21
aggregations and everything we can get much more out of it another example would be if you have Wikipedia Wikipedia articles when the labels and links are are the words and you apply the same concept you get a you get a meaningful connection between different concepts if you if you try to do it
24:43
based on popularity it would always be linked through something like yes that person and that person yeah they're both people okay not exciting but if you apply this principle you get something more out of it so if you combine aggregation and relevancy all the statistics that we can do that is
25:04
actually how we as humans look at the world if I ask you what is the most common website that you go to you'll probably not say Google because you know that's not interesting we as humans have been trained from the very beginning to to recognize patterns and to spot anomalies at the very same
25:24
time and this this concept can can be used for other things as well for example if you use the same principle the significant terms aggregation and per time period so you split your data into time period and you ask what
25:41
is significant for that period how do you call that feature well it's a very common feature that we now see it's what's trending that's just it because it's more specific it's not more popular than in any other area not necessarily but it is more specific for this one time period for the current
26:03
time period let's say compared to yesterday compared to compared to the general background so again once one year once you're doing these aggregations there's again one one single caveat that can happen is that you can have too many options too many too many buckets too many things to
26:24
calculate and if that happens what you so imagine that you're looking for a combination of actors that star together very often so I'm looking for the top ten actors and then for each of those I'm looking for a set of top ten
26:44
actors that act with them that they appear together if I just ran this what will happen in the background is I will essentially get get a matrix of all of all the artists all the actors and all the actors and it would be huge it
27:03
wouldn't fit into memory it would probably blow up my cluster actually Alice research would probably refuse to run this query because it would say hey I would need too much memory this is just not gonna fly so what you can do is you can just say just do it breath first just first get the list of the
27:23
top ten actors and greatly limit the matrix that you will need to that you will need to calculate and then then go ahead so it will be a little slower it will have to run through the data essentially twice but it will actually finish and it will still finish in in quite a quite a reasonable time so
27:46
that's just how to fight the common caveat that people people get into when they start exploring the the aggregations and especially with the multi-dimensional so just true just to wrap things up because we are
28:01
approaching approaching the end and questions the lesson here is that information is power we have a lot of information about your data we have all the statistics all the distributions of the individual words and if you if you understand this and if you if you can map your data to to this problem you
28:25
can get a lot more out of elastic surge then just finding finding a good hotel in London or the conference events in Warsaw so that's it for me
28:41
and if you have any questions I'm here to answer them any questions that's a
29:25
long question thanks more specific question that is shown the example how to search people like the 500 more like you can you do that more like
29:45
people have that have 90% of being like me instead of a fixed number because fix number of course you can do that by a same bio simple query because aggregations are always run on the results of a query so we can very
30:03
easily remember the example that I gave with the with the language classification when I was looking for at least four words I could do the same I could say give me only the users that have at least 70% or 90% or or
30:20
nine yeah I can use both relative and absolute numbers of the same artists that I like and use those as the basis for the aggregation so yes absolutely and it would actually be much simpler you wouldn't even need the sampler aggregation thanks any other questions is anyone still awake okay
30:47
so a question a question going once going twice sold are there any
31:11
performance implications of running say hundreds of percolators of course but it can scale way beyond hundreds I've seen I've seen people doing millions of
31:24
percolations and it still works it scales very well with the distributed nature of elastic search essentially the only resource that the percolation consumes is CPU so add more CPU either to a box or add more boxes and it will scale fairly linearly so and also just the more boxes and more CPU you
31:48
will have the faster it will get you don't need anything else you don't need much memory you don't need faster disks you only need the CPU so it's very easy and fairly cheap to scale to to give you an idea I think that if you
32:04
want to run hundreds and thousands or millions of percolations you will need like five reasonable boxes or something like that and you will get responses within within milliseconds so it actually does scale very well another
32:23
question could you give us some examples of the customers you had you mentioned that you had like cases that were like really impressive for you and you didn't accept those these cases sorry could you give us some examples of
32:42
the use cases from the customers that you mentioned that you didn't expect them so so some of some of what we didn't expect it was there was the percolator example there are some people running big clusters of elastic surge and they don't store any data in it like they have a cluster of 15 20
33:03
machines without storing any data that is a weird weird experience for for essentially a data store so that's definitely one of them and we also always run into into these issues where we have a feature we recommend people to use it and then people listen to our advice and we find out that
33:24
we might have underestimated the the people in the wild for example we introduced the the idea of email of index aliases that you can have an alias for index essentially like a like a sibling or something so you can have you can sort of decouple the design of your indices from what the
33:44
application sees so you could have like an alias per user but all the users can live together in one big index and the alias will just point to that index and a filter and that works very well unless until we
34:01
encountered a user that had millions of users and suddenly we had millions of aliases and we didn't thought that that would ever happen so as with anything else with with computer engineering like assumptions assumptions so we we encountered something like that we had to go back and fix it and rework
34:24
the aliases so these are the two most notable examples where we got really surprised by how our users used our product that we really didn't foresee and it's it's good because we always learn something new and it allows us to sort of reorient ourselves better to what the users actually need
34:46
okay any last questions so hello I have a question regarding reverse queries for
35:10
language classification so basically elastic search supports the n-gram indices so could you use those actually for classification of languages so n-grams have the problem that they have a very wide spread so they
35:28
might give you some correlation with the language but they will definitely not be not be precise so just to explain n-gram is essentially if I split a word into all all the tuples of letters for example with thanks I would
35:45
have th a h a n a n k and then I would essentially query for these for these triplets and it will obviously have a have a correlation but it will by no way be be decisive enough especially for something like language
36:04
classification where you're really interested in in the in the probability n-grams are very good for as an to as an addition to something else because because of their nature because they they always match something that's why you typically don't want to use them alone but
36:26
they're fine if you have some some more optimistic methods like exact matching and then the regular like fuzzy matching and everything and then you just throw n-grams into the mix to sort of boost the signal if it matches
36:41
and sort of to catch some things if nothing else matches so I definitely wouldn't use n-grams for language classification and I typically only use them with a combination of other other query types and other analysis process make sense okay okay so I think that we're we're running out of time so
37:05
thank you very much and if you have more questions I'll be I'll be outside