Beyond the basics with Elasticsearch

Video in TIB AV-Portal: Beyond the basics with Elasticsearch

Formal Metadata

Beyond the basics with Elasticsearch
Title of Series
Part Number
Number of Parts
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Elasticsearch has many use cases, some of them fairly obvious and widely used, like plain searching through documents or analytics. In this talk I would like to go through some of the more advanced scenarios we have seen in the wild. Some examples of what we will cover: Trend detection - how you can use the aggregation framework to go beyond simple "counting" and make use of the full-text properties of Elasticsearch. Percolator - percolator is reversed search and many people use it as such to drive alerts or "stored search" functionality for their website, let's look at how we can use it to detect languages, geo locations or drive live search. If we end up with some time to spare we can explore some other ideas about how we can utilize the features of a search engine to drive non-trivial data analysis.
Functional (mathematics) Personal digital assistant Multiplication sign ACID Elasticity (physics) Functional (mathematics)
Web page Statistics Length Multiplication sign Source code Field (computer science) Neuroinformatik Number Data structure Position operator Condition number Area Pairwise comparison Matching (graph theory) Information Electronic mailing list Database Price index Limit (category theory) Flow separation Subject indexing Category of being Word Numeral (linguistics) Process (computing) Computer animation Search engine (computing) Personal digital assistant Query language
Length Multiplication sign Inverse element Bookmark (World Wide Web) Neuroinformatik Medical imaging Hooking Different (Kate Ryan album) Single-precision floating-point format Scripting language Programming language Electronic mailing list Sound effect Bit Price index Type theory Process (computing) Cost curve Website Right angle Quicksort Resultant Statistics Functional (mathematics) Random number generation Divisor Distance Event horizon Field (computer science) Theory Number Frequency Gezeitenkraft Well-formed formula Term (mathematics) Ideal (ethics) Information Gender Expression Set (mathematics) Subject indexing Word Uniform resource locator Computer animation Personal digital assistant Query language Calculation Buffer overflow Library (computing)
Point (geometry) Functional (mathematics) Multiplication sign 1 (number) Shape (magazine) Web browser Disk read-and-write head Event horizon Formal language Power (physics) Different (Kate Ryan album) Percolation theory Cuboid Elasticity (physics) Address space Descriptive statistics Email Percolation Inheritance (object-oriented programming) Electronic mailing list Set (mathematics) Price index Subject indexing Word Uniform resource locator Process (computing) Befehlsprozessor Computer animation Query language Personal digital assistant Website Video game Resultant
Identifiability Code Multiplication sign Hyperlink Event horizon Metadata Formal language Different (Kate Ryan album) Percolation theory Turbo-Code Distribution (mathematics) Percolation Inheritance (object-oriented programming) Information Interface (computing) Bit Database Subject indexing Type theory Category of being Computer animation Query language Search engine (computing) Repository (publishing) Personal digital assistant Website Writing Resultant Geometry
Search engine (computing) Virtual machine Analytic set Website Database Pattern language Musical ensemble Pie chart Power (physics) Neuroinformatik
Axiom of choice Group action Statistics Diagonal Code Length Direction (geometry) Connectivity (graph theory) 1 (number) Digital object identifier Inverse element Field (computer science) Automatic differentiation Number Subset Frequency Term (mathematics) Well-formed formula Representation (politics) Energy level Selectivity (electronic) Office suite Distribution (mathematics) Graph (mathematics) Matching (graph theory) Information Gender Graph (mathematics) Electronic mailing list Sampling (statistics) Mathematical analysis Connected space Word Arithmetic mean Computer animation Ring (mathematics) Personal digital assistant Query language Normal (geometry) Right angle Quicksort Resultant
Point (geometry) Web page Algorithm Statistics Graph (mathematics) Computer animation Personal digital assistant Endliche Modelltheorie Metropolitan area network Connected space
Context awareness Group action Multiplication sign Virtual machine ACID Combinational logic Mereology Cartesian product Dimensional analysis Neuroinformatik Wave packet Frequency Facebook Latent heat Term (mathematics) Data structure Algorithm Information Inheritance (object-oriented programming) Forcing (mathematics) Database Set (mathematics) Line (geometry) Type theory Message passing Computer animation Query language Personal digital assistant Website Free variables and bound variables Pattern language Table (information)
Presentation of a group Multiplication sign Source code Sheaf (mathematics) Distance Mereology Trigonometric functions Field (computer science) Formal language Power (physics) Meeting/Interview Term (mathematics) Well-formed formula Different (Kate Ryan album) Matrix (mathematics) Default (computer science) Information Weight Vector space model Type theory Word Integrated development environment Search engine (computing) Query language Right angle Resultant
I 1 of the half of the time and the and and the and the have mn so local thank you for a given the target is the basics with acid search and it is essentially about the use cases where you can use ElasticSearch but might not be immediately obvious that that is something that you can do assume that through several of those and see how and why is it is suited for the of for that particular scenario but before we go beyond the basics we need to talk about what are we going beyond so what is this what is the base functionality of us at certain how does how outcomes that we can do all these other things like where is it coming from was all coming from search the I search especially full-text search is the primary function of Elastic Search and search is not a new problem the it's been
around for a while and it has actually changed much at the 1st essentially index over a book over some text has been created in 12 30 and we still use the same data structures to this same day of course there's been plenty of improvements but the underlying infrastructure that the inverted index remains the same the it is the index that you're familiar with if you've ever read of any book which I hope that you have sincerely and this is how it looks you have the list of interesting words and then for each of these words you have a list in in the case of a book you would have a list of pages for us you would have a list of documents the that actually contain this word and notice several things 1st of all the words are sorted of course it makes sense because you need to be able to find the word that you're looking for so you can go to the page that actually contains a and also the pages the documents are sorted as well and this is all of this is not accidental this is very important for us and we'll see we'll see how and also when we're talking about a search using a computer there are other things involved in this area structure notably some statistics for example how many times is this word contained in this in this document there are how what is the length of the list set things that would be very important later on so when we have a data structure like this how does the search work what super simple if we're looking for a document that mentions both Python and we OK these 2 words and we get back the West and now we just wrote the lawyers and we merge them together so whenever we find a document that is at present in both lists that is I will result if we wanted to do something like a phrase search that we're looking only where I Python is indeed affected by the work of work but and we have to do is add another information into the inverted index we just me add offsets what is the position of this word in the document and then when we're going through the merging process we just say we care not only that the document is in both lists but the offset must be up immediately following each other so Python would be on the position n what would be on the position and plus 1 the so you can see that actually doing of research is not any more expensive than doing a regular search you like you just adding 1 more comparison numerical comparison of that so it is very efficient the what else you can you can sort of imagine here is I can get the list of documents from anywhere and it doesn't have to come from the same index so I could have multiple indices I can have index on every single field in my document and I can use them all if I have 1 condition on the title when conditioned on the category and 1 on the body I would just as committed those 3 inverted indices to get these posting lists is what they're called and merged merge them together so we don't have the limitation of many other data source that we limit the number of indices you can use per query per hour collection and that way so that is also something that we benefit greatly from of from is this this data structure and finally the at the last thing that you do when you when you do this merging when you find your match you quantify how good a match it it's that is the primary difference between up a search engine and end of a database we not only tell you which documents match is that your query but also how well does it match is it a good matches a just way and we know that because we we have the information about other statistics so this
is called relevancy we tell you how relevant the document is to your query so I is relevancy well into
computer at the base of it there are 2 of the 2 numbers numbers there we go TF and IDF TF is term frequency it is just the number of occurrences of that word in the given document or in the given field so if I'm looking for GenGO at in a document so how many times does this document containing the word gender is there only once Is there 3 times and obviously the higher the number the better the relevance ideas of is inverse document frequency i which is just a fancy word of of saying hi common or where this word it's in your entire dataset the the is this room that is contained in every single document that you have or is this and that is only a present in 1 per cent of your documents n we can we can get this information right away from from the inverted index because that essentially of the length of the list attached to the field compared to the number of documents that we have overall fairly easy to calculate and there's actually the exact formula if you if you're so inclined and this number has has the opposite effect the the more common in the word is the less relevant of this document is for for the result because if we find a word that that is in every single document yeah who cares have in every single document of course we're going to find it that doesn't mean anything the so the this is sort of the Bayes formula for anything that has to do with relevancy and it works very well for text no seen the library that does the indexing and the heavy lifting for 6 search add some some of the stuff on top of it you can see the exact formula the end of you can see in the middle that that's the TF-IDF the BIC BIC some let it pass on top of it is it takes into account for example the length of the field because if we find that the word gender when the tidal forces in the body that's also gives us different information right if we have a short field and we still find it there it's lower than that it would be if we have a full text of the book and we find in there as well those are different different types of information so it improves uh on the basic on the basic TF-IDF formula but it still up it still only relying on the statistics that had learned about your data sets and sometimes you wanna go a little further sometimes you have other information about your data so let's say a imagine that you you have a q a q a website site where people ask questions and and give that give answers and let's call it like a buffer overflow I don't know end up you you have the users rated the questions and the answers this is a good question this is a good answer and that was an information that is that you want to take into account but you don't wanna sort by because that would completely destroy your servants so if you had 1 very high quality question that has many many words it will also be on top no matter what the what the people would search for that's not what you want you wanna take the relevancy anyone attain the popularity and combine those those numbers together to get something that you can then sort so under another use case would be I'm looking for a hotel or I'm looking for a conference for an event and I wanted to be in this location I'm not strictly limited to that location but I would prefer to be there all around this time so again we can take that the numerical indicator the distance from our ideal and you use that to and feed that into into the relevance the canister so this is the theory behind it at this is this is the practice this is how it actually looks this is a this is a very simple query Gnostic search when we're looking for a hotel we're looking for a hotel at the Grand Hotel and we have several other criteria we would prefer to have a balcony we're not limiting ourselves to just hotels from balconies but if it does like we we will add to to the score so we want bump it up also we wanted to be in central London and we wanted to be within 1 kilometer of center of London while not center Greenwich this the you can the divide by the 0 In the coordinates the and we want to be there we don't want to limit ourselves to at 2 hotels that might be so good that they would get to the top even without fulfilling this criterion to say that we we want to use the sort of the of the cost function to calculate the score it's 1 of the 1 of the sorry
it's 1 of the shapes on on the image on the right that actually determines how fast the score drops once you get outside of your ideals of the and we also have to take into account the popularity of the hotel and then we will add some random numbers it's random numbers are always good they make everything so much the no in this case at the when the numbers are there sort of 2 to shuffle the results a little bit around so that our people have chance to discover new things because of we actually to this from from a from a customer from a real example how the how they're doing it and they do have this random scores the because otherwise but they would have some some hotels or some of some results that would never be hit because they would always be just just behind a just behind the fold and also people would perceive the results of stale but if you if you show that if you shuffle it around a little bit they will always find something new and they'll always be excited and hopefully come back to your website so uh this is 1 of the ways how we use the statistics that the TF-IDF and all the things that we know all about your data this is the most straightforward way we use it to calculate relevancy and we allow you to hook into that process yourself if you're so inclined you can just removal of this and just say hey instead of all these different criteria I just wanna users and give it an expression in your favorite programming language whatever that is even if even Python so and up due do all these calculations yourself these are essentially just prebuilt scripts that we have we have built so that you can you can use it and you don't have to expose the scripting functionality because obviously that can have some of that kind of some issues so this this was this was the 1st use case so how to get more out of the other the relevancy that we already have another interesting use cases we
have revolves around reverse search or how we call it the percolator and it is exactly what what it sounds like it is research in the world you index your documents and then you run your queries With Percolator you index your queries and then you run your documents so the what is
what is but this is useful for is for an example I would if you have something like start search functionality on your website but you have you of classifieds or something like that and you allow the users to search and then nothing shows up but the user wants to say hey I I'm interested in the search like say that and whenever there is a new a new item new document that matches that search just to send me an e-mail and I'll I'll come back the normally that's if it that's a fairly hard problem with percolated comes out of the box you index the query that they're that they're running including all the bells and whistles that of that Osake search allows you to do and then when a new document comes in you just ask for it to be percolated and you will get back all the different queries that people have registered that that they would be eligible some people even use it to power of something like life search if you've ever been on any website you're searching and suddenly there's a pop-up head in the time that you have been looking at these results 10 new ones the that again can be powered by the percolator when you do a search and at the same time you register that the percolation and every single new document that comes in gets percolated as well and you will know which browser you need to push this document into who is who is actually looking at the results right now the yeah so those are the fairly obvious use cases I'll I'll talk about my favorite 1 for percolate and that is classification because there is a bunch of stuff that super easy to to search for but not that easy to do it the other way around the for example location it is very easy to construct a query that will look for events in in Austin you just have to have the shape of Austin somewhere you either pass it into the query but that's not optimal so typically you have an index somewhere in Elasticsearch in my case I have an index called shapes what I have cities and I also of Austin there so I say hey I am interesting in anything that falls within the city of Austin and that is a very simple query to work but what if I wanna do the opposite I have a G point I have a set of coordinates and I didn't know where they are what city what's the that's known that reveal unless you have something like that where you have a bunch of queries index in your Elastic Search and then you just showed the document with the with the dual point and will tell you yet we have these these queries matched the 1 representing North America the 1 representing the United States the 1 representing taxes the 1 representing Austin maybe even the 1 representing a city block or something so you can really pinpoint down the exact address at that point just a matter of how good data you have and how much CPU E 1 burden on that's but it gives you so the non obvious reconstruction of the data now the interesting 1 is language classification this operates on assumption that every language and I've chosen Polish because English would be way too wow which weird every language as a few words that don't exist in any other language the so the assumption here is I can write a query that looks for these words in this case I am I give a list of words and I'm saying I want if at least 4 of those are in the documents quite I considered a match so in this case I'm looking for documents that are written in Polish because nobody else no other language in the world would ever write all of a word like this so it's a fairly definitely good indicator so again of very easy to weigh the query and once we have the query can reverse the process and actually ask for I ask for which queries matched so if I don't have a document for an event so I have a general US with some believe description because otherwise we demo with Pol Pot well this is what I get this a it
Turbo that Ch this is what I get back I get back the identifiers of all the queries that actually matched so I know that this is an event that is in the city of Austin is in Polish and it is it it's it deals with the topic of my thoughts well please don't try to look for the are they're nowhere near Austin but the so this is how you can use percolator for classification you typically do those you index your document you have your document you're about to index it so you run the percolation you get all the dynamic classifiers to get all the topics and although all of the language and the geo-location you put you added to the document and then you indexes so then when you're looking for something in the city of Austin it's a super simple of exact lookup which will obviously be much faster then running the geo cup every single time and you can take the because those of a little bit further you can attach metadata to the percolation queries for example who requested this this percolation is it a user that paid mean or is it a user can write n n of other the criteria like that as you can also not run all the percolation to every single time we can then use this metadata to filter those cetera cetera you can also use it to want to highlight so if you wanna highlight some passages but it that's that can be fairly hard problem but it's a problem that search engines and are really good at so you can ask of us got sick searches should highlight some some passages for you and then index index all the already highlighted text the so this has been uh this has been an hour journey into the into the depth of the percolate and then there's 1 last
big thing that I wanna talk about and that's aggregations now many different many different databases have aggregations how come a search engine has 1 2 well it all started with something like that it's this is an interface you might be familiar with and it's called the of faceted search were faceted navigation you type something in you get the results that you get the 10 blue links but you also get on the left side of the overview the overview of what actually matched your query so in this case I'm looking for GenGO so I immediately can see that Django is mostly connected with Python and I can see how many repositories know many users of actually matched my query and that is the huge difference between facets N and search search is great when you know what you're looking for if you know how to spell Django or how how it sounds or something like that the facets are great for exploration because you don't need to know you you look and you see it is 1 thing to do with with code the other thing is to do it with hotels or books or if every if you've ever shopped on a website like amazon ladies see the categories you can see the brands you can see the price distribution and you can see it you don't have to read all the results to get that information so we have we have taken it 1 step further with all
6 search and we we power some analytics based on based on this stuff and we visualize it because humans are essentially parallel recognizing machines you're very good at recognizing patterns you can probably spot several weird things about this picture like the gap in the l in the timeline or the fact that the 2 last pie charts are completely different and you can see that immediately if you want to computers see that you would have to tell it what to look for or have something very very sophisticated but a human can spot this immediately so that is why why facets and aggregations become so important and why we continue to develop but this is very stuff I don't wanna talk about it this is just counting stuff every database can do that we can do better than that we're search engine we
understand your data and we can use it so let's see how we would actually use our 6 urge and aggregations to do something like recommendations let's say that I have a music website and a different
users and then they like different artists so I have a document per user and there's a list in that document in the j so that has all the all the artists that the user so alike so 1 sort of naive way how to do recommendations is just ask for the aggregation desire give me I look for the people who have the same thing I did the and then give me the top 10 popular artists in that group that I have not been exposed to the it's that's easy easy to run not as useful because popular doesn't mean relevant just because everybody listens to 1 direction doesn't mean it's relevant to my to my group so what can we do instead we need to find we need to identify the artists that are yeah relevant to my group to the to the group of people that like the same things that I do a compared to the background so the current looks looks remarkably different we replace the word terms with the word significant terms and that's all there is because now we're essentially telling us at search hey we have we have this group we have defined it based on based on the results of the search and now give me the stuff that is not relevant for this group compared to all the others and we can do this we are the surge guys we understand the data we have all the statistics we have all the numbers you can even see that the graphical representation of what's happening on on the right side normally the when you select a sort of a random out random selection you would expect that there will be a same distribution of people like something in that group compared to the general populace so you would expect all that they had to be laid out on on the diagonal what this did what using the significant terms that it selected all the all the values that are pretty much on the vertical which means that are much more alike in my group than they are in in the general populace which is exactly what I was asking for I was asking for recommendation based on the people who are similar to me give me what I'm more likely to like using using the data using a stereo using just the dumb statistics about the the distribution of the individual values throughout the dataset so no learning was involved this is actually a fairly simple aggregation that you can that you can run it you can see the code is not that expensive and not that expensive right the it's not that involve but this still has some problems this is aggregation we've had for a while but we've notice that there are some cases where doesn't work as well as we would like it to and that the 1 of
problem is the terms that everybody likes the term that everybody hats so in 1 direction is my go-to example of this because you know everybody likes them the especially around gender cont n so what do I you would like to their how do I make sure that I don't suffer from from the bias that of every single document I have actually like this it well a lot of it is already filtered out by the significant terms but I can actually do much better I can also ask for a sample office of the documents the so I will meant to do this analysis on all the all the users that have something in something in common with but only those that are the most relevant so I I've I've included relevancy right now at least twice in the air in my query that I'm running 1st of all I'm looking for users that are most similar to me the and no similar means they have the highest relevancy to to my query because let's repeat let them go the relevancies is based on TF IDF norms etc. so what does this translate to term frequency it emitted from place to people who likes learner thinks who have more things in common with me the idf inverse document frequency it means that people who prefer of the the where the choices that I have I prefer the people who have like the same thing with me then where I will ignore the 1 directions because that doesn't ring me anything but I will actually hang on to add to the weird groups that nobody else in the world knows about and then I know it's the stuff that seen ads on top of it when it takes into account the length of the field the I prefer people with short list that means that people like pretty much exactly what I like so that people who like everything in the world because that will not be relevant not be that world so I can use that directly the tools that we've built in the beginning of the talk for text match and I can use the same numbers the same formulas to get the people who are most similar to me and then I say the take like 5 top 500 of those like on every shot because all 6 which is distributed so everything happens on a shot level and and then run the significant terms tell me what is specific for that group so we've find the selection not anybody who has anything to do with me but the most relevant of people the most similar people and then give me what is specific for that work so this is the sample aggregation it is currently currently in in the newest releases of of Osake search of being ready to release and these 2 together allow me to drop to do the recommendations and have it be more relevant and also have to be faster because we're actually looking in the subset of the data we just used all the information that we have about the data to identify the most relevant part of so we don't lose any precision but on the contrary the it 2 go generalize what we've done here it is in a connected graph when we have of people an artisan they like each other we have it in a fight the connections that are meaningful and amend the ones that are most popular or most common on we've managed to hopefully circumvent so the super nodes the super connected nodes that our that are the hardness of of any of the any of the graphs and we can use this and then go further with it we can actually use this but in in a graph of earth so imagine that you have a
man algorithm that's used to calculate the shortest path the so you want to go from point a to point B or in our case let's model on on Wikipedia you would go
from a page to page B anyone to see what are what are the connections how do you get from 1 point to another and still only take into account the relevant connections because if you just use and naive about graph algorithm you will you will fell victim to to the super notes like you will see you will see that you have concerted in in the USA and so has pretty much every other bad so there's an immediate connection there and that is not relevant at all and I don't mean to insult you country but that's just not relevant connections so when you when you use this approach to even I only those connections that are relevant that are not just an accidents based on statistics that everybody has their connection then you can have the I get much more interesting information out of this so you can actually use this to but to flying meaningful connections
after all aggregation and relevancy that is how we look in in the world I by set and I said earlier that we are of pattern recognizing machines we we look at we look at things and we immediately make assumptions like hate this room is not as full as it was in last yet that make sense like acid search is not as well in this fall's rats I I get I I do the and I can see that immediately it because I have the context because I can't because I can see I don't have decouple the chairs and to know that so that's that is the aggregation part and the relevance is if I ask you what is the most popular website the website that you visit most often the I many people with so I have actually is at conferences of something like it or the StackOverflow for something and that is actually not true they probably spend more time on Facebook or Google or something like that and so that they're ashamed even though they might be too it's the immediately recognize that that is not a relevant answer that's not interesting everybody goes to will like I do not interesting then information whatsoever I know that you know that let's get to the interesting part it up that is more than 2 other group then to the general public if I ask on the street would get Hub is like hopefully I will get punched but I don't know that but that I would tell definitely not get the correct answer so we do the exact same thing that is what what is special about humans compared to compare to computers the so I haven't I have a question for you if I do an aggregation per time period the and then I ask for a significant terms on the tags or anything would do what do I get back how do we call that but is willing to guess the the we can do the training information what's trending not what is the most popular for that given time period but what is more specific for the time period compared to any others so if we only filter the last 5 minutes and we ask for significant terms we get what is currently trending what people talking more often right now then in general than any at any other time so something that might appear as as sophisticated algorithm we can replicate it with 2 lines with a single query to 206 search so this will this will this is a nice may serve short if you just want wanna throw something out there and you don't wanna spend has spent billions trying to come up with your own algorithm like this is exactly what it's so I so there's only 1 other pitfall that I'll I'll warn you about if you
tried to play with this the way aggregations work Osake searches we take all the possible combinations that might come up the the and we create our a bucket for the placeholder then that can blow up very fast for example if you if you look if you have to have a dataset from I DB and you're looking for work ethic of 4 actors who acted together most often I if you just run this naively it would it it will work in the query super simple of bad the query will also blow up your memory like crazy because it will essentially do I have do the Cartesian product of actors forces actors so it will be a huge essentially table a huge data structure that we need to then fill up what we can do however it is we can limit 1 of the dimensions before we get into it so I by before we will try to do everything in 1 pass over the data because it's the most effective way and because we know we are distributed history database so we need that sort of for us to function but in this case if you really insist if you know that this will blow up your memory or because you've tried or because of because you you can tell so what you can do is you can say do do it another way do a breath 1st search so 1st identified the type that actors and then only find the contractors of those 10 so just simply it the 5 like what is the what is the of dimension that you can limit most effectively and do that and then then you'll be fine then you can actually ask for all the information in the world and we will give you add a tiny tiny tiny little sliver of so remember information is
power we have the information we have information about your data and we can we can use them that is the 1 leg up that we as as a search engine I have over that the more traditional data source were limited to to the bearing kind of stuff filtering counting stuff it can be useful but it's so exciting FIL exciting uzazi search that know that if you have any questions thank so we have 5 minutes for questions and has and questions at the new if I agree presentation I question about languages section have you tried to use a section with lasik search us in an abrupt environment I don't know so what what I typically recommend when people deal with multiple languages is just use everything at the same time the problem with language detection typically is that even if you can detect the language of the source you have enough information to go out there and you can do it you can identify the language of the document it's very hard to identify the language of the query because if you if somebody just types into words it's very hard to say what language they are so radical recommend people is but if you know that you're going to be dealing with these 5 languages analyze everything 5 rights analyze it as English as as check as German as Japanese and then do the same for the query when a quote comes in Croatia and of these fields at once and as Troost to allow you that you can specify I have this 1 field but I will analyze multiple rights and then I had this query and I wanna run it against all these differently analyzed fields so the it's what I call the shotgun approach just for everything in there and see what sticks because of how the relevancy works and how different relevancies from different queries are combined without trying so hard to actually think about the problem you will actually get the most relevant results to make sense I may have missed it that is the vector space model still on the a common way of combining information from different query terms or the more sophisticated and this it's still it's still the same same thing if I I talk about TFI TF-IDF and everything but that's only to give weight to the individual parts of the vector but overall it's still it's still essentially we're talking about a vector matrix vector differ distance by the default essentially what what I showed to the the formula this actually cosine of uh the uh literature pretty much but there are some modifications and stuff but it's it's still based on based on that they think you about for every few but think and around at the time and he have a mn