Merken

From icontains to search

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
the and and the some of the time they are so what it what it actually means going from I contains the search in this talk I would actually like to tackle the distinction between filtering and searching and what search actually it's it's going to be a more theoretical talk what's what's actually behind and how it works and I'll hopefully we'll be able to draw also implications into what it actually means for the actual performance and the use cases for the more concrete stopped using search that will actually be tomorrow I'll have a talk on how to integrate Python Django with without loss at search this is mostly theory that can be applied to any and all the search engine so I said the theory so any any good theory and sometimes even the bad ones start with some definitions so let's define would search is for us so for me and and for for the purposes of this talk searches any interface to data anything where
the where you have some dataset and then you need to sort of explore and find what's what's in there and there is a difference between exploring and finding it will deal with that a little bit later but just shortly out when you know what you're looking for then it's search when you don't know what you're looking for you know what's there and that sort of the exploration for of and that what is often more interest so if I ask you to implement a search how would how would you do and how do people do it so a typical the 1st responses I'll just go through it so if I'm looking for a book in my library that that mentions gender I'll I'll pull it pull them out 1 after another and I'll read them and I'll send all those that action mentions Django side if I have a files I will do the same but just using grabs so it'll be much faster but still not ideal and many people do if I have a database Allegis I'll just run the guy contains queries and that's essentially the same thing what this means is I'm going through all my books in my database and going through all the text and I'm looking for genuinely there so you can see that that doesn't really scale because indices in the database doesn't help us grab will still have to read almost every single line in the file on that's not really true because grep is amazing but still it will have to sequentially go through the files and I don't even have to tell you how much work it would be to read the all the books in your library of these I hope I buy to hold that you people read you should it's also so clearly this is not the way to go
but someone must have had this problem before clearly we're not the 1st 1 not even the 1st generation to think about search if you think that what you're actually correct we already
solved this I when you're reading a book you can you can listen to the and sometimes and you'll see an index you see an index that actually points you to where the relevant terms occur in the book this is a data
structure that's 1st occurs in the 12 hundreds in a small monastery in Europe and in concrete 12 12 30 30 so a some set of lungs actually completed the whole cordons of Bible so essentially taking all the references reliable and putting them in sorted order somewhere and then for each of those have a reference to the chapter to the passages in bible where they actually occurred and this enabled them to have much better discussions and to find things much more quickly and this very same data structure is the same
data structure that powers 90 per cent of all the searches out there this data structure is called an inverted index what it is it's a list of terms a list of words that actually occur in in the dataset so in this case we have words like Django flask lifelong jazz and know that they're are in sorted order that's important and then for each of those in the rows in our in our case we have a list of in our case files where those terms occur and again you can see that those are in sorted order and that will be important so what how do we use this data structure for search so if we want to search for Python and GenGO we just find the 2 relevant line it's in finding those 2 I just simple because of the the lines are sorted so we can immediately go to to the gender line into the Python and now what we have is 2 lists of documents to posting lists is is the proper term and that we just need to work the so we're merging 2 sorted lists and we're all putting the documents that actually are of present involved the which is a fairly simple operations commercial sort list to iterate over multiple ways at the same time and just how could anything that's a that's in both that's the sort of that's sort of past that you might get at during an interview it's so that hard you can get a little tricky when you have multiple but it's really not that hard yeah and also this data structure lends itself very nicely To further enhancements so with this as as we have it now it's safe to say that we can do and and or queries than anything like that but if we add some more information into the data structure we can do a lot more for example if we bat
positions so for each 2 for each document we add what was the position of the word so we can see that for 5 1 5 on was the 4th word th and we see that for that for a while to what was the 2nd World and why this is useful is because this actually enables us but to have the condition on the merging phase a lot more sophisticated we can do phrase searches so if we look for the phrase type on wet which means work Python followed by the word web it's easy we did the same word process except in this time we take the we take the position of the offset of of the word into consideration so we only consider it a match is the offset if the difference between the 2 numbers in the 2 different for the 2 different words just 1 and plus 1 and that so that's that will mean that I can do I can do phrase searches so even though I'm only indexing individual words and when doing searches I don't care how big the documents I don't for the words directly so it doesn't scale linearly scales much better and just by adding more information you can even do searches across words and even search across relations so you can immediately see that the author doesn't have to be 1 I can't for example say Python followed by web and there should be at most 10 words in between so I can do proximity search I can do a lot more with that in mind will will discover a little bit today of 1 whatever thing what everything I can do it but in the end it's just the inverted index that which is a which is a very nice data structure and it is how I can do it what I have to put in to be able to get it out so this was about a brief introduction into how inverted index works and the the important thing to remember it's you're always
just merging sorted lists which is a which is a very very efficient operations 1 you of course once you have of the inverted index on that the or in memory than that so how do you get there how do you how do you build this data structure so that you can then search very effectively hopefully I've already convinced you that you can search very effectively with an inverted index so now let's see how you actually build
so let's assume that we have of that we have a of a phrase of sentence for let's call it document for for our use case and this is something surely isn't it you've never heard of before Django was a high level Python web framework that encourages rapid development and clean pragmatic design so this is this is document is the raw text that we have on the input and now we need to split into into words so how do we do this I've I've already I've already provided you with the with the result and you can see that that there are several surprising things for example of I've omitted some works I don't care about words alike is or any or and with that their way to college chances are that every single document will have those words so there's no point in mean actually indexing and because it will it will bring nothing to the table so instead as a form of an optimization out just how this lead them out no 1 cares so that's 1 thing that active the and the other things of all I've done here is everything is lower case because the inverted index that that those exact matches so if you if somebody searches for a lowercase Django basis so we've made of able to find this document right and then also what I've done is I've normalize the works so you can see that in the text I have encourages but what I'm actually gonna index adjusting courage and this is to sort of allow for the for the morphology of the language for the different shapes of the words because jumping and John is is the same word when I search for jumping out a final document the just mentioned of jumps was jumped so this is a process called stemming where you actually take the word and you find justice them just the core of the word that defines the meaning and only all the grammatical constructs around and 1 last thing I'll mention is you see that for for rapid I also indexes as fast so you can see that the inverted index doesn't necessarily have to contain the same words in the same shape or even the same number of words as the original document so I can I can just indexed additional additional words for example fast so when somebody so searches for has development as a phrase they should they should find this document even though who the creators of gender have a very very large vocabulary and they're not limited to simple words like fast also the people search our so we need to we need to take that into account so this
process is called analysis of the example that I give here was composed of of these of these steps are splitted into tokens read words we move the ones that I don't like lowercase everything a do the stemming apply the synonyms and this is a this is a very interesting interesting and important process for doing for doing searches in with various different languages it varies for use case because in some use cases you want to do the analysis in some cases you just want the index the raw value at as it is yeah the and in some cases you might wanna do completely different type of analysis for example you want an index every and every 3 letters as as they follow each other so the so called n-grams taking the partial matches and so stuff like that so it's all do a defined as part of the analysis and and it happens at index i so when a document comes in and we apply this analysis process to get that get the results to get the tokens that will then put into the inverted index so want to change it you have you need to index your data so this is the 1 1 drawback of the of the search engines if used if you change any of this any of this high-level configuration you need to re-index your data the other important thing to keep in mind is that this needs to be applied for the queries as well because it's all nice and well if I lowercase everything my document but if I then door or is the query the comes in it will again never match if I don't remove the stop words from the search so I try to look for for the word which I haven't indexed it will fight so I need to apply the same process for queries as well so this is uh this was analysis this is a process of actually taking the text and splitting it up and populating the inverted index during this during this process we did we've got all the tokens we obviously have to document ID to put into into the posting lists so now we have all we need to actually build the order that's so what else what else is part of the search what else is the major difference between between filtering and searching while the biggest 1 probably is relevant because when you do when you just do a filter when you just when I like queries or I don't think it's in our case it will tell you which document actually contain this word but it will tell you like which continued war better it because it doesn't have any concept of of relevance we did there is a science behind it even so that this is the this is the formula from from seen 1 of the most widespread search libraries out there added a library that powers both Osake certain solar and can even be used as raw there's even a of Python interface to Lucene the and this is a this is this scary formula but it's actually not that difficult to to wrap your head around it when you when you break it apart and the parts are the most relevant parties in in the middle it's this it's this is the sum of it and that's taking into account TF an idea you'll find a TF-IDF formula in in the middle of any relevancy formula TS stands for doc of word term frequency and IDF stands for inverse document frequency so the TL DR version it's and the higher the TF which means that we have found multiple occurrences of the term in the document so the word is repeated multiple times in the body or in the title the higher the score and IDF means the but more common in the word the lesser the score so it's a balancing act because if I find a word 5 times for the word is contained in almost every other document out there that's not really saying much but if I find work twice that is only contained in in 5 per cent of my documents that means that it's a it's a it's a good match so I that was just a just a mathematical expression how to know how to put this down in of in a formula what we also take into account is the length of the field because assumption is that if we find something in the title or in a tag or something like that is probable relevant that if we just find it somewhere in the middle of the body because there's a lot of things in the body so the chances that of our were being there are higher so we need to take that into a so this is this is relevant to this is this is a very of helicopter overview of of how relevancy works and how we can how we can actually do that we can we have the access to the to the to the term frequency because that's part of the inverted index and we also have the access to the document frequency which means how many documents actually contain the store because that is the length of the posting lists for that term so we don't need to do when look up we have direct access to those numbers so we can calculate the there run into the score very quickly so that's all that's all you need to know to build a full text search you can implement this in in Python and and so you can use you can use right for your data storage and it will work just fine but this is not all there is to search we've just build their the start of the search bar that that the text input the other parts on this on this page that we might consider part of the search and and are definitely part of the search experience so let's look at them very very briefly the 1st 1 is highlighting when I when I searched through book and I find the word somewhere in there I cannot return the whole book and just tell the user here I found it somewhere in there that's not very useful as all the knights of of so instead what we can do is we can just come returned fragments sort of i've I found it in in the in the 5th chapter on on the 2nd paragraph on the 2nd sentence and this is how the sentence looks so that's a process that school highlighting because I can also in that fragment I can also highlight the highlighted in the text and that's done again easily
just by sort of augmenting the inverted index to also contain the offset byte offset of the origin all of the original Cher so for for example for the work fast which is nowhere in the text I the the thing that i in dexes fast and I index the offset which is 60 to 65 so then we may have a match on work fast I'll just reach into into the text into the document and retrieve anything around the bite 60 to 65 and I'll I'll put output the HTML emphasis around the sixties 65 per characters again very simple very simple operation I just go somewhere I retrieve byte offset and in of precisely specified by i in in put as some HTML markers and that's all I need for highlighting and the
last 1 we have well we have time for is my and my favorite 1 and that's passes and filtering so on the left side on on did have for example or if you're if you're ever looking for something on Amazon or something and you have the breakdown by by different things on giving up its breakdown by type so is as a repository as a code is is a user and have the breakdown by languages and for each of those you we also have a little visualization like how many uh Doctor of repositories for work in the Python language of I found how many in Ruby or JavaScript and this is something that's called a facet or and lower and aggregation and this is the hard hard way if you remember from the beginning this is the part of exploration this is something that enables you to a explore what's in your dataset without you having to known beforehand if you're looking for gender we have to know that you're looking for data but now if you search for general you immediately see that a majority of the repositories of mentioned gentle R and Python and you can see if you can see right there what then you can do is you can click on the Python and headed filtered to only contain the repositories for a so how this works under under the cover is we have a feature called L aggregations and this far too specific for for Osake search other search engines do is slightly slightly differently much of the functionality is is in all of them so we have 2 different types of obligations we have pockets and metrics so buckets are those that actually so define define the groups of of the documents it's it's what you would put you could say that is something that you would put in the group by and in C 4 to for example you can have a group by language type or you can have a group by group by distance or you can have a bit of a group by love so that's how you define how you define the buckets and then the interesting part is you can nest those buckets so you can define a bucket per language and for each language you can define a bucket per month so in 1 query you can actually get that the different distribution in time for the 4 languages and then inside these but it's you can you can ask for certain metrics yes the count of the document Toulmin your documents fell into that bucket or what's the average value of of of a certain field or something more and more sophisticated n a we've already seen facet navigation how it can be used the other
way how this can be used is to actually just visualize it because this is that this is the bottom of the wages adjusted obligation that serves death of Osake search yeah and it's just doing aggregations and visualizing them so you can see that the the time series that just a datastore so but basically Everything displayed by of by time in this case I believe it's like 5 minutes and within 5 minutes is split between the 4 or 5 different types of of the request in this case and then all you need to
do is you just need to you just need to filter so if somebody clicks on something you add a filter and filter is again using the inverted index but in this case there is no and there's no analysis there is no of there is no nothing so it's just an exact match and because it doesn't contain also and is for it's very fast and very cacheable n as it were now we're out
of time a little bit I'm sorry for that so if you have any questions i do we haven't time for questions we have a number of Western so I'll be happy to answer any of your questions thank you for your answer but you're
Einfügungsdämpfung
Suchmaschine
Physikalische Theorie
Eins
Schnittstelle
Integral
Objekt <Kategorie>
Schnittstelle
Zentrische Streckung
Subtraktion
Hyperbelverfahren
Datenhaltung
Gruppenoperation
Abfrage
Digitalfilter
Elektronische Publikation
Quick-Sort
Computeranimation
Generator <Informatik>
Geschlecht <Mathematik>
Endogene Variable
Programmbibliothek
Indexberechnung
Programmbibliothek
Gerade
Mailing-Liste
Menge
Automatische Indexierung
Konkordanz <Mathematik>
Ordnung <Mathematik>
Datenstruktur
Term
Computeranimation
Bit
Subtraktion
Sortierverfahren
Ortsoperator
Zahlenbereich
Term
Computeranimation
Benutzerbeteiligung
Datensatz
Vier
Datentyp
Abstand
Datenstruktur
Phasenumwandlung
Gerade
Leistung <Physik>
Autorisierung
Nichtlinearer Operator
Zentrische Streckung
Matching <Graphentheorie>
Relativitätstheorie
Indexberechnung
Abfrage
Ausnahmebehandlung
Mailing-Liste
Quick-Sort
Automatische Indexierung
Geschlecht <Mathematik>
Konditionszahl
Wort <Informatik>
Information
Ordnung <Mathematik>
Textverarbeitung
Inverter <Schaltung>
Resultante
Prozess <Physik>
Minimierung
Formale Sprache
Zahlenbereich
Gebäude <Mathematik>
Framework <Informatik>
Computeranimation
Übergang
Mailing-Liste
Bildschirmmaske
Wechselsprung
Benutzerbeteiligung
Mathematische Morphologie
Datenstruktur
Softwareentwickler
Nichtlinearer Operator
Addition
Inverses Problem
Shape <Informatik>
Matching <Graphentheorie>
Indexberechnung
Mailing-Liste
Ein-Ausgabe
Arithmetisches Mittel
Framework <Informatik>
Automatische Indexierung
Rechter Winkel
Geschlecht <Mathematik>
Festspeicher
Basisvektor
Ablöseblase
Speicherabzug
Wort <Informatik>
Tabelle <Informatik>
Resultante
Information Retrieval
Retrievalsprache
Gewichtete Summe
Prozess <Physik>
Formale Sprache
Versionsverwaltung
Ähnlichkeitsgeometrie
Analysis
Computeranimation
Homepage
Eins
Arithmetischer Ausdruck
Suchmaschine
Speicherabzug
Schnittstelle
Funktion <Mathematik>
Nichtlinearer Operator
Dicke
Inverse
Abfrage
Vorzeichen <Mathematik>
Ein-Ausgabe
Frequenz
Datenfeld
Token-Ring
Automatische Indexierung
Rechter Winkel
Heegaard-Zerlegung
Ordnung <Mathematik>
Relationentheorie
Subtraktion
Zahlenbereich
Term
Ausdruck <Logik>
Gewicht <Mathematik>
Kommunalität
Datentyp
Programmbibliothek
Speicher <Informatik>
Konfigurationsraum
Schreib-Lese-Kopf
Leistung <Physik>
Analysis
Matching <Graphentheorie>
Indexberechnung
Token-Ring
Mailing-Liste
Quick-Sort
Summengleichung
Mereologie
Wort <Informatik>
Körpertheorie
Term
Distributionstheorie
Subtraktion
Bit
Formale Sprache
Gruppenkeim
Zählen
Code
Computeranimation
Überlagerung <Mathematik>
Zeitreihenanalyse
Suchmaschine
Datentyp
Minimum
Visualisierung
Skript <Programm>
Abstand
Lineares Funktional
Dokumentenserver
Linienelement
Linienelement
Abfrage
Digitalfilter
Datenfeld
Funktion <Mathematik>
Geschlecht <Mathematik>
Mereologie
Term
Message-Passing
Mailing-Liste
Perfekte Gruppe
Zahlenbereich
Digitalfilter
Analysis
Computeranimation
Analysis
Vorlesung/Konferenz

Metadaten

Formale Metadaten

Titel From icontains to search
Serientitel DjangoCon US 2014
Teil 11
Anzahl der Teile 44
Autor Král, Honza
Mitwirkende Confreaks, LLC
Lizenz CC-Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
DOI 10.5446/32829
Herausgeber DjangoCon US
Erscheinungsjahr 2014
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Good search experience for your users is about more than just a more efficient way to find models containing certain word or phrase. In this talk we'll go through what are the relevant parts of search and how to best implement them (we'll use Elasticsearch for actual examples). The goal of the talk is to demystify search engines and provide the attendees with the tools required to create a great search experience for their sites. It will cover basics from information retrieval theory as well as practical examples on how to integrate search into their apps.

Ähnliche Filme

Loading...