Merken

# Speeding up search with locality sensitive hashing

#### Automatisierte Medienanalyse

## Diese automatischen Videoanalysen setzt das TIB|AV-Portal ein:

**Szenenerkennung**—

**Shot Boundary Detection**segmentiert das Video anhand von Bildmerkmalen. Ein daraus erzeugtes visuelles Inhaltsverzeichnis gibt einen schnellen Überblick über den Inhalt des Videos und bietet einen zielgenauen Zugriff.

**Texterkennung**–

**Intelligent Character Recognition**erfasst, indexiert und macht geschriebene Sprache (zum Beispiel Text auf Folien) durchsuchbar.

**Spracherkennung**–

**Speech to Text**notiert die gesprochene Sprache im Video in Form eines Transkripts, das durchsuchbar ist.

**Bilderkennung**–

**Visual Concept Detection**indexiert das Bewegtbild mit fachspezifischen und fächerübergreifenden visuellen Konzepten (zum Beispiel Landschaft, Fassadendetail, technische Zeichnung, Computeranimation oder Vorlesung).

**Verschlagwortung**–

**Named Entity Recognition**beschreibt die einzelnen Videosegmente mit semantisch verknüpften Sachbegriffen. Synonyme oder Unterbegriffe von eingegebenen Suchbegriffen können dadurch automatisch mitgesucht werden, was die Treffermenge erweitert.

Erkannte Entitäten

Sprachtranskript

00:00

without a a matching and I'm a data

00:09

scientist at list with a reflection company and we basically go go on the internet and get all sorts of fashion products from all sides cell is also designed for them all in 1 place so that the user can just go to our website in the following every designers and browse your favorite fashionable and biting from us so that's the principle I'm going to have to be

00:30

talking about new data such that nearest neighbor search is very simple principle basically you have a point and you want you want to find other points that are close to so the most obvious applications is obviously not so that is something we use every day you've got you got let's see a location on a map and you want to ask you will Bingo where we want to ask water the newsletters restaurants to me or what other news conference and and that's what it does it figures out where all your mother you give a delegation and that looks out on the other points on the map the restaurants they looking for and the calculates distances between where you are and the resting days and and tries to give you the closest ones so that's basically the essence of nearest neighbor search Given a point given other points that are close to it now this is the most obvious application but it's even if you're not building a mapping application which you may well not be all that we certainly aren't you still find it very useful so what do we use it for what use the full image search and used for recommendations so in how does it work the principle of Saturday's when you've got an image let's say that so we have a user and states submit an image to us dress and they want they want us to give them images of similar interests or something something that's a good substitute for the dress that it submitted to some as programmers it's sort of hard to find similar images just by looking at an image so the 1st step we want to do is to transform an image into something that you can more easily work so that 1 very naive idea would be to let's say well we've got an image which is RGB migrate during the green and blue and these are the numbers for a given image we could average the values of each column and they try to find out the images of database which have sort of similar public values and that's a very naive approach and it's definitely not going to work but it illustrates the principle that after transforming the image into a numeric representation you've got a point in this case you've got a point in a 3 D space and that's news neighbors such if you want to find similar images to to the image the user just has just given us we are and we look at other images other points which are close in that space so that is news nuisance the way actually do it is something more complicated so recently there but the most fashionable way of doing this and probably the most effective is our deep learning so this is a very simplified diagram of a convolutional neural network so basically the the black square the far left and all of this slide is that is it is the image that you started and then you basically build a machine learning model which successively detects more the more interesting features of a given image so common that they 1st year you detect just simple edges my is alignment of part of the images that goes from left to to right that sort of stuff but is progressively build better and better representations you start more modern out of about 2 minutes so maybe the 1st layer you just text edges in the 2nd layer you take shapes the squares circle and the following layers you you you text file higher-level constant is a capsule was the dog was the building of the bridge now the nice thing about it is that in the final layer so you got to pay as long list of numbers and ordered list of numbers vector which represents a point in space in a high-dimensional space and the nice thing about it is well images of cats in that space are going to be close to other images of counts images registered with close to other images of bridges so that's how you can do very good very good image search and this is indeed what we return list of process take images and process them into this point representation high-dimensional space and I want to use his name was such as to find similar images that's useful for 2 things like 1 thing is actually give us an image and we can find similar images or maybe you type in the text right and then we convert to text phrase into a point in the same high-dimensional space where the images are and suddenly you can find images which is similar to the text itself so that's really and that's something we can are there other useful applications the duplication so we've got go to images and we have no metadata associated with those images but we know that the the same products like the underlying truth is this is the same product and we can use this new neighbor search to find out that these are the same product and present them just as a single thing about 1 to 2 different things so that means and other applications recommendations so the this approach is analysis of collaborative filtering approach so basically you you have your products and heavy use new represent both products and uses a sport space so let's say we take our products and this is this is a function of points and recast them them into this high high-dimensional space so we do the same for using is users and represent response and recast these points at random into a high-dimensional space and then what do is with a given user by you pointing out space interacting with a given product we do all the points together but if a given user did not interact with a given product would push them further apart and the nice thing about it is at the end of this pushing apart and putting together process users and close to the point to the products that they would like and far apart from the things that wouldn't like so when you want to recommend things to that while we look at the point of that that represents them in a high-dimensional space and then we use news such to find find the products they would like so that's also very very so on a data scientist and innovative scientists very excited by the sort of things and we spend a lot of time thinking about them in implementing them in a sort of go away work for 6 months and you could come up with amazing solution we translate pictures of products into this high dimensional space is similar and similar products out those space and as well as a data scientist I've done this amazing thing and and now works and it's going to be great is deployed on the website and make lots of lots of money all make use of free rehab now this is where you are a data scientist ability of beautiful child which is going to be great and it'll go and that's just deployed magnets make it so how would you do it right you you have given a query point of take a user and you want to find all the newest products and that's simple right you go to a database and find all the points representing products and you compute the distance between the query point of all the points all the products in database and all that same give eclipses once simple right I mean it could consistent so yes but now the problem

07:17

is that lists we have 80 million images and we have about 9 million products so if we wanted to the simple solution ie you know calculate distances with all the points and then return the closest I users would be would be

07:31

very very broad but it's only finished take literally minutes so that would work so how do we make it worse the colleagues since the touching to so we all know about abstinence of pictures in Python so we're going to build a special constant so we going to pick a hash function which unlike normal hash functions in maps points that are close together in space for the same fashion which is very different than normal functions which are supposed to map things uniformly over the past so this is so special takes 2 points that are close together in space where not to the same hospital and we just build a hash table normal have table using them so you take a point and you take the half codes and then you put your points in the in the hash buckets corresponding to half codes and magically all the points are close together with and the same bucket and when you doing searches look up the bucket that you need and you just set within the about which is really so to do this at least and we use random projection forest and I'm going to tell you how that works and so this is our much space of points so we've got about a hundred grade point and I want to point which is the query point the point I want to find find nearest neighbors for and this is how we do it so if we didn't locality-sensitive hashing that have to calculate the distance between the blue points and every other point which takes too long we cannot do some to make it faster we draw a line around the line the beauty of which is to take line draw has to go through origin otherwise the line will do and the nice thing about it is 1 the if we look at the picture most of the points that are closer to the query points and on the same side of the line and the points that are not close not close to the query point up on the other side of the line so just by drawing a land line we managed to create 2 hot markets and we're suddenly we only have to look for half of all points to find nearest neighbor so that's already know a speedup factor 2 just the 1 on the line and you know we don't have to do anything intelligence to to draw the line is just a random 1 so that's a nice thing about and if this thing that is not enough for you you draw another round again completely random and the point that end up on the same side of the line and up in the same if this again the speed of is not enough for you to keep drawing lines and you've got few enough points in a hash buckets and that's your that's the right so we get drawn up clients to have small enough such markets and then when you need to perform a nearest neighbors query it take the blue points and calculate which have packages and up and then just computes brute-force distances between the query point and the points that in the same so that the principle and if you can think of it as building a binary tree is long so started all the points and then we have a suspect and the points on the left side of the line going to the left subtree the points on the and line going to right subtree and then we follow the right subtree in this example and do another splits into a left subtree right subtree another splits another splits and you can do so with a query point you can start the root of the tree and then follow the states onto the end up in the right half of so that's the principle of how it no let's say you in some cases but in some cases boundary right there serve the way we started with at least we thought we're going to draw a fixed number of his life and hopefully that will give us the up and hopefully will be accurate so what we do is we decided on dimensionality let's say we want to do a hundred on states and then after a 100 almost that's installed and and things that end up in the same package of the nearest neighbors and the rest discount that that works reasonably well if your points of uniformly distributed in your space right because all regions of fairly equal density wherever you draw the lines you hot markets are going to look through all roughly the same size the same number of points and these are going to be good enough but spaces where some regions and problems with some regions of the space of high density but other regions of low density where you're going to end up is with buckets some having lots and lots of points and some been completely empty neither neither of which is very good so if you have a bucket with lots and lots of points In order to get a good speed up if you have a bucket with very few points in its you know again getting results both of which are hard so the 1st point is keeps splitting and to denote a small enough so we don't take a fixed number of states you just below the binary tree and displayed to the splitting is that until you reach a stopping criterion which is this not it contains x number of points and when that happens to stop splitting and you take the tree so that the 1st point the 2nd point is use medians splits so if you just take random hyperplanes London run lines you can you can end up with highly unbalanced trees so the left subtree very short because for some reason that very few points there but myself through very very very deep because a lot of points in that part of the space that's not horrible but it's not great either because you spend more time traversing the deep part of the tree and you will be traversing the deeper part of more often because that's where more points so means that you take a random line and then you can't the median distance from the point to the line and then spectrum that distance so that it is guaranteed that half the points always go left tree and right half the points we always get the right subtree so that gives you a nice balance tree and faster traversal and the final point is to build a forest of trees so don't just build 1 tree will you build lots of trees was the reason for the well the random projections algorithms and locality-sensitive hashing in general is approximate they don't give you an exact answer the probabilistically correct but in some cases there will be wrong so if we look at this picture if you look at the query point well maybe there are some points to the left of the lines that are closer to the query point to the points in the right side of the law but if you build just 1 tree we never going to surface these points goes then in a different part of the train so that that's a mistake so this algorithm makes an and that's that's a great you won't have as good results as we can getting speed so the way we get around this problem it would be lots of training and because in each tree that the lines are chosen at random again each she will make its own air but they will not repeat each other's there so when you combine the predictions from all the trees they will end up correcting for each other and the the accuracy score the aggregate will be higher than the actresses all this in a single tree so that's why that's where the loss of the the other less property of this approach is if you build more trees you going to get more results at the expense of while having to traverse the tree so that it takes a lot but this is something you can control can pick basically the trade-off between the number of trees build it controls your performance and you can pick a point on the performance of the speed versus accuracy because and that's appropriate for applications for answering this if you want your speed Bill few trees to get go accuracy they're very quick if you want something accurate but don't care about speed up much lots of trees will be accurate maybe also fast OK so that's the that the principle of the algorithm does anybody have any questions at this point because of happy to the this yeah was hit it

15:35

works well for 2 dimensional called generalizes to higher dimensions like the view all I assume you lose hyperplanes right but how do you so we need to know the feature space is a hyperplane right so so the algorithm the actually 1 of the main reasons for the existence of the algorithm is high-dimensional space and the reason why I seem to be on the slides because to these easy to visualize high-dimensional spaces aren't that's why it's into the but basically what you do is you in high dimensions you draw a run a hyperplane which is you know whatever you dimensionality is and then you do the same calculation so it's it's exactly the same principle just high-dimensions and works very well for for very high dimensions and they like interval questions OK so that's the principle of the algorithm and of you unit in Python

16:30

system with at a Python conference itself so useful to to give an idea of Python packages so that that so python packages for doing this and 1 of them is annoying which is which is that there a very cleverly named package ANN approximate nearest neighbors very cleverly named bycatch from school to find the right it's a Python wrapper for C + + code it's that installable it's very fast another another packages forest which is psychic like those of you who of data scientists will play with ML you probably already have cited learning computers so so easy to use because only then and is also quite quite quite a few and then you've got flan which is not believe that single Sparse Code to and sort of only and and hard to deployed and the nice thing about it is you give it your data and then it takes a long time to train but it figures out the optimal structure for your problem and that gives you high performance the end and a Python wrapper for it which works explained told there some some some some some bits that we don't like as much so as the forest itself I could learn how you can read sources very readable but it's actually very slow so if you if you want to to develop for quite high performance applications maybe not the best solution found is a pain deploy its 2 plus coding have seen many also dependencies and always great so recommend you use it but for us it didn't really fulfill 2 important terms requirements 1 is it doesn't allow it wants to build a forest of trees you can't ordinal points to it which us was no no we need to add new priority index of the command so this is something we needed and secondly you cannot you cannot do this of course you have to keep all the vectors in memory all the time and so we are so like any engineer rotor and it's called up the

18:30

forests which speaks to the algorithm is available and get have and so on it's been installed as well answer so please go forth and tried out and break it all sorts of always and I'll try to fix it's

18:43

quite fast it's not as fast as a more about it so fast enough processes such as much much faster than the elicit forest which is spoken cycling allows adding new items to the index and does not require us to store all the points in memory which is which is really really really Ravenna so how do we use it we use it in conjunction with post goes so basically River like service had that has the and indexes and the other forests so we send the query point there and what it gives us back is these are the properties of the easily measured these you are going to be interested in so you get these ideas back and pushed them to post and go they post rescue idea no idea these please apply the following business rules all the filtering all you know we West expenses forth and then do final distance calculations also impose grows using C extensions so we stole the actual point quantification of the actual vectors as a race uh in PostgreSQL and we've written some parts extensions to PostgreSQL allow us to do the distance got calculations in which climate the excitement because this is

19:49

also not if you doing so there also numerical stuff you have raisin partner never arrays press and you can write custom functions to do anything you want in C. so if you really want to you can a stochastic gradient descent Machine Learning governments impose present environment posters because and I'm not sure you should do that that is definitely possible so the whole

20:11

combination the around the algorithm and implementation and process as data as a backing store and gives us a fast and reliable and services that would support in production and give us approximately 100 times speed up was 60 % precision upset at 10 means that if we get 10 nearest neighbors we get this 6 out of 10 actual nearest neighbors using the approximate approximate approach 100 times speed up so I that samples reasonably good you know scalable brute force hybrid not particularly demanding speed base and that that's why we start so so we recall that it allows us in this thing up that we gained it allows us to several some results so we don't have to be computed we can just 7 real-time which is also very nice I can't it's all built on top of real database so stuff concurrency and updating the database that's all taken care of by by smarter people so it's well

21:14

thinking and the very happy to take in question

21:19

without

21:31

yeah I know I just did something like that

21:35

for chemistry this distance and to put forest uh but I would like to ask you for an estimate of watch number of entries this is going to the heart so when it's my application going to need trees like in the future I don't know way I mean it's sort of an empirical question right so it then depends on requirements as well as the let's say the doing offline processing and you want to you you don't remind it's taking 10 seconds of 20 seconds with the data you have that kind money let's say 1 lookup takes 10 seconds but you need to do you look 100 thousand times and then maybe that's not that's stuff the point we really want to look into the solution yeah I think if it's that that that sort of if it's if it's fast enough for you for you doing you don't need to add to the full so what was your pain point in in your application but was too long for you and so let's say if you want to serve web requests by 100 ms is too long for and doing this proof also take and you up to 3 seconds and would completely destroy the database and so on so getting from the seconds 202 around 100 seconds that was the those 2 different forms who thinks question of clustering so as a algorithms like K-means and so on which could be fostered and precision have less precision this yes so the could the question is can you use clustering algorithms to achieve this the same sort of effect and yes you can and there's an approach called hierarchical clustering which again is sort of building building of let's a binary tree for example where we have all the data points and you put them into 2 clusters are and you be able to clusters and then decided to classes in the 1st level of the tree and then you go down into each cluster and recursively build small clusters and keeps pituitary but also approach until something i've investigated so I cannot say cannot give you a good on water what's the performance trade-offs can questions to thank you Madam

00:00

Web Site

Spiegelung <Mathematik>

Zellularer Automat

Mailing-Liste

Biprodukt

Matching

Quick-Sort

00:29

Programmiergerät

Punkt

Prozess <Physik>

Freeware

Selbstrepräsentation

Kartesische Koordinaten

Bridge <Kommunikationstechnik>

Kardinalzahl

Newsletter

Zählen

Ähnlichkeitsgeometrie

Raum-Zeit

Computeranimation

Eins

Spezialrechner

Metadaten

Randomisierung

Punkt

Substitution

Lineares Funktional

Datenhaltung

Gebäude <Mathematik>

Güte der Anpassung

Abfrage

Ähnlichkeitsgeometrie

Biprodukt

Rechenschieber

Rechter Winkel

URL

Computerunterstützte Übersetzung

Aggregatzustand

Sichtbarkeitsverfahren

Web Site

Wasserdampftafel

Zahlenbereich

Virtuelle Maschine

Informationsmodellierung

Migration <Informatik>

Endogene Variable

Datentyp

Biprodukt

Verband <Mathematik>

Abstand

Bildgebendes Verfahren

Kreisfläche

Vektorgraphik

Raum-Zeit

Dimensionsanalyse

Mailing-Liste

Vektorraum

Umsetzung <Informatik>

Elektronische Publikation

Quick-Sort

Mapping <Computergraphik>

Codec

Diagramm

Quadratzahl

Mereologie

Neuronales Netz

07:30

Resultante

Retrievalsprache

Einfügungsdämpfung

Merkmalsraum

Punkt

Kartesische Koordinaten

Wald <Graphentheorie>

Lokalität <Informatik>

Raum-Zeit

Computeranimation

Gradient

Nachbarschaft <Mathematik>

Netzwerktopologie

Client

Einheit <Mathematik>

Algorithmus

Prognoseverfahren

Polygonzug

Code

Existenzsatz

Randomisierung

Punkt

Wurzel <Mathematik>

Gerade

Lineares Funktional

Sichtenkonzept

Kategorie <Mathematik>

Gebäude <Mathematik>

Abfrage

Rechnen

Binärbaum

Knotenmenge

Dialekt

Teilbarkeit

Dichte <Physik>

Rechenschieber

Randwert

Funktion <Mathematik>

Rechter Winkel

Heegaard-Zerlegung

Projektive Ebene

Ordnung <Mathematik>

Tabelle <Informatik>

Aggregatzustand

Partitionsfunktion

Ausgeglichener Baum

Hash-Algorithmus

Wellenpaket

Hausdorff-Dimension

Kondition <Mathematik>

Zahlenbereich

Gebäude <Mathematik>

Unrundheit

Punktspektrum

Wrapper <Programmierung>

Heegaard-Zerlegung

Zufallszahlen

Hash-Algorithmus

Hyperebene

Abstand

Videospiel

Fehlermeldung

Wald <Graphentheorie>

Vektorgraphik

Axonometrie

Indexberechnung

Einfache Genauigkeit

Medianwert

Mapping <Computergraphik>

Mereologie

Codierung

Normalvektor

16:29

Bit

Punkt

Wald <Graphentheorie>

Indexberechnung

Einfache Genauigkeit

Kartesische Koordinaten

Computerunterstütztes Verfahren

Vektorraum

Quellcode

Schwach besetzte Matrix

Wald <Graphentheorie>

Term

Wrapper <Programmierung>

Quick-Sort

Code

Computeranimation

Netzwerktopologie

Algorithmus

Automatische Indexierung

Code

Festspeicher

Wrapper <Programmierung>

Punkt

Datenstruktur

18:42

Retrievalsprache

Punkt

Schreiben <Datenverarbeitung>

Maßerweiterung

Kombinatorische Gruppentheorie

ROM <Informatik>

Computeranimation

Gradientenverfahren

Abstand

Maßerweiterung

Lineares Funktional

Wald <Graphentheorie>

Kategorie <Mathematik>

Abfrage

Indexberechnung

Quantifizierung

Schlussregel

Vektorraum

Rechnen

Dienst <Informatik>

Array <Informatik>

Funktion <Mathematik>

Automatische Indexierung

Festspeicher

Stochastik

Dreiecksfreier Graph

Mereologie

Programmierumgebung

20:09

Resultante

Prozess <Physik>

Datenparallelität

Datenhaltung

Schaltnetz

Implementierung

Computeranimation

Datenhaltung

Dienst <Informatik>

Algorithmus

Skalierbarkeit

Forcing

Reelle Zahl

Caching

Stichprobenumfang

Hybridrechner

21:17

Prozess <Physik>

Punkt

Wasserdampftafel

Klasse <Mathematik>

Zahlenbereich

Kartesische Koordinaten

Computeranimation

Übergang

Netzwerktopologie

Benutzerbeteiligung

Bildschirmmaske

Algorithmus

Abstand

Cluster <Rechnernetz>

Soundverarbeitung

Schätzwert

Wald <Graphentheorie>

Datenhaltung

Zwei

Gebäude <Mathematik>

Binärbaum

Quick-Sort

Rechter Winkel

Beweistheorie

Server

### Metadaten

#### Formale Metadaten

Titel | Speeding up search with locality sensitive hashing |

Serientitel | EuroPython 2015 |

Teil | 59 |

Anzahl der Teile | 173 |

Autor | Kula, Maciej |

Lizenz |
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben |

DOI | 10.5446/20155 |

Herausgeber | EuroPython |

Erscheinungsjahr | 2015 |

Sprache | Englisch |

Produktionsort | Bilbao, Euskadi, Spain |

#### Inhaltliche Metadaten

Fachgebiet | Informatik |

Abstract | Maciej Kula - Speeding up search with locality sensitive hashing Locality sensitive hashing (LSH) is a technique for reducing complex data down to a simple hash code. If two hash codes are similar than the original data is similar. Typically, they are used for speeding up search and other similarity comparisons. In this presentation I will discuss two ways of implementing LSH in python; the first method is completely stateless but only works on certain forms of data; the second is stateful but does not make any assumptions about the distribution of the underlying data. I will conclude the presentation by describing how we apply LSH to search at Lyst. |

Schlagwörter |
EuroPython Conference EP 2015 EuroPython 2015 |