Query Embeddings: Web Scale Search powered by Deep Learning and Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 45 | |
Number of Parts | 169 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/21105 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201645 / 169
1
5
6
7
10
11
12
13
18
20
24
26
29
30
31
32
33
36
39
41
44
48
51
52
53
59
60
62
68
69
71
79
82
83
84
85
90
91
98
99
101
102
106
110
113
114
115
118
122
123
124
125
132
133
135
136
137
140
143
144
145
147
148
149
151
153
154
155
156
158
162
163
166
167
169
00:00
World Wide Web ConsortiumBitData managementQuery languageComputer scienceEinbettung <Mathematik>Physical systemPlanningBuildingWorld Wide Web ConsortiumNatural languageSoftware engineeringProcess (computing)Search engine (computing)Lecture/Conference
00:39
Sampling (statistics)SoftwareWorld Wide Web ConsortiumSearch engine (computing)BuildingInformation retrievalSoftware development kitSpeech synthesisMachine learningFormal languageNatural numberProcess (computing)InternetworkingPower (physics)Web browserHypermediaMereologyAreaInformation retrievalSearch engine (computing)HypermediaWeb browserPower (physics)Expert systemInformationRepresentation (politics)Computer animation
01:07
WebsiteLink (knot theory)Web browserWorld Wide Web ConsortiumQuery languageAddress spaceEvent horizonMatching (graph theory)QuicksortProcess (computing)Lecture/Conference
01:38
Web browserQuery languageParameter (computer programming)Uniform resource locatorMatching (graph theory)Process (computing)WikiDegree (graph theory)Euclidean vectorQuery languageInformation retrievalBitReal-time operating systemWorld Wide Web ConsortiumCombinational logicEndliche ModelltheorieWeb browserPoint (geometry)Negative numberLengthGradientMultiplication signQuicksortComputer animationProgram flowchart
02:33
Price indexQuery languageSummierbarkeitRankingHypermediaMultiplication signWorld Wide Web ConsortiumSearch engine (computing)BitSubject indexingQuery languageComputerHome pageType theoryResultantLoginGroup actionFacebookState of matterFrequencyPredictabilityUniform resource locatorIntrusion detection systemLecture/ConferenceComputer animation
04:05
Uniform resource locatorProcess (computing)Set (mathematics)Web pageSubject indexingQuery languageLecture/Conference
04:36
RankingNumberPoint (geometry)Query languageVector graphicsRepresentation (politics)WordPoint (geometry)ResultantQuery languageBitEuclidean vectorWeb pageHome pageSearch engine (computing)State of matterSubject indexingComputer animation
05:33
Query languageEuclidean vectorArithmetic meanElectronic mailing listNumberShared memoryPoint (geometry)Representation (politics)WordContext awarenessSemantics (computer science)Focus (optics)Vector graphicsLecture/Conference
06:04
NumberEuclidean vectorPoint (geometry)Query languageFormal languageQuery languageVector graphicsTrigonometric functionsDistanceWordEndliche ModelltheorieSpacetimeEuclidean vectorThermal conductivityAreaSimilarity (geometry)Vector spaceSet (mathematics)Computer animation
06:36
Game theoryStructural loadFreewareSineMaxima and minimaRepresentation (politics)WordData modelStochasticGradientGradient descentEuclidean vectorArchitectureArtificial neural networkContinuous functionHill differential equationEndliche ModelltheorieStochastische SpracheMoving averageSubject indexingQuery languageWordEuclidean vectorPhysical systemGame theoryMultiplication signSpacetimeExecution unitResultantSequenceVector graphicsElectronic mailing listGradient descentComputer-assisted translationReverse engineeringBitWindowTable (information)AdditionPresentation of a groupRow (database)Context awarenessEndliche ModelltheorieProcess (computing)DistanceMassProduct (business)Pattern languageFormal languageUnsupervised learningArtificial neural networkRight angleResidual (numerical analysis)Arithmetic meanSimilarity (geometry)Type theorySinc functionFault-tolerant systemQuicksortAnalytic continuationMoving averageSimulationStochasticMetropolitan area networkAlgorithmTrigonometric functionsRepresentation (politics)Matching (graph theory)Front and back endsMathematicsLecture/ConferenceComputer animation
10:13
Stochastische SpracheEndliche ModelltheorieFormal languageData modelProcess modelingWordEndliche ModelltheorieMoving averageComputer-assisted translationCache (computing)SequenceData dictionaryFormal languagePredictabilityLecture/ConferenceComputer animation
10:50
Endliche ModelltheorieMUDLinear regressionContext awarenessNumbering schemeWordNoise (electronics)Contrast (vision)EstimatorExact sequenceComputer-assisted translationRandomizationSequenceMoving averageUniformer RaumSet (mathematics)Endliche ModelltheorieGleichverteilungWave packetLecture/ConferenceComputer animation
11:37
Endliche ModelltheorieWordWave packetIterationSequencePosition operatorReplication (computing)DistancePhysical systemLecture/Conference
12:06
Data modelComputer wormNumberContext awarenessEndliche ModelltheorieGravitationProduct (business)Coefficient of determinationWordMathematical optimizationGradientWindowHydraulic jumpGraph (mathematics)Set (mathematics)Computer animation
13:03
Data modelParameter (computer programming)Distribution (mathematics)NoiseInsertion lossGradientFunction (mathematics)Einbettung <Mathematik>Derivation (linguistics)WordVector graphicsInformationMultiplication signCASE <Informatik>Query languageWordDescriptive statisticsSequenceWave packetForm (programming)NumberMereologyState observerNoise (electronics)Table (information)Slide ruleThetafunktionLikelihood functionEinbettung <Mathematik>Insertion lossContext awarenessMetropolitan area networkVector graphicsBounded variationDimensional analysisProjective planeEuclidean vectorGreatest elementGenderInformationVector spaceEndliche ModelltheorieCategory of beingMathematical optimizationSpacetimeIdentifiabilityGradient descentProduct (business)DivisorArithmetic meanMessage passingNatural languageComputerLecture/ConferenceDiagramProgram flowchart
15:57
Query languageEuclidean vectorGame theoryVector space modelWeightTerm (mathematics)Schmelze <Betrieb>Maxima and minimaSimilarity (geometry)Address spaceRepresentation (politics)Query languageVector graphicsWordEuclidean vectorSpacetimeEinbettung <Mathematik>Constructor (object-oriented programming)Term (mathematics)SimulationProcess (computing)BitQuicksortWeightIdentifiabilityHome pageGame theoryDifferent (Kate Ryan album)StatisticsFrequencySubject indexingNormal (geometry)Absolute valueNumberMultiplication signWeb pageAsynchronous Transfer ModeResultantAverageBounded variationDressing (medical)Operator (mathematics)Presentation of a groupElement (mathematics)Euler anglesBit rateVotingRow (database)Sinc functionDialectSurvival analysisLecture/ConferenceComputer animation
18:20
Term (mathematics)Internet forumEuclidean vectorSubject indexingQuery languageHome pageEndliche ModelltheoriePrice indexWeb pageMiniDiscVector graphicsMultiplication signSign (mathematics)Structural loadComputerWordSubject indexingQuery languagePhysical systemWeb pageEndliche ModelltheorieFrequencyVector graphicsHome pageDistanceEuclidean vectorMiniDiscLecture/ConferenceComputer animation
19:36
Query languageApproximationData modelQuery languageForcing (mathematics)RadiusMetric systemSubject indexingEndliche ModelltheorieWave packetQuicksortSemiconductor memoryWrapper (data mining)Product (business)Vector graphicsNetwork topologyCartesian coordinate systemTrigonometric functionsSimilarity (geometry)Euclidean vectorInsertion lossResultantExecution unitExpressionSemantics (computer science)1 (number)2 (number)Student's t-testLecture/ConferenceXMLComputer animation
21:17
Product (business)Run time (program lifecycle phase)MereologyQuery languageProcess (computing)Representation (politics)Latin squareDistancePhysical systemMusical ensembleRight angleTrigonometric functionsDemo (music)ResultantFilm editingNumberLecture/Conference
22:06
Query languageLinear mapMultiplication signCASE <Informatik>String (computer science)Network topologyPoint (geometry)Data structureVektoranalysisEuclidean vectorRankingQuery languageType theorySoftware frameworkVector graphicsLinearizationComputer animation
22:46
Query languageSpacetimePoint (geometry)Binary fileNetwork topologyPoint (geometry)Query languageEuclidean vectorNetwork topologyEndliche ModelltheorieState of matterSpacetimeRandomizationRecursionType theoryHeegaard splittingMultiplication signNumberMereologyComputer animationDiagram
23:21
SpacetimeBinary fileNetwork topologyBinary treePoint (geometry)SpacetimeHeegaard splittingNetwork topologyVector spaceWordBranch (computer science)Computer animationLecture/Conference
23:50
Point (geometry)Network topologyThresholding (image processing)PolygonLink (knot theory)Priority queueUniformer RaumMetric systemBranch (computer science)Proper mapFault-tolerant systemCondition numberMereologyGoodness of fitState of matterConfiguration spaceNetwork topologyPoint (geometry)Formal languageVector graphicsQuery languageEuclidean vectorSimilarity (geometry)AreaPower setHeegaard splittingQuicksortSequenceForestRight angleEndliche ModelltheorieRepresentation (politics)Trigonometric functionsBinary treeMatrix (mathematics)Time zoneReal-time operating systemFreezingComputer animation
25:54
WordPrice indexKey (cryptography)FreezingNetwork topologyWave packetForestEuclidean vectorVideo gameSubject indexingSystem callString (computer science)Software maintenanceWrapper (data mining)Event horizonEntire functionTerm (mathematics)Data storage deviceEndliche ModelltheoriePhysical systemGame theoryKey (cryptography)Representation (politics)Lecture/ConferenceXMLComputer animation
26:50
Euclidean vectorWeb pageQuery languageResultantEndliche ModelltheorieSubject indexingLecture/Conference
27:18
Real numberPrice indexWeb pageSet (mathematics)Vector graphicsWordFuzzy logicComputer animation
27:47
Real numberSimilarity (geometry)Real-time operating systemTrigonometric functionsQuery languageEuclidean vectorContext awarenessProteinSocial classObservational studyInformationResultantVector graphicsSystem identificationInformation retrievalClassical physicsSoftware testingLecture/ConferenceComputer animation
28:17
Vector graphicsDemoscenePoint (geometry)Web browserQuery languagePhysical systemHome pageLecture/Conference
28:54
MUDHidden Markov modelFAQExecution unitUniform resource nameWindowWide area networkInformation managementArmValue-added networkCAN busHome pageResultantSearch engine (computing)Library (computing)Source codeProduct (business)PrototypeBitSlide ruleEndliche ModelltheorieBounded variationCodeProjective planeXMLProgram flowchartSource code
29:48
Conditional-access moduleWorld Wide Web ConsortiumQuery languageEmulationComputerSimilarity (geometry)Home pageMetric systemVector graphicsQuery languageSoftwareSemiconductor memoryPhysical systemRevision controlSource codeXMLComputer animation
30:17
Query languageWorld Wide Web ConsortiumSimilarity (geometry)Home pageMetric systemVector graphicsQuicksortSimilarity (geometry)Physical systemVector graphicsResultantDifferential (mechanical device)Electronic mailing listWeb pageEndliche ModelltheorieEuclidean vectorMereologyLevel (video gaming)Lecture/ConferenceComputer animation
31:04
WordWordEndliche ModelltheorieFitness functionSheaf (mathematics)Query languageMereologyVector graphicsData structureCodeString (computer science)Term (mathematics)Key (cryptography)Einbettung <Mathematik>Run time (program lifecycle phase)Different (Kate Ryan album)Projective planeElectronic mailing listExpert systemResultantPhysical systemData compressionSemiconductor memoryMessage passingQuicksortImplementationEuclidean vectorCycle (graph theory)CASE <Informatik>DatabaseSubject indexingMultiplication signSet (mathematics)Finite-state machineRight angleView (database)ProteinData storage deviceMoment (mathematics)Windows RegistryXMLUMLComputer animationLecture/Conference
Transcript: English(auto-generated)
00:01
So our next speaker is Ankit Bahuguna and he'll be talking about query embeddings, web scale search powered by deep learning and Python. Thanks a lot. I will be talking about query embeddings which is our system which we have developed at Glix
00:23
which does use deep learning and the system is entirely built in Python. A bit about myself. I'm a software engineer in research at Glix. I have a background in computer science and natural language processing and deep learning. We are building a web search engine
00:40
which is part of a web browser and also the browser works on mobile too. The areas that interest me are NLP, information retrieval, deep learning and I also am a Mozilla representative since 2012. So about Glix, we are based in Munich. It's majority owned by Hubert Buddha Media.
01:01
We are an international team of 90 experts from 28 different countries and we combine the power of data, search and browsers so that we are redefining the browsing experience. Our website is Glix.com and you can actually like check out our browsers. So here I'm talking about search so I'll start with it.
01:21
So search at Glix looks something like this. So when you open your web browser, what you usually do is you go for a link or you go for a search term. What Glix experience gives you is a web browser with an address bar which is intelligent enough to directly give you the site based on what your query is so say if you're searching for something like Python,
01:42
wiki, you'll get a Python wiki link. If you want to search for like weather in Bilbao, you'll get like the weather snippet and interestingly I found out that on Monday, that's today, it's 41 degrees so take care and of course if you want to search for news, you'll get like real time news so it's a combination of a lot of data
02:00
built into a browser with like the technology of search behind it so it's all three things combined. So a bit historically about how traditional search works so traditionally search was, so search is a very long-started problem and by search I mean information retrieval or the web search and what they used to come up with
02:23
was create a vector model of your documents and your query and then do a match at real time and the aim of the whole process was to come up with like the best URLs or the best documents for the user query. Over the time what we found out, like search engines evolved, there was the web became rich with web 2.0,
02:43
there was a lot of media which came in and people expected more from the web. To come up with our search story, search at clicks is based on to match a user query with a query in our index and our index is based on query logs so if you type Facebook or FB,
03:02
it has to go to Facebook.com. Given search and index, you can actually construct a much more meaningful search result experience for the user because it's enriched by how many times people actually query and lead to the same page. So what we aim is we construct alternative queries
03:21
given the user query so if we find it directly, it's great but if something which is different or we have not seen before, we will try to construct them at runtime and try to search for those results in our index and broadly our index looks something like this. So you have a query and it has URL IDs
03:40
which means a URL ID is linked to some hashing value and that URL is the actual URL that people go to given the query and there are these frequency counts and everything which actually allows us to make a prediction, okay, this page is the right full page that the user actually intended. To give an overview of the search problem itself
04:01
in a bit more depth, the search problem can actually be seen as a two-step process. First one is recall, the second one is ranking. So given your index of like billions of pages, what you try to aim at is like get the most best set of candidate pages that you can say, okay, given a user query,
04:20
that should correspond to them. So say I say I want to get the 10,000 pages or 10,000 URLs from my billions of pages which best fit the query and then the problem comes up is the ranking problem. The ranking problem means given these 10,000 pages, what we want is give me the top 10, 100, three results.
04:41
As you might know, that given any search engine result page, the second page is a dead page so everybody concerns about the first page so it's very important to have the top five or top three results as the best result for your query and that's all we care about at Clix. At Clix, what we want is like given a user query, we try to come up with three best results
05:01
from our two billion pages in the index. So where does deep learning come up? So what we aim at Clix is like we are trying a traditional method of search using fuzzy matching the words in the query to a document but then we are also utilizing something
05:21
which is a bit deeper and a bit different which is using something called semantic vectors or distributed representation of words. What we actually try to do is we represent our queries as vectors. So a vector is like a fixed dimensional floating point list of numbers and what we try to do is
05:42
given a query and given a vector, that vector should semantically understand the meaning of the query. This particular thing is called distributed representation where the words which appear in the same context share semantic meaning and the meaning of the query is defined by this vector.
06:01
These query vectors are learned in an unsupervised yet supervised manner where we focus on the context of the words in the sentences or the queries and learn the same. And the area that we actually study this thing is called neural probabilistic language model. Similarity between these queries is measured as a cosine distance between two vectors.
06:21
So if two vectors are close together in the vector space, so they are more similar. And hence what we do is that we try to get the closest queries based on which are the closest vectors in space are to the user query vector. And this gives us a recall set or the first set that we can actually fetch from our index which most accurately correspond to our user query.
06:45
So a simple example to illustrate this is like say a user types a simple query like sims game pc download which is a game. What our system actually gives us is sort of a list of these queries along with their cosine distance
07:00
to the query vector that user typed. So given the query sims game pc download we get like sort of a sorted list where the first one is like the most closest to sims game pc download. Bear in mind like it's a bit different to understand because you're not doing a word to word match
07:22
but a vector to vector match. So the vector for the query sims game pc download is much closer to the download game pc sims. Now this is coming from our search backend which is bag of words because we want to optimize the space as well. So eventually the vector comes out to be the same.
07:40
And the values on the right are the cosine distances. So as we move down the cosine distance increases and we'll see like we'll start getting some bit far off results. So we are usually concerned about like top 50 closest queries that come through this system. So a bit more about how this learning process works
08:00
and what we're actually utilizing in production is we use something called an unsupervised deep learning technique to learn these word representation. So effectively what we want to learn is like given the continuous representation of the word you would like the distance of like two words cw minus cw dash to reflect a meaningful similarity.
08:22
So for example if there's a vector like king and you subtract that like a vector of man and then you add a vector of woman you'll probably get a vector which is close to vector of queen. And the algorithm that defines this is word2vec and we learn this representation and the corresponding vectors.
08:42
So a bit more about word2vec. It was actually given by Mikulov in 2013. They had two different models where continuous bag of words representation and continuous skip gram model. This we focus on again distributed representations that are learned by neural networks. Both models are trained using stochastic gradient descent
09:02
and back propagation. A bit more visual indication of how this works is like in a continuous bag of words model on the left we have like a context words of five words and we want to try to predict the center word. So given like the cat sat on mat
09:22
the word sat has to be predicted given the other context words. And the skip gram model does the exact reverse so given the center word in the sentence or a context window you try to predict the surrounding words. So given these two models you can actually like define these vectors for each word that you see as a lookup table
09:42
and you can learn them using stochastic gradient descent. I'll probably skip this because this has a lot of math in it but still. So what we try to optimize is a neural probabilistic language model tries to optimize given how many times you'll see a particular word
10:01
given the context and given how many times you see a word not in its context. So a best language model will actually say okay given a certain sequence of words you'll see the next word and given a certain sequence of word you will not see a certain word and that's where the model actually learns. And this is one of the example
10:21
of how a traditional language model actually works. So for example this the cat sits on the mat you try to predict what is the probability of mat coming after the sequence in a certain vocabulary dictionary that you have. But the only cache here that we have to worry about is like your vocabulary could be very very huge.
10:42
So what you might look at is like you want to try to predict the probability of a word. Say you have seven to 10 million words in your vocabulary you want to predict the probability of your one single word across all of them. So to avoid this scheme what we use something called noise contrastive estimation
11:00
where we actually don't use the entire vocabulary to test our word against. What we do is like we say okay we pick a set of five noisy words. 10 noisy words. So for this particular sequence the cat sits on the mat you're pretty much sure that the mat is the right word but so can the other words but then say the cat sits on the hair
11:23
or something like that. So these words will not be the exact sequence that you will find in real life. And you can pick those words at random from a uniform distribution and get these noisy words as your training examples. So what effectively a model learns right now given the sequence what is the right word
11:42
to get as a next word and given the sequence which are not the right words. So if the system differentiates over and over again with millions of examples and you train this over certain iterations you'll probably get a model which is able to differentiate the position of the right words with the position of the bad words
12:00
separated with a clear distance. So let's see how this will work with an example itself. So for example there is a document like the quick brown fox jumped over the lazy dog and we have a context window size of one. We say okay, given the first three words
12:22
the quick brown fox I have the center word quick and the surrounding words as the and brown. So I want to get in a continuous graph of my model what I want is like can you predict quick based on the and brown. So it's just like a very simple example but at production we found like skip gram does much better so effectively what we try to find out
12:42
is like we try to predict the context word from a target word so we predict the and brown from quick. So given quick predict what is the probability of the predict what is the probability of brown. And the objective function is defined over the entire data set. So whatever data set we have our data set is built on a lot of Wikipedia data
13:04
a lot of query data, title descriptions that we have and a lot of other textual data that we have to actually learn how the queries are formed or how sentences are formed or what is the sequence of these words. And we use SGD for this. Say at train time T we have like a certain case
13:21
where you have quick and the and our goal is to predict the from quick so we select like some noisy examples say like say num noises like one and we say sheep, sheep should not be like part of this. So next we compute a loss for this pair of observers and noisy examples and we get this objective function.
13:42
So what we try to do is given this which is like a log of the value of the score so given the probability which is the correct piece of sentence or correct piece of context the and q should be given a score of one and given like a quick and sheep this score should be zero.
14:01
So if you update the value of theta because that depends on it we can maximize this objective function as like a logistic log likelihood and we can actually do a gradient descent on top of it. So we perform an update on the embeddings and we repeat this process over and over again for different examples over the entire corpus and we come up with like a lookup table
14:22
for words and vectors. So we can define the dimensionality of a vector as I said in my slide that we use 100 as dimensions to represent that word and that's pretty well for us. So how do these word embeddings actually look like or what we have actually learned is something like this. So if you see like the word vectors
14:43
or like you project these vectors in space what you find is like the vector for man and woman is roughly equidistant from like king and queen and you'll find not just variation in gender but also variation in like verb tense like walking and walked and swimming and swam
15:00
because you might have sentences where like the guy or the person is walking and the person is running or he walks or he runs would occur in the same context and this is what the model actually captures pretty nicely and not just that we actually also have like some other informational features like countries and capitals like Spain and Madrid
15:23
or Italy and Rome, Germany and Berlin. So these are like country capital relationships. This is like a projection on a 2D scale using TSNE where you actually can see, I mean it's a bit short but you can actually see like some characters here at the bottom and here on the top you'll have like
15:42
May, shirt, wood, some here like more, less, some more adjective identifiers and this is like a projection that you can see if you see the more semantically meaning words are actually closer in vector space and this is a very important property because if you can try to leverage this and construct like sentence or document representation
16:03
you'll probably get like similar documents in space as well and that is what query embedding addresses. So the way we generate a query vector using these word vectors is like for the same query since game PC download we have a vector for each of these words. What we do is like we just don't use these word vectors
16:21
as it gets the term relevance and term relevance for us is a bit sort of a custom process that we come up with but actually what you see is like you'll get a score for each term in the query. So this tells us like sims is the most important relevant word in the query because it's the name or the name identifier
16:42
and next week what we do is we use this term relevance and also a vector to calculate like a vector average of these vectors. So what a weighted average actually means is like say given two vectors of two different words and their weights or their term relevance
17:00
you do a NumPy average and you'll get like an average representation of those words and effectively what we actually say our query vector is this average representation. So given our vector and the term relevance we get like this average representation and this represents our query vectors. Effectively at the end sims game PC download is nothing but this hundred dimensional vector
17:23
and that is what we use as our query vector. A bit about term relevance. So we have two different modes of term relevance. Usually it is the frequency of the words that you find but it's not very good for scale. Also like you use something like TFIDF
17:40
or these sort of representations but what we have used is something called TF5TF is like given the number of queries linked to a page how many times that term has occurred in those top five queries and that's a much better indication to us given the data that we have that we can roughly say that given the word statistics
18:02
give me this number and give me the knock-in frequency I'll get something like an absolute term relevance and the relative one is actually sort of a normalization on the all the pages that we have in our index. What we found out is like if you normalize your scores across all the pages of your index the vectors are slightly better
18:20
and get slightly better results. And these all are data dependent. We compute them on the fly each time we refresh our index and for example this looks something like this. For each word you'll have like features like frequency, document frequency, you have UQF and all the other stuff and similarly for all the other words as well.
18:40
So what we actually create is like a query vector index now so given a traditional index which has all the documents we have all the queries and their vectors and we try to do a query vector lookup. So we cannot do this for all the queries because there are just too many queries. So what we found out is like
19:00
given all the pages are indexed we can actually just pick the top five queries which effectively represent the page and we call them as top queries and from the page models we can actually get this data. So roughly we come up with like 465 million queries which represent all the pages in our index. And we try to learn a query vectors for each one of them
19:22
and if you just like dump the whole system on disk it's like around 700 gigs and what we actually have the problem now is like how do we get similar queries from these 465 million queries? So given a user query find me the closest 50 queries from this 465 million queries.
19:43
So how do we find closest queries? Should we use brute force? It's too slow, it's too, too slow. We cannot use hashing techniques that effectively because it's not very accurate for vectors because these vectors are semantic even a small loss in precision could lead to like haywire results. So what our solution actually required
20:00
was the application of a cosine similarity metric. Somehow we should have to scale for like 465 million queries and take 10 milliseconds or less. So the way we came up with the answer was something called approximate nearest neighbor vector model and they were actually pretty helpful for us.
20:20
So what the model that we use is called Annoy. It is a C++ and Python wrapper that exists for this. To build the approximate nearest neighbor models for all the vectors of queries that we have. Annoy is actually used in production at Spotify and now at Clicks as well. We can train all on the 465 million documents at once
20:43
but it's too slow because it is sort of memory intensive. So what we do is like we don't train them, all of them together. We have a cluster where we actually host these models along with our search index. So we train them as 10 models with like 46 million queries each and we train it on 10 trees.
21:01
What these trees actually mean I'll explain next. And the size of the model is like around 27 gigs per shot, 27 gigs per shot that what you get after training which is like around 270 gigs if you just scale it to 10 models. And everything is stored in RAM because for us the most important thing is latency. Given a search you want the results
21:20
to happen pretty quickly. Later I show a demo of how this thing actually is used in production. And then at runtime what you try to do is like you query all these 10 shots simultaneously and then sort them based on what cosine distances that you get. So your different parts of your shots might have different closest queries. So eventually what you'd want is like
21:40
the best representation of those queries which are closely matching the user query. And where we actually found a nice cut off was like 75, 55 is a heuristic number as how many nearest queries would be very good for the system. That doesn't really like decrease our recall or anything, our latency for that matter because this has a huge latency cost as well.
22:02
So by first I want to actually explain like how we actually use Annoy and how Annoy actually works. It's one of the nice frameworks that you can actually use if you are using vector calculus or like using like something vector based approaches for your recalls or ranking.
22:21
And we try to find out the nearest point to any query point in like a sub linear time. So what you try to find out is like you cannot do it one by one so it's not O of N. What you want to do is like try to do it in sub linear type. Can you get those closest queries in like log of N time? And the best case data structure for that is a tree.
22:40
So given like all your query vectors are represented by like each point represents a single query, what you try to find out is like say given a certain point which is the nearest point or like a user vector which is like a user query vector is some random point on the space, find the nearest ones. So to train that model first,
23:01
to build that tree what you do is like you just split this type of space recursively. So you split, take two points at random and split the space. You do it again and then you get something like a tree. So you have like a certain segmentation of certain number of points in the cluster which are like in different parts of the tree.
23:22
You keep splitting and you end up with a huge binary tree. The nice point about this binary tree is like the points that are close to each other in space are more likely to be close to each other in the tree itself. So if you are trying to navigate through a node and you try to come up with like some child nodes that holds a trap or that whole branch
23:42
would be composed of all the similar nodes in the vector space. And this is a very important feature. So how do we search for a point in these splits that we have built? So say that X, the red X is our user query vector
24:00
and we try to find out which are the nearest vectors to this particular vector and give me the queries related to it. So what you do is you end up with like, you search for a point and you just jot down the path from the binary tree and you will get like these, okay, seven neighbors that you get. And you use like a cosine matrix of how close it is
24:20
if it's very close to like between zero and .5 it's much, much more closer. If it's more than one because cosine takes values between minus two and two. So then you can actually decide, okay, how close your vector is. But here what the problem is like you only see like seven neighbors coming. What if we want more neighbors? What if we want more than seven closest queries?
24:42
So what we'd use is something called, we don't just navigate through one branch of the tree, we can actually navigate to the second branch of the tree and this is maintained in sort of a priority queue and we can actually traverse both the parts of the tree and get like these closest vectors. And so you don't, not only like look at the light blue part
25:01
but also like a slightly darker blue part. So you see both the sides of the tree because that's where the split occurs and you can find, okay, both of these sort of areas in hyperspace are like closer to the user vector. But sometimes you'll find like, because we did it randomly, what happens is like,
25:21
you can actually miss out on some nice zones because you just split across two different points. So what you do is like to minimize this, you train a forest or freeze and it actually looks something like this. So you not only like train on like a certain sequence of splits, but you randomize those across say 10 trees.
25:41
So effectively your model learns these 10 configurations at once and searches for them in real time in parallel. And this gives you like a pretty good representation because when you sort them and get like good query representations, you'll get like some good similarity between queries. So we train a forest or freeze. So one bad or like a missing feature in Noi
26:03
or like maybe it's a feature not a bug, is like it doesn't let you store string values, but it actually allows you to store indexes. So you can actually store like for a query, sims game pc download, give this like a unique index, say like five or one, and that one will be stored with a vector
26:20
and that model will be learned. So when you query Noi, you'll get like an index back of all the indexes which are close to it. So what we have at Clix is like we have developed a system called Kiwi, which is like a key value index, which is also responsible for our entire search index. We found it is much better than Redis or anything to compare with
26:41
in terms of reads and maintainability. We developed it in house. It's written in C++ with Python wrappers again. And it actually stores your index to query representation. So what you effectively see is given a user query, you'll get a query vector. You search within the Noi models,
27:00
the closest query vectors, you'll get indexes for these. Then you query the Kiwi index, you'll get all the queries and effectively you can fetch the pages for all the queries that are closest to the user query. And this is how we improve our recall. And the results are pretty, pretty amazing in the sense that we get much richer set
27:21
of candidate pages after the first fetching step. We'd like a higher possibility of expected pages among them. And the reason it is going this way is because now we are going beyond synonyms and doing a simple fuzzy match, but actually using how vectors are learned semantically. It screws up sometimes,
27:41
but most of the time you'll find like there is a definite improvement because you always try to learn those words which are near to the context. And that's a very important feature. And queries are now matched in real time using a cosine vector similarity between query vectors, plus using the classical information retrieval techniques that we use at clicks.
28:01
And overall there's a recall improvement from previous release that we had was around five to seven percent. So it's the improvement that we find on internal tests that how much we are improving on this. And translated improvement in the final top three results is around one percent. So that gives us a clear identification of where these vectors are actually useful or not.
28:23
And the system actually triggers only for those queries which we have never seen before. So that's also like a very, very important point here because for the scene queries like FB or Google, you actually land it to a certain page, you're definitely sure about it. But for queries which are not seen before,
28:40
which are new to us, which are not in the index, you have to go beyond the traditional techniques. And this one technique actually helps a lot. So before I conclude, I actually wanted to show what the browser actually looks like. So this is like a clicks browser. And this is the search page. And we actually have this snippet which comes up.
29:01
The idea of this was to reduce that whole step of search engine result page. And you can actually get directly to our page. So the libraries are like Spotify and OI, which is again available on GitHub. Kiwi which is clicks OSS and a GitHub that you'll actually find.
29:21
It's pretty useful. It's a pretty active project as well. VirtaVec can be trained using Gensim if you want to do a prototype. But I would recommend to use the original C code because it's a bit more optimized and we found there are certain variations in the models that are developed because of the comment history that we see.
29:42
There are other clicks OSS projects that you can actually contribute to. If you want to find the slides, it is actually on Speaker Deck. It's qepython, bit.ly slash qepython. So before I conclude, I'll just say this thing that we are still working on this system.
30:02
We have the first version of this thing ready but we are trying to look up at other approaches of deep learning like using something called long and short-term memory networks. The only downside of that approach is most of these user queries are keyword-based and you don't usually find people actually typing, okay, what is the height of Statue of Liberty,
30:20
those sort of things. You'll probably say Statue of Liberty height. And that sort of linguistic relationships may be well-captured by LSTMs. They are more complicated but this system is simple enough to still give you pretty good results. So we are trying to use this new metric that we have into ranking. We are trying to use this query-to-page similarity
30:41
using document vectors where, again, we are using a differentiated LSTM model or like a paragraph-to-vectors model. And we are trying to also improve our search system for there are some pages which are never queried before. So we have a lot of lists of these pages. We try to find out what could be the best way to represent those pages.
31:00
So either using vectors or traditional n-grams approach or something like this. Last but not the least, I'll say thank you. And I'll finish with this quote which was given by John Rupert Firth in 1957 where he said, you shall know a word by the company it keeps. And Michelob actually developed a model using the same contextual approach of words
31:22
and it actually helped us give good results. So thank you. Any questions? Yeah.
31:57
So one of the reasons we had was like
32:00
we wanted like a unified, so we tried a lot of these key-value stores ourselves. We tried Redis, we tried like a traditional database, we tried Elasticsearch. But what we found is like our needs are a bit different in the sense that we sometimes have a vector index where we need like our values should be a list of vectors. Sometimes it is just strings.
32:21
Sometimes they are repeated strings where like you have the same JSON data structure again and again. So we can actually optimize it more if we can write those parts of the code ourselves. We started by doing that. So I mean Kiwi is a much bigger project here and I'm not really the expert in it. But what I can say is like it has a lot of features in like you can actually like compress your keys.
32:41
You can do a message pack sort of compression using Zlib or Snappy. And that gives you like a much cohesive vector. It's faster to index, it's faster to reads, and it's scalable. In terms that we don't actually have to put this in memory, we can actually still have it in disk and do a memory map. So you can still have like lots of data that you can train.
33:00
What we actually wanted in our use case was we wanted reads to be optimized because we don't have writes at all. We can compile the index at once and then what we want at run time is like user query and give data from the index. For that Kiwi works pretty nicely for us.
33:28
With database. You were already talking about having no writes on the database. I was wondering how you handle having new data, new queries, new data to train your embeddings
33:43
or embeddings I would say nearest neighbor index. Because from what I know there are still no, no, there are no implementations of nearest neighbors
34:01
that can just update the index. Yeah, so it's true. So what we do is like we have a release cycle where we compile each NOI index every month and we also like get new queries and new query vectors for this. So it's not like a one time system but it's true. Like say immediately if tomorrow I want to include
34:21
like a set of results which are like new queries for tomorrow I cannot do that. But to address the same issue we have news. So the news vertical actually handles this. So for the most recent part of like say anything that is trending right now you'll have like in the news section. So given like the concepts you usually find like say
34:41
Pokemon Go was already available on Wikipedia before its release. So you actually have these concepts which are already learned from Wikipedia data and that's what we use. So you can always learn the concept for the new words like some XYZ Gen X word which comes Gen Y word that comes up like tomorrow. You probably not have it but it's a very hard problem anyway.
35:07
Anyone else? Okay, let's give a big hand of applause.
Recommendations
Series of 11 media