We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

From Text to Context: How We Introduced a Modern Hybrid Search

00:00

Formal Metadata

Title
From Text to Context: How We Introduced a Modern Hybrid Search
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Customers only buy the products they are able to find. Improving the search functions on the website is crucial for user-friendliness. In our talk we present the lessons learnt from improving the search of our global online marketplace, which sells 20 million products per year. We moved from a traditional word-match based approach (BM25) to a modern hybrid solution that combines BM25 with a semantic vector model, an open-source language model that we fine-tuned to our domain. With numerous references to current literature, we will explain how we designed our new system and solved the multiple challenges we encountered on both the ML and engineering side (data pipeline encoding documents, live service encoding queries, integration with search engine). Our system is based on OpenSearch, the lessons can be applied to other search engines as well. In particular the presentation will cover: - Status and Short-Comings of our old Search - Introduction of Hybrid Search - Our Machine Learning Solution - Architecture and Implementation (with special consideration of latency) - Learnings and Next Steps
Context awarenessDesign of experimentsTotal S.A.James Waddell Alexander IIProduct (business)Computing platformElectronic program guideBitSemantics (computer science)Power (physics)Product (business)Context awarenessProjective planeComputing platformPhysical systemPresentation of a groupComputer animationLecture/Conference
Computer fontPerformance appraisalEnterprise architectureCloud computingSign (mathematics)Context awarenessCohen's kappaTerm (mathematics)FrequencyInverse elementDigital object identifierProduct (business)Right angleGoodness of fitLibrary catalogStatisticsElectronic program guideContext awarenessFamilyVector spaceSearch algorithmMultiplication signVirtual machineWordMereologyInverse elementDimensional analysisEnterprise architectureVideo gameBitFrequencyMathematical optimizationArrow of timePhysical systemSearch engine (computing)Term (mathematics)Well-formed formulaLengthLogarithmBusiness modelMatching (graph theory)Query languageMetric systemProblemorientierte ProgrammierspracheUniform resource locatorData storage deviceWeb 2.0ResultantDifferent (Kate Ryan album)Wave packetSubject indexingCASE <Informatik>Shared memoryComputer fontPresentation of a groupWebsiteVector potentialConnectivity (graph theory)Computer animation
Vector spaceSimilarity (geometry)Query languageProduct (business)Context awarenessDistribution (mathematics)Vector spaceProduct (business)RankingoutputPosition operatorBusiness modelMereologyWell-formed formulaResultantCASE <Informatik>Landing pageSemantics (computer science)2 (number)Similarity (geometry)1 (number)Combinational logicInformation retrievalTrigonometric functionsUniform resource locatorWebsiteQuery languageType theoryGoodness of fitFunction (mathematics)Parameter (computer programming)Multiplication signDecision theoryFilm editingFamilyMessage passingMatching (graph theory)Performance appraisalWave packetSearch algorithmSearch theoryPhysical systemProjective planeSearch engine (computing)Correspondence (mathematics)Price indexNegative numberAutomatic differentiationCodierung <Programmierung>Computer animation
Query languageGoogolPerformance appraisalBusiness modelNegative numberLanding pageProduct (business)Data conversionVector spaceLinear regressionInformation retrievalSimilarity (geometry)Function (mathematics)Context awarenessArtificial neural networkEnterprise architectureJava appletSpring (hydrology)BootingStack (abstract data type)Vector graphicsMultiplicationAngleWritingAerodynamicsCoprocessorRankingSemantics (computer science)Power (physics)Query languageInferenceService (economics)Vector spaceProgramming languageCore dumpEinbettung <Mathematik>Touch typingImplementationSinc functionRow (database)Software framework2 (number)Open setPerformance appraisalCASE <Informatik>Position operatorInformation retrievalBusiness modelRevision controlPlug-in (computing)Java appletDifferent (Kate Ryan album)Stack (abstract data type)Transformation (genetics)ResultantBus (computing)Proof theoryTunisFile formatMedical imagingModule (mathematics)Goodness of fitFunctional programmingRun time (program lifecycle phase)Product (business)Default (computer science)Library catalogPhysical systemProcess (computing)Client (computing)BitPrice indexSpacetimeVideo gameWave packetSlide ruleMereologyFormal languageSheaf (mathematics)Cartesian coordinate systemEnterprise architectureArtificial neural networkFront and back endsAxiom of choiceSearch algorithmGene clusterTable (information)DataflowInfinityWeb 2.0Multiplication signCurvatureSet (mathematics)Elasticity (physics)View (database)DatabaseDiallyl disulfideInclusion mapEmbedded systemWeb applicationComputer animationDiagram
Vector graphicsQuery languageCoprocessorSingle-precision floating-point formatMultiplicationRankingField (computer science)WritingAerodynamicsProduct (business)Enterprise architectureBlock (periodic table)PreprocessorMathematical optimizationUniform resource locatorDifferent (Kate Ryan album)Graphics processing unitInferenceFamilyGoodness of fitOpen setTerm (mathematics)Software testingStructural loadParameter (computer programming)Metric systemConfiguration spaceField (computer science)Dynamical systemPhysical systemFile formatSet (mathematics)RankingSeries (mathematics)Library catalogCASE <Informatik>Reading (process)ResultantMathematicsProcess (computing)Independence (probability theory)Business modelEinbettung <Mathematik>Query languageThresholding (image processing)Mechanism designProduct (business)BitVector spaceComputer hardwareObject-oriented programmingBefehlsprozessorCodierung <Programmierung>Run time (program lifecycle phase)PredictabilityClient (computing)Dependent and independent variablesDatabaseRoundness (object)Data storage deviceSubject indexingArtificial neural networkMereologyReal-time operating systemBit rateContext awarenessTemplate (C++)Front and back endsReduction of orderWeb applicationVector fieldSemantics (computer science)Multiplication signImplementationUtility softwareDataflowPhase transitionRight anglePosition operatorService (economics)Scaling (geometry)TunisWeightComputer animation
Context awarenessSuccessive over-relaxationObject-oriented analysis and designTrigonometric functionsResultantSlide ruleInstallable File SystemIterationDirection (geometry)Point (geometry)Expert systemSemiconductor memoryGeometric quantizationCorrespondence (mathematics)CASE <Informatik>Wave packetHoaxProduct (business)Category of beingBusiness modelImplementationProjective planeElectronic program guideVector spaceQuery languageFilm editingSoftware testingIndependence (probability theory)Decision theoryGoodness of fitInclusion mapArithmetic progressionPower (physics)Uniform resource locatorMathematicsComputing platformBitView (database)Coordinate systemDesign of experimentsOcean currentComputer animation
Neighbourhood (graph theory)Stress (mechanics)Context awarenessState diagramElectronic meeting systemOptical character recognitionFormal languageElectronic program guideProblemorientierte ProgrammierspracheBitBusiness modelQuery languageServer (computing)Physical systemWave packetProgramming languageRight angleSlide ruleGoodness of fitFunction (mathematics)Vector spaceDifferent (Kate Ryan album)Computer animationLecture/Conference
Formal languagePerformance appraisalBusiness modelQuery languagePerformance appraisalRankingMathematical optimizationBusiness modelWebsiteDivisorComputer animation
Steady state (chemistry)Trigonometric functions1 (number)CASE <Informatik>MultiplicationExterior algebraSearch algorithmDivisorMoment (mathematics)Multiplication signComputer configurationOpen setBitMultilaterationVector spaceLecture/ConferenceComputer animation
Transcript: English(auto-generated)
Thank you so much. It's amazing to have you all here. Today, we're going to dive a bit into how we upgraded search experience at GetYourGuide by building hybrid search which involves combining power
of text-based search and semantic search. But before talking about that in details, I would like to introduce myself and my wonderful colleagues and co-presenter. We already had some introduction, but just kind of people who are also not present here and involved in the project. So, I am Dharin, and I'm presenting here
with my colleague, Angskar. And we have a few more people involved. Ryan, who is a data scientist, and Alex, who is an ML engineer. And I work as a backend engineer, specifically on the search system at GetYourGuide, and Angskar, who is also a data scientist.
Quick other shout-out for a few other people who were involved and supported throughout the project, and now a bit about where we work. So, we work at a company called GetYourGuide, and some context before talking more about the problem and why search experience is essential for our business.
At GetYourGuide, we position ourselves as market leaders in providing travel experience-based marketplace. The products are activities, essentially the experiences that one would do while traveling. Therefore, it's critical that we provide a platform for our customers to discover the right products.
And moving on to discover the right products, just a quick snippet of some stats at GetYourGuide, and some context in the catalog of the product offering, which makes it important to have good search experience to allow our customers to find the right activity and right experience.
So, we have around 140K, and in fact, it's growing every day, the products and catalog that we have. So, moving on to the crux of the presentation, and this is the outline. We already spoken about who we are, and next we'll talk about more about the problem, then we'll introduce the concept
of hybrid search, more theoretical underpinnings of it. Then we'll discuss how we train and evaluate potential solutions. After that, we'll talk about the architecture of the solutions, like the practical systems that we have in place to allow having it in production life.
And then we'll talk about the results, the outcomes. And finally, we'll also share about the learnings that we had along the way. And what next we can do to improve things. So, let's first look at the problem, why it was increasingly important for us to invest our time and resources into it.
So, just one glimpse, sad emoji there. Basically, someone, and there's an actual user query, try to search things to do in Edinburgh with family. There are no results, which technically is not true at all. We do have those activities and products, but it didn't show up.
So, and shockingly or not so, the potential customer just bounces off, right? So, just another such example, a bit different here is that there is no location in the query, so it's fun activities with my kids. There is no location in the query, and there's a bit of difference,
and we'll talk about it why there's a difference. But we do also have such examples, and there are no good results here as well. So, moving on from sad faces to a happy one, let me move over
and I'll let my co-presenter, Angsar, take over the theoretical underpinnings of our work. Okay, so I have to start with a smile now. Okay, so the next thing I want to do is to introduce the concept of hybrid search.
Let me ask first the room, like who knows already what hybrid search is here? Okay, then I really have to introduce the concept. And is there anybody who has applied it already in a live system? Okay. Okay, cool. Yeah, that's good to know. So, I will try to explain it slowly enough.
So, for a long time, the like web searches, like on websites, they relied only on keyword search, and that means like looking at exact word matches.
And there, the basic metric that was always in use is called TF-IDF. It's term frequency inverse document frequency. Term frequency is the easiest part of it. Like here, you see in the leftmost document,
so the word pizza shows up there three times. So, the term frequency of pizza for this document equals three, and the one of tour is two. And of course, the higher the better. So, the document matches best if the score is the highest. But there are words which show up basically in every document,
like and or or. Or in our context, the word tour shows up in a lot of products. So, to also reflect that, there's this inverse document frequency. You just check in how many documents does the word show up.
So, here, pizza shows up only in one document. The inverse of one is one. So, the inverse document frequency is one. That's the highest possible. And for the tour, it shows up in three. So, inverse is one over three. So, it's worse. And yeah, in both score parts, it's best to score highest.
And they are not just combined by product, but there's also a logarithm involved. But I don't want to go into too deep details. And TF-IDF is not used as is in the search engines, but there's an improvement called BM25, best match 25.
And that also considers the length of the document. So, if the TF-IDF is the same, but the document is smaller, then it's better. It's also a more involved formula. You can look it up, but I don't want to go into details.
Now, this is how it works for one word. Then, normally, search queries, they have more than one word, and you would just sum up the BM25 scores of all the words. With this, we can maybe already understand why this query, Things to Do in Edinburgh,
didn't have any results. Like, we didn't have really documents or products which matched enough of those words in the query. In particular, like, family might not show up in the product text, but it's better to detect something like the semantic part of it.
And that's done best by an improvement, which is called vector search, which is like some years old now. And instead of looking at the real words there,
it converts, with a machine learning model, it converts the text into vectors. And that is done on the one hand for the documents, which are, in our case, the products that are the black arrows here. And on the other hand, live, you need to also do that for the query that comes in,
which is this Pisa New York. And then there are optimized search engines also for this use case, which are then fast in finding the fastest product vector for the query vector. Daryne will talk a bit about that. And in this case, you want to show
definitely this NYC Pisa walking tour. You might also want to include the New York City food tour. Of course, the real vectors in use, they are not two-dimensional. It's just to be able to show how it works in general. Nowadays, it goes up to 2,000 dimensions.
Maybe it's even outdated because the models and also the vectors, they are growing. OK, so these are the two components, vector search and keyword search. And as you might have guessed, hybrid search is basically to combine both approaches.
Like in practice, we see that none of these approaches is best for all cases. In particular, for out of domain things, where you didn't have good training data before, the vector search tends to be not so good, but keyword search always matches without training.
So how do we combine it? The idea is pretty simple. So you have your documents, you run the keyword search, which is here on top with this BM25 metric. So first, you index the documents into the search engine,
and that basically stores the frequencies On the other hand, you pre-calculate the vectors for the documents for the vector search. And then when the query comes, you run both searches in parallel. In our case, that's actually in the same search engine.
And then in the end, you get the results of both approaches, and then it's just a question of how to combine it. So here in this picture, the intersection of both results looks quite small,
but in practice, in our case, that's actually the only thing we had to care about. Like we put the parameter so high that all interesting results are in this intersection, and then we just had to care about how to combine the outputs into a good ranking
and a good cutoff decision, like what to show, what not to show. Okay, so how do you actually combine the results? One thing is how to combine the scores, or you can combine the scores. And there, a pretty simple approach is to just use a linear combination of all the inputs.
And note that here I put in a third score, that's a product score, and in our case, but I think in most use cases, that's also important to get a good final ranking. It reflects how good the product is overall on our website.
So not how it matches to the query, but we also want to have good products on top and not the bad ones. So this can have the downside that there are different distributions for this keyword metric, the BM25,
and for the cosine similarity for the vectors, and that the linear combination might not be really catching the best possible results, but we actually started with a simple approach. If there are big problems with the distribution, you could use another simple approach
which just doesn't look at the input scores at all, it just looks at the rankings of the results. It's called reciprocal rank fusion, and here you see the formula. The other important part is that just the rank from one and from the other one produces the output score.
Of course, it has the downside that if there's a big gap, let's say between third and fourth positions for the vectors, like really the one is really close, the next one is really way off, it doesn't matter for this output score, so it probably is not the best one.
The best way, but also more complex way to do it is to learn to rank another model on top, to train another model on top which decides how to combine the scores, and that's called learn to rank model. Combining the scores is only one way
how you see in practice these approaches are combined. The other way is to do it in a hierarchical way, and here one thing you see very often is that you apply pure keyword search first because the search engines have been optimized for it for a long time,
so it's quite fast. And then you just take the top results of the keyword search, let's say 100, and you re-rank via vector search, and then there are even slower and even better models for maybe the top 10 results. There are so-called cross encoders where you have to put in both texts into one model
and get a better quality ranking score, but it's much slower. So that we didn't use, but the other approach would be to use vector search only for complex queries,
and that we also saw done by other teams, and you could say we also do it in a way, because in our use case, we already mentioned that the location is very important. It's probably the most important part of the search query. For that, we already had a good system in place, so if you just type in a location,
we don't even use semantic search now. So we just use the old extraction of the location and filter the results by that, and then rank it via their scores. Okay, so that's the basic concept of hybrid search
and how you could put it into practice. Let's see how we did it and how we decided which model to use and how to train and evaluate it. So the first question is,
yeah, what would we use as a training or evaluation data? And we had the problem that when we started with this project, the search wasn't really optimized. So where it was good at was locations. So you could also see that on our website, like if you looked at the search bar,
it said where do you go? So it incentivized the visitor to only type in locations, and that's why we couldn't use past data to train really for something more complex than just locations. And what we came up with is to use the data from paid search. So we do bidding on Google for keywords like family-friendly,
and then if you bid high enough and if the keyword matches the search query on Google, then Google shows our ads as a sponsored result, and then visitors come in on the corresponding landing page.
So we used the keyword that the users came with as a search query. It could be something like family-friendly activity in Berlin. And then we checked how did the products on this landing page perform, and that was then the indication of what would be a good match and a bad match.
And because on the landing page for family-friendly activity in Prague, let's say there we only show already matching products. They are in Prague, and they should be more or less family-friendly. We also had to add simple negatives in the training data,
so things which are not from Prague and things which are definitely not family-friendly. With this training data questions, how do you then evaluate the model? So we split this data into training and evaluation, and for our ranking model,
it's a standard metric to look at NDCG, normalized discounted cumulative gain, and that's what we also did. Then we had to train models based on this data, and the obvious one is the semantic search model,
which translates the query into the vector and the product text into a vector, but we also had to train a second model, and that's actually the one which combines these scores. So in our case, it was only a linear formula that was trained,
but still it's important to note that there are basically two models that need to be trained. So this linear formula in the end, it combines the vector similarity from vector search, the BM25 from keyword search, and the product score.
That's then how we decided which model to use, and that's how we trained to fine-tune the model, because with the semantic search model, you always start with the pre-trained model, and then you just fine-tune it to your data. But which pre-trained model to start with is the question.
We also had to decide, and there we would give the advice to look at leaderboards, and in our case, we looked at this MTEB leaderboard for retrieval. There are different sections in there, but the retrieval part would be the interesting one for this use case.
There is also one on SBIRT, but that seems to be a bit outdated, because I think what I see there, it's still the old models. Okay, now you know how we did the training and the evaluation. I hope you're a bit curious to see those results
that came out of the offline evaluation, and these are the results. So let's just start at the top. This in-float E5 large V2 model that was pretty much on the top also of the leaderboard at that time, and we tried it out with our data,
and it performed best NDCG-wise, but it's definitely too slow to really, in a live setting, translate the query into a vector. We also tried out this second position. It's just the vectors that you can get out of the OpenAI API,
and it's impressive that it's on second position, because it's not fine-tuned to our use case. Still quite good. The one that we chose in the end, it's the smaller version of the E5 model. It's a multilingual E5 small,
so smaller one and multilingual, because we don't just have English searches we have a lot of languages, so we were happy that we could introduce it already at all languages. The rest of the rows are included here
to give some additional insights. For example, if you compare the row four to the row six, you see it's the same model paraphrase, and the only difference is here that the higher one is fine-tuned. So it's just one showcase
of that fine-tuning to your own data actually helps. Row five, this all-mini-LM, I just included because it's a pretty default model for the semantic searches. So it's good,
but we found a small one that is better. And finally, if you compare the model that we chose in the end, this row three to the last one, there the only difference is that the one on the lowest position
doesn't include the BM25 from keyword search. So it's another nice proof that combining both approaches into one score actually leads to the best results.
One note, so these results on this table, they are just using English language data, but we did the same thing with multilingual. The scores are a bit different, but the main learnings, they stayed the same.
And yeah, I didn't find a space to also include this, but we also had an indication that the product score was quite important actually. So yeah, you should really include something like this. And maybe the next one I skip, it's just a technical thing for if you want to try it out,
maybe you can look at the slides. Okay, so that's how we did the training and evaluation. Then let's see now how we put this live into our architecture. And there, Darian will show you first how our architecture looked before and then how we had to adjust it to get to hybrid search.
Hello again. So yeah, in this section, we're going to quickly walk through the architecture details of the system we came up with during the time when we implemented hybrid search.
This is again an holistic bird's eye view of the existing system that we had, which basically does the traditional keyword-based search. As you can see, the query coming in from the search bar via an API. And then we use open search, which essentially is a fork of Elasticsearch
managed and supported by AWS. And you can see that we use that as our database. We have all the products with, and in fact, we support 30 plus languages. That is what the multilingual aspect of the model also comes into play.
And there's an asynchronous process to kind of provide all of this data and push that data into other data in open search. We use Kafka as well for that. And so yes, before going into
what new things we added into the system, I would want to kind of give a brief glimpse of our existing tech stack that we have, which is also relevant for the specific problem at hand. So on the infrastructure side, we run our services in Kubernetes. We run daily jobs on Databricks clusters, and we schedule them via Airflow.
And lastly, we use, as I mentioned, Kafka as a go-to database of choice. Everything we have is pretty much implemented in Python and Java. The search backend service that we saw earlier is in Java. ML applications and ML services are mostly written in Python.
And we use OpenSearch, Postgres, to process and access data in web applications. On top of OpenSearch, specifically for this particular problem, we do use KNN plugin. It powers our approximate nearest neighbor search,
specifically used for semantic search. And since plugin also supports a few popular implementations such as Faius, NMSlib, and Lucene. And finally, I'll briefly touch upon some of the core frameworks we are using, which is in Python.
sentence transformers, it's a Python module for accessing and using text embedding models, also image embedding models, but essentially we use it for text embedding model that we have. And another one kind of format for serving the model is ONNX, which is kind of open neural network exchange format.
And we use one of the runtimes that is supporting the ONNX format. And we'll briefly go what we use there. So this is the new architecture, like some bits and pieces of existing one, as well as the new things that we've added.
So firstly, again, the query text bar comes in from the users and we have search API. However, the another aspect now we have added is inference service. So whenever the query comes in, we do query and we send the query to the inference service that generates the embeddings for us.
We'll go into details of the inference service, what language it's used, of course it's Python, mostly, spoilers. And we take the embeddings from the inference service and then we have a BM25 score from OpenSearch, combining them into a linear fashion,
the linear function that Ansgar showed earlier. And then we return it back to the client. So in background, there's an asynchronous job that picks up the catalog of products, embeds it and delivers it over to OpenSearch via Kafka Bus. And so there's like two distinct processes
on the read side and on the write side. There is the pipeline for encoding that again generates the embeddings for the existing catalog of products that we have and then we push it into OpenSearch. So among the challenges we had in mind before jumping into the implementation,
we wanted our end-to-end latency to be low and it should be below 200 milliseconds at p95. And as the system is managed by a DS team and a backend team, we wanted the data scientists to be able to experiment independently, like use different models or try out different parameters
and reduce dependencies on the backend team and the overhead so that we can, of course, move and move faster. So we implemented the dynamically generated vector fields in OpenSearch via dynamic templates. As you can see, like a quick small snippet of how it looks like. So this dynamic template allows us
to kind of create vector fields on the fly for the existing index that we have. And so in case we want to kind of try out different models, we can just do it via defining a model ID for that field. And during read time and during inference time,
we can kind of toss a coin and select which model we want to kind of infer on or run the in and on and send it back to the response to the client. The next important thing over here is the hybrid query. So we use the same index to store activity data
as well as the vectors. So at runtime, we only do one query instead of two. Again, circling back to latency being very important for us. And that's only making one round trip to the database instead of two. For location-based queries,
we do a bit of special treatment around some parameters, but more on that later. Going into, zooming into the inference service, which translates the search queries into vectors live. It's a Python web application living in Kubernetes. On each deployment, it picks the fresh model from MLflow,
compiles it into the open neural network exchange format, as mentioned, and starts serving. There are some usual steps, pre-processing and inference, but something else kind of checks out over here, as you can see, the location block. As I mentioned earlier,
it checks whether the query contained location or not, which is our primary entity as well, and selects different set of ranking parameters based on the outcome of that. In terms of the serving the model itself, we use CPU, we are limited by our infrastructure. And during implementation as well,
we were also somewhat selective of the model so that it has good enough, so we have good enough room for optimizations on CPU, since we didn't have GPUs. We ran a series of load tests using different port configurations and models,
with less and more parameters, and we found kind of sweet spot. We were happy with in terms of both offline metrics and end-to-end latency. On the side note, as I mentioned earlier, we ended up using the open format from ONNX to serve the model, as in the load testing phase,
we gathered the port metrics and observed better CPU utilization, so less resource constraints, and better scaling opportunities. So our primary objective was access to hardware optimizations with ONNX runtime that we leveraged from Microsoft. In terms of performance, currently it's roughly 10 milliseconds in P95,
so something that we again wanted to target. A bit more about the pipeline for encoding. All of it kind of is scheduled by Airflow, and everything is basically in Python.
And as you can see, all products that we have, the catalog, the text, kind of the Airflow managers, the jobs, independent jobs that collect and prepare the data, then the fine-tuning happens, which we again we touched upon briefly earlier, and the models serve via MLflow,
and the asynchronous flow, as you can see, the prediction happens for the existing catalog of data, and we publish that embedding data, and then that goes into Kafka. Again, something that we saw earlier. So all of this, again, is managed by Airflow from jobs, and MLflow allows us to solve the model.
So this, again, allows really independent deployment of things and changes in real time without having any dependencies on engineering teams. Yeah, so after all of this,
let's look quickly at our live results. We often felt like a young sorcerer, like trying out things, experimenting, but I guess that's normal, hopefully. So experiment results. Quickly, first thing was eyeballing. I'm just checking how it feels, like how does it look, even for other cases that we thought didn't work earlier.
And so, and a bit of manual testing, we also saw that product score had a really high impact. Again, another thing that Angskar tested on is a product score. It is essentially our internal scoring mechanism
for how the product is performing, essentially how people are experiencing the product. And also, we also saw some lower threshold for queries without location. Yes, and, oh, sorry, I missed that.
And we had some manual rate adjustments for the parameters, special treatment for queries without location. And these are some of the things. And then we, after all of these tweaks, eyeballing manual testing, we ran a series of A-B tests. First one, we found out there was some problems
with pure location queries. As I mentioned earlier, location is our primary entity and people do search for just purely location and something that we already were supporting before. So that doesn't require semantic understanding, like Berlin doesn't require any kind of context over there. So quickly, something that we filtered out
is that we didn't want hybrid search to be done over pure location queries, which worked in the favor. And there was a second test. We found out there were too many empty results, something that we, of course, didn't want for specific cases. So there were some special thresholds over there
and queries without location as well. So we had some changes in the parameters and yeah, there was something that we tweaked a bit more. And the third and final one where we actually rolled out and there was some nice positive signals
because of the changes that we did. And as you can see, there was increase in more users clicking and more revenue essentially, which is great, right? Yeah, and so the examples that we tested on earlier before,
as you can see on the left, before we didn't have any results for things to do in Edinburgh with family, now get some nicely fitting results. Here, the vector search part is actually because family friendly is not in the text. So that's like a context that tries to match now.
Another one example that we showed here is fun activities with my kids, now has a lot of nicely matching results. So yeah, that was something that you can see as a result of the journey that we took. And of course, throughout our journey, we had some learnings and some better ifs
that we could have done better that will be captured in the next few slides and I will let Excar take over. Okay, so maybe now also a bit of a sad face by me, but let me start first by the things that went well.
So I think the first thing was that we had a project team. That means like we put people together that are usually in different teams that get your guide. So for example, like Darian is normally in the search platform team and I'm in the traveler data products team,
but for this project, we were put into this same project team and in this team we had all the knowledge that we needed to improve and to go to a hybrid search on the engineering and data product side and we had also the power to decide. So both of that made sure
that we really were able to get fast progress. Other good thing that we did in my view and that's also kind of the general philosophy at Get Your Guide is to iterate quickly, not try to be perfect from the beginning. For example, we started with this just linear mix on top
or we use this fake training data, but turned out to be good enough for the first step. And then after the first step, as we saw with these failures, we then find the biggest pain point and iterate until it's good enough and that in this case it was really working well.
One thing that might be special to our case is that there was also this corresponding change in the design and user experience. We already mentioned it before. So before in our search bar, we had this text, where do you go?
So users were only typing location and we had to switch that to something like search, Get Your Guide, so that users are even there, even trying out more complex queries that we can now serve. And we tested it out in independent but simultaneous AB tests to be able to look at the impact separately
and as expected, but still I think it's good to have validated that. Like the best of these two experiments is if we do both on the B side. So if we switch to search, Get Your Guide, and if we roll out the hybrid search.
Okay, so that's all that worked well, but there's still a lot of things to improve. And yeah, the biggest thing I would say is that now if you use our search, there's often the perception that there are way too many results. So it's quite difficult for us to find the right cutoff.
And yeah, things that we would like to try out here is to really use a more complex model, this learn to rank model for this decision. Experts also told us that we could try out an elbow approach, maybe in the direction of finding
where there's a bigger, first bigger gap in the scores. I'm a bit skeptical about that one, but yeah, that we can also try out. Of course, now that we have switched to this better search bar copy, we have our own training data from search
and it should help a lot to use that one. We also want to improve the entity or maybe even filter extraction, like what we're doing currently. What we also mentioned before is that we find the location and we use it as a filter.
There we can get more fuzzy and also typo tolerant. And also include other things like categories. That's ongoing work. And on the implementation side,
we are wanting to have a less memory footprint. So we want to use something called quantization, which reduces the memory that you need to store each vector and each coordinate. And it can be done in a way that harms the search performance least.
We're looking into that. And also we want to improve auto-suggest. So if you start typing, we are suggesting things. It used to be just locations, but now it already includes also products. And I think we're trying out categories.
Maybe it's already rolled out. And we also want to include popular queries and old queries by the same user. So some personalization there. Okay, so that's a look into the future. And with that, I hope you had some learnings from this.
Thank you for your attention. We have about three minutes for questions. So if you have any question,
you can just move to the microphone and then speakers would help. Hi. Yeah, thanks for the talk. It was great. One detail I wondered about was, you said you introduced a multimodal model and your experience with using various language queries and so on.
Did you need to account for not showing users answers that are not in the language that they understand? No, because so there's already a good system in place. Like, it's domain-based at GetYourGuide.
So if you switch the language, we only show output in that language. So that's not affected. But yeah, basically where it's important is to get the vector for the search query. If we had only used the English model, then we could only serve English search queries. But now we were able to also serve it
for all our domains directly. I think the quality was a bit worse in particular if it's languages where we had fewer training data. But overall it worked well.
How did you generate the scores for your different models? Like, how did you mark your own homework for the different models you were trying out? You had some scores from zero to one. Is it about these in the training overview what went well and what didn't? You had a slide where you had about
eight different variants of the models and how well they performed. And I just wonder, there's so much data, how do you know how well they performed? Okay, how do I get back there? So let's go back there. Here, this one, right? Yeah, so it's this NDCG. So we had the evaluation data,
which for each of the queries, had already a performance that we saw on the website. So from that you get an optimal ranking. And this NDCG metric then takes in the ranking that comes out of our model. And by comparing it and also taking in
these performance values, it calculates a score. That's this NDCG score. I was wondering what factors led you to decide an open search? Who's asking?
I'm over here. What was the question? Maybe it's better, I didn't use the mic. What factors led you to decide to use open search? Because there are a lot of solutions available today. That's actually a great question,
because we are considering different options right now as we speak. What factors we decided? I guess it's kind of a tool that worked for most of our cases, 95% of the cases that we had. It worked really well. It allowed us to speed up delivery
and had almost the features that were required at the time. Definitely not something that we feel it allows to do more or essentially something that we want to do further. It doesn't allow us at the moment. And I do commit a bit to open search, so there is a way for us to improve there.
However, it's still something that we are evaluating at the moment. So maybe we can catch up later and talk about it. Could you just tell us which alternative are you considering? Sorry, can you? Which alternative are you considering?
Multiple ones. I mean, we can briefly talk about it later. Thanks. Maybe I wanted to add, maybe it became clear before. Basically, we had this open search already in place for the keyword search, and so it was for us the least switching cost to just include this vector. Thank you very much.
We will reconvene in about four minutes.