We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

#mices: Neural Search in Practice

00:00

Formal Metadata

Title
#mices: Neural Search in Practice
Title of Series
Number of Parts
48
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Over the last two years we have developed a Neural IR model for complementing the traditional search system of a large fashion e-commerce company. The ML model has been rolled out in 16 countries and is used for increasing recall of low-result queries as well as for query-dependent ranking powering 20% of traffic today. In this talk we will present the different stages of the model development, the feature representations we have chosen, how we generate massive amounts of training data, and how we manage the tradeoff between learning from big data and staying efficient. We will demonstrate the use cases and present successful results from offline and online testing.
Axiom of choiceProduct (business)Abelian categoryDisintegrationModel theoryInformationInformation retrievalQuery languageParsingRepresentation (politics)Symbol tableData acquisitionClassical physicsMedical imagingRepresentation (politics)Complex (psychology)Mathematical optimizationNumberPoint (geometry)Office suiteSimulationMultiplication signFreewareHacker (term)Presentation of a groupLink (knot theory)InformationInformation retrievalGraph coloringCategory of beingInformation systemsSpacetimeCASE <Informatik>Stress (mechanics)Order (biology)Range (statistics)Symbol tableSubject indexingSoftware testingCentralizer and normalizerView (database)Social classVirtual machineLevel (video gaming)Network topologyGUI widgetOperator (mathematics)Computer architecturePhysical systemDifferent (Kate Ryan album)Musical ensembleSoftwareBootingAndroid (robot)Goodness of fitConnectivity (graph theory)PlanningWeb pageSearch engine (computing)String (computer science)2 (number)Term (mathematics)Machine learningRight angleDescriptive statisticsState observerEndliche ModelltheorieForm (programming)Query languageModel theoryResultantFormal language1 (number)Arithmetic meanProper mapTwitterSimilarity (geometry)Variety (linguistics)Expert systemReal numberSoftware maintenanceXMLUMLComputer animationDiagramProgram flowchart
Subject indexingMatching (graph theory)Representation (politics)Symbol tableVector spaceProduct (business)Office suiteIntegrated development environmentOperator (mathematics)Block (periodic table)Content (media)Data storage deviceTrailWebdesignFamilyMatching (graph theory)BuildingConformal mapDifferent (Kate Ryan album)Multiplication signCombinational logicComplex (psychology)Length of stayVideoconferencingService (economics)Similarity (geometry)Extreme programmingCarry (arithmetic)Power (physics)Real numberSoftware developerModel theoryQuicksortPrice indexCASE <Informatik>SpacetimeSoftwareFunctional (mathematics)Software testingCapability Maturity ModelCentralizer and normalizeroutputEndliche ModelltheorieFatou-MengeExpert systemInteractive televisionPoint (geometry)Group actionNeuroinformatikString (computer science)Position operatorImplementationDomain nameElectronic visual displayEinbettung <Mathematik>ResultantException handlingInformationPlotterDigital photographySource codeSampling (statistics)Classical physicsUniqueness quantificationNegative numberLoginProteinGene clusterFile formatSystem callRing (mathematics)Process (computing)Decision theoryView (database)Presentation of a groupRun time (program lifecycle phase)Workstation <Musikinstrument>Squeeze theoremWave packetQuery languageCovering spaceWeightTable (information)Disk read-and-write headComputer architectureForcing (mathematics)Event horizonGUI widgetTerm (mathematics)Principal idealLaptopRepresentation (politics)Intrusion detection systemDimensional analysisInsertion lossToken ringAttribute grammarWordInformation retrievalType theoryDot productSummierbarkeitAbstractionFlow separationBackpropagation-AlgorithmusRule of inferenceAnglePhysical systemLengthStandard deviationTask (computing)1 (number)Diagram
Sample (statistics)Product (business)Random numberQuery languageNegative numberSampling (music)Sign (mathematics)LaceField (computer science)Performance appraisalData Encryption StandardThread (computing)Software development kitSoftware testingDefault (computer science)Network operating systemTable (information)Data modelEquals signFormal languagePairwise comparisonResultantProduct (business)Wave packetQuery languageEinbettung <Mathematik>Sampling (statistics)ImplementationModel theoryEndliche ModelltheorieFormal languagePairwise comparisonSoftware testingMultiplication signMinimal surfaceFraction (mathematics)Virtual machineBit rateTable (information)BefehlsprozessorMultiplicationNumberMetric systemMoment (mathematics)Normal (geometry)Server (computing)MereologyRun time (program lifecycle phase)Sheaf (mathematics)Position operatorParallel portDressing (medical)Vector spaceBackpropagation-AlgorithmusLibrary catalogCollisionInsertion lossTerm (mathematics)Condition numberMedical imagingPerformance appraisalLoginCombinational logicFigurate numberKey (cryptography)Theory of relativityNoise (electronics)Perfect groupResidual (numerical analysis)SoftwareService (economics)2 (number)Line (geometry)Data managementSingle-precision floating-point formatLie groupOffice suiteDigital photographyAbsolute valuePattern languagePoint (geometry)Musical ensembleLogic gateSystem callWebsiteGame theoryNeuroinformatikForcing (mathematics)Metropolitan area networkWeb pageWater vaporWorkstation <Musikinstrument>MathematicsPerspective (visual)Classical physicsDecision theoryView (database)CirclePresentation of a groupComa BerenicesEvoluteLaceQuicksortCASE <Informatik>Computer animationDiagramXML
Data modelEquals signPairwise comparisonFormal languageSoftware testingGroup actionAsynchronous Transfer ModeTime evolutionPersonal digital assistantQuery languageProduct (business)Computer-generated imageryReal numberPopulation densityOffice suiteEndliche ModelltheorieMedical imagingSoftware testingRight angleRevision controlMereologyInformation retrievalFault-tolerant systemINTEGRALMoment (mathematics)Transformation (genetics)Projective planeGroup actionModel theoryLine (geometry)CircleFacebookSquare numberSampling (statistics)Wave packetSoftwareVideo gameStructural loadVector potentialProduct (business)InformationMetropolitan area networkMatching (graph theory)Power (physics)Process (computing)Multiplication signVirtual machineNeuroinformatikCanonical ensembleInterpreter (computing)Category of beingPattern languageSoftware frameworkPresentation of a groupElectronic visual displayPhysical systemModal logicLibrary catalogElectronic mailing listRankingExclusive orPairwise comparisonCASE <Informatik>GUI widgetoutputSpacetimeLimit (category theory)Focus (optics)Software developerState of matterFrequencyRun time (program lifecycle phase)Data centerGoogolIntegrated development environmentGoogle Street ViewSanitary sewerOrder (biology)PlanningQuery languageResultantRaw image formatCartesian coordinate systemEinbettung <Mathematik>Computer architectureFlow separationType theoryFilter <Stochastik>Attribute grammarParallel portGraph coloringReal numberWordTerm (mathematics)Set (mathematics)Maxima and minimaComputer animationXMLProgram flowchart
Term (mathematics)RankingQuery languageSearch algorithmEndliche ModelltheorieProduct (business)MultilaterationWave packetTransformation (genetics)BefehlsprozessorLibrary catalogGraphics processing unitMoment (mathematics)Model theoryPower (physics)ResultantInformation retrievalCombinational logicOcean currentFluid statics2 (number)Sampling (statistics)InformationClassical physicsLambda calculusWordPerformance appraisalHydraulic jumpDegree (graph theory)Einbettung <Mathematik>Core dumpAugmented realityUniverse (mathematics)State of matterData managementBit rateComplex (psychology)Musical ensembleSpacetimeGroup actionSoftwareFile archiverDomain nameComputer animationMeeting/Interview
XMLUML
Transcript: English(auto-generated)
Thank you for joining and listening to our talk today. Traditional search systems work well up to a certain point. What comes next is neural search. Welcome to neural search in practice.
My name is Maximilian Berg and Daniel Weinland will present the second half of the talk. First, we will talk shortly about search at Solando. Afterwards, we will motivate and introduce neural information retrieval by showing our use cases. Then we will guide you through our journey with neural information retrieval. And finally, we will show our current work and give a future outlook.
In case you have never heard of Solando, we are the biggest online fashion retailer in Europe. At Solando, we search roughly 400,000 fashion articles of several thousand different brands. So retrieving the right article for a search is a non-trivial. We served 13 different languages and on a typical Sunday night, which is the busiest time of the week,
we have on average 250 searches per second. This results in sufficient traffic and click data to be able to train a neural search. And to your spoiler, 20% of our traffic is already enriched by neural search.
So where and why do we add machine learning or AI to our system? Let's have a bird's eye view on how we do search with our traditional information retrieval system. When a query comes in, we do spell checking, NER and further refinements of the query in order to get a structured query which we can fire against Elasticsearch.
You can see the structured query here is category stress and the style is boho. What are the reasons of adding neural search here? While the traditional search pipeline gives us pretty high precision, our recall suffers especially for complex queries.
It can furthermore only search for information which we explicitly store. Adding a lot of special use cases makes the system become more complex over time. This means it costs a lot of maintenance work to keep it up to date. So where do we apply neural information retrieval? You could imagine we replace several components in our classical information retrieval system with machine learning.
Instead, we do it end to end, which means we directly return articles for a given query without any step in between. This has proven to not only be possible, but even working quite well. Let's come to our current main use cases.
First, there is a so-called no-hit use case where our classical system doesn't hit any articles. The reason here is that embellished is nowhere in our article data. It's applied quite an unprocessed term, but anyhow, the neural information retrieval can find pretty exactly what is meant.
Furthermore, this is a pretty wide range of articles with different properties, but still the neural information retrieval gets the right one. Or at least really good ones. The second use case are the so-called low-hit queries. There, the classical information retrieval can find some articles, but still not all we have to offer.
In this case, Newborn is not written in sufficient article descriptions and our synonyms could not solve the term properly. These results on the right-hand side appear since our customers still find baby articles by other means than just searching and we can leverage this knowledge.
But how does all of this work? Well, it is as simple as this. We get as much data as we can. In our case it is click data, and if you do it, most probably it will also be click data of your users. You clean it up a little bit, then you throw it on a deep learning network and afterwards you have some great insights. So easy.
But before we deep dive into the architecture of neural search and get too concrete, I want to give a high-level overview of the most important properties. So let's start with an honest observation. It can't yet compete with our traditional search on very simple queries like a single brand. We have each individual article now assortment tagged with the correct brand and so retrieving a brand is super easy.
This even holds true for combined queries like Mike's shoes. Anyhow, combining a lot of simple terms might become more complicated. Let's think about the example North Face Anorak with PlushOD in Zenfone. Here we have combined the brand, the category, the material, the color and the style.
How we solve this with neural information retrieval, you will see towards the end of the talk. When you apply deep learning, it takes some time getting the first decent results. On the contrary, with a simple NAR you can directly hard-code your first results and improve over time.
While fine-tuning improves the neural search quite a lot and is really beneficial, the initial costs in getting something running are quite high. Anyhow, tuning classical information retrieval also takes a considerable amount of time. Furthermore, you need data. And the more data you got, the better it is.
Thus, bootstrapping a freshly founded e-commerce shop with a neural search might be quite complicated. While the tooling and pretrained models get better and better, it is still not as easy as deploying a classical search engine like Elasticsearch and getting first results there. Anyhow, once you have a decent amount of data, your model will just become better over time since more data accumulates, which is great.
Let's get to another great advantage. First of all, you can answer queries for vocabulary that you never hand-curated. In the fashion domain, new trends and styles are coming up quite frequently. All with re-tagging the whole assortment thoroughly is not possible.
Another example is answering queries for brands that we don't have in our assortment. Here, our neural search system allows us to show directly brands with similar properties to the user. When we increase the scope of our neural search to a new language, they could directly benefit from the already existing ones.
As an example, we don't have too much traffic in Swedish. Anyhow, the neural search system used the insights it got from German click data and fine-tuned the results for the Swedish language. Finally, users might have a quite different view on queries and experts.
This holds especially true for vague terms like vintage, boho or shiny. Having experts create data might lead to two one-dimensional results and does not leverage the variety of our assortment. So now that we have covered the basic properties, let's come to modeling a neural information retrieval.
First, write general and then we get more and more concrete. Let's start with this very short intro comparing classical information retrieval and neural information retrieval. Actually, they both have a lot in common. They both have an offline or indexing phase. There we go over all the documents and map them to representations optimized for retrieval.
In case of classical information retrieval, this is usually some form of symbolic representation. This means strings stored in an index for efficient retrieval. In case of neural information retrieval, this is usually simply a vector of real numbers. It is often called a latent representation, which in some sense means that it is no longer representing terms or strings.
Instead, some hidden dimensions, which have been discovered from data, are considered as being informative for the retrieval task. At runtime, a query is then mapped to a similar representation.
The retrieval consists of matching query and product representations and finding the most similar ones. One reason that motivates neural information retrieval is that classical search pipelines are well known to turn into search systems with complex dependencies and many handcrafted rules and exceptions designed by domain experts. NeuralRR, on the other hand, try to avoid this by relying on standardized deep learning building blocks and by relying on learning from data.
Also coming back to our use cases of finding results for no hit and low hit queries, we can already see another strong advantage of the NeuralRR approach, which is that by matching vector in a vector space, we will always be able to find some product similar to a query.
Of course, they might not really match the query, but they will be at least the best match we can offer. This is not the case for classical information retrieval. As soon as query and product do not share any token, we cannot find product.
We had tried a complex model in a research environment, which was presented in the MISIS 2018. We had complications migrating it into a production environment and thus we decided to start over with a simpler model. Furthermore, the results of the complex one were inferior to our current model. If we review the neural information retrieval literature, we can actually find two types of prominent NeuralRR architectures.
The one that we just introduced are often called representation models. Here the main idea is that product and documents are processed independently and they only meet in the final matching step. This of course has the advantage that product representations can be computed offline without knowledge of the query.
The other group of architectures are the so-called interaction models. They start directly by computing matching and interaction features between query and documents. These features can be based on string matching or matching of written beddings for example.
Those interaction features are then sent through a deep network, which produces a final matching score. There seems to be some indications that these models are much more powerful than the representation ones. However, they have the disadvantage that product representations cannot be computed offline.
Instead, at query time, every combination of query and product needs to be computed and sent through the network, which makes them much more computationally demanding. We have so far only read those representation models and now Daniel will guide you through our concrete implementation of the abstract representation model.
Thank you Max. As Max mentioned earlier, going directly live with a complex research model turned out to be difficult for us. For that reason, we decided that we want to start with a very simple model that can serve as a baseline.
It should already have the general structure of a neural IR model, but made out of very simple building blocks. Our idea was then to first test this model on some real use cases and to ideally already demonstrate some impact. Then we slowly wanted to extend the model.
As it turned out, we are actually still using this model in production. So apparently it became more than just a baseline model for us in the end. So in this model, the whole deep learning pipeline consists of two embedding layers, one for the query and one for the product.
So these embeddings layers, they are actually simply lookup tables which map every unique query and product to a vector of 100 dimensions. So in this model, we, for example, do not apply any tokenization to a query. Instead, every query string is a unique input.
We actually do this for our top 1 million queries, which represent more than 90% of our search traffic. Also for the products, we are not using product attributes for representing the products, but instead every product is a unique input.
Implementation-wise, this means that products and queries are first mapped to unique IDs. These IDs are then input to the embedding layers. To measure the similarity or to measure a matching score between a product and a query, we are using the dot product. So another way to think about this model is by plotting the coordinates of the embeddings as points in a vector space.
Here we only show two coordinates, but of course this space has much higher dimensions in reality, in our case 100. In the beginning, the coordinates are just randomly initialized. We can imagine that the dot product between two vectors represents some kind of similarity.
In fact, the dot product will be large if the angle between the two vectors is small. It also depends on the length of the vectors, but for the intuition we can say now that the dot product will be large if two points in this space are close.
In the beginning of the training, the coordinates of the embeddings are randomly initialized. Then we are using click logs to update the positions of the embeddings. Each time a customer has issued a search query and clicked the product, we will move the corresponding points a bit closer.
The products that have not been clicked will be pushed away. This all actually happens during the back propagation step of the training. And for this to work, we also have to define a loss function, which we will show later. So we have billions of click logs and for each of them, we will repeat this step.
So over time, we can imagine that some kind of clusters will form in this space where queries and products that have been clicked together will be close to each other. At retrieval, the dot product will give high scores to query product pairs that are close in this space.
As we have just mentioned, we have to define a loss function so that we can update the embedding coordinates during training. The loss that we use is the so-called negative sampling loss, which is well known from Word2Vec. In difference to Word2Vec, we are using it with query document or query product pairs and not with pairs of words.
In this loss, we have two terms. The first one contains the dot product of a positive query product pair, which means a pair that we can find in our training data.
So this dot product should become large so that the overall loss will become small. The second term is a sum over several negative products that we randomly sample from all our products. Here we have again the dot product, but this time with a negative sign.
So this term should become small so that the overall loss becomes small. During training, we compute this loss for each of our several billion training pairs, actually multiple times. And each time the backpropagation step will update the vectors of the embeddings so
that the desired conditions for minimizing the loss will become more and more fulfilled. As I have mentioned before, we are using click logs for training our models. However, we are not using them in the classical sense where only a query and the immediate clicks are considered as positive training pairs.
Instead, we are combining searches and clicks of all search sessions into training pairs. This is important for us as we also have no training data for the no-hit queries. So in the example here, a customer was first searching for fluffy slippers, but
he was not seeing any results because our search did not understand the query. Then he reformulated it to warm slippers and our search will understand this query, return some results, and the customer was clicking on some of the products.
When we then build training pairs, we take all combinations of queries and products and build out training pairs of all of them. This is a very simple way of capturing possible query formulations, but of course it also adds a lot of noise.
Nevertheless, we found this to be really one of the keys to the success of our method. If we are building the data this way, we actually end up with more than 4 billion training samples. As this is really a lot of training data, training time really matters.
What we found is that training our model using machines with multiple CPUs can lead to a strong speed up. For example, going five times over our 4.5 billion training samples using 16 CPUs takes approximately one day. As the model is really simple, we found this to be much more efficient than training using CPUs where the same amount of data would take one week.
The parallelization is actually based on an approach that is called Hockfield and the idea of this approach is very simple. It basically means to do parallelization without any locking and thus to accept possible collisions.
This is possible with our model because at every training step it only updates the embeddings of a few queries and products. Here we see some example results retrieved with the model. They come actually all from a tool that we have implemented which
automatically after every training renders the top results for a few hundred queries. Of course, we also have an automatic evaluation which computes various metrics including NDCG. But at least for major model changes, we still manually check those images to make sure that nothing went completely wrong.
These examples here are all no hit queries and the agenda here is actually not part of the query but is referring to the section in our catalog in which the query has been issued.
They are all in German but let me explain the first two. So for example, this one here is referring to some dresses made of some special tattoo-like material which is called Tattoospitz in German. This is a term that we do not have in our data. But if we would zoom in here, we would see that the dresses returned by the models are indeed all made of this material.
The second query, we have the top Vigormspiel. Vigormspiel, meaning something like figure enhancing, is also a term which we do not have in our product data. Here we have some more examples.
We also see that results are not always perfect. For example, here the second one says pullover patchwork and we see that all the results have some patchwork-like patterns, but they are not all pullovers. We have run various A-B tests with the model.
In almost all of them, we could show some significant KPI improvement. The results that we show here are actually from the first A-B test that we have run, which is almost two years ago. Also, do not be confused by the numbers here. That is because we do not want to show absolute numbers. So we are basically showing always the relative improvement in comparison to the baseline models that we used.
The first test is the NOID search use case. Here we compared against the baseline, which was using the same training data than the newer search. But it simply showed for every query the products that had the most clicks for this query.
Here our newer search could improve relatively by 31. It could improve the click-through rate relatively by 31%. For the low number results use case, we compared against our normal search pipeline.
Here, by combining the normal search with the newer search, we could improve click-through rate relatively by 30%. We could also show some significant revenue uplift of 2.5%.
Because of the positive outcomes of the A-B tests, we then decided to productionize the model training and serving. Meanwhile, the model is live in all our 16 country shops and it is serving approximately 20% of our traffic. So we haven't talked much about serving implementation so far.
This was actually one of the main reasons in the beginning why we thought that the whole idea of implementing a newer search might not work out. Because basically computing embeddings and dot products at runtime might be too expensive. In the end, we solved this very simply by simply pre-computing all the results for a very large fraction of our traffic.
We actually found that more than 90% of our traffic consists of less than 1 million different queries. And thus at the moment we simply pre-compute results for all those queries and store them in lookup tables.
For smaller country shops, we use an international model, which means a single model trained on the searching languages of all the countries in which we sell. For countries where we have few traffic, we could show that this strongly improves
the quality in comparison to a model only trained on the language of that country. This has two reasons. First, many queries are exactly the same in all the countries, for example brand queries. Second, the products are the same in all countries. Thus, the embeddings of the products can benefit from the training data of all countries.
Here are some examples from the international model. I won't read them out now all, but for example, the last one is an English one and here the query is true one shoulder wrap filters. And we can see that the products returned indeed match this query.
So after going live with the baseline model, we now have everything in place for developing more advanced models. And that's actually what we are doing at the moment. Our main goal in comparison to the baseline is of course to achieve better generalization power.
And in particular, we also don't want to be limited to a fixed set of queries and documents. Also, we want to introduce some real text understanding into this network. So this is the model that we are currently experimenting with.
It uses transformers for representing the raw text of products and queries. And we also included some features based on product images into this model. Transformers are a deep learning model that has been originally proposed by researchers from
Google and that has become the state of the art for building NLP architectures. It's strongly focused around an idea called self-attention. So in our model, we have replaced the simple embedding layer of the baseline model with several transformer layers. Input to the transformer layers are text embeddings that we learn on the fly while training the model.
So far for products, for representing product attributes, we have been using three types of attributes, name, brand, and category.
And additionally, we also include some image-based embeddings that also encode attributes such as color, patterns, and shape, for example. These embeddings have been actually developed by a research team at Zalando, especially for the purpose of representing fashion products.
Of course, this network cannot be trained using parallelization like our baseline. Instead, here we are using a GPU machine and some sub-sampling of the training data for efficiency reasons. Playing around with this model, we found it to do quite some good job. This is, for example, the query that Max showed initially.
It's coming actually from a real-life example where one of our colleagues saw a commercial from Zalando on Facebook for a jacket. And he somehow wanted to find the exact same jacket again in our catalog. So he described it with his own words as a North Face jacket with a plush hoodie in mustard brown color.
And indeed, the second jacket that our model returned here was the one that he was looking for. This is another search query that I have just made up to show what kind of queries the model can handle.
Of course, the model does not necessarily interpret all the terms in this query, but it somehow automatically picks those that it understands and finds the matching products. We have run initial AP tests with this model. So far, it can, however, not beat the baseline overall.
We have actually tried two versions of the model. One was trained on the exactly same session-based click data than the baseline. The other one was trained on real click-through data. Analyzing the results in more details, we actually found the baseline to do better for short and frequent queries.
That the transformer model is not performing so well on those queries might actually come from the fact that we have sub-sampled them too strongly during training. Also, we didn't include an ID-based embedding into the transformer model, which we of course could have done as we did it with the baseline.
It seems that the ID-based embeddings represent some really strong cues, most likely some popularity-like cues. That can neither be represented through text nor images. On the downside, they only work for queries and products with sufficient training
data, which was also the reason why we excluded them from the transformer model. On long and more rare queries, the transformer model showed much better performance. Also, of course, it can produce results for any kind of query which is not possible with the baseline.
So this brings us almost to the end of our presentation. To give a short outlook on what we are currently working on. As just mentioned, we are working on improving the transformer model and hope that we can run another A-B test soon. Another thing which we are also currently working on and which because of time limits unfortunately
didn't make it in this presentation is using a neural IR model as signal for ranking. So in all the applications we have described so far, the neural IR model was only used for retrieval. Ranking of the results that has then been done by an existing solution based on lambda-mart. So here we actually face the challenge of somehow combining the signals of the two frameworks.
And that's something that we are currently very actively researching. Also, for the ranking to run in production, we finally will have to compute the dot product at runtime. So to summarize our main learnings.
In the beginning of our journey, there has been quite some skepticism among us with regards to the potential of a neural IR model. This has been also justified because at least to our knowledge, neural IR is still more of a research topic.
We at least do not know of many examples where it is applied in practice. In our case, it has now become an integral part of our search pipeline and is meanwhile serving more than 20% of our traffic. This was mostly possible because we started with two isolated use cases, no-hit queries and queries with few results.
And key to our success in a commercial environment was also that we started very simple and early showed some impact. Another finding that really surprised us is how much value there is in session-based customer data.
We were really surprised how much information we can get from this data. It also shows that our customers are actually quite good in finding the products they are looking for. And by doing so, they do a very valuable job for us, which helps us training our models and thus improving our search for our future customers.
Thank you. And now we are ready for your questions. Thank you, Daniel. Thank you, Maximilian. Actually, our talk generated a lot of questions on Slack.
So I remind the audience that you can join the speakers right after the talk on Jitsi. And I apologize if we will not be able to answer all the questions. So as I said, yeah, a lot of questions on Slack.
So the first question is from Andreas and Andreas wants to know when you train the model based on click models, aren't you introducing a bias already because the clicks are limited by the current keyword search? So you're essentially training, teaching the neural search to behave like the keyword search.
Perhaps I can comment on this. So when Daniel said we use sessions, probably for us, I think it's also another question that comes later. A session is a whole day. So one user, one whole day. And so it's not just the search that they did, but also the catalog.
When they go through the catalog and find things by other means, then direct search, we use all of these clicks. And via this, we can get more information than our old search somehow can generate. Speaking about sessions, this brings us to the second question. How do we, it's a very specific one, how do you identify sessions?
Yeah, I think may let me respond. As Max said, we're using just the whole day at the moment for a session, just all the activity that the customer has been doing in one day. I think another way we tried would be to, we have also something like every half an hour of inactivity that would also be a session and that will give almost the same results.
Indeed, I think most sessions are very short, most customer sessions that we have that is on one intent, that is a few searches until he finds his results. So even though we use this whole day, most of these sessions, this is very short timeframes.
Great. So Mia noticed that you said that neural search still can't compete with exact brand retrieval. So the question is, do you use a combination of classic and neural search for best performance? Yes, so what we actually use is the neural search as a fallback.
When the classic either can't find something, then we have the no hit case, then we only use the neural search. Or when it just can return less than 20 articles, we add more articles or 50 articles that depends. Currently, I think we are less than 50 articles, then we add more articles from neural search. So we only use this if the classic information retrieval pipeline can't find sufficient results.
So I'll jump to a question that had a few plus ones. Why don't you just add the missing queries or keywords to the product data? Yeah, that's of course also something that we could do.
And I think in some sense, it's also similar to this baseline that we have used. We showed in some experiments basically compared the simple neural model with a counting-based bias baseline, I think, which does something quite similar. And of course, I think the main problem here is of course that this data can be very sparse.
So basically, adding these terms to queries only works for frequent queries and products that have been frequently clicked. And even though this might be also some criticism of our baseline model, of course our long-term goal is, and this is already this transformer model that we're using, is of course to have much better generalization. So basically something that works for every query and for every product.
Also if the product hasn't had any clicks. Exactly, especially then we would really have this problem that was mentioned in an earlier question that we only can replicate what our current search could do. When we somehow only add this term to the query products which were clicked with this query, but not to similar products, then we really get this problem that gets worse over time.
We have also a question from Uwe, Uwe Schindler, who is in the audience. So he mentions the relevance ranking. So the question is how is the relevance ranking implemented?
Uwe says that one problem with expanding the queries instead of adding the additional terms to your product data is relevance ranking. How is this implemented? Are you using BN25 maybe or with Altier afterwards? I think this question came before the end of the talk because I briefly mentioned that we're basically using
the neural IR model at the moment is only used for retrieval. So the ranking so far is completely done by some lambda mart. I think also our current relevance ranking is not using any query relevance.
So it's really only based on static product features. And it's based on lambda mart. Another question from Valentin.
How often do you update the neural models? So we train them once a week. So in the beginning we didn't update them that much at all. But then we saw the grading of our KPIs and then we decided, OK, we have to train them more often.
And currently we do it once a week. Yes. And a question about short queries with typos. How do they behave there? I think to some degree the transformer model can handle them.
But we also kind of found that we may have to especially add some typos to the training data to make this even better work with typos. That's something that we didn't try so far. But it seems that it can handle some of typos but it might also need some augmentation of the training data.
OK. So as we still have some time. So we have two minutes left. You are very efficient in answering the queries.
You mentioned the CPU versus GPU. So was training on GPU cheaper? Are you using GPUs for dot product computation? We do everything on CPU. So we do the training on CPUs and also for now the evaluation.
For the simple model. For the simple model, exactly. For the complex model, we use GPUs and trained on GPUs. And as was mentioned in the talk, we down-sample there quite a lot. And the training is quite fast for the complex model. But this is due to how we do the sampling of the training data. But for the current model and production, we do everything on CPU.
Even though we do this down-sampling, we kind of found that the more we train, the better the model gets. I mean, on one side we had to do this down-sampling to get this model to work. But it seems like the longer we train, the more data we take, the better the model actually gets.
OK. So I remember that you can join the speakers right after the session. So we still have 30 seconds left for one last question. Then we can continue the discussion in jitsi.
So the question is, do you start with a pre-trained model and then fine-tune or do you have to start from scratch? So we tried this out, especially with the international model. We tried to first train all countries together and then just continue training on the individual countries. But we found that this doesn't work out. So we always train from a completely random, initialized model.
So there is no pre-trained model at all. Yeah. And also the word embeddings that we use for the transformer model, I think they have been not pre-trained. Also, especially I think because we have many brands and things like this, special
words that you might not find necessarily in some pre-trained core power models. OK. So I think we have to wrap up. So thanks, everybody, for participating. Thank you, Maximilian. Thank you, Daniel, for your talk.