We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Offline Ranking Validation - Predicting A/B Test Results

00:00

Formal Metadata

Title
Offline Ranking Validation - Predicting A/B Test Results
Title of Series
Number of Parts
56
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Implementing a machine learning model for ranking in an ecommerce search requires a well-designed approach to how the target metric is defined. In our team we validate our target metrics with online tests on live traffic. This requires both long preparation times and long enough runtimes to yield valid results. Having to choose only a few candidates for the next A/B test is hard and slows us down significantly. So what if we had a way to evaluate the candidates beforehand to make a more informed decision? We came up with an approach to predict how a certain ranking will perform in an onsite test. We leverage historic user interaction data from search events and try to correlate them with ranking metrics like NDCG. This gives us insights on how well the ranking meets the user intent. This is not meant to be a replacement for a real A/B test, but allows us to narrow down the field of candidates to a manageable number. In this talk we will share our approach to offline ranking validation and how it performed in practice.
RankingMusical ensembleValidity (statistics)Statistical hypothesis testingResultantPredictabilityXMLUMLLecture/Conference
NumberOrder (biology)Data modelThermal expansionShift operatorDuality (mathematics)Product (business)Bounded variationWeb 2.0Product (business)NumberWeb pageCategory of beingLibrary catalogComputing platformDifferent (Kate Ryan album)Computer animationLecture/Conference
Shift operatorProduct (business)NumberQuery languageEmpennageFocus (optics)Bounded variationProduct (business)Total S.A.Scaling (geometry)Volume (thermodynamics)Query languageDifferent (Kate Ryan album)NumberComputing platformDescriptive statisticsWeb pageGraph (mathematics)Graph (mathematics)Equaliser (mathematics)Diagram
Query languageTerm (mathematics)Product (business)RankingFunction (mathematics)Focus (optics)Data modelScale (map)Uniqueness quantificationProduct (business)Total S.A.Context awarenessQuery languageEndliche ModelltheorieDifferent (Kate Ryan album)Weight functionVideo gameStatistical hypothesis testingFunctional (mathematics)Statistical hypothesis testingTerm (mathematics)BitComputer animation
RankingNumberVirtual machineQuicksortStatistical hypothesis testingPerformance appraisal1 (number)Model theoryInformationContext awarenessInformationsmengeLecture/Conference
Metric systemInformation retrievalPerformance appraisalContext awarenessInformationMultiplication signInformationProduct (business)NumberMetric systemContext awarenessQuery languageArithmetic meanInformation retrievalCorrespondence (mathematics)Group actionModel theoryShift operatorExpert systemLatent heatType theoryReflection (mathematics)Computer animation
FeedbackPerformance appraisalSystem programmingQuery languageMathematicsString (computer science)RankingProduct (business)Physical systemFeedbackRankingQuery languageModel theoryCalculationMultiplication signInstance (computer science)Product (business)Latent heatEvent horizonStatistical hypothesis testingMetric systemSummierbarkeitContext awarenessRepresentation (politics)Perfect groupPerformance appraisalForm (programming)Set (mathematics)Noise (electronics)AverageInformation retrievalInformationRight angleWeb pageType theoryString (computer science)Order (biology)Lecture/ConferenceComputer animation
Product (business)Physical systemPerformance appraisalStatistical hypothesis testingContext awarenessState of matterProduct (business)EstimatorPerformance appraisalEvent horizonInverse elementCore dumpMultiplication signState observerLecture/ConferenceComputer animation
EstimationEvent horizonPhysical systemRankingWeightStatistical hypothesis testingEvent horizonProduct (business)Model theoryDiagram
QuicksortMereologyRankingInsertion lossModel theoryMetric systemPerformance appraisalIP addressMedianStatistical hypothesis testingEndliche ModelltheorieSummierbarkeitResultantOrder (biology)Functional (mathematics)Right angleVarianceState observerProduct (business)EstimatorBit rateStatistical hypothesis testingDifferent (Kate Ryan album)CASE <Informatik>
Performance appraisalPredictionPerformance appraisalPoint (geometry)Model theoryLecture/Conference
PurchasingCorrelation and dependenceCross-correlationPredictionStatistical hypothesis testingStatistical hypothesis testingGroup actionDifferent (Kate Ryan album)Interactive televisionLecture/ConferenceMeeting/InterviewComputer animation
PredictionPurchasingStatistical hypothesis testingCorrelation and dependenceCross-correlationModel theoryDifferent (Kate Ryan album)Web 2.0ResultantCross-correlationBit rateData conversionEndliche ModelltheoriePurchasingStatistical hypothesis testingLecture/ConferenceComputer animation
Cross-correlationDistribution (mathematics)PredictionAbsolute valueStatistical hypothesis testingGreatest elementData conversionTable (information)Bit rateModel theoryResultantCross-correlationEndliche ModelltheorieNoise (electronics)Order (biology)PredictabilityP-valueComputer animation
Statistical hypothesis testingBit rateResultantEndliche ModelltheorieOrder (biology)PredictabilityLecture/Conference
Distribution (mathematics)Cross-correlationPredictionAbsolute valueStatistical hypothesis testingEndliche ModelltheorieData conversionOrder (biology)Bit ratePredictabilityResultantGoodness of fitComputer animation
PredictabilityEvent horizonOrder (biology)Point (geometry)Statistical hypothesis testingResultantLecture/Conference
Product (business)PredictionEmpennageStatistical hypothesis testingPredictabilityStatistical hypothesis testingResultantModel theoryComputer animation
Product (business)Query languageLecture/Conference
Product (business)PredictionEmpennageContext awarenessProduct (business)Computer animation
Data modelStatistical hypothesis testingFeedbackSource codeStatistical hypothesis testingTerm (mathematics)ResultantAnnihilator (ring theory)Multiplication signState observerBeat (acoustics)EstimatorWave packetQuery languageSubsetIP addressCore dumpMathematicsStatistical hypothesis testingPerturbation theoryEndliche ModelltheorieFocus (optics)Source codeLecture/ConferenceComputer animation
SequelLink (knot theory)Slide ruleCross-correlationLecture/Conference
Cross-correlationKey (cryptography)Bit rateOrder (biology)Data conversionLecture/Conference
PurchasingStatistical hypothesis testingCorrelation and dependenceCross-correlationPredictionBit rateTerm (mathematics)Web 2.0Query languageCross-correlationComputer animation
Right angleEndliche ModelltheorieLecture/Conference
Game theoryOrder (biology)Product (business)Query language
Regulator geneSimilarity (geometry)Right angleInformationStatistical hypothesis testingLecture/Conference
Gamma functionRight angleMetric systemCross-correlationBitMusical ensembleElectronic program guideTerm (mathematics)Data miningLecture/Conference
Type theoryCross-correlationMusical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
Thanks and welcome to our talk on offline ranking validation or how we try to predict AV test results. First a quick overview of what we are going to talk about today. We will start with a short intro to our company since maybe not all of you know Otto.
Then we will talk about what the special things are you have to consider when working in an e-commerce search setting. Then we will introduce you to what we did for this offline ranking validation and we will finish the talk with a short outlook.
About us, me and Junus, we are working in the search team of Otto. I'm specifically focusing on improving the ranking on our web shop. And coming to our company which is Otto. And I've brought some key facts about the company.
So it's a large online shop that is based in Hamburg in Germany. And it was founded already over 70 years ago and managed to transform from then being a catalog retailer to being an online retailer and now we are transforming again to become a marketplace and selling products of many different sellers not only Otto products.
Today we have a GMV of almost 7 billion euros and on average 3 million visits on our web page. And as I said we are currently transforming into becoming a marketplace which means a lot of growth.
And we can put that growth into those four categories you can see here. The first one is with becoming a marketplace you will have an increasing number of sellers on your platform and also an increasing number of products. You can see it in the two graphs. So the first graph shows the increasing number of sellers on our platform.
We started with around 1000 sellers in the beginning of last year and now we are already at over 4000 different vendors and the same holds for the number of products we have available on our platform. So it almost doubled from April last year from around 9 million different product variations
up to almost 18 million product variations. The second thing that scales with becoming a marketplace is the number of searches that occur on your web page. So you have much more search queries in total and also an increasing amount of queries in the long tail. So we have around 20 million new queries occurring every six months.
And while the search volume increases we also have a decrease in the quality of the product data we get from the different sellers. So they might not add a lot of pictures to their products or the description is not very good which also influences very much the search performance.
And lastly if you become a marketplace you have to shift your focus on improving the business relevance on your platform for your own products to focusing solely on the customer relevance because you want to be a fair marketplace that treats every seller the same way. So if you optimize for customer relevance you imply business relevance for every seller equally.
Coming to our product the search at AutoDE I've brought some a little bit older key facts but we had in 2020 around 1.7 million search queries per day and on very busy days up to
almost 5 million search queries. In total during the year we had around 600 million search terms and those were composed of 40 million unique search queries. And coming to the topic of our talk the ranking on AutoDE. So we are in the ranking team and currently the ranking on AutoDE is based on manually
curated weighting functions that are defined on different contexts. For example it can be broad context like fashion or it can be a very precise context like umbrellas and there are special ranking functions for each of these. That actually works quite well so 80% of our customers are really satisfied with our search
but there is always a but. With the scaling to marketplace that really doesn't work anymore so you can't manually tune ranking functions if you have over 80 million products that are shown to customers.
Also like I said before we want to be a fair marketplace and the contexts were defined in the past together with the seller Auto. So if we want to be fair and treat all sellers equally we should only focus on customer relevance and not say OK the sellers can define some ranking functions that we then put into place. And this is why we are trying to develop a model based ranking and this is also where
the motivation for our talk comes from because we are currently developing many different rankers and trying out many different things and we very frequently ask ourselves the question so which one is the best approach which model should we go for
which one can we put in a live test and this is why we try to predict what will come out of those tests to have good candidates. And with this question I hand over to you Knoth. Thank you. All right. So like Andrea mentioned at Auto DE we decided we need some sort of machine learning
driven ranking system in our shop but that's kind of begs the question out of a number of candidate models which ones do we put in an AAB test and which ones do we show our customers. And there's kind of two schools of thought when it comes to ranking evaluation. The first one I'm going to show you right now which is the ranking evaluation in a full information setting.
So full information in this context means we have labeled or annotated data. Usually you would get these labels by showing a group of subject matter experts and queries and corresponding products and they would judge the relevancy of those products. And once you have those labels you can calculate any number of information retrieval
metrics on those labels and you can also train your LTR models on that data. The actual way those metrics are calculated is not really that important. All we need to know is we know the perfect ordering for our products and we can then judge how well our ranker would perform. But of course there's some downsides with this approach.
The first one is probably the most obvious one. Manually labeling data is very expensive and time consuming. And especially in an e-commerce context we observe that those labels might not even be aligned with the user behavior that we see in our shop. On top of that those labels cannot really reflect context specific information such
as the time of the day someone browses our shop, the device they use or the layout type that we show. Also those labels are static so especially now where consumer behavior shifts a lot due to economic downturns and crises we would have to collect those labels very frequently
to kind of reflect those shifts. So we thought to ourselves okay we need something better and we turned to the implicit feedback that users leave in our shop. And we wanted to use that implicit feedback to evaluate the performance of our ranking systems. So the implicit feedback can take a lot of forms.
There's stuff like the dwell time on product pages, whether a customer ordered the product or not. But for now we're going to focus on the most simple one which is has a customer clicked on a product for a certain query or not. And if we just assume for a second that a click is a perfect representation of relevance
we can actually calculate the same information retrieval metrics that we talked about earlier on our log data. And the way this would work in practice is for example we have a query here with four products in a certain order, two of those were clicked. The product at logging time position one and at logging time position three. So for this very specific query instance our old ranking system has a sum of relevant
ranks of four. Very simple metric. Our new ranker would have switched products A and B around and would have put the click product only at position two which means that the sum of relevant ranks would be five so for this very specific query instance it seems like our new ranker would perform worse.
And the reason why I'm kind of emphasizing that it's a very specific query instance is because we're kind of changing the granularity. Before we were judging ranking performance on just a query string, now we can calculate performance on individual query events. And that also means that we can, this allows us to include all this contextual information
that we talked about earlier, device type, time of day, et cetera, in our models and also in our evaluation. And it's kind of turns our evaluation into a counterfactual approach because we asked the question what if we had shown the user a different ranking. There is a big caveat though and that of course lies in our assumption that clicks
are a perfect representation of relevance because this is obviously not true. Clicks are very noisy and they are also biased. Noisy in this context means that users do click on the relevant products, maybe they are just curious because we showed some weird product or maybe it's a misclick,
but they also very frequently just skip relevant products. Maybe they finished their journey, they didn't scroll down that far, who knows. But usually noise is something that you can get rid of if the data set you use is large enough and you can kind of average it out. Bias on the other hand doesn't average out and clicks are also biased.
And when we talk about bias in a ranking context we usually talk about position bias and the position bias states that products on higher positions are more likely to be viewed and thus in turn of course also way more likely to be clicked. So that in practice means that we don't really have an unbiased estimation of our
ranking performance because our old ranker that we had in production at logging time has an influence on our evaluation and that's exactly what we don't want. So how do we get rid of bias, of position bias? We do it through an approach called inverse propensity scoring.
And the core idea is super simple. Once you have an estimate of your observation probabilities per position you just inversely weigh those click events by this probability. So in practice this means if we observe a click on position 30 we would give this click event just a much higher weight than a click that occurs on position 1.
And yes so that also means we kind of reward models that are able to identify relevant products on lower positions. So we tried it out and what we did was we developed two different models. One based purely on auto rates, one based on clicks and auto rates and we also threw
a random model in there for sanity checks. And we calculated two different IPS evaluation metrics, the DCG and the sum of relevant ranks and we used two weeks of logging data to do that. And we can see that for the DCG there's not a winner, right? So it seems like both our models perform more or less equally.
The random model luckily is the worst one. But for the sum of relevant ranks we see that there is a big variance in our metrics but the click and auto model has a slight advantage when it comes to the median of sum of relevant ranks because in this case of course a lower value means a better model.
And what we did is also we put those models into an ABC test just to verify that our offline evaluation works. But before we get there let me just sum up the IPS approach really quick. The IPS approach has a lot of pros, right? It allows us to basically evaluate any ranking system as long as the product base that we
evaluate stays the same. And there's also the potential to use IPS scoring directly in the loss functions of your models and thus kind of directly optimizing for those sort of metrics. It's also quite easy to implement, at least the evaluation part. On the downside though we can see that the results are not really that conclusive yet
for those models. They seem to perform quite similarly and probably the biggest downside is we really need a an accurate estimation of observation propensities, right? And we don't really have that right now. The one we have is quite rudimentary. There's also some like practical implications at lower positions your observation propensities
are very low. So just like outlier clicks on very low positions can have a huge impact on your evaluation. That's why you need to cap your propensities at a certain point. But this is only one of the approaches we tried for offline evaluation and Andrea is now going to introduce the second one.
Yeah, thanks, Jonas. So we have this one approach of evaluating how our models will perform in an on-site test and then we were looking at data from past A-B tests and looked or computed NDCG values between the ranking that was shown to the users and what our new rankers would
have ranked. And then we actually saw that if we group by the user interactions there is a difference in the NDCG values that we get with the models. So we actually saw that if there was no click on our web shop also the NDCG value of the model that we inspected was lower and if there was a click the NDCG value was
actually higher. And the same held for purchases. This is why we thought let's use this information and try to see if we can also predict the results from that. So we thought that if we look for correlations between our on-site KPIs like conversion rate
and click-through rate and the NDCG values of our models and we find a correlation then we can say the higher the correlation between our on-site KPIs and the NDCG of the certain model the better the model will also perform in an A-B test.
And that's also something we then tested and we did it also for the models that we currently have in an A-B-C test. And you can see the results in the two tables. The top one shows the click-through rate and the bottom one shows the conversion rate. And you can see that also for this approach the order and click-based ranking is the
one with the highest correlation. Then comes the order based and then the random model has the lowest correlation. And you might wonder now why the correlations are so low. I think it's because there is a lot of noise in the data usually but you can also see that the P-value is very low so actually there is a small correlation but that one
is significant so we think we can still make use of that data. And the prediction for the conversion rate as a target actually helped the same results. Not as much of a difference like for the click but it also saw the order and click-based model in favor.
And now I'm going to show you the results of the A-B-C test that we then did to see if our prediction was correct. And there we see that okay, yay, for the click-through rate we actually managed to predict which model will perform best in the on-site test. So you can see that we have over 1% of a better uplift with the click and order model
versus the solely order-based model. So that's good but again there's always a but. For the conversion rate it actually didn't look so great so what we saw as of now but the test is actually still running so there is not a significant result yet but
currently it looks like for the conversion rate our prediction is not so good. So actually the order model performed better in the on-site test as of now. That might be due to the fact that we don't have as much data points for the order events than we have for the click events so probably predicting that is even more difficult
than predicting what will happen with clicks. To conclude what we found out in our experiments with a prediction of the test results, we applied our predictions on data from March and tried to say what will perform better in an ABC test and also we tried it with past models that we had in AB tests
and tried to see okay the tendency looks right or wrong. And actually to be honest it looked good for a couple of them but some of the predictions actually were also quite off so let's say we are still very confident with the approach
and we will definitely put more work into it but we can't tell you we found the perfect solution for predicting the AB test results that's to come. I guess you have thought so before and yeah so what we can say for sure are the drawbacks of the approaches that we have.
They will never work for rankers that are very innovative or put completely new products at the top of the list because if you haven't seen the products for that query in the past you can never tell if they would have been relevant for a customer or not for that certain query. And for the same reasons this approach also never works for long-term queries.
It also doesn't take into account what products are surrounding a certain product. So the relevance, the perceived relevance of a product might change depending on what other products it is shown with in a context. If all the other products are way cheaper then it's probably not as relevant as if all the other products are way more expensive then you probably think that's a good one.
So yeah there are certain drawbacks of our approach but we still think it was worth pursuing that idea and we are also going to continue using it. But Jonas will give you a more detailed outlook on what we are planning to do now.
Lastly. Right, so we showed you our approaches so what's next for us? We just keep doing both and then long-term always double check with our ABC test results and see which approach beats out the other one in the long run. And specifically for the IPS approach we have a lot of faith but we just need a better
estimation of our observation probabilities so we maybe do some tests like perturbations of certain positions just to see how the observations change. And we might also test IPS out for the model training directly or try other approaches in the literature. But the core focus of our team right now is to find a ranking model to stay alive
with permanently. We did manage to beat out the status quo quite significantly on a subset of queries but we now want to generalize that uplift on the whole shop. And then hopefully we'll have some news for you next time. So that's it. Here are our sources for the IPS approach.
Thank you for your attention and we are hiring Autosocool companies so if you're interested follow the link. Thank you very much. That was very interesting. Any questions right now maybe?
Thank you for your talk. Can you go back to the slide about the correlation when you introduced like the correlation concept with NDCG? OK. So you say look for correlation between the key KPIs and NDCGs but what do you use
to estimate the relevance to be used in NDCGs? Aren't you using the like click-through rate or the click order rate you mentioned in the beginning? So we use the click-through rate and the conversion rate so we group the data we have by a session of a user and the query term and then for a certain query and a certain user
we know how large the click-through rate of that user is on a certain query and that's like the one side of the correlation and the other side is computing the NDCG of the ranking that a user saw on our web shop and the ranking we would have shown
if we had used our new ranker that we want to estimate. So the ordering we take is the one the user saw and then we compute the NDCG towards what the model would have predicted. I can refine.
But to calculate NDCGs you need a ground truth, right? So like how do we associate a query document to an estimated relevance? So we just say that the order in which it was shown in the shop is the most relevant so the first product gets the highest relevance and then we just give it a decreasing relevance value.
Okay. Okay. Thank you. And maybe we can also chat. Yeah, yeah, that's fine. Thank you. I have one question related to the NDCG because we have done something similar and to try to correlate it with AV testing and what we have seen is basically that
precision is giving us more information. So one thing is that have you looked at the precision and tried to see actually how does it, if the precision is good then the detail to search will increase or click through, right? And, yep, that's my first question.
I didn't really get, we didn't really get that. So have you... It was just a little bit too quiet. Yeah, thanks. Have you also looked at precision and not only NDCG and see the correlation between precision and the... Click through? Not yet, right. But especially for the IFS approach it's actually quite simple to include any other metrics.
For the second approach it's a little bit more sophisticated but we might try that out, right? Because we also want to know which metrics to train our models on and then if you then observe that precision outperforms NDCG in terms of correlation we might do that in the future as well. Okay, thank you.
Mine is a very short one. What type of correlation are you showing? I mean, it's like a Pearson's correlation. Okay, thank you very much. Thank you. Lunchtime.