We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Hybrid search: Greater than the sum of its parts?

00:00

Formal Metadata

Title
Hybrid search: Greater than the sum of its parts?
Title of Series
Number of Parts
56
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Over the decades, information retrieval has been dominated by classical methods such as BM25. These lexical models are simple and effective yet vulnerable to vocabulary mismatch. With the introduction of pre-trained language models such as BERT and its relatives, deep retrieval models have achieved superior performance with their strong ability to capture semantic relationships. The downside is that training these deep models is computationally expensive, and suitable datasets are not always available for fine-tuning toward the target domain. While deep retrieval models work best on domains close to what they have been trained on, lexical models are comparatively robust across datasets and domains. This suggests that lexical and deep models can complement each other, retrieving different sets of relevant results. But how can these results effectively be combined? And can we learn something from language models to learn new indexing methods? This talk will delve into both these approaches and exemplify when they work well and not so well. We will take a closer look at different strategies to combine them to get the best of both, even in zero-shot cases where we don't have enough data to fine-tune the deep model.
MereologyMusical ensembleSoftware engineeringHypermediaComputing platformVirtual machineInferenceMultiplication signProcess (computing)XMLUMLLecture/Conference
MereologyMathematical modelComputing platformData miningSemantics (computer science)InferenceMachine learningContext awarenessTerm (mathematics)Combinational logicMeeting/InterviewLecture/Conference
Execution unitMaterialization (paranormal)Source codeComputing platformPopulation densityTransformation (genetics)Representation (politics)BitComputing platformOpen sourceNeuroinformatikMultilaterationCartesian coordinate systemQuery languageDifferent (Kate Ryan album)Meeting/InterviewLecture/ConferenceXML
Streaming mediaSource codeComputing platformTensorSocial classUniqueness quantificationSet (mathematics)Context awarenessHybrid computerPosition operatorVector spaceSource codeDifferent (Kate Ryan album)Virtual machineBitType theoryData structureComputer animationLecture/Conference
Product (business)Local ringStreaming mediaWeb pageTime zoneQuery languageInformation retrievalRepresentation (politics)Meta elementVector graphicsPreprocessorSparse matrixTerm (mathematics)Term (mathematics)Musical ensembleVideoconferencingMedical imagingSequenceWordQuicksortQuery languageTask (computing)Content (media)Representation (politics)Information retrievalUser profileDifferent (Kate Ryan album)PurchasingNumberUniqueness quantificationBitLengthSparse matrixVector spacePhysical systemProgramming languageProcess (computing)DatabaseRight angleCartesian coordinate systemElement (mathematics)Multiplication signFunctional (mathematics)Set (mathematics)Point cloudComputer animation
Term (mathematics)Information retrievalPreprocessorRepresentation (politics)Vector graphicsSparse matrixVector spaceElement (mathematics)Set (mathematics)Functional (mathematics)WordTerm (mathematics)Subject indexingQuery languageInverter (logic gate)DivisorData structureLecture/ConferenceComputer animation
Sparse matrixPreprocessorTerm (mathematics)Information retrievalRepresentation (politics)Vector graphicsAreaProcess (computing)Performance appraisalQuery languageLevel (video gaming)AlgorithmToken ringPopulation densityEntire functionQuery languageMereologyState of matterToken ringTerm (mathematics)Type theoryFunctional (mathematics)Data storage deviceSet (mathematics)WordDifferent (Kate Ryan album)Multiplication signSparse matrixElectronic mailing listBitSubject indexingPresentation of a groupLetterpress printingMusical ensembleTransformation (genetics)Information retrievalQuicksortSequenceLimit (category theory)Content (media)Representation (politics)Mathematical modelMathematical modelControl flowFrequencyForm (programming)Codierung <Programmierung>Bit error rateComputer animation
Transformation (genetics)Query languageFlow separationRepresentation (politics)QuicksortVector spaceInstance (computer science)Parameter (computer programming)SequenceToken ringSubject indexingMathematical modelSet (mathematics)Canadian Light SourceRight angleMaxima and minimaTunisComputer animation
Query languageEuklidischer RaumProduct (business)Trigonometric functionsRepresentation (politics)Vector graphicsPopulation densityVector spaceInformation retrievalTransformation (genetics)Process (computing)Instance (computer science)Category of beingCodierung <Programmierung>Differential (mechanical device)Vector spaceQuery languageProcess (computing)Transformation (genetics)Physical systemToken ringSequenceMultiplication signLengthSearch engine (computing)Run time (program lifecycle phase)Einbettung <Mathematik>Representation (politics)Sparse matrixElement (mathematics)Order (biology)MultilaterationInformation retrievalMusical ensembleMathematical modelQuicksortDatabaseMathematical modelProduct (business)Dimensional analysisSubject indexingEuklidischer RaumDifferent (Kate Ryan album)Arithmetic meanComputer animation
Information retrievalRepresentation (politics)Sparse matrixRegular graphVector spaceApproximationQuery languageFunction (mathematics)DistanceTrigonometric functionsAngleEuklidischer RaumProduct (business)Subject indexingGeometric quantizationPrinciple of localityMathematical modelPopulation densityInformation retrievalRepresentation (politics)Sparse matrixQuicksortQuery languagePopulation densityMusical ensembleComputer configurationHierarchySubject indexingInverse elementCartesian coordinate systemMathematical modelSemantics (computer science)Wave packetSequenceNumberDot productData structurePurchasingGraph (mathematics)NavigationMedical imagingUser profileTrigonometric functionsSkalarproduktraumEuklidischer RaumDimensional analysisProduct (business)Local ringGeometric quantizationFunctional (mathematics)DistanceGene clusterTheory of relativityArtificial neural networkBit error rateMathematical modelLecture/ConferenceComputer animation
BefehlsprozessorGraphics processing unitQuery languageMathematical modelInformation retrievalPopulation densitySparse matrixMathematical modelQuery languageLibrary (computing)Bit error rateWeb pageEinbettung <Mathematik>RankingArithmetic meanTrigonometric functionsSimilarity (geometry)Right angleRepresentation (politics)Type theoryForm (programming)Different (Kate Ryan album)Mathematical modelDot productComputer animation
Elasticity (physics)WindowBefehlsprozessorMathematical modelInformation retrievalElectronic meeting systemGame theoryAssociative propertyConnected spaceTouchscreenLevel (video gaming)Population densityStreaming mediaRepresentation (politics)TunisWave packetBitDifferent (Kate Ryan album)Open setDirection (geometry)Sparse matrixMathematical modelSet (mathematics)Computer animationLecture/Conference
Neighbourhood (graph theory)Connected spaceTouchscreenLevel (video gaming)Open setPopulation densityProblemorientierte ProgrammierspracheInformation retrievalGame theoryWordTerm (mathematics)Representation (politics)Cycle (graph theory)Query languageConnected spaceRight angleGame theoryComputer animation
Game theoryConnected spaceTouchscreenLevel (video gaming)Information retrievalPopulation densityQuery languageMathematical modelTrailBenchmarkPerformance appraisalTask (computing)Time domainInformation retrievalGame theoryPopulation densityMathematical modelQuery languageTask (computing)QuicksortWordWave packetRight angleLatent heatSet (mathematics)Domain nameComputer animation
Query languageBenchmarkInformationPerformance appraisalInformation retrievalMathematical modelTrailTime domainTask (computing)Domain nameQuery languageDifferent (Kate Ryan album)Set (mathematics)Arithmetic meanTime zoneRepresentation (politics)Software testingMeasurementProblemorientierte ProgrammierspracheType theoryBitContext awarenessWave packetTask (computing)Computer animationDiagram
Mathematical modelTrailBenchmarkQuery languageInformationPerformance appraisalBit error rateTime domainTask (computing)Information retrievalSparse matrixLinear mapRankingPopulation densityRankingDifferent (Kate Ryan album)QuicksortLinearizationAlpha (investment)Query languageCombinational logicSparse matrixDistribution (mathematics)AverageSet (mathematics)Information retrievalExterior algebraMathematical modelInstance (computer science)ResultantPhase transitionMusical ensembleRepresentation (politics)Codierung <Programmierung>Line (geometry)Population densityCategory of beingField (computer science)Type theoryFunctional (mathematics)WeightComputer animation
Execution unitLinear mapRankingScatteringConfiguration spaceMathematical modelArchitectureComponent-based software engineeringContent (media)User profileEinbettung <Mathematik>Attribute grammarString (computer science)outputSubject indexingEmbedded systemPhase transitionProduct (business)Functional (mathematics)Computing platformConfiguration spaceResultantCodeCartesian coordinate systemVirtual machineDeclarative programmingCASE <Informatik>Set (mathematics)Software developerMathematical modelEinbettung <Mathematik>WordInformation retrievalMusical ensembleQuery languageContent (media)QuicksortPhysical systemType theoryPopulation densityField (computer science)Lecture/ConferenceComputer animationXML
RankingLinear mapUser profileEinbettung <Mathematik>Embedded systemAttribute grammarSubject indexingoutputContent (media)String (computer science)Phase transitionProduct (business)Einbettung <Mathematik>Subject indexingTransformation (genetics)SkalarproduktraumField (computer science)QuicksortCASE <Informatik>XML
Linear mapRankingUser profileAttribute grammarEmbedded systemSubject indexingoutputContent (media)String (computer science)Phase transitionProduct (business)Einbettung <Mathematik>Time domainCodeResultantBitInformation retrievalDifferent (Kate Ryan album)Population densityInstance (computer science)Streaming mediaRankingEinbettung <Mathematik>Domain nameQuicksortGraph (mathematics)Closed setMultiplication signType theoryLine (geometry)Lecture/ConferenceDiagramComputer animation
Time domainLine (geometry)Population densityLinearizationCASE <Informatik>Combinational logicInformation retrievalTransformation (genetics)Set (mathematics)Mathematical modelResultantRankingPhase transitionInformationAlpha (investment)Domain nameCodierung <Programmierung>Different (Kate Ryan album)2 (number)Shift operatorWordBitContent (media)Arithmetic meanTrailChord (peer-to-peer)Graph (mathematics)DiagramComputer animation
Numerical digitMaxima and minimaWell-formed formulaInformation retrievalVector graphicsPopulation densitySparse matrixPrincipal component analysisRepresentation (politics)Software frameworkThermal expansionWeightMusical ensembleMacro (computer science)Source codeTerm (mathematics)Information retrievalDomain nameResultantPopulation densityRepresentation (politics)Software frameworkInformationWeightAreaTerm (mathematics)Sparse matrixQuicksortWordMathematical modelThermal expansionSubject indexingTransformation (genetics)BitSlide ruleBit error rateVector spaceComputer animationLecture/Conference
Form (programming)Term (mathematics)Vector graphicsPopulation densitySparse matrixInformation retrievalPrincipal component analysisRepresentation (politics)Software frameworkThermal expansionWeightSource codeComa BerenicesTerm (mathematics)Vector spaceThermal expansionDomain nameDifferent (Kate Ryan album)Representation (politics)Sparse matrixMathematical modelTable (information)Computer animation
Streaming mediaVector graphicsPopulation densitySparse matrixRepresentation (politics)Software frameworkInformation retrievalWell-formed formulaPrincipal component analysisThermal expansionWeightMusical ensembleElasticity (physics)Direction (geometry)Mathematical modelRepresentation (politics)Problemorientierte ProgrammierspracheWave packetSparse matrixDifferent (Kate Ryan album)Musical ensembleForm (programming)Right angleSummierbarkeitResultantBitRankingMachine learningInformation retrievalInstance (computer science)Vulnerability (computing)QuicksortVirtual machinePopulation densityLecture/ConferenceComputer animation
Streaming mediaNormed vector spaceRankingResultantInstance (computer science)Message passingVirtual machineComputer configurationPopulation densityTerm (mathematics)BitQuery languageDigital object identifierMusical ensembleTorusMathematical modelQuicksortInformation retrievalComputer animationLecture/ConferenceMeeting/Interview
Bit error rateMusical ensembleMereologySparse matrixPoint (geometry)Elasticity (physics)QuicksortLecture/Conference
Physical systemResultantLecture/Conference
Execution unitState diagramRankingInformation retrievalReal numberAbstractionTransformation (genetics)Product (business)ImplementationLecture/Conference
Subject indexingRight anglePhysical systemQuery languageTransformation (genetics)Mathematical modelTime zoneMultiplication signMathematical modelDifferent (Kate Ryan album)Lecture/Conference
Graphics processing unitQuery languageSensitivity analysisCASE <Informatik>BefehlsprozessorPerformance appraisalRegular graphMathematical modelImplementationMathematical modelTransformation (genetics)Ocean currentElectric generatorMultiplication signLengthSequenceComplex (psychology)Potenz <Mathematik>Lecture/ConferenceMeeting/Interview
Slide ruleMusical ensembleMathematical modelExpert systemCombinational logicBitCodierung <Programmierung>Augmented realityLecture/Conference
Mathematical modelLoop (music)DivisorRankingResultantLecture/Conference
RankingNichtlineares GleichungssystemResultantCASE <Informatik>Lecture/Conference
Musical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
Yeah, hi. My name is Lester. I work as a software engineer at Yahoo, previously Verizon Media, previously Oath, previously Yahoo. I think this is my third time at Berlin Buzzwords.
I've had the same job but I represent different companies. But anyway, at Yahoo I work for the Vespa team developing the Vespa platform out of the Trondheim, Norway. And there I work primarily with machine learning solutions, being able to run inference on machine
learning models, evaluate them primarily in the context of search and recommendation and so on on the Vespa platform. But I'm not going to talk too much about that today. Today I'm going to talk about hybrid search. And for those of you that caught Joe Christian's talk earlier today about semantic search, Joe Christian is a colleague of mine,
there was a question about how do we define hybrid search. And the way that I am going to define hybrid search, at least initially in this talk, is the combination of traditional keyword term-based search like BM25 and so on, with the kind of newer, cooler,
dense representation, transformer-based, birth-based search and so on. How do we kind of combine those things to maybe kind of get something out of them that is more than each of them individually? So we're going to see how we do that a little bit later. First of all, a little bit about Vespa, the team that I work on. So Vespa is what we call
the open source platform for low latency computations over large, evolving data. It's a platform that we've been developing over many, many years, primarily inside of Yahoo. But we open sourced this a few years ago. And today it's internally in Yahoo, it's very popular,
it's running hundreds of different applications, hundreds of thousands of queries every second, hundreds of millions of unique users every month. And Vespa has, as it has been developed over many, many years, is a very kind of big and robust set of features, like the ability to compute using native tensors as first class
citizens, built-in support from many different types of machine learning solutions. You can string them together and so on, which is kind of a unique feature. But today, in the context of hybrid search, what we're going to be looking at a little bit is the capability of doing search, like keyword-based search, structured search, and combining that with vector search.
And we think, at least, that Vespa is kind of like in a unique position to capitalize on those two different approaches. Yeah, at Yahoo it's running a lot of different applications and so on. It's been very
a big use since we have a hosted solution inside of Yahoo, running 25 billion real-time queries a day, triple that number of updates and so on. And we've recently opened up a cloud offering so we can use the competence that we've gained in running this internally, also for external customers as well. But I'm not going to talk too much about that. So, getting
to the problem at hand, today we're talking pretty much about search. And to kind of define the problem a little bit, for those not too familiar with it, we have this set of content that we want to search within. Content can be anything, really. It can be text, it
can be images, video, user profiles, purchase sequences, whatever you want to do. But yeah, it's some sort of content that you have. But most of the talk today will be about text. And whenever you have some sort of documents and so on representing the text,
we need some way of representing it. And then after we have this representation, we need some way of storing it in some system, some database or so on. And the task really is to have some sort of query that comes in. And we need to represent that in some way that's compatible with the documents. And then do a search or do like a query within the system. And
retrieve the most relevant documents for that query. Typically this is like a multi-stage process where first we find all the documents that are in some way relevant. And then we score them according to some sort of scoring function and just ordering all the documents
that we have. And then returning the top k of them. That's the kind of general problem that we have for search. And when it comes to text, we've had this solution for a long time. We call this term-based or sparse retrieval. And we have these documents and we represent them using
some sort of bag of word representation. And bag of words just means that we represent them, we don't really take into consideration the sequence of terms in a document. We just say this document contains these terms. So that's why this is a bag of words.
And we can kind of think about this a little bit, that we have like a huge, huge, huge vector. The length of the vector is the number of unique terms that you have in your corpus. So for the English language, you have probably hundreds of thousands of unique elements in this vector. But only a few of them are actually non-zero, right?
The words that are in the terms that are inside this document. So that's why we call it a sparse vector because there's only very few number of these elements are actually non-zero. Of course, we don't represent them as a long vector. We represent them using some other more kind of structure that's more applicable to that. However, we can think about it like a
sparse vector. And we put this into an inverted index so we can kind of look up these terms later. So when a query comes in, we do the same thing. We represent them using this bag of words. And then we look up each of these terms, which documents contain these words. We get a set of documents that are in some way relevant to the query. And then we score them using some
scoring function, typically these days, of course, or typically over many, many different years, as a BM25 scoring function. It looks kind of complicated. It's not, really. But it's structured in a way that we can pre-compute a lot of stuff. And instead of just saying which words are within which document, we can actually weight these terms
a little bit and store that. And so I'm going to use that later in some time. And this is kind of like the state of search over many, many years. But it's kind of prone to one thing that's called the vocabulary mismatch. And that's basically if you're kind of searching for something that's not exactly what's inside of your documents, you might miss the relevant documents that you have. And then we have a lot
of tricks that you can use in BM25 or within term-based retrieval, like different types of pre-processing, stemming, limitization, stop word removal, and so on. Many different types of tricks that you can use to increase recall. And the term is sparse retrieval. I mean,
or sparse representations. It's not something that was very commonly used before. It's mostly something that's been more used now recently to connect it with dense retrieval, which I'll get back to in a little bit. So just as a kind of short example, we have the documents
containing different terms. We have the sparse representation on top there. We can create an inverted index. And the inverted index contains what's called the posting lists of each document within these different terms. We can issue some sort of query. We can look at which terms
connect to which documents and thus get a set of relevant documents. And then we can use different scoring functions, of course, to be able to score how relevant these documents are for the query. So very straightforward. That's the way we do kind of sparse retrieval.
Of course, many terms or words inside these documents occur with different frequencies and so on. Some are very common. Some are very not common and so on. But there's different ways of accelerating this. So queries that contain a lot of common words would pretty much have
to scan through entire document collection to be able to score all these documents. There's different ways of accelerating this. One is want. I won't get too much into that, but it's basically a way of skipping through many, many documents by looking at how much a term can actually contribute to the score. It keeps a set of relevant documents and tries to spin through that. I'm not going to talk too much about that. I'm just going to mention
here, I'm going to bring it up a little bit later, that there are ways to accelerate sparse retrieval rather than just kind of skimming through the entire document collection. Okay, that was kind of like the state up to about, I don't know, 10, 12 something years ago.
And then in came the era of deep learning. We had this word to VEC and everything started looking interesting and so on. And about three years ago, of course, in came BERT and transformers and so on and kind of changed the landscape a little bit. You probably know the story there. And one of the first usages of these models within search was what's called
cross-encoding. And basically it was taking the document collection or whatever kind of content you have, the words and so on, and then breaking down into tokens and different transformer models
use different forms of tokenization. You have byte per encoding, you have word piece, sentence piece, and so on, but they all kind of take the words and break them down into individual tokens. So you have a kind of a sequence of tokens. And you can do that the same with a query, break that down into a set of or a sequence of tokens, and you can kind of put those two
into the transformer model and now you can calculate how relevant the query is for the A cross-encoder works like this that you have a query up on the top there and you have
some sort of separator token. You have the CLS token which is kind of the start of whatever you want to calculate. You have the sequence of tokens of the query and a separator and the sequence of tokens for a document. And each of these tokens have their own vector representation. And each of these vector representations are kind of just sent through
this huge transformer model, typically creating hundreds of millions of parameters. And then you get another sequence of these kind of vectors on the other side. And then it's typically common to look at this one token, the CLS token, classification token, whatever you want to call it, and run some sort of softmax on the other side and be able to calculate what is the
probability that sentence one matches sentence two. Now you do this in having a data set, for instance MSMarker that you want to kind of fine-tune it on. So you send in what the query is in the documents and if you know that the query is relevant or the document is relevant to the query, that's a signal for backpropagating all this through
to the transformer model and so on. And thus getting new representations for each token and so on. Of course this has some drawbacks because there's no way of indexing this, right? Because you need to take the query and each and every document that you have in your collection you have to kind of calculate this for every document. And this of course takes a while
because this is a kind of big, big model and it's not really a feasible way of doing this. So it can work if you have a smaller document collection, which means that you can do like for instance BM25 or this traditional search of BM25 first and do a re-ranking with this is a common
way of doing it and it works actually very well and it's a pretty good way of doing it these days. However there's another way as well which became popular a little bit later or pretty much the same time just afterwards which is called the bi-encoder. So just kind of differentiate
between what a cross-encoder is and a bi-encoder. The bi-encoder has two different transformer models rather than just one. Well they don't have to be different but it's like two at least. And you have one for the query and one for the document. It works the same way and on the other side you have this signal for instance dot product or Euclidean or
cosine whatever you want to use about how relevant the query is to the document or oppositely. And you train this and you back propagate this through the two different models. And the kind of nice property of this is that the documents of course they only need to be calculated like during indexing not during a query time so you can actually spend
some time on that if you want to. And the query transformer needs to be done during a query so it needs to be calculated during a query. But the benefit here is that a transformer model typically has this process called self-attention. And self-attention means that
each and every token is aware of each and every token so the runtime of a transformer model is quadratic to the length of sequence that you have. So the query transformer queries typically are shorter than documents so running the query is actually or transforming the query is actually or can be fairly efficient so you do have actually time to do that. Also you can have a
different model for the queries and the documents so they can be made more efficient. So that's the way we do it these days and these documents or these vectors that come from documents there that can be stored inside of a nearest neighbor retrieval system.
So this is how embedded or dense retrieval looks. You still have your documents you have your processing which is the same tokenization as before and you transform to embedding representations and now instead of these kind of sparse representations we have a short shorter dense representation. Short here is like in the order of you know a few hundred elements
typically for bert-based transformers 768 dimensions so on and these are stored inside of your nearest neighbor search engine or database. The same with the query we kind of calculate the embedding there and then we have the embedding of the query and we can kind of look for the nearest neighbor of this query within some sort of you know within the document
collection and hopefully you know find the documents that are closest to the query. Okay that's how that's the kind of idea behind dense retrieval. So you can kind of think about dense representations as like some sort of way of compressing the huge kind of sparse representation of a document or query down to much shorter a dense representation and we can
use various types of different scoring models. The most commonly used one today is dot product but you have all also other options euclidean and cosine depending on you know the model that you're using. Dot product or inner product is the most common one but has some issues
regarding indexing and this real-time indexing and so on because it needs to be normalized and so on but that's that's another talk. One of the nice things about dense representations of course is that you can really represent anything not necessarily just text you can represent text images sounds user profiles purchase sequences whatever as this
kind of just sequence of numbers so it enables kind of like a multi-modal search as well so you can pretty much train these using train these for so we can you know connect or have semantic
representation of text with images so you can easily create kind of image search applications as well so basically you're just projecting this image and this text representing the image into kind of some same semantic space but at least this dense reputations allow you or enable you to do these kinds of interesting interesting things so when you have like a dense
representation of something you can kind of think about this having documents in some sort of high dimensional space and you have a query you know in this high dimensional space as well and what you want to do is you want to find the closest documents to that query
using various kind of distance functions that you have available. Problem is with that approach is that there's this kind of no known way of doing that exactly for nearest neighbor search without going down to a kind of a linear scan of all documents
so the only way of doing that is building index structures around that and they are approximate we can only find approximately the nearest neighbors using various you know solutions. There are a few ways of doing that that can be built upon traditional inverse indexes you know using kind of k-means approach product quantization and its relatives
locality sensitive hashing and so on it's basically about you know indexing the clusters or centroids of the clusters in some sort of inverse index but these days mostly one uses I think most approaches that use ANNs for these kinds of applications use the HNSW or
the hierarchical navigable small world which is a graph-based structure which is not really compatible with inverse indexes so you need like your own system for that it has a lot of benefits as well which but that is also another talk. Okay so kind of comparing these two
approaches to each other. This is the kind of leaderboard from the Sentence BERT brilliant library easily used to or easily use this Sentence BERT library to create embeddings
both for documents and queries and so on and it has like this leaderboard here and as we see here before the MS Marko data set the BM25 using Elasticsearch has a mean reciprocal rank score of about 17 or 17.3 that's kind of the baseline. However using these other approaches
different forms of ways of encoding this dense representations you see they're much much better 34.4 33.7 for these different types of models for cosine similarity or dot products and so on. So I mean at the just looking at this it looks that really there's no contest between these right
I mean BM25 is outdated all this these new ways of representing and using those dense representations much much better so we should you know just use that right but not entirely right because first of all BM25 is an unsupervised method it's basically no
training it's basically very easy to use that for document collection there's just no way no training and things that you need to to fine-tune or whatever. However these other models here for dense representations they really need to be trained towards this data set and these have been trained towards MS Marko specifically to work well for MS Marko of
course there's no contest in that direction. However there are if we kind of open up a little bit there are nuances here and this is from the the original DPR paper and I think it's an telling of the kind of differences between a kind of a sparse or term-based thing like BM25 versus
a kind of a dense representational way of doing it. So for example if your question is or the query what is the body of water between England and Ireland if we do a BM25 search then we get something maybe get something like this around British cycling because it contains the words or terms England and body and Ireland and so on and just using term-based
bag award scoring it seems like that's a pretty relevant document to the query but of course it's not not really relevant at all. However this dense representation is able to find the correct or most relevant document that that contains around the Irish sea because this is kind of
a more semantic representation that kind of understands that the body of water means something like sea right so it's like a a softer connection between what's what's actually queried here. But also oppositely who plays Thoros of muting Game of Thrones it's a very specific query right you're actually asking for something that's a very specific some
words in this this this document and here the BM25 is able to you know retrieve the correct document and the the of Paul K which is the answer to that question because it specifically contains the words Game of Thrones and Thoros of Mir but the dense passage this is DPR dense passage retrieval model is not able to see that at all it's not been it's not within the
training uh set so there are kind of it's it's not worth just throwing away the BM25 quite yet because it might contain some sort of benefits in using that as well. And another aspect is what happens when you take a look at
um tasks or other other data sets that are out of domain from what you're training. So what we saw earlier was this these dense representations that work very well within the context of MS Mark with that we're trained against but they don't very work
don't work that well on domains that are not. So this this beer the paper that came out about I think last year specifically kind of wants to open that up a little bit and measure this and they came up with 19 18 data sets and nine different tasks of kind of out of domain
data sets to be able to kind of measure this in some way to see how well do these test representations work outside of their kind of competence zone. And out of domain what I mean by that is that just kind of different types of why something is out of domain something
is that documents are different it's entirely different type of documents it's like a medical text versus text about cars and so on that's that's different documents. Queries are different that the you know actual queries that you use query towards the document set and the query document relationship can be different like sometimes you're just asking for something very
specific sometimes you're asking questions and so on so and and of course all of the above. And what this original beer paper found was that BM25 is actually a pretty decent baseline for most things right so you can see here is that this this line in the middle here it's like
how well BM25 scores across these 18 different data sets and the baseline BM25 is better than you know six out of these nine different models or approaches across on average across all these different data sets. The one that works best is this
BM25 first phase with a cross encoder and the second stage it's this kind of multi-stage retrieval. So initially it doesn't look very well that this these kind of dense representations they don't really generalize that well outside of whatever they've been trained on
it seems like. But since then we've seen that BM25 has some sort of you know properties that work well on some sort of documents and we see that this dense retrieval works also very well on some different types of documents as well. It's very easy to think oh let's try
to combine this in a way see if we can kind of get some of the nice properties of BM25 and the nice properties of this dense retrieval maybe kind of you know use both of them and try to see if we can you know extract even better results at the end. So if we kind of set that
up but we have like a hybrid retrieval here where we do term-based retrieval on one side and we have embedded or dense based retrieval on the other side and we combine these results in some way what can we get out of that? There are different ways of combining this I'm going to show just two different ways of doing this because it's too easy of doing it.
One is that so let's see that we have these results coming from from the sparse set and these coming from the dense retrieval then we have the kind of naive way of thinking about this we have some linear combination saying that we have the kind of the scores of these different documents given a sparse query and these scores of these given the dense query and
combine them in some way and note that these kind of have different you know score averages and so on that might come from different distributions or scores in some way so we need to be able to combine them efficiently anyway. So the easiest way is this kind of have some sort of linear combination just selecting some sort of you know alpha here which is how much do you
want the final result set to be weight towards BM25 or how much do you want to be weighted towards for instance so here for instance if we have a alpha of 0.5 we could get this kind of ordering from these documents and then we have the alternative called reciprocal rank fusion which is a parameter-less way of doing it and it just kind of you know don't care about
the scores at all we just look at the ordering and that's basically this function here and it's basically saying that okay if a document is highly ordered in both of these then it will get a high order in in the resulting and so on so we're not really caring about scores it's all we're looking at the ordering inside of these different results.
It actually works very well. So we set this up within Vespa just kind of very few words about how we kind of set this up. Vespa is a platform for doing many different things but in this
search in Vespa you set this application package which is kind of a declarative application package of saying how you want to do something like how you want to set up search and how you want to score these documents how you want to if you want these and these machine learning models in there if you want these custom code and so on you write this application package it gets
sent to this the config cluster and it's distributed across all the different nodes that you have in your system and in this the two kind of main system main types of nodes that we have in the stateless container nodes which is basically where you run your own personal or custom code and then we have content nodes where all the content is stored so whenever some sort of
HTTP request comes in for a query it gets sent to one of the stateless container nodes we're able to run machine learning models for like embedding the query and then it's sent to all the content nodes where we kind of do a retrieval for the most relevant documents it gets sent back into the stateless container and back again so just kind of
give you an idea how that works. Setting this up inside of Vespa having these two different ways of doing it like bm25 on one hand and this embedding dense retrieval on the other side is very easy these days after some development work on our side and basically this is kind of
which is in a a content field which just says enable bm25 means it builds up a bm25 index for that and then we have another field called embedding which basically says gives it some
kind of instructions that during index we should take the content field which is up here and we should embed it using a some sort of embedder which you set up somewhere else embed transformer and it will create automatically for you an embedding and that is actually what will be stored inside of the index in this case we want to store this inside of the hnsw index and we want
inner product to be the distance metric and with that we have actually enabled both of these types of retrieval both bm25 and dense and Vespa will also take care of actually embedding this for you we have some instructions on how to do that down here but yeah and this is an
example of how we would do a linear ranking basically saying okay we would do 0.5 times the bm25 score plus 0.5 times the kind of embedding or the dense retrieval scores called closeness in this instance the reciprocal rank fusion is basically custom code
and that needs to be done on the outside of this stateless container nodes that we saw because basically we can't do that in one one document we have to kind of look at two different document streams we need to kind of lift that up a little bit but i'm not going to show that let's say but the results um out of the main results this first graph here on the left down
here it's not actually out of the main it's actually in the main but it's kind of just want to show that as well to kind of get some sort of impression here so this blue line here is the the kind of resulting set of doing a linear combination of of bm25 and this
dense retrieval we're using sentence transformers in this case mini lm model and we see here if we have a alpha of zero meaning this is only the dense side we're only using this we have a kind of a recall of thousand around 94 or something and we're only using bm25 we have a lower score
over here so in this case it kind of mirrors what we already saw and that dense works better because it's in domain however interestingly there's there's it actually benefits a little bit from using uh bm25 at least in the beginning with the small values of alpha in this case so so actually the the results are better if we kind of are able to combine them
in in some ways this red line here is using reciprocal rank fusion and it's of course even better it's a straight line because it's don't really have this alpha alpha value and it works very well and it's so this combining these two in domain works better than each of them
individually this yellow line is kind of just uh fun to to see it's it's the same as if you're using two um if you're using bm25 in the first stage and you're using a cross encoder in the second stage but then instead of using actual cross encoder score only that in this and this
as a score in the second phase you're combining bm25 and that score as a value of the second phase and you see that it's it's actually no benefit of doing that at all it's mostly better than just use the cross encoder score uh at all because it turns out that the cross encoder pretty much encodes all the information that's in the bm25 so it's actually not beneficial to
combine them anyway it's just kind of interesting results so uh these other two graphs here robust zero four and track over both of them from the beer data set robust zero four is a as a data set that's pretty close to ms marco not really that's that different and so you have
but still you see here that the dense retrieval on this side is lower than bm25 so bm25 works better than than dense retrieval on this side but still if you're able to combine them a little bit you still get kind of a boost over here and this reciprocal rank fusion works much better
than both of them yeah trick covid on this other side is a another data set that has you know pretty much a big shift in and it's a pretty big domain shift outside of ms marco and of course being tracked covid is much more about medical text and so on and contains a lot
of words and and content that's pretty much outside of what ms marco contains and you see here the difference between the bm25 on this side and the dense retrieval on this side is much much larger so uh so bm25 in this case works better across uh this this out of the main data set of course it's unsupervised method and that's why bm25 is a pretty good baseline for
most of the things that you do but even then even if you have a dense retrieval method that's you know trained on something else it's still beneficial to be able to combine them in some ways when you have this reciprocal rank fusion so kind of interesting results right so even though your dense retrieval method is out of the main uh it still is beneficial when you
combine it with uh with uh with uh bm25 so they're kind of working better than both of them individually interesting results of course now we just talked about bm25 on one side and we talked about kind of dense representation on the other side so this kind of setup here
is from from the paper from like jimmy lin called the proposed conceptual framework for representational approach to information retrieval kind of long uh long name but still it's kind of interesting to he's kind of set this up a little bit to be able to kind of more properly uh combine this or uh see where these different approaches are at so we've been looking
at the sparse bag of words uh method which is unsupervised method which is in this corner here and the approaches that we uh compared with was a dense supervised slide down here but of course there are other ways of thinking about this as well there are other
kind of uh methods of doing this and one interesting kind of areas over here where we still have a sparse representation of things uh not beyond 25 but we've learned a way of indexing these um these different terms and basically it's about learning uh learning term
weights and document expansions so basically uh these different approaches here uses some sort of BERT model or transformer model to learn vector representations or sparse representations for these different terms and use different techniques kind of similar to we did with bm25
to increase uh recall uh so such as you know document expansion and so on and if we take a kind of look at that they actually perform fairly well within um i don't know if you can see that but this is the the table that comes from this paper and that they work fairly well in domain
but turns out which is not you can see here these kind of methods here that uses this sparse backwards method they also tend to work fairly well also out of domain which also bm25 does as well better than dense vectors alone but some approaches such as the
is a kind of new direction which tries to kind of train both the sparse and a dense representation inside of the model itself and it's kind of expanding these two different things and the colbert v2 is a model at least that i know of that is is working very well across different domains and basically this is all about representation into learning
so um finally uh just kind of sum this up a little bit and bm25 and dense retrieval is what we saw they're complementary right because they have different aspects they're different qualities that can be used with each other and be able to kind
of play off each other and together form something that might be even better but this is just really ensemble learning right within machine learning uh we have this concept of ensemble learning where you have different uh kind of methods or different weak learners that we call and it can combine later to be like a better a strong learner what's it called
so you have different sorts of approaches there and you can combine them in different ways and yeah this is pretty much just an ensemble right because now we just looked at bm25 and dense but you can think about having you know many other forms of retrieval and combine them using this reciprocal rank fusion and get even better results maybe i don't know it's a question
but there are other ways of thinking about this because we just used this reciprocal rank fusion which was a parameterless way of of combining these results works very well but there are other ways as well for instance we saw that this bm25 example worked very well
because it was very specific things that we were looking at this torus of mir example so we can envision that maybe you have some sort of ensemble that's like okay we have a query coming in for something very specific the idf scores are high so maybe we can kind of wait a little bit towards the bm25 retrieval or we can see the terms are very kind of not very specific
maybe we can wait a little bit towards the dense retrieval method or we can maybe even train this using a machine and model and you know and do that instead so there's many different options going forward of course all of this is very easy to set up in vespa so take a message and vespa is awesome so that was pretty much everything i was going to say
questions yes thank you so i have a
i have a simple question let's say you're not starting by using vespa and you're interested in blending sparse and dense based search methods like do you have to move first to something like vespa for both dense and sparse or can you imagine like combining by having the elastic
search or sort out whatever part for the sparse part and then starting at least at the starting point for a poc using something like vespa or something else for the dense part and then combining those together yeah um of course i mean it's up to you uh i just uh yeah i'm
most familiar i work with vespa of course i'll say that this is really easy to set vespa but there's nothing wrong with setting up your like your one kind of uh system using elastic search using feist on the other hand and so on and doing those yourself you should get the same results definitely okay thank you um hi thanks for the very interesting talk um i have a question
regarding this hybrid approach have you seen um any places that combined even more solutions and on assembling them together so for example having a classic retrieval like bm25 having
dense contextual semantic retrieval and then also something like historical data or even manual ranking methods or something like that uh i have it personally i try to kind of allude to that in the end that that's a real possibility and it's definitely something you should get working on
on it thanks yeah so thanks for your talk um i have a question around the engineering concerns of using transformers in production so specifically around the abstraction of
detail with large documents is there any implementations dealing with that within vespa or do you have any experience with that problem yeah well of course um it's i mean these these transformers they're not cheap right uh they're they're depending
on the model they use i mean they're different size of models and different so and so but the the original kind of burt large model that a lot of people used initially is very computationally expensive right uh but there's a kind of uh there's two different things right one is during document ingestion right during feeding of the documents to your system you typically
have better time than during a query right so the the time spent there is not really that much of a problem depending of course if you if you have a really really large collection of documents and you need to index them within like the shortest amount of time possibly of course then you should use other uh use things like gpu and and so on but during the
query that's that's pretty much when things are very time sensitive because you want to deliver things with as low latency as possible but in that case it might suffice just to have a regular cpu uh implementation of this model evaluation because the queries are not really
that long so even though a very complex transform model it can be done within a relatively tight time budget because at least uh these this current generation of transformer models they're uh the complexity of evolving them are uh exponential in the length of the sequence that you're actually encoding right because the self-intention thing and queries are typically
very short so we can do that fairly effectively so i hope that answers your question somewhere and any more questions right now okay thank you uh yeah really great talk a really powerful
slide i think where you had the the bm25 on one side and uh dense on the other side and see that the combination is actually better of both what i was wondering is and joe also touched a bit in this in his talk one of the way to to basically overcome uh the the out of the dense model would be to basically find an unsupervised way to to train it so like i
think neil's rhyme has called this is like augmented espert um where essentially you just use something to get like a pseudo label data using a cross encoder and i'm thinking since you now have an unsupervised way to basically find better data if you now use that to train
a densum by a betting model and maybe like even loop it back in like how would that work or have you tried that or that's a very interesting question i have not tried that but uh like i said it's something you should get uh working on interesting interesting idea yeah definitely yep hi thank you that's a very interesting talk
i have a question about the combining the bm25 um and and the dense factor with uh the reciprocal rank fusion what happens if your result sets from the two are entirely disjoint
uh it's basically a um i uh i think that in that case it would go uh every other uh document right i think that's the uh question that because it's a very fixed uh equation there and i think it just since it just takes the ordering it would be like yeah just every other
anyone all right then thank you lester thank you thank you thank you all for attending