We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Vector Search: Ask Me Anything!

00:00

Formale Metadaten

Titel
Vector Search: Ask Me Anything!
Serientitel
Anzahl der Teile
69
Autor
Mitwirkende
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Get to know about vector search and ask Dmitry Kan & Max Irwin anything you need to know! This session is presented by "Haystack – The search relevance conference" and hosted by Charlie Hull.
VektorrechnerGemeinsamer SpeicherFeldrechnerFlächeninhaltEinfach zusammenhängender RaumGüte der AnpassungOpen SourceVerschlingungEreignishorizontPhysikalismusChatten <Kommunikation>TropfenKonvexe HülleWeb SiteXMLBesprechung/Interview
Open SourceEinfach zusammenhängender RaumFeldrechnerWhiteboardBenutzerbeteiligungProdukt <Mathematik>SuchmaschineZentrische StreckungSichtenkonzeptImplementierungFlächeninhaltData MiningPunktPlug inFeasibility-StudieAbfrageLeistungsbewertungHauptidealSoftwaretestElastische DeformationLokales MinimumBesprechung/Interview
ClientBitfehlerhäufigkeitFormale SpracheArithmetische FolgeNatürliche ZahlGraphTurm <Mathematik>FlächeninhaltCASE <Informatik>Formale SemantikOpen SourceFeldrechnerDomain <Netzwerk>Einfach zusammenhängender RaumEndliche ModelltheorieEDV-BeratungDatenverwaltungProdukt <Mathematik>BitGoogolBesprechung/Interview
Endliche ModelltheorieZahlenbereichTUNIS <Programm>Inverser LimesElementargeometrieMathematikFeldrechnerTypentheorieSchwach besetzte MatrixSichtenkonzeptDickeFormation <Mathematik>WellenpaketZentrische StreckungToken-RingStochastische AbhängigkeitDomain <Netzwerk>Gebäude <Mathematik>Wort <Informatik>Arithmetisches MittelInformationAutomatische IndexierungClientFeuchteleitungFormale SpracheComputerarchitekturDifferenteIndexberechnungTermMatchingResultanteGamecontrollerFormale SemantikLikelihood-FunktionAbfrageRechter WinkelPhysikalisches SystemProgrammbibliothekKartesische KoordinatenVirtuelle MaschineChatten <Kommunikation>sinc-FunktionPlug inKomplex <Algebra>ExpertensystemStellenringParametersystemAlgorithmusGraphOffene MengeOrdnung <Mathematik>HeuristikNavigierenHierarchische StrukturHalbleiterspeicherLokales MinimumSuchmaschineSuchverfahrenAffiner RaumLineare AlgebraBlackboxDatenstrukturOpen SourceImplementierungSpeicherverwaltungVektorraumEin-AusgabeDimensionsanalyseDichte <Physik>BitfehlerhäufigkeitRouterVererbungshierarchieQuick-SortInformation RetrievalNP-hartes ProblemKonditionszahlMAPMultiplikationsoperatorZusammenhängender GraphCoprozessorNeunzehnMathematikerinAggregatzustandFlächeninhaltVektorfeldBeweistheorieVerfügbarkeitWeb logGraphikprozessorAutorisierungQuellcodeDigitalisierungDateiformatVollständigkeitBitMetrisches SystemUmwandlungsenthalpieWhiteboardZweiCASE <Informatik>Güte der AnpassungÄhnlichkeitsgeometrieRichtungMinkowski-MetrikWeb ServicesInhalt <Mathematik>FunktionalRelativitätstheorieFlächentheorieVerschlingungPlastikkarteRepository <Informatik>Web-SeiteSchnittmengeSpannweite <Stochastik>RankingStabilitätstheorie <Logik>MereologieGeometrische QuantisierungCharakteristisches PolynomGarbentheorieYouTubep-BlockWurm <Informatik>ZustandsdichteBitrateEinbettung <Mathematik>DatenfeldMini-DiscGanze FunktionKategorie <Mathematik>SprachsyntheseMustersprachePerspektiveProzess <Informatik>Elastische DeformationPunktNeuronales NetzProdukt <Mathematik>Numerische TaxonomieCodeCodierung <Programmierung>SoftwareentwicklerMomentenproblemGeradeTaskAbgeschlossene MengeBildschirmmaskeSoftwaretestNichtlinearer OperatorEntscheidungstheorieMessage-PassingOrtsoperatorWiederherstellung <Informatik>Strategisches SpielPhasenumwandlungInstantiierungGreedy-AlgorithmusTrigonometrische FunktionKonfigurationsraumGefrierenGleitendes MittelDatenreplikationTrennschärfe <Statistik>Arbeit <Physik>SchaltnetzTabelleRandverteilungAbstandFehlermeldungNachbarschaft <Mathematik>BenutzerbeteiligungLesen <Datenverarbeitung>Konfiguration <Informatik>Fundamentalsatz der AlgebraPhysikalischer EffektDelisches ProblemNatürliche ZahlFilter <Stochastik>Klassische PhysikGraphfärbungNotebook-ComputerSchießverfahrenHackerVirenscannerGruppenoperationComputervirusObjekt <Kategorie>Schreib-Lese-KopfProjektive EbeneStatistische HypotheseVertauschungsrelationGewicht <Ausgleichsrechnung>FitnessfunktionDemoszene <Programmierung>Bildgebendes VerfahrenHeegaard-ZerlegungMultigraphDigital Object IdentifierStatistikDemo <Programm>SkalarproduktDämon <Informatik>CachingRuhmasseCoxeter-GruppeTheoremSelbst organisierendes SystemGoogolLogischer SchlussAusdruck <Logik>Umsetzung <Informatik>Gibbs-VerteilungData DictionaryVektorraummodellBenutzeroberflächeInverter <Schaltung>Gemeinsamer SpeicherDivergente ReiheSoftwareMailing-ListeBenchmarkUnterraumEinflussgrößeExogene VariableTwitter <Softwareplattform>HIP <Kommunikationsprotokoll>App <Programm>AssoziativgesetzMatrizenrechnungMultiplikationTropfenBesprechung/InterviewComputeranimation
Transkript: Englisch(automatisch erzeugt)
Hello, good morning, good evening, good afternoon, good day, wherever you are in the world, and welcome to Berlin Buzzwords. My name's Charlie Hull. I'm from Open Source Connections. We're the search and relevance people,
and we're sponsoring Berlin Buzzwords, very happy to do that. Do check out our booth in the partner area. So this talk actually is presented by the Haystack Conference who are partnering with Buzzwords this year. With Haystack, we aim to share great talks on search and relevance and bring the community together.
Currently, we're running a Haystack live meetup every few weeks. I'll drop a link into the chat. You're very welcome to join that. We've got nearly 800 people come along to those talks. And later this year, we're even hoping to start running physical events again, fingers crossed. Do keep an eye on the Haystack website, and I'll paste a link to that into the chat as well.
But anyway, back to tonight's Ask Me Anything. So, vector search, it's the next big thing in search, right? Well, how do you actually do vector search? Why should you consider using it? Does it work? How does it work? Is it fast? Is it slow? Might it be better than good old text search with TF-IDF?
What are the pros and cons? Is it even ready for mainstream use yet? Well, I'm very happy to say, to help answer some of these questions, we have two luminaries of the search world. We have Dimitri Khan of Silo AI, and my colleague Max Erwin from Open Source Connections, who are gonna try and answer these questions.
Dimitri, Max, I'm gonna ask you to introduce yourselves, and also maybe a quick story about how you got so interested in this topic. So, Dimitri, maybe you can kick us off. Yeah, thanks Charlie. Hi everyone, glad to be here. So yes, I'm Dimitri Khan, go by Dima for short.
I'm currently a principal AI scientist with Silo AI. It's the largest private AI lab in the Nordics. And I'm currently leading a team of AI scientists and search engineers building new experiences for web scale search. So how did I end up in this topic of vector search? It has been a hobby topic for me since August last year.
Well, apparently there was nothing better to do. And from the first experiment I have set out to evaluate vector search from the feasibility and production readiness point of view. And it turned out that both Solr and Elasticsearch support vector search. Well, technically for Solr,
I had to take a custom query plugin into use and for Elasticsearch, I came across Elastic and then implementation. And another goal of mine was to publish my findings on Medium. And this helped me to attract attention from the larger community, leading to testing and commercial implementation
of vector search on custom APU boards. And according to my most recent experiment, this custom solution was the best and the second best was Elasticsearch. So I'm continuing my experimentation in this area. It's really, really interesting topic for me.
Fantastic, thanks Dima. Max. Max. Hey everybody, I'm Max Irwin. I'm a managing consultant at Open Source Connections. I've been working in the search domain for about 10 years now, maybe a little bit longer. I started learning and using NLP in 2015. My initial area of research
was actually knowledge graph extraction and vocabulary extraction. And I still tinker there occasionally, but I fell into the natural progression of NLP into large language models, the BERT stuff in the past three years. These days, I'm actively working with clients
and trying to bring these models and merge them with the practical tools that we use day-to-day in search technology. I'm also writing a couple chapters for the book AI Powered Search by Trey Grainger with Doug Turnbull. My chapters are about using large language models
with vector search for auto-complete semantic search and question answering. I'm focused specifically again on practical tooling and the use cases for practitioners and trying to bring all this cool stuff that happens in ivory towers of academia and Google and Bing into the hands of just us regular folks who are trying
to ship smaller products day-to-day. Fantastic, thanks Max. So the way this is going to work, you can submit questions on vector search for Dimitri and Max in the usual fashion in the chat. But we also, we thought we'd get ahead of ourselves a little,
and we asked the community a couple of weeks ago to send us some questions to get us kicked off and maybe to inspire some of your questions. So we're going to start with those. Hopefully this will be useful. So we're going to kick off with our first question and forgive me, I may have to read this from the document because it's complicated.
Our first question, and I'm sorry, we don't have the people who ask these questions written down here, but maybe you'll recognize them yourself. A lot of machine learning applications use the face or face NOI or NMS lib behind a simple web service for approximate nearest neighbor retrieval,
for example, in a recommender systems. This works well for simple applications, but when efficient filtering is required, it seems you need to take the leap to a fully fledged search system, Elasticsearch, Vespa, et cetera. Do you think there's an un-serviced niche for a face or face plus filter tool? Or do you think the additional benefits of a search system like Vespa
pays for the additional complexity it brings? I'm going to ask this to the Deema. Oh yeah, thanks Charlie. Well, if you are in the Elasticsearch world as I am, you have two options. So I already mentioned the ElasticNN plugin.
It basically implements an LSH, locality sensitive hashing algorithm. And then if you want to live dangerously, you can also go and check out Open Distro and they implement like a graph method. And it's basically like off heap. So it builds like it's implemented in C++.
And ElasticNN plugin supports pre-filtering. So you can, it's a difficult use case, I would say many search engines when you have a number of, you know, parameters that you want to filter your search down first and then you will run an ANN algorithm on top of that.
I'd say whichever methods you choose, you need to carefully select the hyper parameters that each of these algorithms, you know, offer in order to bring the best, you know, performance in terms of indexing speed versus recall versus like memory consumed during indexing
and during search. And I'd say at least according to the papers, you know, the graph method, the hierarchical navigable small world graph method scales well to multi-core architectures. And it has like a bunch of heuristics there as well to avoid like local minima when it builds the graph
and it builds a well-connected graph as well for like really large set of nodes. But you know, like if you go back to Lucene, building a graph for each segment might become super expensive. And so you should consider merging segments down into one segment before serving queries.
And so generally I think combining filtering with an ANN in one single, you know, pass is a wise decision because, you know, if you offer like a multi-step retrieval where you will like first retrieve something then filter down and then, you know, re-rank,
this will likely suffer from low speed or low recall of both. So I think combining this into one single phase is really nice and wise solution. Max, what do you think? I think that, yes to both. I think there is a openness for a niche
in the face plus filter. But I think that there are, you know, huge things that Elasticsearch and Vespa, for example, bring to the table. So if you're gonna build something on top of, for example, NMS labor or face or another vector search library,
you're basically doing the same thing as if you were gonna start building a search engine off of Lucene. You can do it, but you're gonna miss out on all the things that you take for granted with DevOps and configurability and deployment and sharding replication and all that stuff. So you probably shouldn't roll your own like that and chuck it into production.
You're gonna have a very, very hard time with that. So Dima has already mentioned, you know, the stuff that Elastic is working on in some areas there. There are some new players that are coming out like Gina AI, VBA, Milvis, Pinecone are a couple examples that are trying to fill that niche.
But those are, you know, they're newer, they're startups, it is risky if you wanna build something, you know, an existing big product on top of one of those newer systems. You can check out Vespa, which is definitely the mature product in this space.
But I think there are a lot of options you can look at and consider, but definitely do the research and make an informed decision. Fantastic. I will mention actually, we've had a couple of these vector search engines presenting at Haystack Live. So you can go back and check those out. They're on our YouTube channel. So I'm gonna move on to the next question.
Sometimes information is encoded in the language people use. I mean, layman's terms for layman's content or professional terms, professional content. So in medicine, you might have very different results for acute myocardial infarction and heart attack. How do you model these differences as input for a language model?
But all our vectors aren't the answer here for where there are domains, where there's information in the meaning, but also in the terms. Max, do you want to kick us off on this one? Yeah, so this is the classic NLP vocabulary mismatch problem, but not even NLP, just the search vocabulary mismatch problem. You have a corpus of text
that contains one type of language. And then you have people searching using a different type of language. There are a couple of things here. So first of all, with vector search, it's not like this magic thing that you're just gonna throw it out there and replace everything that you're doing. There are a lot of tools that you can use
that we've been using for years. And traditionally this has been solved with synonyms and knowledge graphs, right? So you can do a map. So if you see a term that's not in your corpus, you can map to the language that's in your corpus and in your index.
There are, in terms of bringing this stuff into large language models, you can try some hacks of fine tuning by adding additional content to your model that contains the language that you want to include. But no matter what, the large language model was trained on an initial vocabulary.
And that vocabulary is limited. So in BERT, you have like word pieces and the word pieces are like 30,000 initial word pieces. So if your language deviates from that significantly, then even fine tuning may not really help that much. And you can try training your own model and setting your own vocabulary with a merged vocab set and merge content set.
But that's expensive and typically out of the reach for most teams. But if you have the resources, you can try that and as a hypothesis and test and see how it plays out. Timo, what's your view on this one?
Yeah, I think it's kind of cool when you throw like BERT model, for instance, at search and you type mathematics and it tells you, you know, geometry or linear algebra in response, it's kind of all cool and fancy. But I think when it comes, you know, to a specific domain, you know, like financial or healthcare, whatever you have,
I don't think it will capture it so easily. And so I agree with Max there, like you really need to fine tune your model on the data, which might be super expensive as well, depending on the size of your corpora. But at the same time, do you really wanna go and attack that problem
from the large scale model, or do you want to just go and configure that old fashioned dictionary, which will work quite well, because it's a controlled way, you know, to offer this experience to users and why to pay so much, you know, money to train a model when you don't see an exact application for it.
And yeah, I mean, I think we will also cover some topics in the future during today, but like also establish a baseline for your search, like know how it performs now before you venture into vector model. Very sensible. Find, try the stuff that you know works
before you try the stuff that you don't know if it works. So our next question, somebody's noticed that Instagram music uses some kind of language model now. And if you search for a capitan, the French word for captain, to find French songs that are called capitan, English songs about captains are actually not relevant.
How do we avoid losing the information contained in the exact words when we search with meaning, as you might have with a vector model? Demit. Yeah. So actually, as a matter of fact, I'm building with my team, like a multilingual search engine.
And we basically have like independent indices for different languages. And so when query comes in, we do our best to detect a language. And so then we will like send the query into the specific index. So there is like a high likelihood
that it will capture the semantics of what you need, even like without vector search. But other than that, I think if you already implemented like vector search, give users control in your user interface. Like if they don't agree with the results
and they clearly see that search engine didn't nail it, just give them tools to go back to like old fashioned lexical search with exact match. So that's what I would recommend. And I guess you could also try some things like language detection. Yeah.
Max, what do you think? Yeah, I think the important lesson is that don't just throw away your existing search stack and replace it with vector search right away. It's another feature that you would use.
So when somebody, and it's important to take a step back and think of the problems that your users have, the information needs that your users have. So if somebody approaches your search bar and they search for something in quotes, or they're looking for an exact term, give them what they want. People have been trained on keyword search
since like the early 90s. So there's a lot of cultural stuff that's embedded in just searching for nouns and not providing any other language. So when that happens, don't throw out the ability to do the exact match.
And do additional things. Use diversity of search results, do some federation maybe, do some stuff to bring in other things. So you get both, the best of both worlds. You get the exact matching where people have very, very fine control over what they're retrieving. And then you also get that juicy semantic meaning
relationship from vector search and combine the two for a better experience. Fantastic. So while we're asking these pre-canned questions, do remember to submit your own questions using the questions tab on the right of this presentation
as you're watching us. And we'll ask our experts here. I've got a quick question here I'm gonna answer myself actually. Somebody said they have experience of search and keyword search and building taxonomies. And if I hard to work in these fields, I will recommend relevant Slack, which I'll drop a link into the chat.
And there's a jobs channel there. So if you wanna do that, maybe go on to working in some of the more advanced fields like vector search, that's a good place to start. So our next submitted question, I'll have to read this one carefully because it's complicated. So we see some patterns that have emerged in the space of dense retrieval, both from the research side as well in the industry.
What are your thoughts in what's coming next in dense retrieval? Where are things heading and what will people need to do to prepare? Dima, do you want to start us on this one? Oh yeah, for sure. Thanks Charlie. I think there is like a lot of development going on in this area. So I would really recommend you the beer paper
if you haven't read it yet. I'll try to share the link later, but it's an excellent benchmark, where they compare dense methods against re-ranking methods against lexical matching using TFID for BM25. And then you can like this paper establishes
like the baseline of understanding what's going on in this area. And then there've been like some really cool papers recently, for instance, training embedding model like on byte level. So this helps you to solve some daunting issues
with misspelling and other related problems. And then another paper applies four year transform to improve the speed of BERT and it basically became seven times faster with like 92% accuracy. So I'd say the community is moving ahead
on solving this like various issues with the embeddings because these players actually do use them in production. So in the client that I'm working for right now, we are using like a dense methods. I will not name it, but it basically gives really good results on DCG.
And another thing from this beer paper is that, you know, like dense retrieval methods will not generalize well. So like they will be like BM25 only when the model was trained on the same domain as the queries.
And also what's interesting and important to understand is like when you apply vector search in your domain, depending on the size of the documents, you need to pick the similarity metric very carefully because for instance cosine similarity
will favor shorter documents while dot product will favor, you know, longer documents. So maybe you need to have some kind of combination of this, you know, metrics or like a dynamic selection of the metric, depending on the use case or the query intent. And also like performance of vector search at large,
you know, it's an unsolved issue. So you need to be looking at a bunch of like model configuration parameters that will work best for you. And sort of like, that's my personal advice, pay less attention to the error margins reported by big players,
because what works for them might not work for you. And I will try to share some papers with you later as well, so. Well, we're lucky to have Joe Christian Bergam and Josh Devens in the channel and there'll be our link buddies tonight. So thank you guys for posting links to the papers. So you've got more reading to do.
Max, what do you think? Where are things heading? Yeah, great, great question. So when this stuff first started showing up, like, you know, three years ago, two, three years ago, it was mostly focused on retrieval and matching and ranking, you know, for near, replacing, maybe not replacing, but using approximate nearest neighbor
instead of BM25 as you're matching and ranking the signals. I think that what we've seen recently and the things that you really should prepare yourselves for are how this technology and these techniques with vector search are being used for the entire search experience.
So it's auto-complete, it's spelling, query rewriting, it's like snipping and highlighting, you know, question answering, recommendations, personalization, classification and enrichment,
like all of these aspects that we think about in a full search experience, we're seeing, we see that Google and Bing are now using these day-to-day. You can do a web search on either of those engines or any of the engines that use Bing, for example. Well, if you look at the page rendered, you can tell that it's using this technology.
And the way it always follows with this technology is you pay attention to what the people with the billions of dollars are doing, and sooner or later, it's gonna fold into the little fish, you know, the folks like us that are trying to do the day-to-day stuff.
So I recommend that you focus on the fundamentals of how these technologies work. Read the papers, and if you don't understand the papers or the math, that's fine. Get involved with the community, play around with the hardwood-based stuff, try out some CoLab notebooks that people have published, just to get a feel of, you know,
how this technology works. And then apply it to your own problems and explore, you know, and tinker, and see what's possible, and see the problems that you run into. Just keep yourself fresh with experience, because that's, you know, that's how we learn, and that's how we go forward with all these new technologies.
Anytime something comes up, you just gotta keep playing with it. And the state-of-the-art, the soda world, we'll just keep pushing forward, and the community will just keep pushing forward. Just, you know, follow what's going on, follow the community, and see what interests you, and learn where you feel you have gaps.
Thanks, Max. So we've got a question actually submitted online. I'm going to wedge it in here while we work through our pre-canned questions. So somebody asks, isn't the aforementioned pre-filtering counterintuitive, at least for e-commerce?
We cannot know beforehand for unregistered customers whether they want to filter by color or price range. Now, do we know what that refers to in our previous conversation? I might need a bit more color there, but I mean- I think it probably refers to our, I think when we were talking about doing pre-filtering,
maybe for language, I guess, but yeah, I'm not entirely sure. Yeah, I guess if I understand the question right, is that if we apply, let's say we choose between, let's say, price and size, right? So we have two filters in the search engine.
Now we have an option just to run the vector search on everything, and then basically do some post filtering or smart ranking and maybe show two groups of results by size and by color, whatever it is. But the way I see, the way I look at it
is that you will be also bound by speed of light. So like when you execute your vector search, right, you will face performance issues. There will be like a bottleneck. Like if you look at my blog post, I will share the link as well, sorry. I don't have access to the chat right now, but the thing is that it's quite expensive.
It's super expensive. It's like more than a second what it takes to run one single search. So you do want to pre-filter this space. Basically you are entering a new space of your documents, right? And now you run your vector search with some similarity metric,
and you are sort of like searching in that local sub space of your documents. Is it good experience or not? It's up to your UX. It's up to what you are delivering in the product. So maybe still offer the tools to the user. If user disagrees with the results and you have some hints there that,
hey, we applied some method that maybe we think it's the best, but here are the tools. If you disagree, go and filter out yourself or maybe don't filter. So that would be my answer. Right, I hope that answers your question. I'm gonna skip onto one of our previous questions here because I think that this might cover quite a few people's requests,
but content length is an interesting one, isn't it? Because obviously, we know in text matching, content length can affect the weighting of the various fields we use in our ranking for me. Where is there a content length sweet spot where dense vectors have a clear advantage
over sparse vectors playing TFIDF? Max. No, there is and there isn't. So if you're getting embeddings for text, if you look at the model, the model will be limited by, you'll have some limitation on the number of tokens
that you can pass into the model in one step. And so that's an upfront limitation of the model and architecture that you choose. There are a lot of efforts now to remove that barrier to make things longer and longer. I would ask the question,
assuming that there were some, if there wasn't any limit, ask yourself the same question just with BM25. How would you do this? And I think in a lot of cases, it's important to not just generalize, but take a step back and look at the problem. What does it mean to have a relevant document
and a relevant piece of information? My colleague Bertrand, he asked a question. He had a client who was trying to index a document that was hundreds of megabytes. And you're wondering, what does that mean to have a relevant hit on a document
that's hundreds of megabytes in length? That could be, you can contain a lot of Wikipedia in that, there's so much knowledge. So there's this idea of, well, how do you carve up the text for what you want for your domain and your customer's needs?
Where do you draw the line? Are people looking for a specific answer? Are they looking for a passage? Are they looking for entire books or chapters? And that varies from need to need, even in the domain. You might have situations where you say, okay, I'm gonna give you the whole thing back
or I'm just gonna give you this one snippet. So in terms of the technology limitation that exists, but even then, understand how you're cutting stuff up and how you want to surface relevant data. And vector search there is kind of the afterthought.
It's like, okay, well, I have this similarity function and I'm just gonna apply that and get a score to the texts that I have. What do you think, Timo? Do you think there's a sweet spot on content length? I think if there is a sweet spot, it's definitely before 512 word pieces
because all neural approaches have this limit and maybe eventually this will be lifted. But at this point, if you read the paper that I mentioned, they actually mentioned this limitation there. But the question is, again, do you even need that much?
If you take a really long document like Mark Six explained just now, I think if it has a really diverse set of topics in it, if you have thousands of pages in that document, it's like a copy paste from Wikipedia or something. Imagine clustering this.
Let's say you're using the graph method and in the graph method, you will have this really big hotspots and then this document will be connected to a bunch of other documents. And does it help your user? I'm not sure. So what I would do is probably try to dissect your document
in a number of meaningful blocks. So for instance, let's say you have a section which is about a specific topic or introduction or whatever the meat part of that document. And then you could, for instance, go and index those specific sections in a separate field
and then use BM25 as your baseline. You don't even need the vector search there. Then another approach is that you could summarize the document. Then the question is, if you have thousands of pages, can you actually summarize that document? I don't think so. You will probably need a number of summaries and then you could of course encode those as vectors
using whatever birth model or whatever you would like. So I would say step back from thinking, is it vector search that's gonna solve all of my problems? Is it BM25? Is it some not invented yet methods? And think about what is in your data.
Like ask your domain experts to annotate those documents so you actually have those building blocks at your hands and you can go and experiment with different methods and have some sensible baseline. I think BM25 is a proven baseline at the moment. And so play off of that.
That I think would be my recommendation here. Great, thank you. So I'm gonna just flip to a question from the audience. We've got, it's been highly voted and somebody has asked, how does hit highlighting work in a vector search world? Max, do you wanna take that one? Oh, that's a great question.
So the way hit highlighting works is, so you have a large language model and it's pre-trained. You have a pre-trained model. I'll talk specifically about the question answering aspect of this because this is a form of hit highlighting for closed domain question answering.
There's open domain question answering but this is the technique that I'm using for question answering is you have a fine tuning task where you have the passage, the query and the thing that you want to highlight, right?
So those three things is what you need for training and test data to fine tune your model. So it will learn in the fine tuning task given this passage and given this query, what's the word or words
that I should present and highlight? And it doesn't make up text. It basically gives you positioning which works very similar to highlighting, right? So you fine tune this model on your task, on your data
and then you have the model and then when you are using this thing in production, you get your search results back from ANN or BM25 or whatever and you pass in the passages that come back in your results into this other model and then it returns the positioning for you
and then you can use highlighting there or you could just call it out and you don't have to highlight it in place. You can just say, here's your answer, right? So that's pretty much how it works. I'm trying to remember there's a specific data set that's available that it's a,
starts with an S but it's escaping me right now because I'm having a brain freeze while I'm trying to talk and answer questions. But when I remember, I'll chuck it into the chat in the breakout session. Great, Dima, what do you think in highlighting?
Yeah, I think it's kind of challenging because if you sort of like, if you're just entering this area, let's say you, as I gave you an example, very simple, right, you type mathematics and it gives you like linear algebra. Like, can you go and highlight linear algebra
having mathematics? No, so you need to have some way of knowing the distance, right? So like, okay, between mathematics and linear algebra, there is like the smallest distance and the model should tell you that. So like you could apply like a layer in the model, let's say, let's say attention layer
and then see, okay, which of these words correlate best with the query, right? And this is what I think Max alluded to as well. So let's say, document is returned and there are like a number of passages there that highly correlate with the query. So you can go and highlight them. But the question is, should you highlight the whole passage
or can you actually build a method that will actually highlight individual most prominent words that contribute to answering your question? This is what highlighters usually do, right? When you go to Google and you type something, it actually highlights you the actual words to pay attention to. And I think you can apply like an attention layer.
Again, I would need to go and double check that, but this is the direction in which I would go as well. And I think there are some other methods. Somebody mentioned to me like, decrypting the vectors back to words and then trying to see the overlap, but I'm not sure if it's like shooting from a Canon.
But I think attention layer might work better. So, yeah. I did remember the name of the data set. It's a squad is the task. And one of our link bodies has already posted that in the chat. So thank you, Kevin.
So we've got our, let's have a look at one of our set questions again. And this is an interesting one. What are good and bad use cases for vector search? And Vima, do you wanna start us with that one? Sure.
This is actually an interesting question because I was thinking like in principle, you can represent any object as a vector. The question is, do you have a good model to do that? You know, you could in principle take, well, it's known that you can take image, you can take sound, text,
maybe even like a virus signature, right? If you're like into antivirus world. Then I still think that it's important to pick the right similarity metric because for different objects,
they will have different like structure in the vector. Like let's say it's a dense vector or like you have a sift data structure. They will have different characteristics. And so again, if you read that bigger paper, you will see that different, you know, models will generalize or not generalize. So you'll have to retrain it to that specific object.
And so I think in general, you know, vector search has like a really wide applicability area. And probably that's why we see so many new interesting startups and open source projects in this field. But when it comes to like implementing sort of coming to the bad side,
like if you throw like some dense vector approach to all of your queries, probably you may end up in a situation that users will be like scratching their head and thinking what's going on here? Like I'm looking for this specific thing and it's telling me about some similar thing that I'm not interested in.
I'm interested in that specific thing. And so this is again, the great point to step back and think about establishing the baseline for your search engine. I happen to be a commuter at Cupid. And so great tool, open source, use it
or use some other tool to establish the baseline. You know, I'm doing it with my team currently for a number of languages. And you will learn a ton from establishing the baseline. Trust me. Like ranging from, hey, you have some problems in the formatting of the document,
you know, to source authority or, you know, freshness of some of the documents and so on. So work with your domain experts there. And then consider any of the, you know, ranking methods, like even LTR learning to rank
as a black box. Okay, I have the baseline. Now I can go and apply, you know, different methods one by one and see which one wins. And so what other team is doing in our company is that they systematically train dance retrieval methods, tune some parameters, and they have the leaderboard of all of those with respect to a specific score
like DCG and DCG. And then they also can compute the same metrics on their arrival, you know, on their rivals. And then that's like a sweet spot where you wanna be. Max, what do you think on this one? Good or bad use cases of vector search?
Yeah, so I'll tell you the biggest leap that I think most people are gonna have to make into vector search mentally is that, you know, if you come from an inverted index, you're very used to a certain way of thinking where you have your index of a set number of terms
and maybe you have some synonyms, and then you basically do a match and a lookup in that index directly. And then you use BM25 for similarity scoring. With approximate nearest neighbor searching and vector searching, it's kind of like this leap
into there's this one step that you do for both. It's like you get stuff back that matches and then there's a score behind it all at once, right? So that's, I think that if you get over that first hump
of switching your brain into that, I think a lot of the use cases, good or bad, may come naturally when you're thinking about how to apply this into your domain. And I can't, unfortunately, I can't really tell your brain how to make that switch. It comes with a lot of playing around and learning and figuring it out.
But I'll give you some do's and don'ts instead of good and bad use cases because I think there's a whole bunch of stuff that you can try and do, and it may end up successful. It's really hard to answer that generally. So I'll say some don'ts is when you have vectors that come from multiple passages,
don't try to just average them together to try to increase performance or reduce the space costs that you have. That's not gonna work. Don't try to do that stuff. Don't use a pre-trained model without first understanding what it was trained on and the vocabulary that it contains,
and then also its limitations compared to your domain. Don't just pick a random pre-trained model and be like, oh, this is what they used in this notebook. I'm gonna use this one, right? Understand the model that you're choosing and using, and then fine tune it. And don't just use vector similarity
as your only feature for ranking. You know that you have a lot of stuff that you can use. We talk about search when you're surfacing results. You don't just use BM25. You use BM25 and you have function scores for like, oh, the recency of the date, you know, the rating on the product,
and all kinds of other features that combine to make up total relevance for a document when somebody's searching for an information need. So that's some stuff into some dots. Some dos is like, do split up your documents into good passages in the size that fits your architecture, your domain, your use cases.
And then, you know, you can investigate. Instead of averaging, you can look at things like distillation, summarization, PCA, other techniques, quantization to try to get the performance there. Because you can't just dump raw vectors right now into your index for entire documents.
You just, your compute and your disc will hate you. And your RAM will hate you for that. So you will have to find a way to get there. Learn and fine tune the models. I think this is the most, probably the most important thing when you are solving a problem for your domain.
The use cases for your product will require you to figure out exactly what your starting point will be, and how you tune it for the task at hand. So you may have a bunch of use cases that you wanna use this technology for.
I mentioned some before, like autocomplete, spelling, query, right? You can do a whole bunch of things. So understand that these may all require different models and different architectures for your need and different fine tuning tasks. And yeah, use this as an additional features
and as an additive thing for your experience and your ranking and your retrieval. It's gonna be additive. It's not gonna be like, I'm gonna replace the whole thing right now with vector search. It's another extremely powerful feature for search. And use what you know and have learned already
and combine it and play around with it and see how it folds naturally into your stack and into your domain. Thanks, Max. So I'm just gonna jump to one of the questions submitted by the audience here. We have a question, what about active learning to improve vector search?
Meaning tagging the user inputs and then updating or filtering the vectors? What do you think, Timo? Yeah, it's a great question. I think you can do that. So there are like, if you enter dense retrieval methods, then I actually don't remember the method name,
but everything is in the paper. So one method is that you can combine the document with the queries that naturally match this document. And so you will bubble up the prominence of that document during search.
And so in principle, I'm thinking, why not? You can apply an active learning approach here where you will identify the relevant queries for a specific document. You will need to build some system for that, I guess. You could use Cupid, I guess. And so you would go and update the document vector
with those new queries. And so next time around users are searching with the queries like this, you will have those documents bubbling up to the top. So in principle, yes, I think yes, and you should. Actually in general, I like the idea
of seeing your search engine as an evolving model, as an evolving organism, where you should think creatively, like what else can we do to actually establish this pipeline of ideas and improvements? Because I've seen cases when,
let's say we apply a specific vector space model, like in principle, BM25, right? And then we just stop looking forward. We just think, okay, everything is in the data, but it's not true.
You will notice that in the production, you will see decline in CTR, or you will see decline in what we call exposure, which is probably not like a common metric use, but it's basically how often do we show the results from our search engine. So if you see any decline there, take those queries and throw them into Cupid
or some other system where you can investigate them with magnifying glass, and then looking at all the ways you can encode additional signal from those queries and documents into your model. I hope that answers the question, but yeah.
What do you think Max? I totally agree. I just have one very important point that I think you need to consider when doing active learning, is you need to be very careful of bias and thinking about who, as a model prompt has presented to you
of whether it's good or bad, and then you want to update the model based on your reaction, it's very easy to just very quickly go through it and forget that you are a subjective and biased individual. Everybody is in their own way. And if you are taking even crowdsource data from this
or your product to do active learning, there's gonna be bias there also. So there are teams that are really good with this who are used to dealing with this with learning to rank data sets. But if you're new to this and you're used to kind of capturing judgments
at a small scale, try to remember that you probably want to get more than one opinion on things, and you wanna understand consensus and disagreement and have discussions about them, instead of just you as one person,
like update model, update model, update model, because that's just gonna, you're gonna overfit to your own desires and wishes. And that may drift from what your users actually want. Maybe I also wanted to add, and this is like a topic dear to my heart, try to release more frequently.
Because if you spend a bunch of time thinking about, oh, I have this cool idea, I just need another couple of weeks to polish it. By the time you release it, the season might be away and you just will not nail it. And we see some interesting use case in our search engine right now, when CTR all of a sudden went down for a specific language,
but we've done no release. And we are like just figuring out what's going on here. So when you start from that angle, like and go backwards to your model, like there is like a long path there and you can generate a bunch of ideas what you can try.
Great, thank you. Okay, so we've got a doozy of a question coming up next from the submitted by the audience. And I'm wondering who this are, that this is. So I'll read this out, it's a long one. In order to make vector search performance, an approximate nearest neighbor approach is typically applied. So Lucene has HNSW, so it's faster.
Given that these A and N techniques essentially partition the vector space into a smaller neighborhood, it seems natural to shard large indexes by A and N neighborhood. So that queries could be routed to specific neighborhoods on specific nodes. And then we can obviously use for performance neighborhoods with more queries
could have more replicas handle nodes. This will prevent the entire index from being needed to be searched in every query. Could this be useful for enormous indexes such as for an open web search engine? And what are your thoughts now feasible this would be to implement with solar, vespa, elastic, et cetera, using HNSW or similar algorithms. So that's a bit of a mouthful.
Who wants to take that one first? Maybe I'll take a stab at it. So I recently reread one HNSW paper last week and I, because I really want to understand
how this thing works. And it's one of those things where it's casually offhand mentioned in the paper of, well, this thing is you can shard this. I don't think using those exact words, but due to the nature of navigable small world and neighborhoods, you can effectively shard and split this out and then it'll work.
Like that's how it's kind of presented in the paper. But there's no like implementation detail, of course. So I think that you definitely are gonna have to do this. It is possible. I don't know the techniques to do it, and I don't know how vespa works. I don't know how,
cause Lucene has HNSW, but solar and elastic are going to be responsible for the sharding of the Lucene indices. So I think that that's probably a huge barrier to getting this stuff into solar and elastic. I don't know if Josh is working on this right now for elastic, Josh and his team,
or I think that Joe Christian might have some comments here about how it works in Vespa, but it should be possible. And I think it's gonna be absolutely necessary. And I'm hoping that, I'm just kind of sitting around waiting for Lucene nine to ship into elastic and solar
and all this stuff to magically appear. I know that's not gonna happen. It requires a lot of hard work from a lot of hardworking people. But I think this is absolutely necessary. Okay. What do you think, Dima? You're working on a big search engine. Yes. Actually think about TF-IDF and BM25.
When you have, let's say 100 shards and your query comes in, you have some sort of router component which will forward it to independent shards. It will search for the documents, score them, and then return you, let's say top and from each shard. Are these scores compatible?
I have a strong conviction that they are not. Why? Because every shard will have their own like term level statistics, document length, which is like local to that shard. And there are some solutions which you can, for instance, where you can build a global idea, right?
So idea of which will be like globally updatable, like a distributed cache, whatever. And probably you will run into some race conditions there, I'm pretty sure. But there are ways to attack it, right? So I think the same, exactly same problem
exists in our traditional lovely BM25 search. Now, if you enter into the graph search or into like a clustered search, if you pick that paper, the hierarchical small world, navigable small world graph, it's so mouthful that I keep reminding myself
what's the order of letters there. But that method actually explicitly states in the paper that it's not compatible with the distributed search. And they do mention, you know, like this famous mathematicians from 19th century that proof is obvious, and then they die.
And then like 100 years after somebody tries to prove that the theorem, and then they, you know, die almost as well. So I think, and then they say actually that the previous incarnation of that algorithm, when you remove the age, so it's not hierarchical, it's just a navigable small world graph.
That's a perfect match for the distributed search engine. That's what they say. But again, you need to go and check it for yourself. I don't think that you should easily trust everything that's written in papers, you know, go and try it for yourself. Yeah, there were a lot of typos in that paper. So that should be off also.
So what I'm taking away from this is people in the street don't always trust people in academia. And also that some of these areas of maths are effectively cursed, and you should navigate them with a great deal of caution, because it seems that people die if you get too close. I'm a bit worried by this.
Cursed mathematics, I love that idea. Okay, well, we've got another few minutes. We're actually going to run on a little later than our published end date, end time today, because we're the last session of the day. So we're not going to be jumping into the breakout room. You're welcome to stay with us. We'd love you to stay with us. We're gonna try and get through a few more questions, maybe for an extra 10 minutes at the end of this session
but I must just for completeness mention that at the same time, we have the workshop on digital and ethics running in the machine house. And of course there's a spatial lounge for socializing outside the sessions. But if you're gonna stay with us, we've got some more questions to get through. So let's see what else we have.
Let's have a look. We've got, well, there's a quick one here, and maybe we can answer quite quickly. How could GPUs be leveraged to improve the vector scoring calculations? And I'm gonna ask that to Dima because I believe you looked at this. Is the GSI application you looked at that's using a GPU, isn't it? Yeah, GSI is using, so they've built their own custom APU board.
So it's like associative processing unit. So it's not CPU, it's not GPU. It's like something, their custom implementation. And it's particularly friendly with metrics, you know, multiplication and whatnot.
So basically, the method, I think they have a bunch of like, wait there with the RAM. So they use a lot of RAM and I'm not sure exactly how this board is structured, but basically they can even ship that board to you and you can try it.
But basically that was the fastest method that I have seen and I benchmarked. You know, if you look into my blog post, I think the scale of difference was like 70 milliseconds versus like 1.5 seconds for the vanilla Elasticsearch vector search. So that's like a huge, huge difference.
But then again, in order to use that method, I had to, and I'm sure the team is going to iterate on this, but I had to prepare like a Numpy array that I would, and it took me like four days to embed 1 million documents into that space
and then ship that array. And so they uploaded that. It didn't take too long to index and then it was super, super fast. But then again, this is kind of like a hybrid approach to my sense because you can run it on premise, you'll have to pay, right? But then can you actually emulate something like this
without an APU? And then for the GPUs, I think the beer paper as well mentioned something, if I remember correctly, that you can actually use GPUs to speed up your search engine for the vector. Yeah. It's funny that this question was asked
because I haven't tried it yet, but two days ago, I stumbled across a repo in GitHub called CUHNSW, which has been, you know, it looks like it's a couple months of coding that claims to use CUDA with HNSW. But again, I haven't installed it or played with it,
but it's something to look into if you want to just tinker and you have an Nvidia card that's CUDA capable. Well, they'll drop their link into the chat if you have it. I don't have access to the chat. I'll drop it into our chat, Charlie, and then you can relay it into a-
There we go. Through the magic of copy and paste, I shall put it in for everyone. No affiliation. I just found this thing and I started to say, I'm gonna look at this later, which we all know with GitHub stars, that never happens. But yeah, if you're really interested, you can go try to install it and see if you can get stuff indexed in search.
Okay, so we've got a quick question here I'm going to ask about... actually, I'm not sure this is vector search related. Somebody has asked about seasonality and fine tuning. How can we infer when to update? Maybe a small decrease in click metrics would be enough, or is there a
better way for this kind of problem? Dima, I'm going to ask, does that apply to vector search? Is that referring back to something earlier? Actually, I think it applies to embedding at large. I will try to give you that paper, but I don't remember from the top of my head.
Basically, the paper was dealing with seasonality change by, you know, basically you can compute like a stable date range with which you can tag the terms and then you embed them. And so the embedding will also have the term as well as the date range, right? And then, you know, when you search, for instance, you can also account for like, okay, what season I am in, you can
definitely know that, which date range you fall in, and then you attach that to your term. It almost sounds like payload based search, if you know what I mean, right? So let's say you have a term and then you can add some characteristics to it. And then during the search, you can pre-filter or like filter the terms that fall into the specific
category set, let's say part of speech tag or something else that you might encode there. So something similar. So that paper is really interesting. I don't know if they have a practical implementation. I think the code is on GitHub. So if you're interested, I will try to
to find that paper and post it as well. Yeah. Okay. So let's go on to our next question from the audience here. So we talked a bit about ANN coming into Lucene 9, and this is an
interesting question. Will this be an easy to use feature when it's exposed in Elasticsearch and Solr? I think it'll be easy to use for people once they can access it that way. Max, what do you think? I honestly think, you know, both of the Elastic team and the Solr community
write amazing software, right? And I think that it will definitely be usable and it should be straightforward to use because at the heart of it is stuff from a user perspective isn't
really that complicated. You know, you can go and you can install NMSlib right now and index stuff in Python and like three lines of code and then you can query it in like a couple lines of code. I think the hard part really is getting the vectors and understanding what you're matching on. So I think that once this is available in Solr and Elastic,
I imagine that it'll be pretty straightforward. It'll probably just be like, you know, you're going to have a dense vector field and you're going to specify the analyzer and the similarity function and you'll be able to query it. I don't see it being much more difficult than that. I think the hard part is the hard stuff certainly rests on the shoulders of the implementation
teams at Elasticsearch and the Solr community who have to think about all the crazy stuff like sharding and performance and memory and heap and all that crazy stuff that, you know, users of the tools don't necessarily have to worry about right away, but that's something that
is going to also fold into your operations and production of like, well, how much memory do I give the JVM if I'm using dense vector search? How should I, what's my sharding strategy going to look like? Is this going to impact, you know, my high availability and disaster recovery
strategies? You know, is it going to be, is it going to make my index huge and my memory really big? So, you know, I'm going to have to really not worry about my budget. I think those are probably probably the questions that are going to come into play and we'll see when we can actually benchmark when this technology is available to us in these engines and we can start
indexing stuff and seeing how to use it. What do you think, Deema? I mean, you've tried these things in your blog posts series. Do you think once it's going to be easy to use from a practical sense? I like the implementation when it stays inside the JVM if you're on JVM
because if you go off heap, what will happen is that it's so hard to measure like how much memory you should give it and usually these algorithms are super greedy. Actually, for your information, HNSW algorithm is very greedy on RAM. You do have some hyper
parameters that you can tune and kind of lower the RAM consumption at the expense of like the quality of the index. So that's the indexing trade-off and then you have the search trade-off where you can also alter some hyper parameters there. But having said
that, I still like the idea of, let's say, if I'm on JVM, give me every tool that's on JVM. I don't want to go off heap, even though it sounds sexy to go off heap, but I don't think it's super practical. And again, maybe I will be proven wrong in some time, but for now, I would choose this approach versus, let's say, open distro, which offers you off heap implementation of
HNSW because I've run into a number of issues. I don't want to say that I'm dissatisfied with open distro. Open distro is a nice, great way of solving a bunch of issues and also scaling your system. And also Elastic Search, the vanilla one, doesn't have
any algorithm implemented yet. But again, I just, well, maybe it's just tough luck, but I wasn't able to index one million vectors with open distro. It just crashed on me really, really badly. And I spent multiple days figuring out what's going on.
And maybe I just need to give it a really large machine and then just throw money at the problem, which I don't want to do. So another thing, the practical perspective, and I think Max mentioned that when you will index, when you compute embeddings, don't
choose high dimensionality because it's so appealing to choose 768 dimensions vanilla, uncased BERT model and hope for the best. The problem is that the index size will be super huge. If you look at the BER paper and you compare the BM25 to one of the
dense vectors, dense model, the difference was, yeah, it's here. I made a note. The difference is that the Colbert model is like 900 gigabytes versus 18 gigabytes for BM25.
That's like huge difference, like in terms of cost, in terms of memory, in terms of retrieval. And remember Lucene, it tries its best to cache the fields, but will it be working okay for super large segments and super large dictionaries? Probably not. So be careful there.
Yes, yes. Well, you can always solve problems with more memory, obviously. So we're coming to the end of our time slot here, and I want to just round us off here.
I'm afraid we haven't got to everyone's questions in the chat here. We didn't get to everyone's questions from our pre-submitted list, but I do want to thank everybody who submitted a question and hope you've got yours answered. Secondly, huge thanks to both Max and Dima for
working so hard on this. We've done quite a lot of work ahead of time to make sure we gave you some really quality content here. So thank you both.