We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Neural Search Comes to Apache Solr: Approximate Nearest Neighbor, BERT & more

00:00

Formal Metadata

Title
Neural Search Comes to Apache Solr: Approximate Nearest Neighbor, BERT & more
Title of Series
Number of Parts
56
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The first integrations of machine learning techniques with search allowed to improve the ranking of your search results (Learning To Rank) - but one limitation has always been that documents had to contain the keywords that the user typed in the search box in order to be retrieved. For example, the query “tiger” won’t retrieve documents containing only the terms “panthera tigris”. This is called the vocabulary mismatch problem and over the years it has been mitigated through query and document expansion approaches. Neural search is an Artificial Intelligence technique that allows a search engine to reach those documents that are semantically similar to the user’s query without necessarily containing those terms; it avoids the need for long lists of synonyms by automatically learning the similarity of terms and sentences in your collection through the utilisation of deep neural networks and numerical vector representation. This talk explores the first Apache Solr official contribution about this topic, available from Apache Solr 9.0. During the talk we will give an overview of neural search (Don’t worry - we will keep it simple!): we will describe vector representations for queries and documents, and how Approximate K-Nearest Neighbor (KNN) vector search works. We will show how neural search can be used along with deep learning techniques (e.g, BERT) or directly on vector data, and how we implemented this feature in Apache Solr, giving usage examples! Join us as we explore this new exciting Apache Solr feature and learn how you can leverage it to improve your search experience!
Machine learningSoftwareComputerExpert systemMusical ensembleArithmetic meanLine (geometry)Artificial neural networkMultiplication signDirection (geometry)Computer programmingGroup actionTrailInformation retrievalBitMachine learningProcess (computing)Degree (graph theory)Universe (mathematics)Computer scienceSoftware engineeringXMLUMLLecture/ConferenceMeeting/Interview
RankingPerformance appraisalFormal languageNatural numberProcess (computing)Similarity (geometry)Open sourceInformation retrievalFrequencyInformation retrievalCartesian coordinate systemReal numberOpen sourceSoftwareWave packetCodeLecture/ConferenceMeeting/Interview
RankingPerformance appraisalProcess (computing)Formal languageNatural numberSimilarity (geometry)UsabilityOpen sourceCodeProjective planeVirtual machineSlide ruleInformation retrievalEmailInformationClient (computing)TwitterElectronic mailing listLecture/Conference
Sign (mathematics)Negative numberMereologyImplementationDescriptive statisticsTerm (mathematics)INTEGRALOcean currentResultantProjective planeRight angleArithmetic meanInformationFormal languageBitQuery languageVector space modelEndliche ModelltheorieLine (geometry)Position operatorInformation retrievalPerspective (visual)Search engine (computing)Musical ensembleCausalityQuicksortNegative numberComputer-assisted translationLecture/ConferenceComputer animation
ResultantSearch engine (computing)Matching (graph theory)BitFamilyLatin squareLecture/Conference
Negative numberDifferent (Kate Ryan album)Similarity (geometry)UsabilityPopulation densitySparse matrixDimensional analysisRepresentation (politics)Query languageTerm (mathematics)Vector space modelData dictionaryHausdorff dimensionSubject indexingPrice indexSpacetimeData structureTerm (mathematics)Representation (politics)Vector space modelSubject indexingTransformation (genetics)AlgorithmArithmetic meanInformationShared memoryQuery languageSimilarity (geometry)NumberCorrespondence (mathematics)Data dictionaryFrequencyDimensional analysisMultiplication signEndliche ModelltheorieElement (mathematics)Artificial neural networkDependent and independent variablesPopulation densityCodierung <Programmierung>Knowledge representation and reasoningWordProgramming paradigmSparse matrixInformation retrievalComputer animationLecture/Conference
Graph (mathematics)SpacetimeSimilarity (geometry)Query languageVector space modelMetric systemInformation retrievalTrigonometric functionsProduct (business)Element (mathematics)Similarity (geometry)Query languageEndliche ModelltheorieFormal languageSubject indexingCodierung <Programmierung>Level (video gaming)outputTransformation (genetics)Vector space modelVector spaceRepresentation (politics)Connectivity (graph theory)DistanceCASE <Informatik>Function (mathematics)QuicksortComputer configurationMultiplication signData structureDescriptive statisticsMathematicsComputer animationDiagramProgram flowchart
Metric systemSimilarity (geometry)Product (business)Element (mathematics)Information retrievalTrigonometric functionsUsabilityVector space modelQuery languageSpacetimeNetwork topologyMaizePrinciple of localityObject (grammar)Graph (mathematics)Hash functionPrice indexProcess (computing)Data structureApproximationVector space modelQuery languageObject (grammar)NavigationVector spaceData structureCASE <Informatik>Subject indexingGraph (mathematics)Computer configurationInformationDistanceMeasurementAngleSolid geometryMultiplication signPerspective (visual)Trigonometric functionsSymbol tablePartition (number theory)MereologyBitFamilySimilarity (geometry)Different (Kate Ryan album)HierarchyContext awarenessInformation retrievalComputer animationLecture/Conference
Vector space modelVertex (graph theory)NachbarschaftsgraphElectronic mailing listGraph (mathematics)Information retrievalMaxima and minimaUsabilityHierarchyData structureQuery languageFile formatComputer fileGraph (mathematics)Trigonometric functionsDatabase normalizationSimilarity (geometry)Execution unitFunction (mathematics)Euklidischer RaumPrice indexError messageProduct (business)NavigationVector space modelEndliche ModelltheorieGraph (mathematics)DistanceSoftware developerHierarchyWordVertex (graph theory)Functional (mathematics)Maxima and minimaUtility softwareDemosceneProjective planeElectronic mailing listSet (mathematics)ImplementationLink (knot theory)CodecSpacetimeDifferent (Kate Ryan album)Graph (mathematics)File formatMobile WebData structurePerspective (visual)Degree (graph theory)Similarity (geometry)Query languageGreedy algorithmSemiconductor memorySoftwareLecture/ConferenceComputer animation
Euklidischer RaumPrice indexVector space modelTrigonometric functionsExecution unitSimilarity (geometry)Database normalizationError messageProduct (business)Query languageFunction (mathematics)Vector fieldQuery languageFunctional (mathematics)Graph (mathematics)CASE <Informatik>Order of magnitudeMathematical optimizationProduct (business)Vector space modelSoftwareDampingMultiplication signSubject indexingDistanceTrigonometric functionsField (computer science)Type theoryoutputCodecSimilarity (geometry)ChainImplementationXML
Digital filterQuery languageVector space modelCodeSolrImplementationTrigonometric functionsEuklidischer RaumProduct (business)Similarity (geometry)Function (mathematics)Element (mathematics)Subject indexingPopulation densityHausdorff dimensionMultiplication signQuery languageInterface (computing)Field (computer science)Classical physicsStandard deviationDampingLink (knot theory)ImplementationOrder (biology)outputVector space modelComplex (psychology)Vector fieldSubject indexingLibrary (computing)Content (media)CASE <Informatik>Lecture/ConferenceComputer animation
Euklidischer RaumProduct (business)Similarity (geometry)Trigonometric functionsVector space modelElement (mathematics)Population densitySubject indexingFunction (mathematics)Hausdorff dimensionPoint (geometry)System administratorField (computer science)Parameter (computer programming)Moment (mathematics)Population densityDimensional analysisData modelType theoryVector fieldVector space modelXML
Euklidischer RaumTrigonometric functionsProduct (business)Similarity (geometry)Element (mathematics)Population densityVector space modelSubject indexingFunction (mathematics)Hausdorff dimensionFunctional (mathematics)BuildingCASE <Informatik>Vector space modelSimilarity (geometry)Content (media)Moment (mathematics)Ocean currentLimit (category theory)Field (computer science)Subject indexingAttribute grammarData storage deviceDistanceType theoryTheory of relativityTrigonometric functionsProduct (business)Lecture/Conference
Trigonometric functionsParameter (computer programming)Vector space modelGraph (mathematics)IntegerDefault (computer science)Configuration spaceFile formatGUI widgetUsabilitySimilarity (geometry)Function (mathematics)Array data structureClient (computing)Element (mathematics)Population densitySubject indexingParameter (computer programming)Moment (mathematics)Graph (mathematics)Element (mathematics)AlgorithmStapeldateiPerspective (visual)ResultantVertex (graph theory)DampingSubject indexingMultiplicationMaxima and minimaoutputVector space modelOcean currentMultiplication signRepresentation (politics)Clique-widthDegree (graph theory)Computational intelligenceMappingElectronic mailing listInformation retrievalNumberConnected spaceChemical equationNormal (geometry)Different (Kate Ryan album)Lecture/ConferenceComputer animation
Vector space modelParsingQuery languageDefault (computer science)Digital filterSolrElectronic mailing listDampingPerspective (visual)Query languageParsingSubject indexingPopulation densityResultantParameter (computer programming)Field (computer science)Nichtlineares GleichungssystemSquare numberWellenvektorVector space modelVector fieldPoisson-KlammerMultiplication signLimit (category theory)Moment (mathematics)Filter <Stochastik>outputClassical physicsLecture/ConferenceComputer animation
Query languageParsingRankingVector space modelParameter (computer programming)Price indexSheaf (mathematics)Hybrid computerPopulation densitySparse matrixVolumeBenchmarkQuery languageParsingLimit (category theory)Sparse matrixParsingBoolean algebraMultiplicationInformation retrievalMaxima and minimaResultantShooting methodClosed setElectronic data interchangeSubject indexingBenchmarkDifferent (Kate Ryan album)Message passingEndliche ModelltheorieFormal languageRankingCombinational logicLecture/ConferenceComputer animation
Vector space modelMathematical optimizationPrice indexQuery languageWordAverageLengthVolumeBenchmarkSubject indexingVector space modelResultantMathematical optimizationMultiplication signBenchmarkMeasurementSubject indexingTerm (mathematics)Codierung <Programmierung>Graph (mathematics)Bit error rateQuery languagePerspective (visual)Classical physicsProgram flowchartComputer animation
AverageEnterprise architectureFormal languageEndliche ModelltheoriePattern languageTime domainSimilarity (geometry)Vector space modelContext awarenessTranslation (relic)Data modelPopulation densityInformation retrievalTask (computing)UsabilityRandom numberMotion captureProcess modelingFormal languageVector space modelEndliche ModelltheorieTerm (mathematics)Transformation (genetics)Enterprise architectureoutputComputer configurationWeb 2.0InformationCodierung <Programmierung>Electric generatorTask (computing)Translation (relic)Shared memoryWordMaschinelle ÜbersetzungInformation retrievalDomain nameSimilarity (geometry)MassWindowCovering spaceVirtual machineComputer animationLecture/Conference
Token ringVector space modelSequenceData modelDimensional analysisFormal languageOperator (mathematics)Bit error rateFamilyMultiplication signWeightVector space modelArtificial neural networkBit error rateBounded variationFreewareWave packetResultantCodeEndliche ModelltheorieoutputTask (computing)SoftwareParsingClassical physicsMathematical analysisDifferent (Kate Ryan album)CASE <Informatik>TunisQuery languageInformation retrievalDomain nameOpen sourceLecture/ConferenceComputer animation
Vector space modelProcess (computing)EmailLibrary (computing)Vector space modelGraphics processing unitStapeldateiMereologyLine (geometry)Utility softwareChemical equationMultiplication signCodeProduct (business)Subject indexingLecture/Conference
File formatConfiguration spaceAlgorithmDefault (computer science)CodecImplementationSolrQuery languageBound statePersonal digital assistantSubject indexingQuery languageMultiplication signVector space modelOcean currentPerspective (visual)Configuration spaceAlgorithmCodecGraph (mathematics)BuildingParameter (computer programming)Arithmetic progressionoutputTunisLecture/ConferenceComputer animation
Well-formed formulaQueue (abstract data type)CodeSimilarity (geometry)Euklidischer RaumVector space modelEuklidischer RaumFunctional (mathematics)Similarity (geometry)DivisorDistanceTwitterBitTrigonometric functionsLecture/Conference
Subject indexingParsingoutputQuery languageData modelBit error rateInferenceCoprocessorBlogPrice indexSolrComponent-based software engineeringCodeVector space modelProcess (computing)Population densityTerm (mathematics)Chemical equationSubject indexingStrategy gameFrequencyMereologyCoprocessorData miningVector space modelParsingSemiconductor memoryDistanceInstallation artAdditionoutputRepresentation (politics)MetreCodePopulation densityFigurate numberIdeal (ethics)Different (Kate Ryan album)Endliche ModelltheorieSparse matrixBlogMultiplication signPerspective (visual)Query languageInferenceBit error rateLecture/Conference
Musical ensembleSound effectDimensional analysisVector space modelImplementationDefault (computer science)Bit error rateLecture/Conference
Limit (category theory)CASE <Informatik>Domain nameTunisVector space modelPresentation of a groupLecture/Conference
Artificial neural networkWeightRight angleChemical equationCASE <Informatik>Vector space modelQuery languageParsingLecture/Conference
ParsingContext awarenessQuery languageMereologySummierbarkeitIdeal (ethics)Vector space modelWeightComputer clusterBitMultiplication signLecture/Conference
Similarity (geometry)Classical physicsStatisticsMereologyBitVector space modelSubject indexingDifferential (mechanical device)Perspective (visual)QuicksortImplementationLecture/Conference
Perspective (visual)Open set1 (number)Canonical ensembleBenchmarkCodeLink (knot theory)ImplementationFacebookComputer fileDisk read-and-write headElasticity (physics)Lecture/Conference
Musical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
So, today I'm going to present a story of an Apache Solar contribution. As you can see from the title, there are a lot of buzzwords. I mean, that's in line with the name of the conference, so I was like, okay, I should put as many as possible, right? It's going
to make it more appealing. So, before we start, a quick introduction about myself. So, my name is Alessandro Benedetti. I'm originally from Tarquinia, Italy, an ancient city, pre-Roman actually, so it's an Etruscan city, and I am an R&D software
engineer. In my spare time, I'm the director of my company. I mean, I used to say that because I actually love a lot engineering, and the direction of a company implies a lot of many other things, which sometimes are a little bit more boring, but I really like
the R&D side of my job. I have a master's degree in computer science from the University of Rome, and I am a member of the program committee of the European Conference on Information Retrieval, the special interest group in information retrieval and the science that are academic conferences, but they have also an industry day normally, so I enjoy taking
part of peer reviewing and especially of the reproducibility track. I've been working for a long time with Lucene and Solr. I'm a Lucene and Solr committer, and I am a PMC member of Solr, and I've been also working with Lasik Sersalot. So, my passion
is around integrating artificial intelligence and machine learning technologies with information retrieval. In my spare time, I also do pitch volleyball and snowboarding, not in London where I live, but in Italy when I go back for like the summer or some short winter
period. A short introduction about my company, SIS. I founded SIS in 2016, late 2016, and we are information retrieval specialists. So, the mission of my company is to reduce the gap between academic research in information retrieval and real world industry application,
and we decided to do that through open source software. So, that's the reason we are really passionate about open source software, and we are not only using and effectively consulting training about open source software, but we are actively contributing back. So, we
are contributing back with ideas, code, and support, and not only through like official code contribution, but also with like an internal project that we share on our blog and through the mailing list. So, we try to help as much as possible because we really
love the scientific approach behind information retrieval, and we want to give a hand to the scientific community. We want to improve information retrieval in general. So, of course, we need money from our clients to go ahead, but we are really happy to give it back.
And some of the trends we are working now are listed in the slide. I won't repeat them, but we are really passionate about integrating machine learning with search, and this is part of our talk today. So, an overview about what I'm going to talk about. So, first of all,
we're going to describe some of the problems with lexical search, effectively like term-based search. You know, I'm going to use semantic search as a sort of the only grail of information retrieval, right? Being able to always recognize the meaning behind a user's information
and return relevant results. Currently, most of search solutions use lexical search, so matching on terms, and there are some problems with it. Then we are going to describe how neural vector-based search works, how it aims to solve those problems, and the Apache
A little bit of a description about BERT and how you can integrate large language models with Solr to obtain an end-to-end neural search implementation. I'm going to wrap it up with some
future works to describe our current projects, what's currently in line for us to develop, and what we are going to release and contribute soon. So, the first problem I want to describe
is the vocabulary mismatch problem. This affects lexical search in general, so you may see example of the vocabulary mismatch problem with false positives. So, the user information need in this example is to find out the population of the city of Rome. We see that one relevant document
is returned by the search engine. Rome's population is 4.3 million, so that's fine, it's great, it's a good result. It has just one query term, so from a lexical perspective
it doesn't sound that great. For this reason, a lexical search engine would return as a better result actually, a document containing hundreds of people queuing for live music in Rome. So, this document has nothing to do with the Rome population, I mean not very relevant,
but it shares with the query three query terms. So, from a lexical perspective it contains a lot of information which is in line with the user information need. So, that's a false positive caused by the fact that the dictionary used by the user doesn't align with the
dictionary in the corpus. Another example is the false negative kind of results that you can get with a vocabulary mismatch problem. So, 2022, the year of the tiger, so this example is in line
with the Chinese calendar, and the query, the user information need is about the size of the tiger, of the big cat called tiger. So, how big is a tiger? We can get lexically a document, I mean, assuming we applied a little bit of stemming, a nice candidate would be the
tiger is the biggest member of the Felide family. Now, unfortunately, a result that potentially is not returned would be Pantera tigris can reach 390 centimeters nose to tail. We can see here that
instead of tiger is used the scientific Latin name for the animal, and even if it's like an interesting and absolutely relevant result, a lexical search engine, assuming it doesn't use any advanced like synonym matching algorithm, is not going to return it. In general, semantic
similarity has a problem with the vocabulary. So, you may have user information needs that are extremely similar from a lexical perspective, such as how are you or how old are you. So, clearly they share a lot of query terms, but the meaning is completely different. On the other
end, we can have user information needs that don't share any term at all and with a very similar meaning, such as how old are you and what is your age. So, our neural search aims to solve this
problem. We need to do a little step back to describe the vector representation that is used by lexical search and by dense retrieval. So, sparse modeling implies, and that is used by bag-of-words examples, bag-of-words approaches, implies that we model the dimension of the vector
that corresponds to the size of the term dictionary. So, each term in the corpus of information corresponds to one dimension in the vector. So, with this structure, we end up having
for any given document a vector that is mostly zeros, because most of the terms in the dictionaries are not in a certain document. Then some of the values, so the terms that appear in the document,
will be different from zero. So, it can be just one when the term is present or potentially we may encode the term frequencies or any kind of term scoring in this representation. On the other end, with dense representation, we have a fixed number of dimensions and normally this
is much lower than sparse representations, so the vectors are shorter. And for any given document, we have a vector that is mostly non-zeros. But how can you generate this vector? How can you encode the text and information in vectors? With the neural search paradigm, we are going to use
deep neural networks to encode the text in a vector representation, store them in data structures at indexing time, and then query them. Specifically,
we are going to call transformer the element that will have the responsibility of doing this encoding. We're going to see later on how effectively large language models and transformer work to a high level and how you can integrate them with Solr. But for the sake of our workflow
description, you can imagine that component takes in input text and is able to return an output vector. Then, at indexing time, we take the vectors, one for each document, and we build
some sort of data structures and we're going to see the different options we have. And we build this data structure at indexing time and then, at query time, we run a search on a vector representation of the query to find the closest vectors to the query. And the similarity
between the query and the document is so that similarity score is effectively translated to the distance in a vector space. So when I say closer, I mean effectively the more similar. There are various ways of calculating the distance between vectors.
We are not going to describe that much, the math behind it. Just for you to know which kind of distance to use, it depends on your use case. So the classic recommendation is like experiment. And just as a general idea, the cosine similarity is a distance measure
that takes into account the angle between vectors. And this is pretty much a solid option for information retrieval use cases, but you should experiment. And anyway, there are various options supported by Solr regarding that. So how do you query for vectors? So you want to find
the top k nearest vectors, so nearest neighbors as it calls. And the acronym you potentially have seen is k and n. And what you do is effectively you start from your query vector
and you retrieve what's closer. I mean the closer the better and highest semantic similarity. But running exact nearest neighbor search is expensive. So if you take just your query and then the vectors from each of your documents and you just calculate the distance,
and you may have like millions of vectors, this takes time, this takes computational resources. So unless you have a small corpus of information, probably it's not the right idea to go with the exact nearest neighbor. So researchers ended up finding different approximate solutions.
So you can lose accuracy but gain a lot from the performance perspective. Normally going through the approximate way means you lose a little bit of information, so you are compressing potentially your vectors, you are pre-processing your data
and building some data structures that then you reuse at query time. Just to give you like context, there are mainly three families of solutions for approximate nearest neighbor. Tree-based, where you build effectively a partitioning of your vector space and then at query time you just navigate parts of the vector space to find your closest vector or
closest vectors. Ashing, so you can reduce the dimensionality of your vectors, hopefully, I mean avoiding to lose the information and keeping the differences between vectors
and then grouping similar objects, so like using clustering approaches. Finally graph-based approaches, which is the one used by Lucene and then in Solr, and especially we're going to talk about HNSW, and first of all this acronym stands for hierarchical navigable small
world graphs. So it's one among the top performing solutions from the index time data structures that you can use for approximate nearest neighbor, and a couple of references from the original paper that developed the idea. The latest one is from 2018 but of course the
research team has been working on this even with some additional developments later than that. So what is a hierarchical navigable small world graph? It's a proximity graph, so it models vectors and distances between vectors, specifically each of the vertices in the graph
is a vector and closer vectors are linked together. Now why is hierarchical? So the approach behind hierarchical navigable small world graph takes inspiration from skip lists.
So what you do, you model different layers and the higher the layer, the longer the links, so the longer the edges between the nodes, and this is for fast retrieval, and if you go down in layers what you get is shorter edges or shorter links for accuracy, so you are able to refine
the distances and refine the neighbors effectively. So what you do is you go layer by layer with a greedy search looking for the local minimum that hopefully is the global
minimum and the more you go down the more you refine the minimum. The degree of the vertices is something that you decide when building the graph, so effectively the higher the degree the lower the probability of hitting a local minimum
because it means that the graph is more connected and you are more likely to end up finding the right neighbors. So how this is implemented in Solr? First of all, Apache Lucene implementation. So originally before the end of 2020, vector-based search
was doable in Lucene using data structures that were not meant for that functionality. So this means that it was working not in an optimal way, but it was possible to achieve vector-based search. In November 2020 with Apache Lucene 9.0,
a dedicated codec with a dedicated file format has been contributed for navigable small world graphs. So that's the first milestone to enable vector-based search and consequential in neural search. Then over the last couple of years, more or less, we got many contributions
in that space in Lucene, handling of document deletions, the introduction of the hierarchical age in Navigonal Small World Graph implementation, and improvement from the perspective of performance
and memory utilization. In March 2022, also pre-filtering has been contributed in Lucene to give the possibility of reducing the scope of your search before looking for neighbors.
Many more issues are actually related with this topic. I tagged them and you can run a JIRA query on the Lucene N solar project from the Apache Software Foundation to find out if we're curious about all the different contributions. So in Lucene, there's defined a set of similarity
functions that you can use, the Euclidean distance, the cosine similarity, and in case you have normalized vectors, so vectors of magnitude one, you can use the dot product which
is effectively an optimization of the cosine distance. So if you normalize before, you don't need to normalize when building the graph and you can use directly the dot product. So they're pretty much similar, just keep it in mind that depending on the kind of vectors you have, you can use one distance or another. How can you index something that is vectors in Lucene? And
this is actually what is used in the Apache software implementation, is a dedicated field type called KNN vector field. So a KNN vector field takes an input, an array of float values,
and as simple as it is, you just add it to your document, to your Lucene document, and then you push it to your indexing chain, and to the codec, the writer, and the graph builder, the HNSW
graph is built at indexing time. So far so cool, and what about the query time? So also from the query time, Lucene exposes a nice interface that is used in Apache Solar to model a KNN query.
So this query takes an input, the field, so the KNN vector field you want to run your search in, the query vector, so the target of your search, a simple array of floats, the top K,
so the amount of neighbors you want to return, and potentially a pre-filter, so another query as complex as you like to reduce the scope of your search before you look for neighbors. So the Apache Solar implementation uses these libraries from Lucene and an additional way
of encoding the stored content, effectively using classic standard float values storing, so to make it very transparent and easy to use. So it's been released with Apache Solar 9.0
last May, so it's pretty recent. I've been working on that at the beginning of the year, and thanks to the effort of the community, we've been able to release it last month. So also in the case of Solar, you can take a look to the JIRA link from the slides, or in general
searching for vector-based search in Solar to find all the related issues and also future works. So how can you use Solar to index and search vectors? And then I'll run a full end-to-end neural search. So the entry point is your schema, as usual. So the schema XML allows in Solar
to the admin to define your data model. So in your data model, you will define the field type, which is the dense vector field type, and a couple of parameters that effectively set
the internal Lucene parameters. And these parameters are actually quite simple. So the vector dimension, so the cardinality of your vector, and this is limited to 1024, not for any particular reason except to be performance-aware. So it's possible, it is
just an encoded value because we wanted to give flexibility to the user, but not too much, and then get back complaints like we are using 10 million vectors and this doesn't really work. So for this reason, at the moment, it's limited to 1024. We may increase that in the future.
If you want to do that, you need to customize your Lucene build and then set it in Solar. So currently, normally, you are fine with smaller vectors than 1024, but in case you need something custom, you can do that. And the similarity function. So the three functions that are
supported by Solar, as I mentioned before, Euclidean, dot product, and cosine distance. Then you assign this field type to your field, and then the usual attributes in the schema to index the field, store it, and on and on and on. So currently, with vector-based field
effectively, only stored content and indexing is allowed. So doc values and multivalues is not possible at the moment. I mean, they are current limitations and doc values just because we don't
need them right now. I mean, maybe in the future, if there's any function score related thing that can benefit of them, we'll do that. But at the moment, we were focused on providing a nice and easy KNN experience to our users. There are also a couple of advanced parameters
that are strictly related with the current algorithm used, so HNSW. And these parameters affect the way you build the graph at indexing time. So we have the HNSW max connections, which is the parameter that affects the degree of the vertices. So this is related with the
balance, the trade-off between performance and accuracy. And of course, the higher, the more connected the graph, and this means more computational resources time for building the graph and for searching, but the more accurate results. And also, the beam width. Those
parameters actually, both of them are related to specific parameters from the papers. So the name is slightly different in the papers, but in the SOLAR documentation, you will find the mapping between the SOLAR parameters and the paper parameters, and the beam width affects
effectively the number of nodes per layer. And if you're curious, you can access the 2018 paper and explore them in detail. Unless you want to run something very specific, you don't need it to change them unless you really need it. So how can you index vectors in SOLAR? So that's
simple. It's not that different from a multivalued float. So you just pass in a JSON array, so an array of float values in an XML representation of your document, which is quite verbose, and I don't know if, to be honest, how many still use the XML representation to
push data to SOLAR, but it's doable. So you just have to represent it with multiple XML nodes, so each one for each element in the vector. And in SOLARJ, you can just use Java lists. So you add them to the SOLAR input document and you're ready to go. You can push the document
or batch of documents to SOLAR. So from the indexing perspective and retrieval perspective in the search results, again, not very different from just a multivalued float. So that's quite simple and nice. From the searching perspective, a new query parser has been introduced to SOLAR
that takes in input very simple parameters. The field, which must be a dense vector field to run your queries, the top k, so the amount of neighbors you want to retrieve,
and the query vector, that's it, represented with an array with square brackets. So you define already everything at indexing time, so you don't need anything else, so it's quite simple. But there are some limitations at the moment, so let's explore the limitations
as well. So filter queries with vector-based search in SOLAR. So at the moment, you can use vector-based search, KNN, in filter queries, so where the FQ parameter uses the KNN query parser
and the main query uses a classic lexical search. You can also do the opposite. So you can use the KNN vector-based search in your main query and like classic lexical search in
the filter query or filter queries. But what's going on at the moment is post filtering. So what you do is you intersect the document IDs coming from your filters with the document IDs coming from your top k neighbors, and this means that potentially you end up with less
results than your top k, because you just take the top k and potentially you reduce that set. So this of course is a problem, it's a current limitation, but with the new release coming up, so with SOLAR 9.1 we are introducing pre-filtering as well. So you do the filter
first and then you look for the top k neighbors. Another limitation is with re-ranking. So it's currently possible to use re-ranking with the KNN query parser, but what happens is that you
change the score of the first pass retrieval only of the documents that appears in the top k. So you're not running a one-to-one re-scoring of each of the first pass retrieval search results, but you just effectively intersect them again with the KNN results. So pure re-scoring
is another feature that is coming with a future SOLAR release, so you will be able to select the top k candidates in a lexical way potentially, and then just re-score them or manipulate, combine their score with a language model-based approach.
And something that is actually currently possible and can be quite interesting is to combine hybrid dense and sparse retrieval in Apache Solr. So there are query parsers in Apache Solr that allow
you to combine different query parsers, so different clauses such as the boolean query parser, where you can define multiple should clauses that are going to affect how the search results are matched from your index and scored in your ranking. So you can define a clause which is
a lexical clause and another clause that is a pure vector-based search clause. And then documents are going to be returned matching in this example both the should clauses and scores are calculated summing them, or potentially depending on the way you are combining these
clauses, if you are using an edX max for example, picking the max out of it. And what happens from the Lucim side is that you build your query combining the results of those two different query parsers in this example. So we've done some initial benchmark and what we found out
was on a small index performance to be quite nice. Of course this doesn't necessarily reflect linearly bigger volumes, but with an index of more or less like one gigabyte, two gigabytes, we ended up having this kind of measures in terms of time when building the index, so it takes more
time to build the graphs. But from a query time perspective, KNN ended up being quite fast, actually faster than classic simple lexical search. And in terms of optimization of your
segments in the index, so after a merge of all your segments, we noticed an even bigger improvement in KNN search results. So we're going to do additional benchmarks in the future, but this is just to give an idea that it's usable effectively already. So how can you use
BERT with all of this? So first of all, effectively you can encode vectors from text and then use those vectors in Solr. So there are various ways, of course you may have already vectors, maybe you want to generate vectors from text, and large language models can be an option,
but large language models need a lot of data to be trained on. So it's normally difficult for a small enterprise to gather such a big amount of data. And for this reason, transformers were quite successful, because they used a pre-training on large corpora such as
Wikipedia, the web, a large bibliographic corpus of information, and then that means for the transformer, for the language model, and a way to achieve an understanding, a general
understanding of the language. Then you will need to fine-tune it with your smaller amount of data, which is domain related, potentially to achieve a specific task. So you may be looking for text summarization, maybe dense retrieval, maybe translation, so machine translation,
or essay generation, whatever you want to do. And the large language model approach shares similarities with Word2Vec, so effectively Word2Vec was generating a vector per word,
with large language models you can generate vectors per sentences, for example, for each sentence. And just to give you an idea of how some of them work, is using a masked language modeling approach, where you take and input a window of text with various terms, you hide
one of the terms, and the model aims to predict the missing term. So you train it on this large corpora, and then the language model is able to predict missing terms, and you can extract the weight from the deep learning neural networks you produce to effectively get
the vectors. So BERT is one of them, and is actually a huge family of them, and was originally contributed by Google, and over time has been refined a lot, and there are many
variations, you can download pre-trained BERT models, and then you can fine-tune them, and the important thing for you to know is that they allow you to pass from text to vectors, and how can you do that with open source software? So with open source software,
the result is going to be using a parser for your text, a classic text analysis, and then you take and input a model. So this model that can be originally pre-trained, and then
if you just use it pre-trained, it's not going to work that well, so the recommendation is to go through the fine-tuning step. So the fine-tuning step effectively takes and inputs the pre-trained model, and refine the weights in the neural networks, potentially changing the last layer, to adapt to your domain, to your task. So in the dense retrieval case, what we want to do
is to achieve a large difference between the score of a positive document for your query, and a negative document for your query. So effectively providing these examples, we can fine-tune
the model to recognize better this difference, and effectively then we can pick the weights from the neural network, and those are the vectors' values. In this example, we are going to
use PyTorch to build and encode the vectors, and we are going to use, in this case, the model is something in the input, so it doesn't matter what the model is. We picked in this example a sentence that was just downloaded from the available pre-trained models, but as I mentioned,
it's not recommended to use it just as it is, because you want to fine-tune it. But with PyTorch and some Python code, it's actually super simple to move from your document text to vectors. So you import your libraries, you may potentially use GPU acceleration
or not, it depends if you have it available, and then what you do, you read your documents, you fetch sentences, in this example it's one for each document, and then you push a batch to
the sentence pair to encode each sentence to a vector, and then you will push the vectors to Solr. So literally, with 12-13 lines of codes, you can achieve vectorization. Of course,
then if you want to bring this to production, you need to also take care of performance and everything, but nowadays, thanks to GPU utilization, you can achieve nice performances. Of course, it's expensive to build vectors, but then you may actually, I mean, this is the balance,
so you spend a lot of time generating the vectors and indexing time to then obtain the benefits at query time. So to wrap it up, some future and current works we're doing. So from the Solr perspective, working to simplify the configuration in the schema so that currently you need potentially to specify the codec you want to use, which is quite advanced,
so we want to simplify that. We want to just leave the algorithm as an input parameter and the advanced HNSW tuning for the graph building as a possibility, but we don't want users to
really specify the Lucene 90 or 91 or 92 codec, so this is currently a work in progress. Pre-filtering to Solr, so as I mentioned in March has been contributed to Lucene and this has been adopted in Elasticsearch, so we want to do the same for Solr. Actually, I mean, it's almost finished this work and it's going to be contributed and available from
Solr 9.1, so this will give the ability of running filter queries, reduce the scope of your search and then looking for top k neighbors. And then some Lucene simplifications, I've been working on the vector similarity function simplification because the Euclidean
distance works like the opposite of like the others, so effectively it's a distance, whilst the others are a similarity, so the more distant actually the less relevant, so the others on the other end is like the higher cosine similarity the more relevant, so this like different in trend was completing a little bit the code, so I've been working into that
simplification. And another contribution we are working on is to provide an updated request processor to enrich text at indexing time and get the vectors directly in Solr and the same equity parser that takes input text and a BERT model to do the inference. Of course,
from a code perspective it's not that complex, but we are also evaluating all the performance implications, so before like contributing and making it available in Solr, we want to make sure that there's like a nice balance in the memory for example used by your Solr distance to do these effectively like end-to-end directly in Solr. So some
additional resources from our blog, so we wrote a lot about neural search in Solr and all the details of our contribution and how you can use BERT to improve search relevance, and if you can also tackle this problem with like additional strategies AI related such as document enriching
and potentially like changing the terms score, so instead of using term frequencies in the index using term scores effectively identified through deep learning. So to finish the talk with some
thanks to the Apache Lucene community, which I'm part for all the HNSW goodies and improvements. Elia Varshani, a colleague of mine, developed me for the contribution. Christine for the Acres review, so Christine is a Lucene and Solr committer, she helped me a lot in reviewing the code of the contribution. Cassandra for the documentation, so
being a non-native speaker I ended up having some not ideal documentation sentences, so she helped me a lot to simplify and improve that. And finally Michael for the discussion about dense vectors and how to describe that and the difference between sparse and dense vector
representation in a nice and easy way in the documentation for everyone to understand. And thank you audience for your attention. Thank you very much. Any questions
right now in the room? Great talk, thank you. The vector dimension, do you have any
kind of suggestions on how that should be tuned and where people should start? So I think probably I would recommend to start with a classic like a small BERT implementation with 768 which is one of the default you get from the pre-trained and start from there, keeping in mind the fact that currently in Solr 1024 is going to be the
limit. Of course it also depends if you have already vectors, so in that case you may need to adapt, but in case you move from text to vectors I would start that way and then check effectively two experiments if that makes sense for your domain.
But yeah I would start simple with like a pre-trained 768 BERT, fine-tune it and check how it goes. Great presentation Alessandro and happy to see an end search in Solr. Great work. So one way to do hybrid search is to for example say okay I trust lexical
search 90% and I trust ANN search 10% so you could assign weights during the scoring right and then your re-ranker could reorder the documents in the way they should be reordered and the reason here I guess is that dense search is more like optimized for
recall or at least vector search right and the lexical search could be tuned for precision so you could have a balance there. So have you thought about this use case and are you planning to implement one? So effectively if you use combined query parsers and you are aware that
the score coming from vector-based search is going to be from zero to one for example and you may tune it already because anyway the lexical search part effectively calculates the bm25 score with all the boost you can add as you prefer and then if you combine like
you're just going to do the sum. So what you mentioned it should be already possible of course there's a problem with the lexical side that is not probabilistic in Solr and Lucene so you don't really know if it's going to be you know like I don't know 10 000 or or one while
the vector-based search unfortunately is going I mean unfortunately it's always going to be from zero to one. So possible but definitely probably it's not easy to to do that right now. Also probably the ideal kind of scenario I would like is to integrate it in some way with like
learning to rank so that these weights can come a little bit like with more sense and making sure that you don't go like dwarfing the vector-based search score or the other way around. So possible right now but I would dedicate a little bit more time for this problem. Does
this answer your question? Last one please. Thank you for a great talk Alessandro and I would be interested to know if we have benchmarked or have some stats to differentiate
between the classical similarity and vector like does it outperforms the BM25 in any way? Do we have any stats about it? I mean I love that you showcased about the indexing part but do we have anything on the precision part of vector search and BM25? Okay so I've not done that directly from a quality perspective but so the the sort of implementation is quite
similar to the elastic search implementation which is quite different from the open search implementation actually. So if you take a look to the Canon benchmarks for elastic search I'm pretty sure they are pretty much similar to the solar ones. We're going to do in the
future also like quality related benchmarks for the solar implementation but elastic search and solar both of them uses the same Lucene code so from a quality perspective there are already so if you go I don't remember now out of my head the exact links but there are some benchmarks
for quality like recall based mostly for elastic search combining it with like Facebook files and Best Buy thing and others so you can take at least you can have some inspiration and some idea from there and when we release the solar ones I suspect it's going to be pretty similar.
Okay thank you very much Alessandro, that was great. Thank you.