We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Document classification search; joins vs payloads

00:00

Formale Metadaten

Titel
Document classification search; joins vs payloads
Serientitel
Anzahl der Teile
69
Autor
Mitwirkende
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Payloads are a powerful though seldom utilized feature in the Lucene-Solr ecosystem. This talk reviews the existing payload support in Lucene and introduces the new features in Lucene and Solr 9 (LUCENE-9659 / SOLR-14787). The main focus of the talk will be to explore real world search & ml use cases that traditionally utilize a query time join and the application of Lucene payloads to solve them. This talk is for search practitioners interested in utilizing machine learned data in search based analytics dashboards. Many Solr based applications attempting to deal with machine learned classifications are forced to implement a parent-child join relationship between a document and its classifications. This model introduces many additional system constraints and costs at both query and index time to maintain the ability to filter results as desired. New features in the payload span query in Lucene provide applications a way to maintain query flexibility without incurring the cost of performing a query-time join. This greatly simplifies system design and architecture and can provide dramatic improvements to query performance. A reference implementation will be presented that compares the join and payload approaches. The demonstration will show how to search for documents that have classifications above a particular confidence threshold at scale.
Objekt <Kategorie>SchwellwertverfahrenBildgebendes VerfahrenWurm <Informatik>MultiplikationsoperatorOrtsoperatorSiedepunktParserFokalpunktFunktion <Mathematik>BereichsschätzungTypentheorieCASE <Informatik>Open SourceElektronische PublikationGleichheitszeichenTermNeuronales NetzSoftwareDatenmodellVererbungshierarchieInformationSyntaktische AnalyseMereologieDatenfeldKategorie <Mathematik>QuaderIdentitätsverwaltungBinärcodeEndliche ModelltheorieAdditionMailing-ListeAutomatische IndexierungAbfragePixelBitDemo <Programm>Kartesische KoordinatenComputerarchitekturNichtlinearer OperatorEin-AusgabeWort <Informatik>DateiformatZeichenketteGanze ZahlAnalysisRechenbuchKomplex <Algebra>DeterminanteDifferenzkernVirtuelle MaschineSprachsyntheseFunktionalNormalvektorSpannweite <Stochastik>StrömungsrichtungMustererkennungEinsCodierung <Programmierung>PunktKette <Mathematik>DifferenteGamecontrollerUngleichungRechter WinkelParametersystemSoftwareentwicklerXMLUML
MultiplikationEinfache GenauigkeitAbfrageRelativitätstheorieTypentheorieDatenfeldDatensatzTermBitVererbungshierarchieInterface <Schaltung>AbfragePaarvergleichAutomatische IndexierungKomplex <Algebra>NormalvektorHalbleiterspeicherNichtlinearer OperatorMetadatenDifferenzkernWeb-SeiteRoutingCASE <Informatik>ProgrammierumgebungBestimmtheitsmaßZahlenbereichComputerunterstützte ÜbersetzungStichprobenumfangFunktion <Mathematik>PunktUngleichungZeiger <Informatik>Einfache GenauigkeitAppletSchwellwertverfahrenBereichsschätzungFormation <Mathematik>Response-ZeitEndliche ModelltheorieSelbstrepräsentationOverhead <Kommunikationstechnik>BenchmarkMultiplikationsoperatorResultanteSpannweite <Stochastik>Quick-SortKartesische KoordinatenLinearisierungVollständiger VerbandMittelwertWurm <Informatik>Virtuelle MaschineRandomisierungSchnittmengeThreadRechenschieberParserTransformation <Mathematik>Neuronales NetzEntscheidungstheorieComputeranimation
AbfrageCachingDigitalfilterIndexberechnungThreadStichprobenumfangAppletAutomatische IndexierungMittelwertLokales MinimumARM <Computerarchitektur>Notebook-ComputerBitAutomatische IndexierungMultiplikationsoperatorTermDateiformatBenchmarkDatenfeldSelbstrepräsentationCASE <Informatik>ResultanteOverhead <Kommunikationstechnik>Spannweite <Stochastik>HalbleiterspeicherBitrateResponse-ZeitStichprobenumfangNeuronales NetzMAPAbfrageWurm <Informatik>ÄquivalenzklasseSystemaufrufSpeicherverwaltungBildverarbeitungSpielkonsoleSpeicherabzugOpen SourceBildgebendes VerfahrenEndliche ModelltheorieFunktion <Mathematik>GruppenoperationEinsNotebook-ComputerVollständiger VerbandSoundverarbeitungApp <Programm>RohdatenCachingMittelwertRelativitätstheorieZweiEinflussgrößePhysikalisches SystemDemo <Programm>Prozess <Informatik>Formation <Mathematik>Exogene VariableVererbungshierarchieComputeranimation
ÄhnlichkeitsgeometrieVektorraummodellMatchingTrigonometrische FunktionAbfrageParserWurm <Informatik>TeilbarkeitCodecNeuronales NetzSpezialrechnerInformationIndexberechnungHecke-OperatorNatürliche SpracheSpannweite <Stochastik>BitVerschlingungMaßerweiterungQuick-SortTypentheorieDemoszene <Programmierung>Syntaktische AnalyseWurm <Informatik>AbfrageÜberlagerung <Mathematik>VektorraummodellQuellcodeTrigonometrische FunktionVerzweigendes ProgrammProgrammbibliothekElastische DeformationPlug inCodierung <Programmierung>CodecGeradeCodePunktSpeicherabzugFormation <Mathematik>Hecke-OperatorSkriptspracheParserÄhnlichkeitsgeometrieBildgebendes VerfahrenDebuggingWechselsprungNeuronales NetzBetrag <Mathematik>ComputeranimationXMLUML
Transkript: Englisch(automatisch erzeugt)
Hello everybody. Glad everybody could join us today. Very excited to talk about some of the features that we've added to Lucene and Solr. Thinly veiled in a talk about image search and indexing the output of neural networks.
So, who am I? I'm the founder of KMW technology. We've been in operation since about 2010. We're based in Boston and we primarily focus on Solr, Elasticsearch and Lucene. We provide training, search cluster architecture review, application development, we perform Solr audits.
And we're a very, very big proponent of open source contributors, supporters and committers. So before we get into some of the approaches that we took to solving this problem, I'm just going to do a quick little overview on what payloads are because I feel like they're often overlooked.
People don't necessarily know what payloads are. Payloads are a piece of binary data that can be stored at a position in a field of a document. And they're stored in the POS files of the index. And the POS file in the Lucene index provides us with the byte offsets for this terms position within a document in the PAY, the payload file.
And this allows us to very quickly reference that binary data at query time. And there's a couple Lucene queries that support this, the span check term query and a few others.
But primarily, these are exposed through Solr's query syntax to the query parsers of the payload check and the payload score query parsers. Previously, the payload check query parser could only perform a pure equality operation.
Like does the payload equal a value that's specified? But we saw a very simple improvement to make to that, which is just to support any quality operations that's greater than, less than, greater than or equal to. And this really opens it up to a wider range of use cases with a relatively minimum impact to performance compared to normal payload checks.
So what are payloads in Solr? How do you configure them? Well, what they are, they're special field type. And the important thing in that field type is that it uses the delimited payload filter in the analysis chain.
What this allows you to do is it allows you to encode a payload with every value. So if you see the data input format down below, where you have something like a value, a pipe, and then the payload that you want to associate with it, and that becomes your field data. Current payload, encoder, decoder, support for integer floating point, and the string operation or identity as it's also sometimes called.
Quick review of the Solr query parsers that support this. First one, which we won't be talking about much today is the payload score query parser,
which allows you to use this payload information as part of your relevancy calculation or your scoring calculation. Useful to know that it's there, but not the focus of this talk. More interesting was the payload check query parser that can make a determination to match a particular term in a document if the payload equals a particular value.
And I think that most of this functionality came out of the part of speech searching use cases. So that would be searching for the word train only if the train word had been tagged as being a noun versus a verb.
So kind of more granular control, not just term matching, but matching the payload. As mentioned before, what we did is we extended this payload check query parser to add an additional parameter, which is the operation OP, which will match against the payloads. And here we specify GT representing a greater than example.
So if we could imagine having the word train with a payload of 0.75, we can search for documents that contain train where it had a payload that was greater than that value. All right, so let's talk about our use case motivation. Why did we go down this road in the first place?
The example use case we're going to present here in this talk is really dealing with the output of neural network model classifications. So if you look at like most of the bleeding edge in image recognition, most more than likely you come across, you know, convolutional neural networks,
some open source popular ones, VGG16 and YOLO we used in this example. We'll show you a demo of what that output looks like a little bit later in the talk. But generally speaking, a neural network, you have something like an image for each pixel is a value on the input goes through the network.
And then you have an output layer where each one of the outputs usually is a particular label or a classification. And the score on that output is like a confidence score. So what we end up with is for any given image that we want to classify, we end up with a list of classifications and each classification has its confidence score associated with it.
Other models like YOLO, in addition to giving you a category and a confidence threshold, it can also give you bounding box information about, you know, what it detected in the image.
So we'll come back to this in the demo at the end of the talk, hopefully we have time. But here's just one way to kind of represent payloads in the index classified from a machine learned model. So we have here the VGG16 underscore DPFS, the limited payload floating value S for multi-value field.
And we see the label and the confidence thresholds are encoded here. YOLO classification gives us the object type in this situation. We have an example of a person and that there was one person detected.
We have positional information, X and Y coordinates of where in the image that person was detected. And we even can compute things like how large the person is in the image. All of this kind of boils down to a general data model that looks like this as a one-to-many
relationship between documents and classifications, where you have a document with some data on it, that's your parent document, and then you potentially have many classifications, each one with its label and its own confidence score. So the first approach to index something like this, the most naive approach and simplest
approach by far is if we want to search for all the documents where they had been classified as containing a person with 0.75 confidence or greater, one thing we can do is filter it at index time. And this is the most straightforward way. It says you do your classification up front and you only tag the document with the labels that were above a particular threshold.
Now the pros of this approach is it's incredibly simple, it's very fast, the index is very small, but the real trade-off is that you can't change your mind about what the threshold was at query time because you're throwing that away at index time. So if you wanted to change your
query to say medium or high confidence thresholds, then I need two separate fields to include the different labels and stuff. And that complexity just kind of grows as you want to tweak what you consider high, medium, and low. So kind of a straightforward approach for its simplicity,
but it does not yield any flexibility at query time. And if we look at what a document of this style would look like, obviously you have a document ID and maybe you have a field that's like your high confidence labels where you tag the document as being a dog or a cat or whatever it is and a simple term query on that field is going to find things that are tagged
as being cat or a dog because you did that filtering up front at index time. Not every model has the same sort of confidence threshold, sometimes you want to expose that confidence threshold to the end users to decide what they consider good output from the model or not
and allow the end user to adjust their recall on this. So another approach to setting this up is to use a dynamic field or one field per label that came out of the classification. So you could imagine having a field for a cat that was a floating point field
and that has the score of 0.75 whatever it is, a field for dog, a field for a person, and this can work very well. I mean, the nice thing is that query time, this becomes a search on the label field. It's a range search for whatever numeric value you want in there. So
pretty straightforward, very performant queries, but one trade-off is that you might have a lot of labels and as a result, you'll end up with a huge number of fields in your index, which as it turns out ends up being extremely expensive in terms of memory usage
and in a Solr index or Lucene index. The other trade-off is you might not know the labels ahead of time, so you can't necessarily facet on field names, although I guess you could use like Luke to interrogate the index to get all of those out. It just adds a little complexity there. So what is a document that looks like? What would a document like that look
like? Here's an example document with an ID, has a field for each one of the labels, cat, dog, person, and the score, and the sample query here is very straightforward and very simple. Approach number three that we looked at was actually to use a join query
to leverage the inherent query time join capabilities that Solr and Lucene have. To index the parent document and the child classification records associated with that, search through the classification records and perform a joins of the parent
document and return just the parent document that had a classification record that matched the query. So the pro on this is it gives you full flexibility in terms of your relational type of queries. Return this parent document if and only if the classification
record has a particular label, has a particular value, and you can do as complicated filtering on the classification records as you like and return just the parent record. Now, the big drawback here is of course the join queries are much slower and we're going to show some benchmarks later on that really kind of drive that point home. And that's primarily driven
aside from the fact that you have to do the join is that you have a vastly increased number of documents in the index as a result of having all these children documents around. And search response times generally are kind of linear with the number of documents per shard. So that means that when you start going with a join approach, you really almost immediately have
to think about how are you going to shard or scale up this join. And when you do any sort of sharding in an environment where you are doing a join, you need to make sure that you're routing all of your documents to the same shard based on their join key. Otherwise that join query is just not going to work as you expect. So a little bit of complexity. And if you have
the freedom to route by the join key, then that's probably not as much of an issue, but it definitely needs to be thought about when you're going with an approach like this. And a document here, here's an example of what a join query with a parent document and the child
classification documents would look like. We have a simple document with just an ID, maybe some other metadata on it. And then for our example, we did a million parent documents, each one having an average of 50 classification documents. And the classification documents themselves had a pointer back to the parent. They have a label and a confidence score. And we see an example join
query below that does a search for classifications for label foo. So searching for the foo classification with a confidence 0.75 and up, joining on the parent ID back to the parent document on the ID field. So definitely doable. One thing to note is that with this query here,
only the parent documents will be returned. So the metadata of the child document being classified is not available to the UI unless you use a child doc transformer to fetch the matching
classified documents. So adding additional complexity, not just to the query, but also to fetch the matched metadata from the classification record. All right, so this brings us to the payload approach that we took a look and we observed this common use case
of being able to search for a label on a document or a term on a document and that term or label came from a machine learning model. And it's really almost like a one-dimensional join. We always knew that we're going to be filtering on a single dimension, in this case here, the confidence score. So we looked at the existing payload check query parser and were a little disappointed to
find that it only supported an equality operation and understanding that at the end of the day, this is just paging in a byte array and previously it was doing an equals to implement a comparator interface on that to greater than less than or equal to really was little to no computational
overhead. So we were confident that this approach was going to perform at least as well as normal payload queries do. So what we did is we encoded the confidence scores as a payload floating point value and we indexed them and we extended the payload query check parser to support these inequalities to do this. And we take a look at what an example document
that uses the payload check query parser would look like. Here we have a single field with the classifications. This is effectively the output layer of the neural network with some human readable labels, cat, dog, and person with the pipe delimiting the term or the label from
the confidence score. And you see the payload check query parser where you specify the field that you're going to do this on. You specify the payload and the operation for the comparison and the original query term, which is cat in this case, to only find cats that are 75% or
better confidence score. So four different approaches are great, but you really need to make an informed decision about why you're choosing to go with one or the other. In my opinion, the only real way to do that is actually to do some benchmarking. So what we did is we
generated relative indices representative data to prove out some of these benchmarks. For indexing benchmarks that we're going to talk about, we had a single threaded Java application that was just feeding documents into Solr with the formats as we've described in the previous slides.
For those documents, we generated 1 million of them. Each document had an average of 50 classifications per document, and those classifications had a random confidence score assigned to them between zero and one. There are 10,000 unique labels in that classification
data set that we use to generate. So we have a million documents with on average 50 classifications spanning 10,000 different labels, each with a score randomly distributed zero to one. So kind of a relative representative data set for what we see when we actually use
these neural networks and indexing time. At query time, we wanted to make sure that we're looking at the raw query performance and we're not being fooled by any of the caching going on. So we again have a single threaded app. All the filter cache sizes were set to zero.
So really the only effects of caching that we saw in this benchmark were operating system level caches, potentially some of the Lucene field level caches that get triggered, but none of the Solr caches were enabled for these benchmarks. Getting into the actual benchmarks here, a couple of things jump out at us. The approach one where we're
altering data out at index time, we're able to index these documents very quickly, 11,000 documents a second. The index was the smallest. The memory usage as reported through the Solr console on the core, the heap usage was pretty small ultimately.
Approach two, a bit surprising where we have all of the fields, potentially 10,000 fields in the index, it was the slowest to index. Perhaps because the JSON representation of the document is just much larger. It's not as tight of a format. We don't really know the exact reason
why this is so much slower. Perhaps it was because there's so many different fields that the index itself has to kind of pay attention to where it's writing that data out in the index. It wasn't super easy for that to be achieved. So 209 documents versus 11,000 documents,
clearly this approach two per field approach has major impacts if you're going to be doing a lot of indexing. The other big one that really jumped out at me is that the reported heap usage required for supporting this index. Nearly a thousand times more memory compared to
approach one, really highlighting the impact of having a lot of fields in your index. Approach three using the child document join that yielded the largest index overall. So we're talking about small indices here, but 2.6 gig, or I'm sorry, 2.6 meg in this
sample index, having the overhead of the additional documents in this situation and really kind of contributed to the index size. Memory usage, not as far out of whack as compared to the per field approach two, not too bad. But interestingly approach four with the payloads
yielded a slightly larger index than approach two, but the indexing rate was about 15 times faster than approach two. So it's certainly not as fast as throwing away data at index time, but nearly 10 times faster than the join approach, 15 times faster than the per label approach.
Memory usage surprisingly also is less than original approach one. I think those might just be, we'll call those roughly equivalent. Query benchmarks also very important to pay attention
to because we're not just concerned with indexing, we're also concerned with querying. Join queries were so slow in this benchmark that we just stopped after a thousand queries. We'll say that upfront. Approach one, 600 queries a second, running 10,000 queries, one for each label. No big surprise because this is just a simple term query.
The range query approach on the per field was the second fastest, about 350 queries a second, but notably the average result size for these documents was considerably larger. Probably because the overhead of all the JSON formatting and stuff. So I think that's what
really hurt approach two in this case. The join parent child relationship, queries were taken like two seconds. Of course we turned off caching, so that's very negatively affecting this, but we're not measuring queries in terms of hundreds of queries per second. We're measuring query rates at like half a query second, 0.5 queries per second.
Compared to the payload approach where we don't have the memory hit that we did on the index side, we're still getting about 250 queries a second. Average result size of these are smaller
to the tighter JSON that we have. Overall, the query response time, three milliseconds versus one or two milliseconds in the other approach, still in the ballpark.
So let's talk a little bit about a quick little demo that we'd like to show to see what this looks like. So we indexed the Cocoa image data set. This is about 118,000 images through open source. We have a document processing pipeline that has an image processing sub-pipeline that handles things like running OpenCV for blur detection, detecting faces,
and also running things like VGG16 classification, YOLO classification. And similar, we kind of hinted at this document style before, but here is an example of what the documents look
like in our example index. So let me go ahead and switch over to the index here where, make this a little bit bigger, and we have here an index of 118,000 images, and we want
to start asking some questions based on the outputs of those models. So here, for example, let me ask, of these images, show me the ones where pizza was classified as 0.75 or above. Great. And we see that we've got 855. Maybe I want more recall. I can decrease this,
and now we've got 1100 pictures of pizza. And let's say I want to find ovens. Here I'm looking for pictures that have at least one oven in them. Maybe I'm interested in
ovens and pizza, which is kind of interesting because now we're leveraging the output of one neural network model, YOLO, and another neural network model, VGG16, at the same time, not just to find pizzas or ovens, but pictures of ovens and pizzas or pizzas and ovens.
Maybe I want ovens and pizzas and people, for example. People with, well, that kind of looks like a stove, but that's an oven. You can kind of see a blurry person over here. Here's a person with a pizza, and there's an oven in the background.
Maybe I'm interested in a person with a laptop. I'll find those. Maybe I'm interested in more than one person or a group of people with laptops. Maybe I'm looking for
people at a bakery. Greater than two people at a bakery. Maybe I want less than two people at a bakery. Well, let's see. Or less than or equal to two. Let me get that equal out of there. And there's only one person at a bakery. What are some other things that we need to consider
going forward with this? In some of this work that we did in Lucene, it kind of appeared that
the codec, the encoding and decoding of the payloads was a little bit fragmented. It would be a nice improvement in the code to make it a little more extensible. And once that codec library would be a little more extensible, it would be a very short lift to do some things like vector matching, where if you encoded not just a single floating point value, but an array
of floating point values, you could start doing things like computing cosine similarities. Once we have these sort of classification feature vectors from these neural networks, it's a small jump also to look at the classifications that came out for an image and compose a find similar to find other images that had similar style classifications
and classifications in similar ranges. One of the things that kind of jumped out at me is additional things that would be nice to have. The syntax for this is a little bit difficult to work on. It's not definitely something that an end user would type in.
Having some NLU or NLP sort of front end for query parsing so that you could in natural language say show me a picture of an oven with some pizzas and at least two people and translate that into the appropriate query. So this was all added back in. This will be in Solr 9 when Solr 9 is released.
In the tickets we've seen 9659 and Solr 14787. Myself as a contributor and committers Gus Heck and Dave Smiley, a big thank you to them for helping usher this through to the
community. And yeah, I think we have two minutes for questions. Maybe a few more if I'm lucky. Great talk, Kevin. And as I can read from here, even people feel the same. It was a great talk with a great example that you have showcased. We do have some questions. Some of the questions
are answered by the community itself. However, someone asked that can this payload approach be used in Elasticsearch? Although the link for this has already been provided, but would you like to extend? Sure. So the span payload check query and the Lucene layer,
as soon as that is included in the latest Elasticsearch build, then at least the Lucene query would be there. You would still very much likely need some extension to the existing payload support in Elasticsearch. So being Lucene under the covers, there's no reason why it couldn't be extended to Elastic, but it's not currently supported.
Great. I think on the same line was the question that how easy it would be to extend the query parser in Elasticsearch as it is done in Solr in this talk. And I think you've already answered that question. Yeah, yeah, yeah, absolutely. So, you know, either with separate plugins or extending the existing, existing Elastic source, you'd be able to do that.
But again, this is going to require that Elasticsearch is pulling in the latest Lucene from the 9x branch. Correct. Yep. I think Max Erwin also commenced the same thing. Probably he has tried something on that sort already. So he requests if there is a way to reference the
payloads in a painless script, if that can be provided. We haven't done that. And actually, I was kind of looking at it a little bit more. It'd be nice to kind of extend this payload check support into the payload score query parser so you could start doing things
like that. That would be a very nice future enhancement for certain.