We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Understanding Vespa with a Lucene mindset

00:00

Formal Metadata

Title
Understanding Vespa with a Lucene mindset
Title of Series
Number of Parts
56
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Vespa is no more a 'new kid on the block' in the domain of search and big data. Everyone is wooed over reading about its capabilities in search, recommendation, and machine-learned aspects augmenting search especially for large data-sets. With so many great features to offer and so less documentation to how to get started on Vespa , we want to take an opportunity to introduce it to the lucene based search users. We will cover about Vespa architecture , getting started , leveraging advance features , important aspects all in the analogies easier for someone with a fresh or lucene based search engines mindset.
Musical ensembleGoodness of fitOpen sourceConnected spaceXMLUMLLecture/Conference
Time domainData managementOpen sourceSoftware developerFormal languageMathematical analysisComputerMultiplication signComputerFormal languageFlow separationProjective planeBitDomain nameMathematical analysisSoftware developerData managementOpen sourceCartesian coordinate systemGoodness of fitComputer animation
Lattice (order)Information technology consultingBlogBuildingRankingOpen sourceConnected spaceWordOpen sourceWave packetMusical ensembleBlogComputer animationLecture/Conference
Musical ensembleComputer animation
Computing platformInformation retrievalQuery languageMilitary operationMathematical analysisPartial derivativeCluster samplingEndliche ModelltheorieVideo game consolePersonal digital assistantFile formatEnterprise architectureComplex (psychology)Type theoryScale (map)Focus (optics)Scaling (geometry)Java appletConnectivity (graph theory)Search engine (computing)Configuration spaceDefault (computer science)CoprocessorQuicksortINTEGRALFocus (optics)MereologyComplex (psychology)Virtual machineEndliche ModelltheorieInformation retrievalMultiplication signJava appletWritingQuery languageFile formatReal-time operating systemCartesian coordinate systemAnalytic setCuboidCASE <Informatik>BitComputing platformInformationEnterprise architectureSystem administratorVideo game consoleOperator (mathematics)Right angleComputer architectureMathematical analysisSoftware developerThomas BayesComputer animation
Installation artFile formatService (economics)Software testingSearch engine (computing)CASE <Informatik>Slide ruleCartesian coordinate systemFile formatLecture/ConferenceComputer animation
Service (economics)MultiplicationInstance (computer science)Cartesian coordinate systemConfiguration spaceSearch engine (computing)Multiplication signTouchscreenService (economics)Validity (statistics)Digital electronicsCASE <Informatik>Category of beingProduct (business)
Instance (computer science)Product (business)Data structureCartesian coordinate systemBitLecture/Conference
Cartesian coordinate systemFile formatScripting languageTask (computing)Link (knot theory)CASE <Informatik>Directory serviceNumbering schemeXML
Sample (statistics)File formatIntegerString (computer science)Function (mathematics)Client (computing)Scripting languageMathematicsNamespaceMereologyComputer animationXML
File formatSample (statistics)IntegerString (computer science)Function (mathematics)Functional (mathematics)Computer fileClient (computing)String (computer science)MereologyDisk read-and-write headSingle-precision floating-point formatRight angleWeb 2.0Operator (mathematics)Computer animation
Client (computing)IntegerFile formatScripting languageSample (statistics)String (computer science)Function (mathematics)Price indexComputer fileCartesian coordinate systemConfiguration spaceCASE <Informatik>Connectivity (graph theory)Key (cryptography)XMLLecture/ConferenceComputer animation
Default (computer science)Query languageType theoryQuery languageFile formatLecture/ConferenceComputer animation
Query languageDefault (computer science)Type theoryQuery languageTouchscreenType theoryToken ringWeb 2.0Cartesian coordinate systemTerm (mathematics)Default (computer science)Matching (graph theory)Computer animationLecture/ConferenceXML
Continuous trackTrailQuery languageRankingUser profileLatent heatVolumenvisualisierungMathematicsDefault (computer science)Sample (statistics)Group actionDigital filterMotion captureResultantOrder (biology)Field (computer science)CASE <Informatik>DemosceneSelectivity (electronic)Default (computer science)VolumenvisualisierungLatent heatFunctional (mathematics)Slide ruleQuery languageFile formatNumberSearch engine (computing)Lecture/ConferenceComputer animation
TrailComputer wormQuery languageSample (statistics)Query languageContext awarenessUser profileSet (mathematics)Computer animation
Vector spaceQuery languageEinbettung <Mathematik>Data modelTrailVector spaceEinbettung <Mathematik>Lecture/Conference
Vector spaceQuery languageEinbettung <Mathematik>Data modelTrailQuery languageEinbettung <Mathematik>Vector spaceCASE <Informatik>outputCodeExpert systemNormal (geometry)XML
Einbettung <Mathematik>Vector spaceAdditionPoint (geometry)Attribute grammarSet (mathematics)Basis <Mathematik>Multiplication signResultantVector spaceSlide ruleSimilarity (geometry)Search engine (computing)Einbettung <Mathematik>AdditionCASE <Informatik>Lecture/ConferenceComputer animation
Query languageBitLecture/ConferenceComputer animation
Group actionMultiplication signLecture/Conference
Slide ruleConstraint (mathematics)Lecture/Conference
System callConstraint (mathematics)DemosceneField (computer science)Configuration spaceLattice (order)Partial derivativeMeeting/InterviewLecture/Conference
Slide ruleFundamental theorem of algebraSearch engine (computing)Field (computer science)Meeting/InterviewLecture/Conference
Field (computer science)Latent heatMultiplication signLattice (order)Lecture/ConferenceMeeting/Interview
System callMultiplication signMusical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
Good morning, everyone, and welcome today to Berlin Buzzwords. So, my name is Atitha Arora, and I work with Open Source Connections. So, today we're going to be talking about understanding WESPA with the Lucene mindset.
So, the talk is going to be centered around understanding WESPA. I know it's kind of a little repetitive after two sessions from WESPA folks yesterday, but with me. And I am kind of on a little time crunch because I have a 20-minute session, which is why I would be sticking to this agenda for today,
where I would be introducing myself, my company, WESPA. We would try to understand WESPA in a little better way. We would be covering some distinctions. How do you get started, if you want to? How do you create and deploy WESPA applications? How do you feed data to it, and how do you interact with it? I would also be leaving you with some references you could consult in future.
So, without further ado, let's get started. In good times, I look good. I work in search domain, and I've been working here since 2008. I mean the search domain. I am open source enthusiastic, and I have worked on several open source projects, contributed to them.
I am primarily interested in the search relevance and language analysis because that really kicks me. I'm also a polyglot developer, and I've done my master's in computer applications. And just to kick things a little bit more, I did my master's in strategic business management, as well.
Personally, I'm a mother of two boys, and I love to travel, and I love to cook. So, that's about my company, Open Source Connections. As the name suggests itself, we are also open source enthusiastic, and we are on a mission to empower the words search teams.
And we have lots of conferences, and we hold relevance trainings, as well. Also, we have a lot of books, blogs, check out the website, and we're hiring.
So, moving on. What is WESPA? So, there's going to be this little guy who's going to be helping me understanding what WESPA is. Let's hear him. What's that? Oh. It's just the greatest thing that humans ever made.
The Vespa. Okay, that's much of a hype, and sorry. We get it. It's the greatest thing that humans ever made, for sure. But that's not the WESPA that I'm going to be talking about today. I'm going to be talking about WESPA the search engine, and let's try to understand what WESPA the search engine is.
So, it's the platform which provides you the low-latency data ingestion and information retrieval. It actually does the fast and the real-time writes. It supports the true partial updates, and you can search unstructured data with it.
Along with that, it has the integration for query time, the complex operations, aggregation, and real-time analytics. You can also use the NLP features and complex machine learning models along with it, and the support for TensorFlow, et cetera, is provided out of the box. On top of it, it also provides you with managed and auto-recovering clusters,
and I'm sure this must be very exciting. I would also be trying to create a little bit of a distinction to create the picture for you as to where WESPA stands. So, when Solar came in, it was addressed for the enterprise search market.
The cool thing about it was that it could ingest, and you could also interact with the data in various formats. It provides this nice admin UI console with which you could also query and also get the cluster information. When Elasticsearch came in, it was more analysis-driven.
It was more focused towards the logging, monitoring, and scaling use cases. It also provides the rich APIs for almost everything and anything in your search cluster. Let's see what WESPA stands here. So, WESPA actually fits into all kind of use cases.
It's basically the aggregated focus is on the large-scale data ingestion and information retrieval from it. I think the distinct thing about it is that all the complex stuff that you usually would do outside for the Lucene-based search engines
are provided out of the box by WESPA. So, you can do all sort of complex information retrieval and integrate and use the machine learning models in WESPA by default. So, that's kind of a cool thing. So, I know this is also kind of repetitive. There was a session yesterday where they talked about this architecture.
I'm not going to be spending a lot of time, again, time crunch. So, we have this application package which, like in our Solr application, we have schema and we have Solr config XML and other configuration XMLs. That's where the application package stands, and that's where you integrate and put all your machine learning models
and the other components, could be custom components as well, that goes into the config cluster, which is apparently using also Zookeeper. Then you have the stateless Java container, which comprises of the query processor and all the custom components that you're going to be putting into WESPA. And the content cluster, I think that's an interesting part
because this part is implemented in C++. And I think from the aspect of a developer, I feel this is probably one of the reasons that WESPA is so magical and so fast, I would say, because for the Lucene-based search engines, this stuff is done in Java,
which means it has to be converted into a machine-understandable format. And when this stuff is already in C++, it means that it's kind of closer to the machine, which is probably one of the reasons it's that fast. So, obviously by now you must be wondering, how do I get started?
So, as per the documentation for various tutorials or test use cases that WESPA documentation has, they talk about keeping six gigs dedicated to Docker. I would say, I've tried a pretty small case. Go for 10 gigs. That's kind of safer.
Keep your port 8080 free and brew and wget available. Along with that, you would need Python 3 in case you're using Python because your dataset needs to be converted into WESPA-friendly format, and that's not the usual JSON. It's a slightly distinct JSON. I would be talking about it in my forthcoming slides.
So, how do you create a WESPA application? Just like any other usual Lucene-based search engine application, you have a schema, you have other configuration for the cluster and stuff. Similarly, and also Lucene-based search engines now also provide these schema-guessing capabilities, although they are not production-ready per se,
but considering it's a big data search engine and you're going to be ingesting and interacting with huge amount of data, I think which is why WESPA guys say that you spend some time on your schema so that you can avoid this goof-up. So, you need to have your schema.st, which is going to be the schema of your entire application.
You have services.xml, which probably looks like, I've provided a snapshot, but it's too small, even for this screen. So, that's where the kind of request handler configuration goes in. The host.xml is needed only when you have a multi-node cluster setup,
and validation.xml is more like a circuit breaker so that in case you change some property which might corrupt your production instance, you can have a protection against it. So, in structure, it would look something like this. Once you have all of this set up, you would need to deploy your application.
So, you can run the first command. I know I've got a little bit into the details, and if you're using Apple M1 like me, you might use the second one. Then you go into the directory of your application and then you just hit WESPA deploy and bam, your application is deployed. So, the next big task is how do I feed data to WESPA? So, as I talked about it before,
WESPA supports its own WESPA-friendly JSON format. I've also provided a link to the Python script, which I have used because I was using the nearest neighbor example, which is why you could use the same script or you could modify this as per your scheme or your use case.
This is, again, it's kind of a little blur, but if you look at the ID, that's where the entire changes are, that the ID needs to be slightly different, and I've tried to also break it down for you that ID is like you need to define a namespace and then your schema name,
and then comes the innocent ID, which works universally. So, this is probably one of the parts that needs to be tuned or changed. There is no native support for the dates, so it needs to be converted to long if you have any date-related operations, and I actually transform them to strings
just to keep it mess-free and simple for myself. The bulk data is also supported, but it's supported through the WESPA feed client, which, again, you can brew. Another kind of critical part is defining your ranking function. I mean, I still have to scratch my head to get this thing right,
and after all of this has been done, and if you bypass all of this, you can post your single document JSON as WESPA document and the JSON path, or if you're using WESPA feed client, you can hit the WESPA feed client with the file name and the endpoint where application is running.
So, now we have the configuration, we have the application running, and we have the data available also in the WESPA. So, you might as well need to modify or remove something. Obviously, by now, you must have understood that ID is kind of the key component here. So, you would be interacting with the documents with this ID.
So, WESPA document, obviously, in case if you remove or modify or update, you need to use this ID in case of removing. You just remove the ID, and in case you need to modify, you would give the new document JSON. So, the next thing that falls into the picture
is how do you query, how do you interact with this data? So, WESPA uses a distinct format, which is not too distinct, actually. It's called Yahoo Query Language, which looks a lot similar to SQL, which I'm sure everyone here understands.
So, I would recommend you to use command line. It kind of works best, but this is the query builder. I don't know why this side of the screen is, like, a little blurry, but okay. I was trying to show that you have a lot of tuning knobs. Don't get confused with this. This is available at where the application is running, slash query builder.
The default query type is, like, AND or the all terms to match, like a match all in Elasticsearch, but you have the provision to change it by providing type equals to all, any, weekend, tokenize, web, or phrase.
So, moving on, I've tried to also capture some intents that you may have when you're querying with the data. So, like, in case you want to kind of query all the documents, like a normal star.star or query all feature from the Lucene-based search engines, you can do that also with WESPA at select star,
where true. If you need to do the filtering, you need to provide it in a SQL format, where the field contains your designated query. In case if you want to query all the fields, then in that case, you do not specify the field, but you query as default.
So that would query all the fields. In case if you want to render a specific number of documents, you use the feature of hits. Hits is the number of documents you're going to be rendering. If you need to change the default order of your documents or the results, then you can modify that by introducing order by clause,
and you can define your, or you can specify your field on which you would like to order your results on. If you need to filter certain keywords in your search results, like, I've tried to provide an example, and I also say that query brilliant, but you remove the documents which have title 1987 in them,
and you can also specify the custom ranking, the ranking function that I spoke about in my previous slide. So that's that. One of the good things is that you could also use it for personalization, like you capture user profile,
and you can use this along with your query, and you can provide this as a user profile setting. So, like, the person or the intent is the love songs or something from 80s. You can provide that and add this into the query context. So, moving on, I think by now you guys must be wondering,
WESPA is about vector search. Where is vector search? How do I do vector search? So, I'm not sure how many people were really here when they discussed about that for vector search to work, you need to ingest vector embeddings into your data set.
So, you can generate this using the SBIRT. I think that's kind of a little traditional. We have seen that yesterday. I've attached also the snippet of the code, which can help you generate these embeddings, and then you can push them into WESPA. And then you can also, in case if you want to query the vector,
you can use the snippet and you can create the embedding of your query. And then you can, so because I was using CLI, the command line, which is why I set the export, the query value as this embedding that I generated from this code above.
And then I put, instead of the input query as the normal text query, I put this as the vector query. And that's how you use for the vector search. So, I think all in all, it looks pretty nice, of course, but I think there are still some pain points that,
as I said, the embedding needs to be generated explicitly for the data set and also for the querying to leverage the capabilities of any search engine. I'm not talking particularly about WESPA here. I'm talking about any Lucene-based search engine as well. I think that's kind of something that we still need to kind of iron out.
I had a little tough time understanding if it's better than Lucene or not because it's a pretty recent addition to Lucene. So, I think time will tell. Also, the quality of these results that we get, because usually the similarity search for which the vector search was introduced is done on the attribute basis.
Like, you use different attributes of a document to understand if two documents are similar. And as for the new approach of vector search, we are going to be using vectors. So, the quality of these results needs to be proved if they are efficient, if they are worth going forward with.
And I think another thing that probably is not mentioned on the slides, but that kind of puzzles me, is if there are any use cases that need only vector search, like, that can only be resolved with vector search. I mean, I am still kind of, you know, like, researching about it. Might as well come up with something more next year.
Apart from that, I think I've tried to provide examples, and I just hate that this is kind of a little blurry. But, yeah, you can see that there are a lot of tuning knobs, a lot of possibilities with the Query Builder as well, while I've also tried to provide an example, like, when I've used the Come Online for querying Vespa.
I mean, all in all, it works pretty neatly. There are some references, as I promised, and I cannot recommend them enough on their Slack channel. They're very responsive, very nice folks. They never get bugged. But, yeah, and you can check out the documentation.
I think documentation-wise, it needs a little bit more work, but it's cool. I think that's about it for my talk today. Thank you, if you have any questions for me. Thank you, Aditya. Wonderful talk.
So, now we still have some time for questions. Are there any questions in the audience? Yeah, we have a question. Just a second, I'm bringing the mic.
Hi, thank you for the great talk. On one slide, you mentioned that Vespa supports partial updates natively. So, are there some constraints about that, like in Lucene, when you did to meet certain field configurations
to support the partial updates? Not really. As far as I know, partial updates are supported. Why? Because I think if you talk about partial updates from Lucene, the document is usually deleted, and a new document is introduced. But Vespa uses a single segment. I think that's kind of neat, and I think this probably is not covered on my slides, because it's a huge topic.
I think one of the other neat things about Vespa is that you do not need to configure any shots because everything is one segment. And as we know from our very fundamental search aspects that if it's one segment, it ought to be very fast. I think that's something that we also try to achieve with Lucene-based search engines using merge policies and etc.
So, the partial updates are in place, and that's what I was trying to highlight here, that they're supported. Yes. So, when you want to update a specific field in the document, you still need to parse the whole document? Or is it really possible to update one specific field?
Yes, you can do one specific field. Okay, that's great. Yes. Alright, thanks for the question. So, the time for this talk is up, but of course, if you have further questions, you can meet Attila offline and discuss about Vespa.
So, let's thank Attila one more time. Thank you. Great audience.