The Debate: Which Search Engine?
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 69 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/67327 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202139 / 69
11
25
39
43
45
51
53
54
60
00:00
SuchmaschineKonvexe HülleSuchmaschineRechter WinkelOpen SourceGruppenoperationVerschlingungWeb SiteDivergente ReiheMereologieElastische DeformationEinfach zusammenhängender RaumExpertensystemMultiplikationsoperatorTouchscreenProjektive EbeneKonvexe HülleEreignishorizontXMLUMLBesprechung/Interview
02:24
Algorithmische LerntheorieMathematikBitMultiplikationsoperatorLesezeichen <Internet>HauptidealRechenbuchSuchmaschineVirtuelle MaschineAuswahlaxiomDatenverwaltungFormation <Mathematik>Projektive EbeneTotal <Mathematik>WärmeübergangBesprechung/Interview
03:56
InformationElastische DeformationCoxeter-GruppeProdukt <Mathematik>Dienst <Informatik>Funktion <Mathematik>EINKAUF <Programm>EntscheidungstheorieSelbst organisierendes SystemBefehl <Informatik>SoftwareentwicklerProdukt <Mathematik>Elastische DeformationEntscheidungstheorieLesen <Datenverarbeitung>SchnittmengeMAPSoftwaretestQuick-SortPunktwolkeZentrische StreckungNichtlinearer OperatorAbfrageProgrammierumgebungAutomatische IndexierungStabilitätstheorie <Logik>SpeicherabzugGüte der AnpassungProgrammfehlerOpen SourceExpertensystemKeller <Informatik>DifferenteProgrammierungWellenpaketWeb logHilfesystemDienst <Informatik>EDV-BeratungPlug inSuchmaschineElektronisches ForumZahlenbereichPunktOrientierung <Mathematik>Chatten <Kommunikation>LoginDatenstrukturBitTwitter <Softwareplattform>Minkowski-MetrikMessage-PassingDemoszene <Programmierung>ProgrammbibliothekMultiplikationsoperatorMulti-Tier-ArchitekturPrandtl-ZahlSystemverwaltungBasis <Mathematik>Einfach zusammenhängender RaumNotebook-ComputerSicherungskopieGebäude <Mathematik>BlaseIndexberechnungMeta-TagComputeranimationBesprechung/Interview
11:06
ComputersicherheitSichtenkonzeptCodeMultiplikationsoperatorSuchmaschinePerspektiveProgrammierparadigmaSchnittmengeServerMathematikNichtlinearer OperatorProjektive EbeneQuick-SortCASE <Informatik>Mailing-ListeBildschirmfensterMereologieTeilbarkeitDienst <Informatik>TermAdditionStochastische AbhängigkeitSelbst organisierendes SystemTypentheorieHeegaard-ZerlegungUmwandlungsenthalpieGrenzschichtablösungFastringStabilitätstheorie <Logik>Framework <Informatik>URLSkalierbarkeitPlug inPunktwolkeSchaltnetzComputerarchitekturBitInstantiierungDatenverwaltungAlgorithmische LerntheoriePhysikalisches SystemHardwareAuswahlaxiomZentrische StreckungAnalytische MengeVerschiebungsoperatorNeuroinformatikCMM <Software Engineering>Zusammenhängender GraphVirtuelle MaschineDifferenteInformationsspeicherungBesprechung/Interview
19:20
Transformation <Mathematik>SuchmaschineZentrische StreckungInformation RetrievalDomain <Netzwerk>RankingBenutzerbeteiligungMultiplikationsoperatorSpannweite <Stochastik>WellenpaketSpezifisches VolumenEndliche ModelltheorieAbfrageFunktionalSoftwareentwicklerInformationZahlenbereichSystemplattformKontextbezogenes SystemRechter WinkelComputerarchitekturAutomatische IndexierungSkalierbarkeitWeb logEchtzeitsystemTensorRückkopplungCASE <Informatik>DialektHomepagePartielle DifferentiationSelbstrepräsentationSchaltnetzPunktLeistung <Physik>Klasse <Mathematik>Prozess <Informatik>NeuroinformatikSchnittmengeVektorraump-BlockBitfehlerhäufigkeitBAYESKartesische KoordinatenAuswahlaxiomFormale SpracheExpertensystemOffene MengeBesprechung/Interview
24:42
Demoszene <Programmierung>Einfach zusammenhängender RaumPhysikalisches SystemCASE <Informatik>NebenbedingungEndliche ModelltheorieComputersicherheitMultiplikationRankingCluster <Rechnernetz>TermRegulärer GraphVersionsverwaltungRechter WinkelCoprozessorAlgorithmische LerntheorieMultiplikationsoperatorOpen SourcePunktPunktwolkeVektorraumSprachsyntheseImplementierungStörungstheorieEchtzeitsystemAggregatzustandInverser LimesAbfrageSoftwareentwicklerExogene VariableSchnittmengeAutomatische IndexierungProdukt <Mathematik>HalbleiterspeicherVirtuelle MaschineTypentheorieLeistung <Physik>SystemaufrufDifferenteProgrammbibliothekCodeKontrollstrukturFramework <Informatik>GraphSoundverarbeitungAttributierte GrammatikElastische DeformationInstantiierungUmwandlungsenthalpieInformation RetrievalPartielle DifferentiationZahlenbereichKartesische KoordinatenMailing-ListeDigitale PhotographieNichtlinearer OperatorTransformation <Mathematik>ProgrammierparadigmaVerschiebungsoperatorEinfügungsdämpfungGüte der AnpassungSpannweite <Stochastik>GamecontrollerMinimalgradSpeicherabzugFamilie <Mathematik>SuchmaschineFlickrZentrische StreckungZusammenhängender GraphInterface <Schaltung>AuswahlaxiomTensorfeldKlassische PhysikIntegralBesprechung/Interview
30:04
SummengleichungLeistungsbewertungEndliche ModelltheorieInterface <Schaltung>VektorraumParametersystemRankingCodeDichte <Physik>Transformation <Mathematik>AbfrageDatenstrukturDivergente ReihePlug inAutomatische IndexierungWeb logFächer <Mathematik>Domain <Netzwerk>MereologieInverter <Schaltung>ProgrammbibliothekKlassische PhysikElastische DeformationInformation RetrievalWort <Informatik>Overhead <Kommunikationstechnik>MultiplikationsoperatorDatenverwaltungProzess <Informatik>VererbungshierarchieGüte der AnpassungDatenreplikationWarteschlangeZahlenbereichProjektive EbeneTUNIS <Programm>TypentheorieCoprozessorPunktwolkeInverser LimesQuaderAbstandPhysikalisches SystemProgrammierparadigmaTrigonometrische FunktionVersionsverwaltungGamecontrollerDifferenteOrdnung <Mathematik>MinimalgradCodierung <Programmierung>ImplementierungRechter WinkelSchwach besetzte MatrixVideokonferenzDokumentenserverAlgorithmusTaskÜberschallströmungRandverteilungSchwebungZusammenhängender GraphPhysikalismusSchnittmengeMailing-ListeE-MailUnimodale VerteilungCASE <Informatik>Amenable GruppeVerschiebungsoperatorNichtlinearer OperatorBildschirmmaskeVerschlingungOffene MengeInternet der DingeTermMomentenproblemSpeicherabzugMatchingMessage-PassingGraphMathematikHybridrechnerComputersicherheitKartesische KoordinatenNebenbedingungReelle ZahlZahlensystemFunktionalMinkowski-MetrikDienst <Informatik>FreewareFlächeninhaltDefaultSichtenkonzeptDatenfeldErhaltungssatzSoftwaretestKlasse <Mathematik>GruppenoperationAbstimmung <Frequenz>GeradeAdressraumHidden-Markov-ModellAbgeschlossene MengeWeg <Topologie>PerspektiveAggregatzustandService providerSaaS <Software>ZoomLochkarteFacebookFormale SpracheLokales MinimumTabelleInstantiierungDatenbankEin-AusgabeUmwandlungsenthalpieExpertensystemQuick-SortNatürliche Zahlp-BlockBitGebäude <Mathematik>Selbst organisierendes SystemExistenzaussagePhysikalischer EffektArithmetischer AusdruckHardwareZweiCluster <Rechnernetz>Negative ZahlCodierungParserSkalierbarkeitEinsMeta-TagMaßerweiterungMAPPlastikkarteSoftwareRückkopplungPunktÄhnlichkeitsgeometrieTwitter <Softwareplattform>Virtuelle MaschineAbstraktionsebeneMonster-GruppeBefehlsprozessorSystemaufrufLuenberger-BeobachterZentrische StreckungDimensionsanalyseKonfiguration <Informatik>Multi-Tier-ArchitekturInformationsspeicherungToken-RingStreaming <Kommunikationstechnik>Trennschärfe <Statistik>MinimumLeistung <Physik>Euler-WinkelProfil <Aerodynamik>KonfigurationsraumRechenschieberStörungstheorieEchtzeitsystemProgrammierumgebungBenutzerbeteiligungDialektProdukt <Mathematik>IntegralWhiteboardFilter <Stochastik>Dreiecksfreier GraphStereometrieComputerspielBenutzerfreundlichkeitStandardabweichungHeegaard-ZerlegungWiederherstellung <Informatik>DateiformatSyntaktische AnalyseEvoluteÄquivalenzklasseReverse EngineeringRelativitätstheorieEntscheidungstheorieEinfach zusammenhängender RaumAdditionMultiplikationDemoszene <Programmierung>Deskriptive StatistikSoftwareentwicklerGewicht <Ausgleichsrechnung>TopologieResultanteAttributierte GrammatikRepository <Informatik>Boolesche AlgebraInformationSuchmaschineSummierbarkeitSpannweite <Stochastik>Mathematische LogikOpen SourceSharewareE-LearningNeuronales NetzRechenzentrumVerfügbarkeitCloud ComputingVollständigkeitKeller <Informatik>DatensatzAnalysisDynamisches SystemIndexberechnungInhalt <Mathematik>WellenpaketGoogolElektronischer DatenaustauschAppletFeasibility-StudieFuzzy-LogikHalbleiterspeicherOrtsoperatorFormation <Mathematik>PerkolationArithmetisches MittelGefrierenAsymmetrieKontrollstrukturElektronischer ProgrammführerServerPhasenumwandlungPrimidealSystemplattformURLMetropolitan area networkGrenzschichtablösungSpider <Programm>SkriptspracheTropfenDatenmodellAlgorithmische ProgrammierspracheWechselsprungStochastische AbhängigkeitSelbstrepräsentationEinflussgrößeNormalvektorVorzeichen <Mathematik>SchnelltasteGeometrische FrustrationBus <Informatik>Besprechung/InterviewComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:08
Hello again, and welcome back to the Haystack conference, guesting as part of Berlin Buzzwords this year. I hope you've had a fantastic week so far. It looks sadly like we're coming to the end of our conference,
00:22
but it's been a great few days. I'm Charlie Hull from Open Source Connections, for those who don't know me. This is, as I said, a talk presented by the Haystack conference series. We focus on search and relevance, and a way of sharing great talks on search and relevance of the community. Currently, we're running the Haystack live meetup
00:42
every few weeks, featuring lots of talks on search and relevance. And later this year, we're hoping to run physical conferences again. So keep an eye on the Haystack website. I'll post a link into the chat, and we'll have news on that as soon as we know it. Also, do join the relevant Slack group. I'll put a link in for that, where there's lots of great folks hanging out
01:01
talking about search and relevance. So today's main event, solar or elastic search, elastic search or solar. What's this Vespa engine that's just turned up? Is it better than both of them or either of them? Are they all the same? Does it really matter? These are all common questions that we have put to us at Open Source Connections.
01:23
And as any search engineer will tell you, they come up again and again, which will be the right search engine for you. So how do you choose what will be the best search engine for you and your project? Well, today we brought together three search experts to help, and we're gonna take your questions today and try and bring some clarity to this debate.
01:43
This is actually the second time we've done this debate. We did it earlier this year as part of Haystack Live, and it was hugely popular, but in no way did we answer all the questions there might possibly be. And also to give some of you who weren't there a chance to ask them today. So what I'm gonna do is I'm gonna ask all of our panel to give a brief pitch
02:01
for their favorite engine, and then we'll take questions. But as they do that, do think of your questions around Solar Elasticsearch or Vespa, and drop them in using the questions tab, which is just to the right of your screen today. Do start submitting them now, thinking about them now. So firstly, I'd like to introduce our expert panel.
02:22
We have Josh Devins, who began working in search at SoundCloud before joining Elastic, where he's a senior engineer working on machine learning. Joe Christian Burgum has worked in search since the days of fast search and transfer, and is now a senior search principal engineer at Verizon working on Vespa.
02:42
Ansham Gupta has worked for Lucidworks at IBM on the Watson project, and is now at Apple. But he's also a solar committer and on the Solar Project Management Committee or PMC. So firstly, thank you all three of you for agreeing to do this again. And we're very much looking forward to today.
03:01
We had a quick calculation in the irrelevant slack before we started this. So we reckon between the four of us, we've got around 65 years of collective experience of search. So I think you'll agree that's an awesome title. We're slightly scary title. It doesn't reside with one of us or anything like that. We're all equally youthful and beautiful.
03:21
So, but do get your hard questions ready. We're really looking forward to seeing them. So I'm first, I'm gonna ask everyone in turn to give a quick pitch to why they think their favorite search engine is the best choice for you. And I will say that last time we did this, everyone was far too nice
03:40
about the other people in the room. So a little bit of needling, a little bit of nastiness is absolutely fine. It'll make a refreshing change. Don't be so nice. But anyway, we're gonna kick off today with Elasticsearch, and I'm going to ask Josh to kick us off. Josh. All right. Thank you. Go, go.
04:01
I don't know if someone has an echo or maybe someone can mute. Everyone else can mute their mics. Cool. All right, echo gone. So first, I am from Elastic. I get to show this legal disclaimer before a pitch and before talking about any future product developments or features, have a read over this.
04:23
Basically, don't make any buying decisions based on things that I say, do your homework and you do diligence. So yeah, that's it for the disclaimer. Thanks, we can hide that disclaimer now. So I am gonna talk about Elasticsearch
04:42
and I will try and be slightly less nice than I was last time, maybe. I'm Canadian, so it's in my blood, I guess. So I wanna start, I guess, just to first talk about the few points about why I think Elasticsearch is a great engine and not all of it is technical.
05:03
So I think one of the first things that comes to my mind, of course, is the sort of the community that we have around Elasticsearch and around the products and the whole stack in Elastic. I would argue that we have the largest worldwide community, but I don't have the numbers to back that up.
05:21
And it's not just a community in the sense of other people, peer support, which we have a lot of through forums and blogs and books, but also there's a lot of third parties developing plugins for Elasticsearch. There's great consulting folks like Open Source Connections, where you can also get professional services and help from
05:42
and chances are, if you have some kind of a problem in Elasticsearch or something that you'd like to do, you're not sure how to do it, if you can't find a forum, a blog or a book, there's gonna be someone who has probably done this already before. So chances are you will find somebody eventually
06:01
through one of the various forums, blogs, books or third parties. And I think that's a huge plus for the great community that we have. I'd also throw in that when you're building a team, a search team or any team using Elasticsearch, we have lots of training programs and hireability is really key these days in particular,
06:22
trying to grow your team either by upskilling people, so going through training or by bringing people in that already have the skills that you need. There's a big hiring pool basically that have Elasticsearch experience. So I put that into the bucket of a worldwide community. And I think another thing to call out
06:41
is the pace of development. So Elastic is a pretty good company. Nowadays, we have a lot of engineers working on a lot of different products, including not just only Elasticsearch, but the whole stack. We do releases, minor releases every eight weeks, which is a pretty good pace.
07:00
So if you're looking for bug fixes and you're contributing bug fixes through PRs or somebody at Elastic has fixed something, there's a good chance that you're gonna see it coming out fairly quickly and you might not have to wait very long. The other thing obviously to talk about are some of the core foundations of Elasticsearch. So Lucene at the core,
07:22
20 years of fantastically stable search indexing library, experts from Elastic as well, contributing directly back to Lucene. I think it was two of the top five or three of the top five Lucene committers from 2020 were Elastic employees.
07:42
So we have experts at the company. And I think sort of coupled with that, it's like a lot of those other engines, we designed Elasticsearch from day one, from the ground up for scale. So we have a lot of customers running Elasticsearch at extremely large scales,
08:02
both on-premises and in a cloud environment. And it's a great experience as well for the developer to be able to go from doing local testing, even spinning up an Elasticsearch in your JVM to do testing. You can go to one node, just testing on your laptop, all the way up to hundreds of nodes
08:20
in the cloud or on-premises. And it makes it easy to do it. And it kind of grows with you as well, not just through scale, but from basic features to advanced features, we're kind of there with you as you go down your search journey. I think another great thing for developer experience is the HTTP JSON everything approach that we take,
08:45
including administration. So it's very easy to access for developers and operators. And we have, of course, a very fully featured query DSL that lets you describe your queries, either very simply or very complex if you need to.
09:02
I guess the last piece for me, which is really important, especially if you worked in operations or DevOps space, the whole operation around Elasticsearch, I think is really powerful. So it's easy to deploy. It's easy to manage, easy to observe. You can even use Elastic products to observe Elasticsearch, which is a bit meta,
09:21
but it works very well. And that's what we do at scale as well. So you can be rest assured that observing your Elasticsearch at scale works well. It's also what we do for our cloud offering. I think another great thing that we have is sort of the story around backups, restoring using snapshots,
09:42
very flexible and very easy to do test environments, dev environments, benchmarking environments, being able to do snapshots on a regular basis, restore them anywhere you want. And recently we introduced data tiers, which gives you the ability to kind of take
10:00
this snapshotting idea to a whole other level. So it's basically the idea that you can search over any data set that is time structured. So logs or chat messages, Twitter, tweets, anything you can sort of orient with the time. You can search now not only indices that are live and hot and ready to be accessed
10:23
with low latency in your Elasticsearch cluster, but you could also access seamlessly indices that are stored in S3, thereby reducing the costs for you to operate, but still having access to effectively unlimited amounts of data over which to search over.
10:41
I think that's about it. I think I'm probably at time as well. So I will end there and I am excited to dive into some questions and debates. Next. Fantastic, thank you, Josh. So from Elasticsearch to probably the most
11:01
well-established search engine in the pack here, which is Apache Solar. Ansham, do you want us to give us your pitch about Apache Solar? Sure. So I'm gonna start off with a little bit of history and Apache Solar obviously completely, very tightly coupled with Lucene. Really, really, it was a long time ago
11:22
that Apache Solar was created and then open sourced and became part of the Apache umbrella. And for the last 10 years, it was so closely tied to Lucene that it was actually the same project until recently when the projects now split up only to go its own way. The good part there being the underlying set
11:43
of contributors to these projects are essentially, there's a lot of overlap. So all the good things that Lucene offers are still going to be part of Solar going forward, but the split is going to allow for users to concentrate on Solar specific things
12:01
and organize and release stuff that's more pretend to Solar going forward. In terms of Solar, as I said, it's been out there for a very long time, making it, in my opinion, one of the most mature and scalable as well.
12:22
So a solution that is out there, just because it's been out there and people have been using it in really diverse and varying use cases. They've tried to do all sorts of interesting things that I wouldn't have ever imagined you could get a search engine to do to begin with. And Solar has certainly evolved into being
12:42
more than just a text search engine over the last five, six years with the introduction of analytics, the introduction of things like learning to rank. There's so many new features just on the feature set aspect of how Solar's journey has been in the recent past. And it doesn't stop with just being search
13:02
because it offers stuff like spatial search, analytics and much more, but a whole bunch of people have concentrated on doing stuff that's non typically feature related. So making sure that Solar scales. So yes, there are new features that allow for newer use cases,
13:20
but you also need to concentrate on ensuring that the system by itself stays stable and scalable if you're introducing a whole bunch of features. And that's something that Solar's kind of very successfully or reasonably successfully managed to do over the years.
13:41
One of the really cool things of Solar is because it's under the Apache umbrella and because of the way it's structured, there has always been a need for, and by design, Solar is a very plugin friendly ecosystem. So there's a whole bunch of frameworks and there's a whole bunch of places
14:02
where people can plug in stuff that's custom components for them, be it related to things like security, monitoring, machine learning or any related stuff. It allows for plugging in of a whole bunch of stuff. So as Josh just mentioned, releases every eight weeks,
14:22
I think we might be releasing a little more frequently than that, but the better part here is that Solar has the capability of because it's a community driven project, if there's something that you feel that you really need, you could contribute that back to the community and ensure that that gets released
14:40
even prior to the eight week window. So you could have more frequent releases only because it's a computer driven project. There's been a whole bunch of emphasis on security of late, if you really look at, if you've been following the user list, if you've been following the change log, there's a new perspective and a new view
15:01
of how security has been considered for Solar as a project. And we didn't look at it with so much importance, I guess, in the past, but in the recent past, we've gotten more cognizant of what it means for people to be running the system and how important security aspect of it is.
15:22
And there's a whole bunch of releases that are just security focused nowadays. One thing that I'd like to also highlight is something that I gave a talk on at Buzzwords, which is H-A-N-D-R. And I wouldn't dive into H-A-N-D-R in particular,
15:40
but I'm gonna say Solar has offered things that support H-A-N-D-R in the past, but they've been deprecated now or have evolved into a completely different view or perspective of designing these kind of architecture, which is allow users to have H-A-N-D-R on things like pluggable cloud interfaces
16:03
to allow storing of data on a totally different location type. And so more and more features that are being developed right now are being developed with that in mind. A more recent addition to the Solar project,
16:23
not the main code base, is the Solar operator, which thanks to all the work by Houston and thanks to Bloomberg for donating that to the community that allows or makes it really easy for people to run Solar on Kubernetes. And in my opinion, that's like a paradigm shift
16:41
in terms of how people are gonna use their infrastructure and set up search, concentrating on using their infrastructure, which is Solar in particular, in this case, instead of figuring out how do they set up the hardware and how do they set up their Solar instances across their existing infrastructure.
17:02
So the Solar operator is gonna be a very big change in how the community takes Solar and deploys it in the near future. And I'm really excited about that aspect. And with 9.0, the first release that's gonna happen with Lucene and Solar as separate projects,
17:22
there's a lot of cleanup and without the independence of being independent of Lucene has allowed people in the Solar community to move forward with. And I'm really looking forward to it because there's been a whole bunch of effort towards cleaning up of code base.
17:41
And what that really means is even though Solar is a really old project maintained by community, so there's not one project manager who manages this and has a timeline for when things are gonna be released, the community has taken it onto itself to clean up stuff and make things better, reduce the cruft, stabilize things before the first release happens
18:02
that's outside of the Lucene umbrella. And all of this ties back into, there's just more and more use cases because of the features that Solar offers in combination with the stability and scalability that Solar offers. But most importantly, and I think I've mentioned that in the past,
18:20
the project is driven by the Apache way, which in my opinion is one of the strongest factors why Solar has been around for such a long time and has ensured that the project stays healthy through this duration. Just to summarize, it's a project that has a great set of features.
18:42
It's proven itself to be stable, scalable, and mature. It has a great community that has stayed active. If you look at the contributors, we continue to get new contributors. By the same time, the people who've been around for 10 years or more are still around. The people are still,
19:01
those people are still excited about the project. And that says a lot about someone wanting to rely on Solar as an infrastructure of choice for them. Fantastic. Thank you, Anshum. So, our last pitch today is going to be from Joe Christian on Vespa.
19:23
Now, how on earth do you compete with these two very strong search engines, Joe Christian? Yeah, I don't know. I mean, how can there possibly be a third choice? It's true. It's really great hearing, you know, from two, I would say, industry leading experts
19:41
in the search to, you know, to hear their pitch. Luckily, I get to go last. Since we are the new kind of, new kid on the block here, you know, I get to talk about features and not so much about the community. So, that's one thing. So, maybe not everybody has heard
20:01
about Vespa and what Vespa is. So, I'll quickly talk about what Vespa is. So, we define Vespa as kind of a serving engine for low latency computations over evolving datasets. So, it's not only a powerful search engine, but it's used for a variety
20:21
of real-time serving use cases. So, also including recommendation or recommendation use cases. So, search and recommendation is by far the two most common use cases that Vespa is used for. And Vespa was open sourced
20:41
under the Apache 2.0 license in 2017. But actually, development of Vespa goes back to 2003 when the team here in Trondheim was acquired by Yahoo. So, it has a really long history
21:00
and it's been for us and also now for other people, a really battle-proven platform. It's not something new, shiny that's just arrived. It's been there for a long time, but it hasn't been in the open. And we talk about the scale that we operate Vespa inside Yahoo.
21:22
We serve about 25 billion query requests every day. So, that says something about the volume or the scale that we actually operate Vespa at Yahoo. So, I think, and also on the releases,
21:40
so, I think that I just checked actually on, because you can run Vespa using Docker, and I just checked the Docker Hub. And we have actually three releases in the last seven days. So, I think we have a very incremental releases. So, if you have some issues or feature requests,
22:00
we are pretty responsive to include new features as well. So, but I would like to talk about search and why Vespa is a good choice for building a modern search application. And first, when I talk about search, I'm thinking about the case where there's a user
22:22
that's actually typing a query and wanted to find some information. And that's the kind of context. And in that context, I think there are two kind of primary reasons why Vespa is a great search engine for building a modern search experience.
22:40
And number one is that it supports a broad range of modern retrieval and ranking methods, including pre-trained transformer models like BERT and also vector search. And the second point is that Vespa supports true real-time indexing and true partial updates.
23:01
So, yesterday I gave a talk on the real-time indexing architecture. So, you can go check that into details. But having these large toolbox in Vespa is that you can start off with a simple traditional BM25 ranking model, and then you can build from there
23:22
once you get training data for your domain, and you can start using more modern techniques to enhance the search experience. And Vespa also handles both structured and unstructured data, and very importantly, vectors or tensors in general,
23:41
they are first-class citizens in the Vespa kind of document model. So, you can use tensors in queries and in documents, and you can also use tensors in ranking functions where you can combine query tensors and document tensors. And when I talk about tensors, you might think that they're only relevant for the new kind of modern
24:01
national language processing and so on. But the original idea where we had when you want to use tensors in the Vespa document was actually around recommendation, where you could use the tensors to represent various click feedback features. So, there's actually a blog post on that on our blog, Vespa AI,
24:21
where the homepage team in Yahoo talks about how they use Vespa to recommend articles on the homepage using Vespa and using these click features as tensors. Also, Vespa supports approximate nearest neighbor search, and that's something that Solr or Elasticsearch
24:41
does not have currently. I know it's coming in loose in nine, but we can talk about that later. Later, I mean. And I think having that capability to do vector search in Vespa in one engine is really important because that allows
25:03
a lot of new use cases and also techniques that have demonstrated very good performance on various ranking data sets. So, it's a really good method. And also for recommendation, this is important. I think the implementation in actually
25:21
in Vespa of vector search and approximate nearest neighbor search using HN as W is unique in the industry, actually, because you can combine the vector search with regular search terms. And I think that's really important, and that was one of the key takeaways from the talk yesterday from Max Ervin from Open Source Connections
25:41
when he talked about vector search, and he said if you use a vector search library to power your search engine and someone comes and typing in a phrase query, using quotes, you would expect that that user actually wanted to search for that phrase, right? And vector search does not give you that possibility, but in Vespa you have this wide range
26:02
of methods that you can use so you don't have to have different tools for different queries and different use cases. So, I think that's really, really useful. And speaking of vector search and moving to machine learning, so Vespa integrates with a lot of popular machine learning frameworks,
26:21
so like TensorFlow, PyTorch, and also the classic learning-to-rank models like XGBoost and VibeGBM, which is of the GBDT family. And so you have this kind of, there's no one machine learning framework that solves all the problems, but you have a lot of them in Vespa,
26:42
so you have a lot of flexibility. And finally, on modern search, like we talked about in the last debate, I think that search is actually going through a paradigm shift where we see that these pre-trained transformer models have really transformed search. And it was demonstrated
27:02
on the MS Marko Passage Ranking Dataset, so when researchers applied that BERT into the leaderboard, they advanced the state of the art by 30%. And that basically came overnight, right? Within five days or so in January 2019. So I think it's a really interesting time
27:22
to work in search and see how this is gonna progress, and that's one thing I think is missing from Lucene, that you don't have the ability to represent these new state of the art ranking models in Lucene, which then basically means that Elasticsearch and Solr
27:43
don't have this capability. And yeah, so that's basically on the kind of search and the features in Vespa. And the other thing is, what I think is really important is that Vespa has this, what I call true real-time indexing, so that we don't, like I talked about yesterday,
28:01
is that we have this mutable memory index, which allows you to add and update documents with millisecond latency, and when you get the response back, the actual operation has been applied and is visible in search. And this kind of enables new use cases, because in the case, for example,
28:21
if you have 1 billion documents and you want to change your ranking model, or you want to change some signals, you don't have to re-index all of the documents. You can just do partial updates, and updates, for instance, an attribute or a tensor field. And then the ranking models, which are working on this data, they can take that into effect immediately. So I think that's really powerful.
28:41
And one of the reasons we did this, actually, and implemented it was back in 2008, when we were running Flickr on Vespa, and they really wanted to, they had a really large corpus, and they wanted to update a signal, but re-feeding all the documents over again, it didn't really scale.
29:01
And that's where we weighed this partial update feature, and that could really transform their search experience, because they could offline produce this rank feature, which they will update all the photos with regularly. So to summarize, why I think that Vespa is a great choice for building a search application,
29:23
I think it's number one, it has a lot of modern retrieval and ranking methods, and number two, it has this great indexing pipeline for both real-time indexing, and also for doing partial updates. So that's basically what I have prepared. And I'm really, I have to say,
29:41
I'm really super excited of being here. Like you said, Charlie, we figured out there's collectively 65 years of search experience. So I'm really happy to be here and talk about search, and with these industry leaders in search. So I'm really excited. I'm really hoping for a lot of questions and debate.
30:03
Fantastic, thank you, Joe Christian. So you've heard the pitches from our three experts, three very different engines, some shared history, some shared heritage, even some shared libraries, different models, different ways of perhaps
30:21
managing the search engines. It seems that release speed from what you've all said, the quicker you release new updates of your search engine, the better it is. You've all mentioned that. I thought that amused me, that amused me a little. I'm not sure that's the metric we should, we should necessarily hold ourselves to.
30:41
So I'm going to jump into some questions. We've had a few questions submitted, and I'm going to start with one. Let me see. Actually, this is a question for Josh. I think this will be, probably Josh can answer this one. Somebody asks, I'm new to Elasticsearch,
31:03
and I'm looking for the Swiss Army knife of Sola's eDismax query parser in Elasticsearch. How hard is this to re-implement in Elasticsearch? I'm not familiar with the details of the eDismax. Maybe Anshun wants to talk about it as well. In terms of, you're new to Elasticsearch,
31:23
you want a Swiss Army knife, you're doing search over multiple fields, use multi-match. And the default multi-match type, I believe is best fields, I believe. And it does surprisingly well out of the box
31:41
with no tuning, and you can tune it. And this was actually part of a submission to the MS Marko ranking task that I did was tuning best fields as well as cross fields. And for a short time, it was the best non-year-old approach
32:02
in MS Marko document ranking. So, I mean, I would say just use multi-match. You can try some of the other, like some of the subtypes of multi-match. That's probably the easiest place to start. I'd love to hear from Anshun, eDismax, what's the magic?
32:21
What's the secret sauce there? And maybe there's a more equal, like a better equivalent in the Elasticsearch land. What do you think? I think it's just years of evolution of the parser that kind of came to be based on what everyone wanted.
32:41
But it does a lot of what I call magic, basically, at this point. It's complex. So I'm pretty sure there's a way to get the same implementation in Elastic. I don't know if it exists or the stuff that you just mentioned can do that out of the box as of now.
33:00
It may do better or worse, but I guess it's not the same thing. So I don't think there's anything else that's like the eDismax parser. And in my previous jobs, we've tried to have something that's kind of similar, but you end up using the eDismax parser if that's what you really want to do. So I'm pretty sure- Sorry, go on. I'm pretty sure as well
33:20
there is an open source connections actually in your repo. I remember seeing eDismax in Elasticsearch query parser, but don't quote me on that. I'm going to go look afterwards. I shall have to ask one of my colleagues. In fact, if there's anyone from OSC sitting in the chat, do drop in a link if you can do a quick trawl through the GitHub.
33:40
Joe Christian, you're familiar with eDismax, I guess. Is there something equivalent in Vesper? I'm actually not familiar with eDismax, no. I'm not. We have- How about that, Anshul, you give us the very quick, the 30 second introduction to what eDismax does.
34:08
Yeah, it's magic is what I'd say. There's just a lot of tuning parameters that allow you to pass in the terms that you really want to search on. And no, there's no machine learning involved and there's nothing else involved.
34:21
It's just a very standard way of saying, this is how I really want to search on these possible fields that I really want my results to be based on. As I recall, it's about saying, we're looking for matches in various fields and then we're going to take the one that wins in terms of matching.
34:41
So it's a matter of choosing scores. So how does scoring work in Vesper in equivalent way? Yeah, so yeah, by that explanation, we have very good support for, because you can write your very flexible
35:02
ranking expression because really in Vesper compared to both Solr and Elastic, it's not so much about how you build this huge query tree because like, and you want to add some boost here and you add some weights in the query. So Vesper is not really like that. You have the user query, clean and simple.
35:22
And then you have the YQL, which is the kind of application logic with the filtering. If you want to have some age filter, range filter. And then in the ranking, you can write things like, if you want to have the maximum or if you want to have the sum for these kind of basic features over
35:43
and you can iterate over the fields if you want to do that. And you also have these, what we called field sets, which is similar to this new feature was added to Elastic Search where you can simply add some query terms. Is it called multi-match or what's the, I don't recall, Josh probably have the details.
36:01
There was a new feature. Yeah, multi-match is the class of multi-field search types and then there's a bunch of types that are more specific below that. Right, yeah. So in Vesper, that translate to what we call field sets. So in the schema, you basically say, okay, I have my title, my bodily, my URL, the anchor text
36:21
and then you can have a field set default which points to a title, body, URL, anchor text. So when you're searching a free text query, then you will search all of these fields and that determines really what are the documents that are surfaced into the ranking function. And the ranking function is basically
36:42
semi-mathematically notation where you write down what you want. And you can actually combine like writing if, so if the score and the title, and this is used when we use GPT models, for example. So yeah, I think we have something that is in what you described there on that feature. We have something similar in Vesper,
37:02
not exactly the same wording though. So I was also wrong. The reference I saw was in relevant search, the book. There is no plugin. I did a quick Google and that's where I found it. So talking about the eDismax, it's basically an extension of the Dismax query parser that Lucene has, right?
37:21
So it does a bunch of stuff, as I said, which is stuff like allowing for pure negation queries and sloppy searches and allowing you to configure a bunch of things. But at the end of the day, it's just digging a query, breaking it down into multiple pieces. And as Charlie mentioned,
37:40
scoring the same document based on all of those matches and picking the one that was the max score. So instead of summing all those scores up, it would just pick the one that, so it's gonna score based on what was the best match among the subqueries that were formed. Interesting. So we've got three very different approaches here.
38:01
We've got the solar approach where something's rather evolved over time and it's become a really good sort of general solution to most problems. Any solar expert will tell you, well, you should probably be using the Dismax unless you've got a really good reason not to. And then you've got Elastic, which has tried to implement some similar things here.
38:22
And from what I understand from Vesper, you've got a very flexible scoring setup, but I hear a lot of you could do, you can do. So maybe that kind of general purpose solution is missing. Would that be true? Across fields you mean? No, I don't think that, no, I wouldn't say that
38:43
because basically you can iterate over the fields in your index and you can say, do I want to have the sum or if you want to have the max and you can choose, do I want to have the score as BM25 or do I want to use a native rank or do I want to use any of the other text ranking
39:03
features that we have built into the platform? But I wouldn't say that it's like so-so. I think it's pretty damn good actually because it gives you full control. Fantastic, yeah, we're not doubting your ability, your engineering of Vesper, don't worry. You know, it's pretty damn good.
39:21
So I'm gonna move on to a different question now. So let's have a look. Well, I mean, I'm going to ask this again of you, Joe Christian. Are the use cases when not to use Vesper?
39:42
Yes, I mean, we are not everything to everybody. We're definitely not that. I think for search use cases like that, Elastic is really good at dealing with what I call immutable data. So data that really doesn't change.
40:01
So for example, log data. If you have petabytes of log data, Vesper is not the right engine to use to search that data. Because we are focusing on the real-time aspect, and like I said in my pitch, we really designed the engine to handle evolving data set.
40:20
And a lot of the design decisions that we did around some of the parts of the index is served in memory and so on, doesn't become that cost effective when you have a huge amount of data. That being said, we also have some interesting options in Vesper that is not that well-known or well-described.
40:41
So what we call Vesper Streaming Search, which basically doesn't build index structure. It just stores the data. And once you want to search it, we basically scan through it. And I think in the recent years, when you see more advanced hardware, like Amazon, AWS announced,
41:01
you can get EC2 nodes with 100 gigabits, and you can get terabytes of memory, and 448 vCPUs for these kind of low QPS use cases. It might actually be that streaming through that data
41:22
is actually an option to load it from S3 and then stream through it. Yeah, so that's definitely one case in the current, when using Vesper Index Search, which is actually building index structures, I would not use it for terabytes of log data.
41:40
I think you're on mute, Troy. Yeah, well, let's turn that question to Josh. What's not a good use case for Elasticsearch? So Elasticsearch has a lot of things. A lot of things, well. I would say Elasticsearch was built
42:00
with sort of classical search, BM25 scoring and inverted indices, standard MOSI and data structures. I love this terminology. I can't remember where I saw this, of the phases, like the life cycles in search industry. And so there's this classic approach,
42:22
which is 30 years old. Then there's modern search, which brings learning to rank. And luckily we have great contributors from open source connections that have a learning to rank plugin for Elasticsearch, which is fantastic. And I'd argue that brings us to modern search. We are not quite at what I guess
42:43
is called postmodern search, where Joe Christian has mentioned ANN indices with HNSW and being able to represent deep neural networks in the search engine. We know that Lucene has HNSW implementation.
43:03
It's coming to Lucene 9. It will not be before Elasticsearch 8. So it will be at least after Elasticsearch 8, timing of that TBD. So we know that's coming. We know that's coming to Elasticsearch. So we're getting on the board, on the playing board for ANN.
43:23
And I'm happy to be able to say here that we're also investing in native PyTorch integration and Elasticsearch, which will bring modern NLP, modern information retrieval, sorry, postmodern information retrieval techniques
43:40
into Elasticsearch in a native fashion. And those kinds of, those two investments that we are making, I think will bring us into the postmodern era. And the next couple of years, I think we won't have that gap anymore. I think that gap is gonna close.
44:03
But I think we're thinking about it probably in a little bit different way than Vespa does. So for us, ease of use and approachability is very important as well as supportability. People have to be able to go into our cloud service, click, click, click, and it's gotta work. So there's a huge investment in both of those technologies
44:21
to make sure that it's easily approachable, works out of the box and works well in a cloud environment. Great. So yeah, now in the vector thing seems to be an area where we've got different teams chasing different, all chasing towards that, as you say, postmodern search. I love that term by the way,
44:41
I'm going to steal that one. It's not mine. I got it from someone else. I'm trying, I think it was Sebastian Hofstetter from TUV who has an online course. I think I saw it there in some slide deck. I'm not sure, but it's a great description. So I'm gonna keep using it. Brilliant, that's brilliant. Ansham, what's not a good use case for Solr?
45:07
I'm glad I just hear that personally on what Josh has already said, because we're both kind of still under, relying on the scene at the back to power search. But in particular, I think Solr is just not designed
45:24
with nested documents in mind. So if you have anything that's deeply nested, most of the sort of features are not guaranteed to work for you, even though it's kind of fuzzy, it's not well-defined, it's not well-documented even, I would say. I think anyone who has a use case
45:41
where the document is so nested and it's very hard to normalize, I think that's something that sort of can't handle in addition to sort of not being designed as a primary data store. So if you have documents that are really large, where you're trying to store large fields, it's gonna blow up Solr if you're trying to fetch those fields back.
46:02
It's, again, it's just not designed for any of that. Yeah, I think the whole thing about not using your search engine as a primary data store is a repeated theme. And yes, we see it all the time, and please don't do it. Thank you. So let's just come back to another question
46:22
from our panel. And this is gonna come back to Joe Christian, and it's slightly related. Are there some features in Solr Elasticsearch that Vespa doesn't support? Yes, there's obviously gonna be feature gaps
46:44
between these three engines. So definitely there are some features that we miss. If you're specific about a feature, I can answer if we have something that is equivalent. So yeah, but there's definitely some features gaps
47:02
between Vespa and the Lucene engines, for sure. Okay, okay, you can't think of anything specific. Snapshots. Yeah, of course, if you look at the entire, kind of the whole solution, yeah, Snapshot, we don't have this Snapshot capability
47:22
that Elasticsearch have that you can freeze the index and then do a Snapshot and then put it on S3. We don't have anything like that, no. What about, this is another related question, actually, reverse search support. I mean, Elastic, we have a percolator. And now in Lucene, it's not quite serviced in Solr yet.
47:43
We have the Lucene monitor, which I'm proud to say my previous company, Flax, developed and then contributed to Lucene. So do you have a reverse search capability in Vespa? It depends really what you mean by reverse search. You have a set of queries and you're going to basically watch something and see if it matches any of those queries.
48:01
Right, so we have this feature that doesn't really, we have this other interesting feature that we call predicate fields, where you can actually store a Boolean constraint
48:21
in the document. So you can say that in this document, it will only be matched by if certain attributes are set and you can combine that with a Boolean expression. So that is a way that you can, for instance, run some campaigns. So in an e-commerce setting, you can say that this document is only going to be actually matched against the query
48:42
if certain attributes are set. So that can be, actually it's kind of reversed because the logic of if you're going to match or not is actually stored in the document. But it's not the same as you described that you set up a saved query and then you look at the stream of documents
49:01
and say if it matches. But that could be built in a custom document processor, yes. It's a similar, but not just the same, no. Okay, okay. So coming back to comparing features, Anshun, is there anything from Vespa
49:20
that you see would be a really great addition to Solr in terms of something that might inspire a new feature for Solr? Oh, tough one for me. Not the machine learning person in the room, for sure. And so I'm pretty sure, because there's been limited investment
49:41
by only a very few select folks in that field, like learning to rank happened and then there's been a certain set of people who've been interested in that aspect of Solr. I'm pretty sure there's a lot to learn from Vespa along those lines. We've generally been heavily invested in more around solving the infrastructure challenges,
50:04
be it in the form of a Solr operator or making sure Solr scales well, and there's been little investment on the machine learning side of things. So I think both plug amenity and just everything else that Vespa offers is a great place for us to learn from
50:22
and kind of bring that cool stuff into Solr. What about this, Josh? Because I guess you know the machine learning side. Are there things that you look at Vespa and go, oh, I wish we could put those in? I like quite a lot the composability
50:41
and the way you describe ranking in Vespa. I think it's very clean. I think it's a very nice approach that gives you a lot of flexibility and power, but it's easy to grasp. I think we could do a better job there.
51:01
We definitely look at things as we wanna give people a good set of tools. So like with multi-match, we have a bunch of different types of multi-match, but if you want to really understand what's going on behind those multi-match queries, you gotta dig into some documentation, you gotta try some things.
51:21
I think for real experts that are really tuning ranking, especially with text-based ranking, I think their approach to describing ranking functions is quite nice. So that's the first thing that comes to mind. Okay, okay, great. Sorry, I might be too nice.
51:41
Being too nice still. Thank you, Josh. Well, maybe if we were in the actual, actually in the same physical space, it's very hard to throw a punch over Zoom. Anyway, not that I'm suggesting any of you lovely people would ever do that.
52:00
So we've got a question here from our audience and somebody's saying, as a SaaS provider of search, which obviously wouldn't be popular in some places, I won't say anymore, Vespa looks interesting and something we'll be looking into providing to our customers, but the whole deployment package upload seems clunky
52:23
if we want to let our customers do small, frequent changes to their setups through our solution. Elasticsearch excels at this, being driven by a simple API that allows you to make many small and quick changes. Am I missing something about Vespa? Yeah, so the whole,
52:44
if you want to build, so if I understand the question correctly, you want to build kind of a multi-tenant search as a service on top of our cloud offering. In that case, yeah, I can agree that it's not perfect
53:04
because everything in the Vespa world is about having an application package where you have the schema, the deployment, the flavor specification, how many hosts, how many nodes you're gonna use for containers serving
53:22
for the content cluster and so on. So everything is around this application package and that one of the reasons is that then you can push from GitHub actions and so on and you actually have version control over the changes that you're making. So I think we're coming from a very conservative,
53:42
even if we release often, we have a lot of testing and so on, but that's a little bit conservative. We don't want to do a curl parameter to change something. Or some operator, just the index APIs to change something. So everything is around an application package,
54:01
so that's correct. But we are really open for, I mean, to feedback on the process about using our hosted service. So that's definitely, if that's the real pain point, we can probably work something out on that. So we're really, I feel like we are really responsive to feedback that we are really engaging with the community
54:24
and then the Vespa Slack space that we have created and also at your relevancy Slack space and on Gitter and Twitter and also on GitHub issues. So really bring all the feedback to us. That's the only thing that we can learn from it.
54:44
Yeah. Yeah, I guess the other thing though is you're also, I know that the Vespa team are working, part of your commercial model is the idea that you're providing Vespa as a service. Is that correct? Correct. That's correct. Yeah, so we just announced general availability
55:05
and also we offer free trials for this cloud service. So you can go to cloud- Yes, it's interesting. What you're saying, you're looking at challenge there and tell us about scalability and Ansham mentioned a lot of work
55:23
has gone into this recently in Solr. We have the cross data center replication I caught some of your talk earlier this week. So this seems a major topic in terms of, how do we run these massive clusters in a solid, reliable fashion?
55:41
Do we think that those problems are gonna be solved any time soon? I mean, I'll direct this to Ansham cause I know that in the Solr world, there've been some attempts at various things that haven't quite panned out and have been pulled again. Do we think we're settling in on some nice reliable solutions?
56:01
Ah, great question. And I think I've been involved in a few of those attempts pretty much across different organizations. And it's a very hard problem to solve in my opinion, while setting up the infrastructure bits are still relatively easy. The challenge is around search by its inherent nature
56:24
of being a very complex and use case specific setup. So unlike a database where you spin it up and you have an instance and then all you needed to do is define table and push data into it. Search has so many things that you can change
56:44
and configure custom code that you could deploy and plugins that you might need that it makes it really challenging for a multi-tenant system to work. It works really well for the bottom of the pyramid and in case of it, if that's how I, you'd let me define the user base for search.
57:03
People who have basic search use cases where they have a bunch of data and they want it searched they're not really bothered about custom plugins. It works reasonably well for them. As soon as complications start coming in be it machine learning stuff or be it just basic indexing pipelines,
57:21
it starts getting more and more complex. So I don't know when this is going to be solved because it's a really hard problem to solve to begin with, it's non-trivial. So I've seen this being attempted a few times and if you have specific use cases you wanna cater to you can still deal with it.
57:41
But if your use cases are not bounded then this becomes a fairly challenging problem. I don't know if it's gonna be solved in the near future though. Josh, do you think the best of the Elastic team are doing any better at this? I think one of the reasons is that
58:02
it's so fundamental to our customers. It's so fundamental to all of the use cases that we support. So being able to do things like cross data center cross cluster replication, cross cluster, cross data center search. These are the things that we've invested in over many years at extremely large scales.
58:24
And we're just like I mentioned at the beginning we've just introduced data tiers which brings kind of a new dimension for how you can manage your data and how you can search across extremely large data sets terabytes, petabytes of data. I think we have a number of tools in the toolbox
58:41
let's say, and they're used quite heavily at large scales particularly in our logging observability security use cases, which rely very heavily on being able to have disaster recovery high availability compliance reasons that you have to search across extremely old data sometimes as well.
59:02
This is part of our bread and butter. It's what we think about every day. We don't only think about search relevance but it's about scale, it's about operations it's about ease of use, it's about building a rock solid products. So yeah, I feel pretty confident that we've put a lot of investment in that area and we won't stop.
59:21
It's so fundamental to our customers. And I guess Joe, Christian, I mean, you've come from Vespa, it's being used at Verizon, Yahoo, whatever and over very large scale systems but do you think you've made strides in this area or this sort of way to go?
59:41
Yeah, I think, I mean, we operate Vespa in the kind of private Vespa cloud inside Yahoo. We have about 10 production regions. So obviously Yahoo, we are all over the globe.
01:00:00
So we are used to running multi-region setups. So no doubt about that. In the public offering that we talked about, I think we are currently at three or four regions. So yeah, multi-region, high availability, scalability, latency, of course, it's really fundamental, right, for our business.
01:00:24
And so Westpa has really been a battle-proven platform for this. So this is, yeah, so it's really a feature that we spend a lot of time to build over, and also all the experience from running web search
01:00:45
in this team, and we've been developing Westpa now since 2003, so I think a lot of experience on distributed systems have gone into the version of Westpa that we are seeing right now. On the cold-tiering stuff that Josh is talking about,
01:01:05
we don't have that, right? So because we are working on the evolving datasets for real-time serving, those are the use cases that we focus on, so we typically have everything hot.
01:01:22
So I just like to, go ahead. No, I just wanted to clarify a few things so that Solr doesn't lose out any points here. So yes, so we've used Solr as a multi-tenant system, supporting multiple clusters for multiple use cases,
01:01:43
used by different teams, as an internal offering, but the complications arise by definition because Solr is pluckable, to be able to provide an offering that allows you to plug in custom code, and then for you to maintain it, not knowing what's inside that code
01:02:00
and having the power to really be sure it's not exposing a security risk for everyone else who's co-hosted, kind of makes it super complicated. So I didn't mean to say that it's not possible to host multi-tenant Solr, that certainly is done in multiple places very successfully. It's just more complicated with a public-facing interface
01:02:22
where you're allowed to plug in custom components and custom code and make it work. Yeah, and I guess there's implementations like Salesforce, which I know is massively multi-tenant to some ridiculous degree. People are certainly doing it. I guess the difference here is also about the products,
01:02:40
the communities, whatever, because with Solr, you're trying to offer general solutions to people. With Elastic and Vespa, you're in control in some way of the hosted versions of the system that you're providing as products. You can say you can have this plug-in,
01:03:01
but you can't have that one. You can put some limits on the plugability. Joe, Christian? Actually, the hosted version, the cloud offering that we offer now for general availability, you can use it in the same way
01:03:20
that you're using Vespa on-prem, so all the features. So you can plug in searchers, document processors. We don't offer a semi-smaller feature version, though we do add some more security constraints and isolation so that it's a multi-tenant system,
01:03:43
but the deployments does not share anything, so they are isolated, so that in terms, you don't want anything to break out. There are a lot of security around this, but we do allow people to have full access to deploying machine-learned ranking,
01:04:01
custom models, PyTorch models, O-N-X, Jibility, so the full feature set of Vespa, you are allowed to use that, and we have the experience of running it, so I think, yeah, so we do support having the full-fledged version, not some kind of smaller, easier-to-use version.
01:04:22
Okay. Yeah. Cool. Well, I'm gonna change tack here slightly, because we've got a few questions backing up in our chat room, so I'm going to ask some specific questions and try and give you a nice brief answer, if you can. So, Josh, first one for you. What is the status of Graph Search and Elastic Search?
01:04:41
Is it deprecated? Elastic don't seem to be talking about it a great deal. Pass. It's, I mean, it's supported as it always was. As far as I know, there's no active work, like new development. Unfortunately, that's all I know.
01:05:02
There are some of my colleagues on the chat, and if they would like to add, I'm happy to hear it. Yeah, that's about all I can say. Okay, right. Joe Christian, how big is the Vespa team? How many committers in house and community? I think the committers list on GitHub, Vespa,
01:05:21
I don't recall, but I think there's like 70 or 80 people that have committed. The core, I cannot really comment on kind of the size of the core team working on it, but if the global, yeah, I don't think I can quote specific numbers,
01:05:42
but if you look at GitHub and the contributor list there, you can find who's contributing to Vespa. Because we do all the feature development, all the development in the open. So basically, everything of Vespa is open source, and if you go to the GitHub Vespa engine,
01:06:02
you will see it's busing. I mean, it's in the weekends and the afternoons, there's a lot of activity going on. So all the development is in the open. The issues are in the open, so everything is in the open. So 70. Okay, so I've got one for Anshu just to even things out here.
01:06:23
So in the old days, Solar and Lucene were separate projects, and then they became one project. And I remember everyone talking about how that was good and bad at the time, and it became solar slash Lucene slash solar. Now it's become, it's split apart again. We've got the Lucene project and the solar project. So will it be a few years and then one project again?
01:06:44
I don't think so. I think there were, this was discussed on the mailing list very well. And there was some background given as to why they got together, because there was a bunch of shared stuff that did not make sense to be worked on separately and to be synced every now and then.
01:07:03
Those things have been accomplished. And both of these projects have very different management procedures, if you wanna call it that. The way both these projects are organized and maintained are different, even though they were part of the same repository,
01:07:21
and it's the same bunch of people who have permissions to make changes to both of these projects. Even right now, pretty much it's almost a 99% overlap in terms of contributors or committers. I don't think they're gonna go back into being one again, because there's a lot of work
01:07:40
that goes into merging a project, the merging two projects are splitting them. And that's non-trivial amount of work. And we're putting the amount of work that's needed to split these projects separately into their own build systems, their own release processes, their own everything else. And for people to, and I'm not saying it cannot happen. I'm just saying I don't see that happening
01:08:01
from where I am right now, because it comes with a lot of effort and I don't see a reason to put in that effort again. So never say never, but you don't see it happening. Yes. Okay. So let's look at a different question here. So another comparative question, really.
01:08:24
Query DSL. So there are different ways of querying. I mean, in Solr we've got stuff on the, we've got the lots of parameters. You've kind of got to know the names of lots of parameters, but you can also, you can do queries with,
01:08:43
there's a JSON API, I believe. Then Elasticsearch, you've got the Elasticsearch DSL. In Vespa, you've got the, you've got your own query language. So let's have a look at what are the, what are the best and worst things about each of these query languages?
01:09:02
What could you learn from the others? And I'm going to start that question with Josh. Oh man, I really wish you'd pick someone else. I mean, our query DSL, I mean, you can do everything. Every, all the query functionality that you need to do
01:09:22
with Elasticsearch is in the query DSL, you know, including aggregations and yeah, basically everything you want to do in a query. I think that for, I would say entry-level engineers, I think sometimes it can be overwhelming
01:09:41
to be faced with this wall of how do I build a, how do I build the right query DSL? So I think it's extremely powerful what you can do with our query DSL. I would like to see maybe a second DSL or a subsection that had some higher level abstractions
01:10:01
is what I would call them. And we have them in a lot of places, but I think we could find a balance between giving you all the flexibility, but also maybe an entry point for different level of user. A meta DSL, I like that idea.
01:10:20
Yeah. What about this one Ansham? I mean, Sola's query language, there are mysterious two-letter codes in there that once you've done it for a while, you understand and you remember them, but how could it do things better? Yeah, I mean, Sola has a query DSL as well,
01:10:41
which has a lot to be designed. Like there's a lot of room for improvement on that. It's pretty basic as of now. And the JSON query DSL, but the standard way of talking to Sola and equating Sola is again, something that's evolved over the years and developed a lot of trust,
01:11:01
been developed by people who already understood how the system worked. And so, as you said, there's too many of these parameters and I've looked at what Elasticsearch has offered and like over the years, and it always makes me feel good about something like that existing. And first, I think Sola on that front
01:11:23
can totally learn from Elasticsearch and get to that place first, to be able to have such an expressive query DSL where you could do so much without having to worry so much about remembering two character parameters that you have no idea what they mean about. Like you go digging into refguide, which is a great thing, but it's confusing right now.
01:11:43
So yes, there's a lot to be desired on the query DSL front for Sola. It does offer the JSON query DSL for basic use cases. It's pretty expressive, but it's not there yet. So yes. Joe, Christian, so you've got, I mean, you've obviously spent lots of time building your query language
01:12:01
and I guess it's perfect, surely. It's perfect, it's for sure. Yeah, we have where to start. I mean, so you can use a JSON query API to post a query to Vespa
01:12:23
where you have your parameters like the number of hits you want to return, what is the ranking profile you want to use, and then there's something we call the Yaw query language, which is a SQL-like query language where you just say select ID text from tweet
01:12:45
where user query then can point to another query language, which is the actual end user query. So I think it's a clean way of kind of separating
01:13:00
what the actual user has been typing and so that you don't do any kind of tokenization of the input in the middle tier. You want to have, to let Vespa deal with that because if you don't, you might have some asymmetric behavior of what is happening in indexing time at the query time.
01:13:20
And there you can, in the user query parts, the query part you can use similar to searching Google, you can use plus sign, you can use quotes. So you don't have to do a lot of this query parsing on the middle tier and then build a huge kind of JSON DSL API.
01:13:42
So I think actually our APIs are pretty good. I've seen some people just going with the YQL or the SQL format and then writing the query and then splitting the tokens and so on, which is not a great way of doing it.
01:14:02
I think it's better to use this construct where we have, I can say that I want to have the user query and then I want to have to combine it with some other filters. So the SQL-like query language will also be syntax checked so you don't have any mistakes in your query. Well, the simple kind of user query language
01:14:21
is more relaxed in the way how we parse it. Yeah, so you can post queries, you can get queries, a lot of different options there. I mean, you can do something like a SQL query language in Solr, can't you, Anshun?
01:14:41
The streaming API, does that give you- Yes, yes, the streaming APIs does. Yes, streaming APIs does give you, it does certainly give you that part. I don't know how many users have adopted to that. I don't know how many people use streaming APIs for that purpose, but yes, you certainly can achieve that. For basic search, yeah, I'm still a big fan
01:15:02
of the Elasticsearch API. Haven't moved on from there yet. Still being too nice to each other. I'm disappointed. Fantastic. I also would like to add one more thing is that you can write your own searcher plugins in a nice way where you actually build an API
01:15:21
on top of LESPA, and then you can build the query programmatically using Java code. So you don't need to expose all the kind of parameters to the middle tier. So you can build your own using their programmatic interface instead.
01:15:40
Oh, Joe, Fear, Solr is amazing at being pluggable, trust me. Yes, but we then have the balance of if to be pluggable, you then got to write the thing you're gonna plug into it. So there's also a balance there between something just working out the box and having all the components you need. So a question here from somebody.
01:16:02
Some of this, we've got some smaller players in the search engine world now. We've actually been running a series on the Haystack Live meter, looking at some of these. We've got one coming up on Tantive, the Rust search engine. Now, one of the things a lot of these engines position themselves as is kind of easier to set up, easier to set up. You don't have to do much. You can just get going, you can just get querying.
01:16:22
And the questioner asked specifically that they, it says specifically that they positioned themselves to be easier to set up than even anastic search. But what can we learn from these smaller players? Well, we've got things like Weaviate, we've got Malie Search, there's a couple here,
01:16:41
Toshi, Sonic that I haven't heard of, Tantive. What can we learn from these smaller players that are kind of trying to break into the market? And I must say that I'm always really cheered when I see someone writing a yet another search engine. You know, you might think that it's a task you shouldn't even attempt because, hey, we've got these amazing mature long engines. But now what can we learn from these smaller newer players?
01:17:01
And I'm going to ask that question to Joe Christians to start off. Yeah, so I think, I'm not sure how do you pronounce it? Tantive is, I think it's a library. I don't think it's a full fledged search engine with the APIs, HTTP APIs and so on.
01:17:22
It's more of a search library, as I understand it, right? You can do, and I think they have a lot of inspiration from Lucene, they write about this in their GitHub repository or that. And I know that there's a startup company with the founder of this library.
01:17:42
I think the company is called Quick Wit. I think they are trying to build, I don't know what they're building, but I think it's interesting. But I consider it currently as a library and not a competitor of Vespa or Solr or Elasticsearch.
01:18:04
You mentioned Vivid. I think that is one of the newcomers, the vector search, neural search, and I think there have been a lot of those lately. I think there was a Pinecoin, Gina,
01:18:21
a lot of companies coming up in the vector search space, and they're promising neural search to the rescue. And I think in some cases, these have a problem with over-marketing deep dense vector search.
01:18:43
It's not really proven yet that you can take a generalized model and train it on, for instance, MS Marko, and then apply it to a different domain and then have great results. All the evaluations shows that you really, in order to have great results, you need to train the data on your domain.
01:19:01
So I'm not sure about this predefined models that will beat the traditional search paradigm straight out of the box. So I'm really skeptical about that thing. And also, like Max Ervin mentioned yesterday,
01:19:21
what if the user is doing a phrase search? So this capability is missing. And also in the real world, I think what is unique with, instead of having a vector search engine, adding actually vector search to a real search engine with the traditional inverted indexes and so on makes more sense because in the real world,
01:19:41
search is really constrained, either by the user itself or by hidden application logic. And vector search, the pure vector search, I'm not sure if these mentioned companies actually support this kind of hybrid evaluation model where you can combine both the traditional matching with vector search,
01:20:02
but Vespa certainly do. And when we designed and implemented approximate nearest neighbor search in Vespa, that was a really critical thing for us to be able to combine those two. Yeah, so I welcome all this. There are a lot of things to learn. And there have been a lot of popping up
01:20:22
this vector search engine, but I'm a little bit skeptic at the moment on some of these. Okay, so I mean, Josh, again, the question specifically says they position themselves as being even easier to use in Elasticsearch. So is there something we can learn from these engines?
01:20:44
Yeah, I agree with everything Joe Christian said. I think some of them are niche plays as well. So Tantivi, it's a library. And I see it more as a like embedded type of solution.
01:21:02
You have very limited resources on a device or IoT or something, and you need a search index. Maybe that's a good position there. I mean, like Joe Christian, I'm super excited to be in this field, in this industry during this time, because so much is evolving right now.
01:21:21
And there is a big paradigm shift happening. I think other things that we can learn are non-search use cases or mixed, like multimodal search and your recommendations use cases. I think things like Geno will be very interesting players in those spaces.
01:21:42
Yeah, I agree with Joe Christian. I think as what we would view as like text search, I think a pure ANN play is not feasible today. I mean, just to see where it goes in the future. But I think I'm also wondering when,
01:22:01
like all of the ANN type of libraries are also wrapping other libraries that come out of Facebook or Google or NMSlib or something like that. I'm waiting for companies to start innovating in this space as well, and not leaving it to only the big guns
01:22:22
and just wrapping a bunch of libraries. I think that's an interesting space to watch is the innovation in that space of ANN search in particular. So hopefully, Geno maybe starts doing that as well. And we learn more about how people are using
01:22:41
dense vector search, for example. Okay, Ansham, what do you think? I mean, if you're looking at Solar and Lucene perhaps being maybe the older, more established player in the space, at least within a very large community, is there still room for these new search engines and are they doing anything that we can learn from?
01:23:02
There certainly is room for these search engines. And again, having been around when Elasticsearch just started, I remember when it just started, just like, yeah, they're new, they're small, but there was so much to learn from Elastic and I'm so glad that we did.
01:23:21
I think anyone who used Solar about seven, eight years ago can vouch for how better to use or easier to use it has become. And that's majorly because we looked at Elasticsearch as a poster child of like, okay, we've been doing things a specific way and it felt right to do so 15 years ago.
01:23:43
But now someone's come out and thought about things that we never thought about. And we can always look at them and learn from them. And yes, they would have done things that did not work out for them, but it's always good to have people coming in with a different perspective with,
01:24:00
especially all these post-modern search engines coming up. They're trying to accomplish and do stuff that Solar kind of never thought about again, because we've been here for so long. So it's always nice and refreshing to kind of look at all the new search engines and try and learn, not aggressively,
01:24:21
because you can very quickly lose track of what you really wanna do if you try to concentrate on keeping up with everyone. So wait and watch, see what everyone's doing, what things work, and then learn from it. And don't just try to copy it, but try to do that better. Hmm. I mean, it leads on to maybe a slightly facetious question
01:24:43
that's popped up in our line here, which is, and somebody did vote for it, can we combine everything into one super engine? Could we have the ease of use of Elastic, the long history and community, and the flexibility and plugability of Solar, plus the amazing new vector search features of Vespa? Is someone gonna sit down there and try and pull all these things into one engine?
01:25:04
I mean, I think we will see a convergence of features in particular, and I think some things will become commodity, just as, you know, BM25, TF-IDF. I think we do see it, but having competition in the field is really necessary for innovation as well.
01:25:22
And I think because of that, there will never be, like, there will never be one solution to rule them all. I think we will see convergence, but I certainly hope, actually, that there isn't one solution to rule them all, because I think we can also look for players that are in niches, and we can learn from each other,
01:25:42
like these new players. So I'd say, yeah, kind of, but hopefully not, actually. Yes, it's interesting, isn't it? And I do remember those days Elastic first appeared, and some of the people in the Solar community were, oh, this will never catch on. And actually, what it did do is it put the wind up
01:26:01
a lot of people who maybe should have been paying attention to some of the new requirements, and it drove Solar development quite hard for a while. I think that was a positive impact, and I'm hoping that the same thing's happening with Vespa appearing on the scene. Judging from what you both said and what other people have said, this will spur innovation. It makes people look at their failings
01:26:21
and look at the things that maybe they don't do so well or maybe they should be looking at. And I think that's a really healthy situation. I think it'd be awful to have a monoculture of search. Do you agree, Joe-Christian? I'm not sure if it's gonna converge into one technology because there's so many different use cases around search.
01:26:42
And like I talked about, Elasticsearch coming in really strong at handling immutable data. Vespa coming in very strong at mutable data where data is changing. So I think there will be different search solutions for different areas of search.
01:27:00
I don't think it will converge into one big solution. And the new players, if you'd call Vespa a new player, we also been around, we have to adapt to ever-changing. We've been around since 2003, so we have to go from on-prem to cloud. We have to adapt from running on one CPU 500 megs of RAM
01:27:23
into monster machines that we have today. So the hardware development, we used to have 500 megabits network cards. Now you can get 100 gigabits per second. So a lot of these hardware evolvement also will change how we do search in the future, especially around low throughput use cases
01:27:43
like somebody searching in Kibana. There's very low credit throughput. It doesn't really matter if it takes a second. I think there's gonna be a lot of innovation in that area because of the compute changes, the hardware changes.
01:28:01
I think we've lost Joe Christian. Yeah. Oh dear. We're gonna see some changes there. All right. Sorry, Joe Christian, you froze for a second. All right. We'll leave it there on that one. So, okay.
01:28:22
So I've got a brace of questions here. I'm gonna ask the first one of these to Josh and Ansham. Somebody says with both solar elastic search based on Lucene, I always assume they don't really differ much on relevance.
01:28:41
Things like stemming synonyms, et cetera, they're provided by Lucene. So is that assumption correct or are there real differences in relevancy tuning? Ansham. That's kind of true that the underlying building blocks are kind of shared, but there's enough of a layer,
01:29:02
a thick layer that solar offers on top of everything that Lucene does that essentially makes both of these very different engines to use. Yes, you can accomplish very similar, the same thing using both these engines, I guess for the most part at least, but there's enough of that layer
01:29:20
that sort of provides speed in terms of making sure that the inputs correct to being able to plug multiple things together and the way they get plugged in together. So they're certainly not the same. And then there's parts of solar that do not really use the underlying Lucene features, even though Lucene as a library
01:29:41
would offer something, say facets. Solar has its own implementation of stuff, even though Lucene offers something similar, only because someone say it is open source or someone thought that it would be better or more efficient to do this same thing in solar
01:30:01
using a very different approach. So they might seem similar and yes, they share a lot of building blocks, but there's enough of a difference between solar and elastic storage in terms of the implementation itself. What do you think, Josh?
01:30:23
Yeah, I think the power in relevance tuning though actually comes at that abstraction layer. So that like one layer above where it's about how do you combine all of the fundamental components that Lucene gives you. So if you're doing a multi-field search,
01:30:42
you're gonna use edismax with some magic or you can use multi-match field. And we've actually just introduced in our 7.13 release, I believe a new multi-match like query called combined fields, which is BM25F. So it's a very principled approach
01:31:00
to doing a multi-field query. And like that stuff that we're building on top of Lucene. So that's a layer on top. So I think, yeah, there are definitely differences. Yes, they're the same building blocks underneath, but also with aggregations for faceting and things like that, there is a lot on top.
01:31:22
And I think a lot of it also comes down to the experience as a search relevance engineer, what does it take to build good search relevance? Do you have the right kinds of interfaces? Is it easy to think about making changes? Is it easy to iterate on changes? I think all of those things matter a lot.
01:31:42
And we haven't talked about evaluation, which is like, don't try doing relevance tuning without doing evaluation as well. And that's not, I think that's the part where both of us, we have things like the rank evaluation API, which can help you do evaluation. But I think actually that's part of the search story
01:32:02
that I would argue all three of us could maybe do better at, is how do we promote good ways, like good practices for relevance tuning, including good ways to manage relevance data sets, doing all those measurements. There are third-party tools to do that, but it's not really part of the search engine.
01:32:21
Yeah, that's what I'd say. Charlie, you're muted. After 18 months, I figured out this remote nonsense. So thank you both. And we have a quick question here for Joe Christian,
01:32:43
which is about Vespa. How much is the performance overhead in Vespa if you use classic BM25 retrieval versus dense retrieval-based relevancy ranking? Yeah, so that's really a great question. And that's also,
01:33:02
I mean, do you want me to be short on this? Because this is a huge topic, right? So I'll just go ahead, you know? So there's actually a shortage. I'll say shortage. Okay, shortage. So there's a huge debate actually in the information retrieval community about whether the classic inverted index structure
01:33:21
is gonna be replaced with dense vector search, right? So because you can do a lot with the dense vectors. And one thing is about performance. And if you look at inverted indexes, there's one way you can speed up the evaluation of a BM25, and that is using something called the weak-and algorithm. So that's a dynamic pruning algorithm,
01:33:42
which basically will look at inverted index and quickly try to figure out, you know, what are the documents that will be highly ranked using just BM25. And Vespa has one implementation, I know Solar, Lucene have a wand implementation. So that's actually one of the research questions
01:34:01
that we wanted to answer, you know, if we use replace the wand algorithm and we compare the accuracy, ranking accuracy on MS Marko, and then we use a dense retriever. And actually I'm gonna do, I'm actually writing a blog post on this. It will gonna be released next week where we compare all these different methods,
01:34:20
sparse using wand and using dense. So even including, if you're using the right model, you can actually beat the wand implementation using a dense retriever, also including the query encoding. So taking the query and coding it through the transformer model, get the vector representation, then do approximate nearest neighbor search,
01:34:41
and then basically rank the documents by the cosine score or the actual the angular distance or the cosine score, we get higher throughput. And another thing that's important is that nearest neighbor search does not have this kind of latency tail. It's much more friendly if the query is very long
01:35:02
and wand can struggle if the query is really long. So you can have a much higher 99% till and so on. So actually if you use the correct transformer model and you can put transformer models into Vespa and have them so you can actually encode the query in Vespa, you don't need to have Python dependencies
01:35:21
to run this transformer model. So actually in our experience, the actual the dense retriever can actually out-compete BM25 in terms of performance. Okay, so what do we have left? We're down to our last two questions from the chat,
01:35:43
but I've got to see if I can squeeze in what I can squeeze in now. Josh has dropped in a link for the wand algorithm there, which is great. Thank you, Josh. So let's see, well, I'll ask you another question, Joe,
01:36:03
very quickly, I can get a quick answer on this one. Let me just see if I can prune this one down. When I think about normal search, solar and elastic search do come to mind. However, I feel Vespa has positioned itself way too high and is only being chosen
01:36:20
if the use case is high requirements of machine learning, language processing. Is Vespa doing this deliberately or is it just by accident? Well, I'm not sure if I agree with the premise. I mean, we do have all the classic
01:36:41
text ranking functionality, but as we have observed since 2009, I mean, really search is going through this paradigm shift. So, and we have internal customers, external customers, screaming to take these new techniques that really improves ranking. So, yeah, so, but in the blog posts,
01:37:02
the thing we write about, we don't write about VM25, right? It's there. So we want to kind of illustrate and demonstrate how you can take these state of the art models and how to apply them with Vespa, so. Okay, okay. Okay, so I've got a couple of questions just around us off.
01:37:24
Hopefully it will take us to the end of the session. We're finishing at half past the hour and then there'll be a closing address from Nina and the Berlin Buzzwords team to do hang around for that. So I'm just gonna go around all three of you here and ask, firstly, this is an odd one.
01:37:43
If there was a feature of your search engine you could remove tomorrow, which would it be? Josh. I mean, we remove things maybe to people's frustration.
01:38:02
We remove things pretty regularly from Elasticsearch. I remember Doug Turnbull a few weeks ago was complaining about something that we removed or deprecated at least. If I had to pick something, I mean, I mentioned the first thing that comes to mind, I mentioned combined fields, which is BM25F, a very principled way
01:38:22
to do multi-field searching. It's kind of replacing, not quite replacing, oh, I'm gonna get flack for this. I would say I would remove cross fields, which is a multi-match query type. I think people struggle to understand what it's doing.
01:38:42
So there's like best fields, which is taking the field-centric view of search and then there's cross fields, which takes the term-centric view. So it's like putting all of your fields together, kind of. I think combined fields is a much more principled approach to the term-centric view. So I would remove cross fields.
01:39:02
And I'm gonna see who gets angry at me tomorrow. I have a feeling that won't be universally popular, but hey, let's generate some controversy. Ansham, what about you? What would you kill off from Solr if you could? Yeah, that's a really tough one. I'm gonna give an input answer
01:39:21
and then I'm gonna specify a specific thing. The general answer to Solr is just because it's a community-driven project, has multiple ways of accomplishing the exact same thing with varying implementations, let me tell you. Like it's not necessarily the same implementation either. So I would personally like to reduce things to be done
01:39:44
in one way, if you wanted to accomplish one thing. So it wouldn't be one thing, it would be a bunch of things that are just duplicates of each other in terms of functionality that they offer. And the one thing that I'd, and we've discussed this among the committers a million times now, is reducing the one way of faceting in Solr.
01:40:04
And that's existed forever now. There's the JSON facets and there's the traditional facets. One of them approximates, the other one doesn't. Users don't generally really know what's the difference between them. And so they stick with whatever they feel is an easier query DSL than for it that way.
01:40:22
So either that or which ones foster not knowing what are the repercussions of using one over the other. So if there could be one way of accomplishing or getting facets out of their datasets, that would be so good for them, for them to just concentrate on doing one thing one way.
01:40:40
Yeah. Okay. So Joe, Kristian, if you've been doing this since 2003, there must be one bit of vesper that you just think this just shouldn't be here. Please can we just bring it out? I mean, it's a really hard question. So there are some internal, which is not really impacting the external API.
01:41:03
On the external API, I would like to remove some of the text matching rank features. So there's a bunch of them, the whole list of text ranking features, that users can be a little bit overwhelmed
01:41:21
compared to BM25. Yeah, so that's the kind of user facing feature that some of the, I think we could maybe cut down a little bit on those. Great, thank you. I know it's a hard one, but I think from my experience and a long time ago, sadly now,
01:41:41
maintaining some of these big systems, there's always a bit that just makes you think, I just commented out and maybe nobody will notice. So our last question here, I'm going to ask, again, I'm going to go round and ask you all, what's your favorite coming soon feature?
01:42:02
Something you know is coming down the pipe, maybe it's not quite ready yet. And what do you think is going, that's hopefully going to make a big impact and why do you think it's so great? So I'm going to start that off. I think probably I'll start with you, Josh, on that one. What do you know that's coming down the pipe that you think is going to be really great
01:42:20
for Elasticsearch? So that's what we're seeing, so I don't know if people know about that. I think PyTorch integration, I think it's something that customers have been asking for for a long time. It's going to enable a significant number of search use cases from ranking, dense retrieval,
01:42:45
but we have a lot of customers that also do NLP on ingest to extract structure out of unstructured text. And I think being able to just do NLP tasks at ingest time,
01:43:02
I think from sentiment analysis to NER, zero shot classification, I think this could be hugely powerful for search customers and let's say non-traditional search customers. So people building search for customer service records and things like that. Like these are really common tasks that people have to do
01:43:23
and that today they have to do it outside of Stack. And the PyTorch integration will bring not only inference, but like a complete model management solution. So it's about uploading a model saying the type of cluster that you want and we take care of the rest of that.
01:43:41
So to me, that's, I mean, I'm a bit biased. That's one of my projects that I work on. So I'm definitely biased, but to me, that's probably the biggest one up and coming. Okay, so what about you Joe Christian, what's coming next?
01:44:01
Yeah, there's a couple. One thing? One thing. One thing. That's hard. Yeah, there's gonna be some significant, I mean, we already have really fast indexing. There's gonna be some significant improvements on indexing throughput.
01:44:21
So that's especially around partial updates. So we currently have 50,000 updates per second per node. I've seen some really nice numbers coming out from the core team. So that's one feature that I'm really excited about pushing the performance of Westbook.
01:44:42
Okay, and Ansham, what about you? What do you think is coming down the pipe? What's your favorite new thing on the way? Well, Josh has allowed me to be biased towards something that I've been looking at. So I think the cross-DC for Solr, which has been there for Solr, but never really worked successfully.
01:45:03
The new design for that, something that's loosely based on the work that we've been using at my workplace. So I'm really looking forward to that, that uses a messaging queue in the middle to achieve cross data center replication in a mighty way. That's really exciting. That allows people to use Solr as a,
01:45:22
like allows Solr to be H-A-N-D-R ready is just something super exciting to me. Fantastic, thank you. Well, that's great to hear. And it's great to hear you're all so excited about what's coming later. Joe Kristen has asked internally in our little chat here,
01:45:42
he's asked about when Python's gonna become integrated into Elasticsearch. So it's not PyTorch as in Python. PyTorch, it's TorchScript. So it's native, it'll be a native process. I see. So not quite Python, not into Elastic. But you can't- Not you.
01:46:03
Fantastic. Let's see if we've got one time for one last question. There's a complicated one here. I'm not sure I can ask in the time, but Joe, I'm gonna ask you one last question and get the last word on this. You mentioned pre-trained models in vector search libraries
01:46:23
won't generalize well to other domains. Does this not apply to VSEPR as well? I think the answer is yes. It's just a general problem, isn't it? Yes. So this is a general problem. We have a blog series right now about using transformers for ranking.
01:46:41
So I've written three posts already and the fourth one is coming where we lay out the challenges of pre-trained models. But in domain, they work brilliant, beats bm25 by a large margin. So there's no doubt that this is going to be
01:47:00
something that's gonna stick, but using a model and just drop it to some other domain, we're not selling that. You need to have the ability to train a model for your domain. And that's independent if you're using VSEPR or FICE directly or VAVIET or JINA or any of the other vector search libraries.
01:47:22
Okay. Right, well, we've come to the end of our search engine debate and I do hope you've enjoyed it. We've got through a lot, I think, and I'm hoping this is going to be a, once the video is released by the amazing team here at Buzzwords,
01:47:40
it's going to be a keeper, I think. So I've got a few things to say. So firstly, I'd just like to thank Joe Christian Burgum, Josh Devens, and Ashwin Gupta for standing up yet again for VSEPR, Elasticsearch, and Sola, respectively. And to everyone who asked a question
01:48:01
and supplied us the questions, I think you can agree it's been a great session.
Empfehlungen
Serie mit 2 Medien