NrtSearch: Yelp’s fast, scalable, and cost-effective open source search engine
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 56 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67174 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202230 / 56
22
26
38
46
56
00:00
NumberQuery languageCore dumpTime evolutionStatisticsSubject indexingRankingExtension (kinesiology)Bounded variationPersonal digital assistantMultiplicationEntire functionChannel capacityMilitary operationPressureAsynchronous Transfer ModeOperator (mathematics)Data managementPhysical systemPrice indexFunction (mathematics)Musical ensembleProcess (computing)Server (computing)AlgorithmRankingImplementationPhysical systemNumberOperator (mathematics)Data managementSoftware developerBefehlsprozessorMessage passingNear-ringComputer configurationOrder (biology)Extension (kinesiology)Inverse problemSubject indexingPressureStorage area networkVertex (graph theory)Query languageComputer fileRevision controlDataflowSoftware maintenanceMobile appUniqueness quantificationEntire functionChannel capacityCartesian coordinate systemPlug-in (computing)Flow separationScaling (geometry)CodeFormal languageLevel (video gaming)Local ringSpeicherbereinigungMilitary baseMultiplication signData storage deviceCore dumpLoginDatabase transactionStructural loadRoutingDialectReal-time operating systemBlock (periodic table)Client (computing)Search engine (computing)System callScripting languageLine (geometry)QuicksortEndliche ModelltheorieComputing platformAutomatic differentiationAsynchronous Transfer ModeMathematical analysisGeometryResultantElasticity (physics)Database normalizationInstance (computer science)ScalabilityMathematicsOnline helpProduct (business)Functional (mathematics)Connectivity (graph theory)MereologyIdeal (ethics)DemosceneSource codeCrash (computing)Open sourceBlack boxDatabaseXMLUMLLecture/ConferenceMeeting/InterviewComputer animation
09:58
Crash (computing)MiniDiscClient (computing)HeuristicPhysical systemData managementTrailDatabase transactionDatabaseElectric currentAlgorithmQuery languageWritingData compressionÜberlastkontrolleComputer networkChannel capacityCharacteristic polynomialScale (map)Division (mathematics)Price indexThread (computing)Computer multitaskingReduction of orderSoftware maintenanceWitt algebraResultantConfiguration spaceGateway (telecommunications)Elasticity (physics)DivisorThread (computing)Channel capacityMathematicsCodeComputer fileLimit (category theory)Vector spaceGraph (mathematics)BefehlsprozessorWorkloadSoftwareCountingMereologySubject indexing2 (number)Client (computing)Right angleCharacteristic polynomialDistribution (mathematics)Condition numberDifferent (Kate Ryan album)Default (computer science)ImplementationBuffer solutionMultiplication signLine (geometry)Cartesian coordinate systemVolume (thermodynamics)Repository (publishing)Point cloudData managementOperator (mathematics)Computer configurationLink (knot theory)Endliche ModelltheorieElectronic mailing listNumberGreatest elementInformationWaveService (economics)Program slicingSynchronizationPetaelektronenvoltbereichBlogMultiplicationCASE <Informatik>QuicksortVertex (graph theory)Commitment schemeView (database)Plug-in (computing)State of matterInformation securityProjective planeAdditionReal-time operating systemProduct (business)Remote procedure callSoftware bugSlide ruleData compressionMetadataHand fanWeightFunction (mathematics)MiniDiscFile systemComputer animation
19:39
Multiplication signMusical ensembleRoutingLecture/Conference
20:06
QuicksortSubject indexingRevision controlGeometryRoutingGateway (telecommunications)WorkloadCASE <Informatik>Asynchronous Transfer ModeBit rateLine (geometry)Meeting/InterviewLecture/Conference
20:56
DivisorAsynchronous Transfer ModeSubject indexingMiniDiscWorkloadScatteringCache (computing)Virtual machineMeeting/Interview
21:28
Subject indexingMultiplication signSoftwareElasticity (physics)CASE <Informatik>Real-time operating systemLecture/Conference
22:21
2 (number)Order (biology)MereologyMultiplication signLecture/ConferenceMeeting/Interview
22:47
Scaling (geometry)QuicksortMusical ensembleQuery languageLecture/Conference
23:20
Right angleProcess (computing)MiniDiscField (computer science)Memory managementQuery languageComputer configurationCASE <Informatik>SpeicherbereinigungRevision controlCartesian coordinate systemFile formatCache (computing)Meeting/Interview
24:17
BytecodeOrder (biology)2 (number)Query languageMultiplication signLecture/ConferenceMeeting/Interview
25:04
Plug-in (computing)Elasticity (physics)CodeLoginDifferent (Kate Ryan album)Social classRight angleBootingSource codeLecture/ConferenceMeeting/Interview
26:12
Query languagePerformance appraisalProduct (business)Row (database)DataflowMultiplication signElasticity (physics)Lecture/Conference
26:49
Query languageArithmetic meanLogic gateElasticity (physics)Multiplication signProduct (business)Functional (mathematics)Dependent and independent variablesBuildingResultantComputer animationMeeting/Interview
27:21
Revision controlDevice driverData managementElasticity (physics)DivisorReading (process)Lecture/Conference
28:00
BefehlsprozessorShared memoryGeometryQuery languageThread (computing)MultiplicationMeeting/Interview
28:56
Vertex (graph theory)State of matterSubject indexingAcoustic shadowRight angleScalabilityCASE <Informatik>MereologyLecture/Conference
30:10
Data storage deviceLink (knot theory)Hand fanRight angleE-bookLocal ringMultiplication signComputer configurationCASE <Informatik>Absolute valueMeeting/InterviewLecture/Conference
31:23
Operator (mathematics)Query languageHand fanVarianceFerry CorstenLine (geometry)DiagramLecture/ConferenceMeeting/Interview
31:50
Subject indexingSynchronizationData storage deviceComputer fileChannel capacityHand fanBefehlsprozessorSoftwareUtility softwareMusical ensembleComputer animationLecture/Conference
32:21
Computer fileData storage deviceSynchronizationThread (computing)Parallel portEvolutionarily stable strategySubject indexingSingle-precision floating-point formatExecution unitLecture/ConferenceComputer animation
32:53
Elasticity (physics)Virtual machineThread (computing)Scaling (geometry)Roundness (object)NumberVertex (graph theory)BefehlsprozessorMeeting/InterviewLecture/Conference
33:44
Musical ensembleJSONXMLUML
Transcript: English(auto-generated)
00:07
Hello, everyone, and thank you for being here. I know this is the last session, so most of us are probably somewhat tired. So I really appreciate you all being here. My name is Umesh Tangat. I work for Yelp, mostly on their search infrastructure
00:23
team. I've been with Yelp for around seven years now. During that time, I have led the rewrite of their search infrastructure a couple of times now. And in this talk, we will look into our latest incarnation of our search engine. It's called Near Real-Time Search.
00:41
So Yelp's mission is connecting people with great local businesses. As you can see, search is a very critical component of this, the whole Yelp's mission. Here are some numbers to give you an idea of the scale that we operate at. And these are mostly as of the end of Q4.21.
01:01
We have over 244 million reviews in various languages. All of these are indexed in a real-time way in our search store. There are 33 million unique app devices, and the dub-dub-dub flow is separate. And then we have billions of queries served per year. So back in 2019, I was here to talk about how
01:24
Yelp moved its core search. Back then, it was based on Lucene, but a really old version of Lucene, which generally support real-time indexing and such. And we moved it to Elasticsearch. And I'm here today to talk about how we moved our core search to raw Lucene-based one
01:42
away from Elasticsearch. So what's going on here? Other than we are keeping ourselves employed, I guess. But there is more, as you can see in the talk. So like I mentioned, and you can look at the details in the previous talk from three years ago,
02:00
our search engine back then wasn't really a platform, didn't really have APIs. It was more like a black box, and it was really hard to query. Didn't have like extensibility options or plugins. So we had very specific ranking code, which we would run, and it didn't really scale out as we had more search-like teams at Yelp, ads, and request a code.
02:20
And essentially, all of these are ranking, and they want to do some kind of filter and ranking. So Elasticsearch was great because it gave us extensible APIs, allowed us to do searching, indexing on those APIs, and also help the developers with the operational aspects of it. So troubleshooting was easy and whatnot.
02:41
As a result, it made the search data a lot more accessible to a lot of teams. So people would query within search, within Yelp, the people would query it for just the geo-based stuff, or some simple ranking scripts like painless, or really custom native, highly extensible plugins like analysis, and highlighting,
03:01
and ranking using ML and whatnot. So this led to a much wider adoption and extension. So with that came, that's a good part, but then we kind of became a victim of our own success. So we have high search QPS, high indexing QPS, we have this tiny team maintaining all these Elasticsearch clusters at Yelp, and the queries got computationally really heavy,
03:22
so we had to keep adding nodes. Apart from being pressure on call. And even after doing that, performance was gradually declining. And we'll look at the details of this. So the first aspect is we had cost challenges with Elasticsearch at Yelp.
03:40
So Yelp uses AWS for deployment, and we have a multi-region deployment. And Ops likes to do region failovers. So we typically would run an exercise in production, we would fail over the entire region traffic, so for example in AWS we have a US West two region, and a US East one. They would fail over entire region's traffic
04:01
on another region. So guess what this would do to our search clusters? So we needed to over-provision capacity for peak traffic, which meant it cost us more than it should have. And auto-scaling for Elasticsearch didn't really work for us, because as you can guess, it takes over minutes for the nodes to come up,
04:23
the data to migrate, the shots to migrate, and we were serving 500 store customers for minutes, which is not good for revenue, for happiness of customers and other reasons. Then the other issue was Yelp, if you have used Yelp, is mostly a search,
04:41
but most of the search queries are based off geo. Like you would search for a burrito in San Francisco, or a burrito in New York. And as you can imagine, we have index at application level based on the geography. So for example, not necessarily what we actually do in production, but you could have an index for San Francisco versus New York versus maybe Europe.
05:02
But some nodes got really hot, meaning the shards of, let's say, San Francisco would end up on the same node. And then all of a sudden during lunch hour, San Francisco, those nodes would be hot. And we had the inverse problem when the nodes would be cold and they wouldn't get enough traffic. So although we were spending all this money, there wasn't really efficient usage of the CPU.
05:24
Coming to the performance challenges, we had at Yelp with Elasticsearch. Elasticsearch, as most of you probably know, is document based application. So every node essentially is an indexer and searcher. A node gets a request, which is the primary, it'll index it, it'll see how many replicas it has to go out to
05:42
and send that request there. Then they will do the indexing as well. So the higher indexing pressure means your search is slower and we often have a lot of these bad jobs which our search data teams run in order to improve their signals.
06:02
So you can see we have these bursts of indexing all through the day. So the instinct initially is, oh, search is getting slower, let's bump the replicas. But that actually exacerbates the problem because now you need to send that indexing message to more replicas, which means more indexing. And what makes matters worse is segment merging.
06:23
Because segment merging is a CPU and IO heavy operation and every node is doing it. So often on the crunch time, you would see what's going on. Why is this node into red territory? And the JStack would most times reveal, oh, it's a segment merge. Besides this, we had maintenance challenges like I mentioned before.
06:41
We were a small team maintaining a lot of Elasticsearch clusters now for various clients, like ads, search, request code, and others. While Yelp has a dedicated infrastructure team which supports stateless deployment of containers, microservices, as of today it's built on Kubernetes.
07:00
So we couldn't really use this. We had to use our own team, primarily because we're not deployed in a stateless ecosystem. Back when we were doing this Elasticsearch did have the K8S operator, Kubernetes operator. But the issue we had was we needed to process the data. So we had to make a stateful deployment which meant in the Amazon world,
07:23
we had to use the block store, the EBS. And we had severe IOPS issues in the past with this. Because like I said, our search is heavily model based. We used doc values in Lucene for searching so it got really slow. So that wasn't an option. So the question really was can we deploy our search clusters
07:42
in pure stateless mode? So if you take a step back and look at what problems Elasticsearch tries to solve for us were the search functionality, essentially the filter, or the recall and sort, or the rank. And the cluster orchestration and management which solves the whole scalability,
08:01
redundancy, availability problem. And really, these problems should be taken care of now by a container management system like Kubernetes. For this to happen, we had to answer these two questions with affirmative. One was can we truly have stateless slash ephemeral instances? And two is can we continue doing indexing in real time?
08:24
So this is where Lucene really shined for us. As most of you might know, Lucene's right, once only architecture is really good because once a segment is written, it never changes. And eventually, to do garbage collection of files,
08:42
there's merging of files happening in the background to produce larger segments. And Lucene supports a near real time API which Elasticsearch does not as of now because it uses a document-based application. So in a sense, to propagate the changes, we only need to copy new files from server A to server B.
09:01
And this is really once you communicate which files need to be copied, the copying is left up to the implementer how to copy the raw bytes over. This is where we thought we would build this GRPC server on top of Lucene and we open sourced it from the ground up. It's called a near real time search or an RT search for short.
09:23
So we also have to keep in mind there are certain things we do not want to solve with an RT search. For example, we do not want to do things like distributed leader algorithms to pick a new primary when the current primary has crashed. Or we do not want to do load balancing to route the load to the least loaded replicas.
09:41
This is taken care of by the Kubernetes systems for us. And we are not a database, we don't pretend to be one. We do not have transaction logs so the durability concerns are pushed back to the clients and they would need to maintain their offsets. So between two commits, if you crash, if the primary crashes, your data is lost
10:01
so you've got to go replay offsets from the past last commit. So with this in mind, this is a very simplified view of our architecture where the primary only taking an indexing requests and on commits slash refresh intervals, we upload them to the remote file storage, which in our case is AWS S3.
10:21
The replicas are different Kubernetes pods. These are stateless containers. When they come up, they download data from S3 and for the real time part, the primary syncs with the replicas, we currently use GRPC implementation for this for the NRT updates. So what are the performance characteristics
10:41
of this architecture in production now that's been running for a while? We use the S3 VPC endpoint for better performance because it cuts down the condition on the network, especially for the amounts of data we are downloading. For the right to disk, we wanted to make the bottleneck as a network host capacity and not the right speed
11:03
of the disk, so we have to use SSDs. So this allows us to basically go as fast as possible as a network host and we can bootstrap in seconds for new nodes and optionally for some workloads, we do use compression, the LZ4. It doesn't compress it that much but it's pretty good or efficient in the CPU.
11:21
So that's the reason we chose LZ4. The performance characteristics on the indexing side now, we do use incremental indexing, again, thanks to Lucene's write once only architecture, you really need to upload only the new files and then you need to maintain a state file which gives you the current list of the names of the files and we upload it to S3 on Comet
11:43
like I mentioned before. And another interesting part is the segment merging is only there on the primary. And NRT itself, the API has the option of doing the segment merging, not installing the segment yet and letting the replicas know beforehand, hey, this is the new segment available,
12:01
you might want one of these and once they say, okay, we have it, then that's when you switch the index reader to use the new segments. Now finally, the performance characteristics on search. So now, as you can see, we've taken away all the work from this search request so the CPU is mainly fed up to search requests
12:21
apart from copying over the NRT updates from primary and we can truly scale the replicas as a part of an auto configured microservice by just changing our configs. And then the other thing we added which Lucene again supports but Elasticsearch did not,
12:41
I pulled up this ticket and last I saw it's still open was concurrent searching over segments. So what happens in Elasticsearch is a search in a shard is single threaded. So a shard itself is essentially a Lucene index which has multiple segments but Lucene does allow you to search parallely over those.
13:00
So the way it does it by default is you divide the index into segment slices, you execute the slices with a thread pool and then you reduce the results. So that makes our CPU usage even better for the replicas. So how did this work out in practice? The graph on the top is our Elasticsearch deployment.
13:22
I have erased the axis for reasons that we cannot show the stuff in details but anyway so the x-axis is the time and the y-axis is the traffic over time. As you can see our traffic varies a lot between time of the day and day of the week and the red line is the capacity
13:41
at which we had to provision our Elasticsearch cluster so that you can see the gap between the troughs of the waves and the red line is all the wasted capacity. And the bottom you can see our initial deployment of an RT search where you can see how the traffic is closely being matched by the capacity. So this has helped us reduce cost
14:01
by as much as up to 50% in production. And we are also, while reducing the cost, are able to get 50% faster across various percentiles and I have a timings graph on the next slide. In addition to that, now deploying stateless containers is a breeze, it's a microservice push.
14:21
So doing things like JVM upgrades, OS upgrades, we are not like a month long project or when we have security bugs like the log4j. It's the simple change in the jar and push the code. So going back to the timings, the green line, so these are timings for P50s, P99s, P95s.
14:40
The green line is the NRT search timings. The blue line is the previous elastic search timings and the Y axis essentially is trying to plot the difference in percentage, again, because we cannot show the actual latency. And you can see it's as much as up to 50%, but at least 30%.
15:00
So of course it was not all rosy. We had issues in practice. So while the concurrent search works well, it had bad higher percentiles. And the reason being the default way of executing this didn't really take into consideration the work distributed. It took, it put the segments randomly in a slice.
15:23
So you could get unfortunate and you're gonna have a really large segment on a single thread doing its thing while all of those threads are like not doing much. So Andrew from our team came up with virtual sharding and what this tries to do is divide the work more uniquely across by counting a live doc count. So essentially we split the segments in buckets
15:43
such that each bucket gets segments. So you could get smaller segments, like lumped up in a bucket versus having a lot of segment in the bucket. So this changes that we have to change our merge policy for this. You could only, you would not really merge into one segment, but you would have N,
16:00
if you have N buckets and you have N least segments. But this gives us more predictable timings across different percentiles. And the other issue we had was the fan-out factor because like I mentioned, our current method of getting the primary to talk to replicas for the NRT was just synchronous communication using gRPC.
16:22
And some of our workflows had a fan-out factor because the primary would send the update to replica one, the replica two and so on. So if you have really a high number of replicas and the network capacity of that host would be an issue. But it's straightforward to fix and we're working on it in that we use the channel,
16:41
the gRPC channel only for the metadata, but actual bytes can be like I mentioned, Lucene lives up to the implementer. So we plan to draw that from S3 because we already have them. The other issue was the indexing model, like I mentioned before, the indexer has to now be aware of pulling the data from somewhere, example, Kafka, doing their commits back to Kafka.
17:01
And we saw not all our clients within Yelp were like really comfortable doing this. So the better model would be, you pull, it's sort of a push, like you have an ingestion plugin and then you can configure the offsets and maintain all that stuff internally in the plugin and just leave the configuration to the client. And also we had some fun rough edges and bugs, both in NRT and gRPC,
17:23
like buffer limitations and configurations. Like I mentioned throughout, we have a lot of custom plugins. So we have this gRPC based API, which we also have a gateway to do REST, but this gets very verbose, kind of similar to Elasticsearch almost.
17:40
And it'll be nice to, we're working on simplifying that for some of the clients. So where are we going with this? So the future within Yelp, like I mentioned, we continue to migrate some of our lagging ES-based workloads to NRT search. And there's a lot of feature work going on, like I mentioned in the previous slide, and we're supporting other things like,
18:01
we already supported a vector search within NRT search this past quarter, and now we're working on neural nets as plugins as well. And outside Yelp, I've spoken to a few engineers and a few companies, and they've shown interest in doing kind of a similar approach where they were seeing issues with their Elasticsearch cluster
18:22
beginning more expensive, and if they could really deploy it in a stateless manner. And one of the things, the common things we, common grounds we came to were the onboarding of this could be a little, I won't say overwhelming, but maybe a little cumbersome in that we have our own Kubernetes operator to deploy. But what I found out while playing with this was
18:41
you could easily get this done by using Helm charts. I have a link there. And Helm charts, for those of you who might not be aware, is essentially a package manager for Kubernetes. So you simply upload your CRD files, configuration files, the primary replica, and what readiness probe, liveness probe, what volume mounts, all the other spec you want.
19:00
And then you can download those from a common repository and just override with your own values.yaml file. That works out pretty well. Then the next thing we're thinking of doing is going one step further and almost make it a cloud native offering by deploying this as a service within different marketplaces.
19:22
These are the resources, there's a code, the entering blog post we wrote, which was a year or so ago, but most of it is still relevant. We're also hiring. And yeah, that's all of the information there. And that's all I had. Thank you.
19:46
Do we have time for questions? We have time for questions. Anyone who would like to start? Come on, don't be shy. We have a big crowd here. Oh, there is one in the back.
20:05
Hi, I'm just wondering what size index, what's the sort of index size you're working with and how does that sort of affect cold starts and restart of readers and I guess the writers as well? So the index size is for the cluster. So yeah, like I said, this is a simplified version.
20:24
Do we operate like, most of our workloads are, like I said before, are geo-shardable. So we essentially deploy this like a micro-cluster for the geo-sharding spec. So you can think of it like, oh, like all the, I don't know,
20:40
US West could be in one micro-shard and then we have a gateway in front of that to route the traffic accordingly. That's one mode. And then the other mode where geo-sharding isn't that possible. So in this case, there would be like gigabytes. So we have used hosts up to like 25 gigabytes per second and then they don't have a problem down the line with data. But the indexes go up to,
21:01
so we try to stick to the index size, the limiting factor is the RAM on the machine available. So like maybe 60 gigs or 120 gigs because like I said, we use a lot of doc values and Lucid is very happy to use a disk cache. So that's the limiting factor there. And then for other workloads which are not geo-sharding friendly, we have a scatter-gather mode,
21:21
which is typical like what ES does. So we have another node in front which would scatter-gather, that kind of thing. Thanks. Hi, thanks. Very, very nice talk. Thank you.
21:41
I have a question about how often you are pulling the NRT updates because the main idea behind the indexing on the different, on all the replicas and elastic searches because you can much faster reopen the index like once per second, how often,
22:02
but the problem here is if you, for example, have a large merge, it might take a lot of time to transfer those segments over the network. So what is your intended time to refresh the index then? So because it's not the near real-time anymore in that case, or it depends on the definition.
22:20
So if I'm trying to understand a question you're saying, this is, the bottleneck is going to be our segment merge time, right? And as far as I know, like the NRT, within the NRT, there is an API for the pre-warm, the segment merge. So whenever the primary is ready, it will merge the segments and do it. And I want to say, like, we've worked with a refresh interval of seconds, like order of seconds,
22:43
and that seems to work fine for the most part. Okay, thanks. Hi, thanks for the talk. I have a question related to the auto-scaling. And when there is a new pod started on Kubernetes,
23:03
or like the GVM is usually cold, and there is like a latency spike instead of things getting better, they are getting worse because the GVM is cold and you cannot quickly execute the queries. Do you do some sort of warm-up? So, no, like we used to do this on the past
23:23
or the previous 30 years ago, Lucene version. And I think the issue back then was twofold. Like when we started up a node, our heap size was big. So we did all this loading and we had stored fields, which were very JVM heap happy. That's not the case. Now we have doc values, which are basically in the disk.
23:42
And there is a preload option for doc values, which we use for those file types. And the heap is really small now. We don't have much stored fields. So there is not much garbage collection happening at the start. So that's not a reason for slowing. Like the S3 download, I said write to S3, like that's really important.
24:01
But the JVM process itself is happy and the doc values mean, and if you preload them, then the initial queries are not that slow. And we don't really need to warm. I want to say like we might have like a warming, like application here and there, but typically we don't need like warming. But it's not about the warming of the caches and IO and things, but more about the JVM itself.
24:21
Like it's just interpreting the byte code instead of compiling and running it. We didn't seem to have an issue with that. You're talking about like order of seconds, minutes here. Like what's the typical issue? Like queries usually takes much more time than the 1,000th query.
24:44
I don't see like having that issue because we also have in Kubernetes this liveness probe. So our liveness probe itself has a bunch of thing going on, which probably does like, it doesn't do the warming specifically, but it does a bunch of like pre-configuring things.
25:02
Thanks. Yeah, hi. Great talk. My question is regarding the ML plugins that you mentioned. How easy or hard was it for you to replicate the same behavior with,
25:20
like earlier with Elastic and now with Lucene? Was it easy to do or hard to do or did it take a lot of effort? So, I mean the TLDR is it was not too hard because we already had plugins. So when we build this stuff out, one of the design goals was we need to deploy all this code. And we have plugins not only for ranking,
25:40
but for analysis, and for special things like highlights and collecting logs and whatnot. So what we have is we have a plugin architecture that's extensible. So the plugins themselves are owned by the developing teams, not the source infrastructure team, but the APIs are owned by us. So they would just, when they start their cluster, they would just have to drop the jar
26:00
in the right class path. And we have different class loaders loading this, but the API is in our research itself. Other questions? Then I have a question. How do you decide that Lucene role is faster than Elastic?
26:21
Did you make this evaluation on a bunch of data or you put both of them in productions and then decided? So what's the question? How is the Elastic search query fast, slower than? Not how, how did you decide that Lucene role is faster than Elastic search?
26:41
At the query time, you mean? Yes, yes. So we dark launched all this. So at the opener, we launch like a new flow. We like, it's a Microsoft is like I mentioned. So whatever gate we was talking to Elastic search now talks to this. We would dark launch it. Dark launch meaning we would send the query to the new thing, but we don't take the response, but we can collect the timings by the collectors.
27:02
And so these timings are from that. And we have, we have to dark launch for not only performance reasons, but also functionality. We need to check if actually we're doing the same results or not. So that itself like takes as much time to roll it out in production as building the stuff out. Okay, thank you.
27:21
Anyone else? Okay. I guess a slightly related question, but what do you think is the main driver between the performance of the Lucene version and the Elastic search version? Because I guess mainly Kubernetes does the management.
27:44
What is the main performance factor? Yeah, what is it in Lucene? So it's a few things like I mentioned, right? One is having a dedicated primary replica segregation, but the replicas are only doing the read now.
28:00
And doing other things. So it's a collection of various things. That's one. The other thing was segregation. Another thing was not doing segment merges for another replica, for example. The third thing was like concurrent searching across segments. So you better utilize your CPU rather than having one thread run for one query. The next thing was we can, which I didn't mention,
28:22
we have a geo-sharding based approach. So the hot and cold issue we don't have, because like I told the other gentlemen, now we have multiple of these micro clusters, right? So you can have, let's say San Francisco lunchtime. So this micro cluster could have more replicas and then the Europe one would have scaled down.
28:40
So you're better utilizing the CPU. So overall, that's cheaper. And yeah, writing to SSDs. So it's a collection of a lot of things, I guess.
29:01
One question about your Kubernetes pods and the nodes. Are all the nodes on Kubernetes or only the search nodes? Even the, I mean the indexing nodes,
29:20
they are only also in Kubernetes. Yeah, everything is a microservice. Because we use Kubernetes, it's on Kubernetes, but it could be anything. So like primary is a pod with one, replica is a auto-scalable pod. And indexing pods are also auto-scalable? Sorry, which one? Indexing pods are also auto-scalables.
29:43
No, so the index is only one, right? So if it crashes, Kubernetes will bring up. Okay, so it shows that it's only one, okay. We thought about it like having a shadow one, but then we'd have to have like zookeeper, keep state, what's going on. So yeah, and it's pretty fast
30:01
for us to bring up a new one. Otherwise you could have used the solar almost. Sorry? You could almost have used solar in this case. Solar? Yes. So back then, like the few things, so I'm not sure like how we could do right to S3 for example, have a complete stateless architecture,
30:21
like what we have here. Right, we will still need persistence. Okay, and just another question because I'm not at all a specialist of AWS. You said at the beginning that Kubernetes with elastic was not an option because of EBS storage.
30:44
What is the link between Kubernetes and EBS storage? Can you not use SSD in this case? Local SSD, I mean. So we are using SSD now. Yes, but here. So if you use SSD here and it would go down,
31:02
we would lose the data, right? We still need the data. Okay. Yeah, thanks for the talk. Just one question about the fan-out. You mentioned that you had a problem with fan-out.
31:21
Yeah. But at the same time, you said like you're using also the queries, I mean some of the queries at least are executed concurrently. Doesn't it hurt the fan-out problem because you're gonna have more variance, right? If you execute the query concurrently and if you have more variance, the fan-out is gonna. So the fan-out is an issue for the primary, right?
31:41
All right. When the, where's the diagram? Yeah, so the fan-out is the issue for the blue line, like the primary sending to multiple replicas and then the issue really is the network capacity of the host. But the concurrent searching happens within the replica itself. So that's the better utilization of the CPU for the replica, so it's an IO bound thing.
32:02
Sorry, it's a CPU bound thing. But the fan-out is a primary's network bound thing. Anyone else? One last question. Yeah, I wanted to clarify what that ES issue you called out as. And just so I understand what you said,
32:23
it's that it's one thread per shard and you can't parallel thread within segments within a shard. Right, so I don't know if ES has implemented this now. I think the ticket is still open. So the search that happens in the ES index is every search request for every shard, single thread.
32:43
The parallelism, unit parallelism is shards over there. But a shard essentially is a Lucene index, which is made up of multiple segments. And you can't search across segments in parallel in ES. You would shard more maybe than the guidelines are to take advantage of all the threads you have.
33:04
You would increase the number of shards to match the number of threads available in a way, if you're gonna try and maximize all that CPU. So how you would resolve this in ES? Yeah. Then you would have to have more shards, but I don't know if ES is like auto-sharding now, but in ES before you had to do, you have to design a number of shards at the beginning
33:24
and data can grow afterwards. And the shard actually runs on a separate node. So it's on a different machine. This is utilizing the scores on the same machine, the kind of like vertical scaling. Thank you. Thank you, let's give Umesh a big round of applause. Thank you. Thank you.
Recommendations
Series of 2 media