We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Scaling an online search engine to thousands of physical stores

00:00

Formal Metadata

Title
Scaling an online search engine to thousands of physical stores
Title of Series
Number of Parts
56
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
An online e-commerce search engine is easy to put in place. Scaling it to serve millions of users, adding a marketplace to provide thousands of products, supporting multiple offers, prices and stocks on the same product are additional challenges more difficult to address. And what if, in addition, you mix your online search engine with the activity of thousands of physical stores? In this talk we explain how we addressed all these challenges in the context of the largest retail group and online grocery store in France. The constraint of multiple physical stores backed by the online search engine introduces additional challenges that we emphasize and address in detail. Our point of view, as we explain the challenges and solutions, is both technical and functional.
Enterprise architectureInformation technology consultingService (economics)Software developerMusical ensembleData storage deviceWeb 2.0PhysicalismInformation technology consultingNeuroinformatikBus (computing)Expert systemSearch engine (computing)Text editorOpen setXMLUMLLecture/ConferenceComputer animation
Singuläres IntegralEnterprise architectureInformation technology consultingService (economics)Software developerArchitecturePrice indexProduct (business)Text editorEnterprise architectureMultiplication signLatent heatLecture/ConferenceComputer animation
Digital signal processingCASE <Informatik>Latent heatData storage deviceEvent horizonMultiplication signWeb 2.0Computer animation
Product (business)PlastikkarteComputer fileLocally compact spaceKolmogorov complexityArchitecturePrice indexData storage deviceDigital signal processingSubject indexingProduct (business)Service (economics)Complex (psychology)Data modelComputing platformSearch engine (computing)Network topologyAuditory maskingDifferent (Kate Ryan album)Streaming mediaLevel (video gaming)NeuroinformatikForm (programming)SoftwareDatabaseBitLatent heatPrice indexFlow separationPlastikkarteTheory of relativityComputer fileCapability Maturity ModelLecture/ConferenceMeeting/InterviewComputer animation
Product (business)InformationError messageComplex (psychology)Subject indexingPrice indexElement (mathematics)Independence (probability theory)ArchitectureComputing platformCapability Maturity ModelSystem programmingData managementMilitary operationInformation securityCybersexDigital signalService (economics)Subject indexingProduct (business)Right angleData storage deviceCASE <Informatik>Search engine (computing)Price indexLatent heatIndependence (probability theory)1 (number)Data modelPeer-to-peerReal numberPhysicalismComputer architectureInformation technology consultingPresentation of a groupCartesian coordinate systemInformationDigital signal processingBitService (economics)Multiplication signData streamComputing platformTable (information)Auditory maskingType theoryComplex (psychology)Compact spaceGoodness of fitState of matterPhysical systemElement (mathematics)Different (Kate Ryan album)MereologyPlastikkarteSet (mathematics)Configuration spaceError messageTunisReplication (computing)Source codeUnitäre GruppeDependent and independent variablesLecture/ConferenceComputer animation
Product (business)MultiplicationPrice indexArchitectureScaling (geometry)StapeldateiInformation securitySubject indexingSearch engine (computing)Complex (psychology)Computing platformMereologySource codeBitTerm (mathematics)Information systemsDifferent (Kate Ryan album)Data structurePhysical systemPrice indexInformationLecture/ConferenceComputer animation
Scaling (geometry)Information securityStapeldateiProduct (business)Library catalogConfiguration spaceMultiplication signVideo game consoleComputer architectureInformation securityPresentation of a groupSubject indexingMereologyMathematical optimizationData storage deviceDependent and independent variablesOperator (mathematics)StapeldateiGroup actionBitPrice indexProcess (computing)Data dictionaryProduct (business)ResultantPoint (geometry)DebuggerMoment (mathematics)WebsiteCartesian coordinate systemComputer fileSet (mathematics)FacebookInternet forumDifferent (Kate Ryan album)Computing platformMappingService (economics)Latent heatSoftware developerTunisLecture/ConferenceComputer animation
LaceComputer clusterRouter (computing)Fiber (mathematics)Cluster samplingArchitecturePrice indexHistogramProduct (business)Term (mathematics)Dependent and independent variablesStapeldateiLibrary catalogChecklistService (economics)ScalabilitySubject indexingHacker (term)Information securitySet (mathematics)MereologyState of matterSubject indexingPrice indexInformationScalabilityAlgorithmRouter (computing)Proof theoryData storage deviceResponse time (technology)Structural loadView (database)Gene clusterHistogramChecklistSearch engine (computing)BitFlow separation1 (number)Virtuelles privates NetzwerkLatent heatData structureSet (mathematics)Information securityFilter <Stochastik>Table (information)RoutingTerm (mathematics)Service (economics)Web 2.0Product (business)Front and back endsWeb pageResultantLecture/ConferenceComputer animation
Product (business)Error messageInformation securityPhysical systemSubject indexingData integrityLecture/Conference
FeedbackWebsiteOpen setSoftware testingComputer animationMeeting/InterviewLecture/Conference
Video game consoleFitness functionMereologyConfiguration spaceSoftware testingQuery languageResultantINTEGRALAlgorithmCodeMeeting/InterviewLecture/Conference
Integrated development environmentConfiguration spaceMereologySoftware testingMoment (mathematics)Set (mathematics)INTEGRALElasticity (physics)AutomationLecture/ConferenceComputer animation
Response time (technology)2 (number)Arithmetic meanMultiplication signDebuggerMereologyChainResultantElectronic visual displayFront and back endsLecture/ConferenceMeeting/Interview
Subject indexingMultiplication signData storage deviceProduct (business)BitData structureMereologyLatent heatResultantPrice indexInformationIdentifiabilityPlastikkarteDescriptive statisticsCategory of beingTerm (mathematics)Lecture/ConferenceMeeting/Interview
CodeMilitary baseData storage deviceFlow separationSubject indexingData structureDifferent (Kate Ryan album)Musical ensembleLatent heatConfiguration spaceStreaming mediaRevision controlPlug-in (computing)MereologyProduct (business)Lecture/ConferenceMeeting/Interview
Data storage deviceRight anglePrice indexSubject indexingDifferenz <Mathematik>Flow separationSet (mathematics)Configuration spaceSimilarity (geometry)BitKey (cryptography)Elasticity (physics)CodeMappingMereologyData structureDifferent (Kate Ryan album)Lecture/Conference
Event horizonView (database)Point (geometry)Scaling (geometry)BitMereologyOcean currentGene clusterStructural loadRouter (computing)Subject indexingQuery languageLecture/Conference
Musical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
So yes, we'll be speaking about scaling when it's about physical stores, which is not exactly the same as full web search and full web stores.
I will start by introducing myself and the whole team at Adeline, because we are almost all here today at Berlin Bus World. So my name is Adeline. I am the CTO of this wonderful team. Well, we are experts in search engine technologies.
We are consultants with mostly Elasticsearch Solar, but whatever has search in it is interesting for us. So yeah, we do search. And we also are editors. Sorry, I just closed. This is not my computer.
You have a lot of things open, Omar. So yes, we are also editors of a solution named A2, which is a product that we sell and that we license, specific for e-commerce and enterprise search.
So we will speak about data when it's about the physical world. Then we'll talk more about deeply how it is indexed. Then we will see the story of an e-commerce competitor that grows, speak about architecture,
and run an infrastructure at the end. If you have any question, maybe it's better to keep them for the end of the talk. And I'm a bit late at the beginning, so I will try to speed up. Don't hesitate also to just interrupt me or tell me that I'm lacking time.
So let's go deep in the subject. We are in the physical world. So we are in stores with people in it, with their specificities. We are talking about any physical store that is going online that must do a digital transformation
or wants to, but with the recent events, everyone needed to get on the web sometime. So we are really, it's a specific case because there you have a high diversity of what
is in the shop, the prices, the stocks. Whatever is in the one online store will not be on the other one. So this is specific. Even if it's all under one brand, the grocery store and the physical retail is very specific.
Just to remember what are the expected features of e-commerce search. You probably know a lot about this, but maybe not. We expect search, of course. We need to find what we are looking for, find a product. We are also searching for autocomplete.
We need autocomplete. We need to access very quickly to what we are looking for. We also need to display a product card. We need to navigate through a tree, through the different departments. When there is no product, when we search for it,
we need similar products. We need suggestions. If we write something badly, we need a did you mean thing. So we need a lot of features around search. I will now explain to you what is happening
when a shop needs to go online. A shop handles a lot of data, which can be very, very data in any form. So you have files, maybe in folders in someone's computer. You have databases, you have the network, you have streams, you have different levels of maturity
in any store, and also you can find in such companies services. And the services can also all have their story, their data, their differences. And when it's about search engine,
you have to build a search engine above that. You have to deal with a lot of complexity coming with this data. But that's okay, because a search engine can really help a company that needs to go online in its digital transformation, which is good.
You need that, because when you have a lot of data that is very diverse and complex, you need to go very quick, and the search engine helps for it, because it eats almost anything, and provides this as a unitary,
and very common API or platform, which is very easy to use. It masks all the complexity, and for me, search can put all the digital transformation of a physical company one step ahead, and put it further, and maybe advance a lot.
So now a bit of technical things, because we are in the, so in the physical world, I say this, and we need to handle the how you index the different data we have
around the different products, the products data in every form. So we need to have indices, and we had to think about how you handle it. It's quite not straightforward. The first idea, of course, would be to have one index, and you put all your product referential in it,
and then as you know that you have specificities for offers, offers is like, we put the product in promotion, you put a price tag, whatever. So you have different offers, several per product, and you have different stores, and one store makes offers for products.
So you have a bit of a little relational data, you have a model in it, so you have to denormalize it, put it in an index, and then you're done. So it's easy, you have one index, you have a lot of products, a lot of offers, you have a very big index, and maybe it's a good solution, we'll see.
Another idea on the right would be to have one index with a referential of products. So we keep only the ID, the tag of the product, and then all the offers would be into a separate index. So one index per store with the offers
and the reference to the referential. Some of people maybe that work with search engines technically would see that there is already something that we do not really like. So to summarize, something like this, it's good, it's compact, it's unitary.
You have, like you denormalize, normalize data, but you can find your data model in it, you can stick to the data you expect to have. The cluster states would keep low, you only have one index, you have stuff very unitary.
But we found that you lose a lot of search performance because you have to search through large model. So this is the first problem we had. You have too much information retrieved, you really get, when you get to product,
you have a very big product card with all the offers and all that, you have to find in it what you need to display. Any small error in the product would break the whole thing. So if there is one product, one store,
putting a bad information, well, it will be replicated for everyone. There are frequent updates also because every time one store wants to change a price or do something, well, you will have to update the same document. And on the case two, you saw there was a join.
So joins on search, we don't really like that. So what we tried to do was to put one index per store. Let's do that. We'll duplicate, so we duplicate, we assume we duplicate common elements that are data that is common to everyone.
So we have a lot, a lot, a lot of data which is duplicated. We assume that. We put one index per physical store so that we can keep all the products, all the offers, the configuration. We can also go very deep in the settings,
the index settings are also one per physical store. So we really fine-tune everything for the physical store. And with that, we have, well, the cluster states increases because we have a lot of indices
and some updates, some stuff that happened, but the search performance is the best and it's really easy to get the product we need. Each part keeps its independence and when we are in the physical stores world, this is very important
because you won't say at all the store directors, the people that manage the thing, you will do the same that your colleague because they don't like that. So everyone needs to keep the independence and you have fewer updates per index. You have a lot of updates,
but they are spread with a different indices. So now, let's see another part. We are, let's say we have solved the problem of the physical stores. We assume that we are responding to all the specificities.
We are able to, well, respond to the needs of every physical store. Everyone is happy. And of course, well, what happened was really accelerated with the COVID-19 crisis in our case
because now everything accelerated. So the digital transformation really went faster. Every company wanted it and they matured very quickly. So what happened? Well, it is good,
but any company like that began to compete with the real pure players, the e-commerce pure players, the big ones. So there are other challenges that we'll see.
So this is the most beautiful schema of my presentation. I'm very proud of it. It's not mine. I should have credited it, but imagine, well, we are in the situation of the company when you bought the services of a wonderful company and a wonderful architecture consultant, IT services,
you know, IT design consultant, architect. And you know, you are finally maturing, converging. You have a lot of replication. You have architectures, you have services. You have, you know, you are in the pure players world. You have achieved your digital transformation.
You do like a plenary conferences and you know, you can say you are one of them. So the systems converge, this is good. This is good. You know, you are really thinking as a digital company. So this is good. You are in the full online world and this is normal.
What happened with the online, you know, shopping is that of course, when you are starting thinking like this, it opens a lot of doors in the mind. And anyone that succeeded into putting the physical stores
online can now pretend to be, you know, directly competing with Amazon or Rakuten or Alibaba or whatever in a country wide or even city wide. Why not? So this is real challenge and you can also imagine
that your search engine will now give them, give data and products from very little store. So you still have it. But at the same time, if there is no, not what you need here, you can also search into a marketplace. You can search into, I don't know,
a warehouse that provides speed delivery for non, you know, no food or stuff you need. So the idea would be that the search engine even masks more complexity. And now, you know, you have a search
which is really gets really powerful like the Amazon search, but including physical world. And also imagine that with all what we have into our hands now, we saw the keynote before, add a bit of personal data here,
personalization and it gets even more complex and more powerful. So the search engine is really, you know, very central. So yeah, we will have a search engine platform, one index per store. So a lot of, I said, I put a bit of ideas
of how many products you would find. So 20K products here, a marketplace that would be like provided like one data stream coming from, you know, another service. So you may have like 1 million, 2 millions products in it, unlimited.
A warehouse that contains products that can be delivered. So 500,000 or 1 million, why not, big indexes. And here you have another problem because this is not the same type of products.
You have here, for example, tomatoes, potatoes, you know, kids stuff, whatever. And here we will find iPads, iPods, you know. Here we can find a garden table, for example. So very different products coming from any data source
and you need to put a search engine platform above it that would be still performant. So what we are thinking about would be to have, you know, to work before indexing because we cannot add complexity to the search part
because we need to keep it, you know, speed quick. So we imagine like a common schema. We say that we prepare our data before indexing it so that we have a bit of a structure and the goal behind that is to make a search
through different indices, so multi-index searches. So once we have it, it has to work. And we saw that we have big indices, we have a lot of differences.
So we will now see a bit of, you know, how it can be built and how it can exist in a information system. So first let's see a bit of the need
in terms of infrastructure and what we need to address technically. First thing, we are in the mobile world. So we need that, you know, anyone shopping with a mobile phone will get tired very quickly. You can say it. When it's about one second, you know,
you search and it doesn't work, you will move away and check your Facebook updates or TikTok. So it needs to be very, very quick. And for the search part, we are challenged today to be under half a second for a response.
Then you have a lot of ingestion coming because for example, for a marketplace, you have batches of products coming all the time, all day long, every minute, I don't know, every five minutes and stuff. And you also have updates, you have removal, products removal, you have moderation because it's coming from basically anywhere.
And there are a lot of data processing, so you have to handle this ingestion. You need to provide search. You don't know what for. I mean, you have to build a platform, make it, I put fully integrated with the main front end,
you know, the website, the main website, the main application, but you have to be open so that anyone at any moment can develop something that would use the search in the whole services. So this is important to make an API to build it as a platform.
And to finish, well, you have to handle security, robustness, isolation of data. We'll talk about this quickly. So I'm sorry, it's a bit small. I don't know if you can read.
It's not very well written, but this is an example of, you know, the architecture, basic architecture using Elasticsearch. So this is what we do. So we have Elasticsearch with its configuration, the different mappings and settings for any index.
You will find one index per store, you will find the marketplace and other indices. You also have a configuration index and the configurations will contain also one config per index or per store or per specificity or per group.
To edit the configuration, we have a tool which you call the Business Console with the best Business Console developer just in front of me. I have to mention him because he saved the presentation. So big up the Business Console. So you can fine tune the configuration. You can really apply boosts, define the facets and all that.
It's very central and important to spend time on that because this is the entry point of all the business and all the people that know about the products. And then you have the two parts with a very, very important one, which is before indexing.
So there are data coming from anywhere. I didn't draw it, but you will find, you know, the files, the stuff, anything here. And before indexing, you have to do whatever it is that will speed up the search.
So all the optimization, all the, you know, any processing, dictionaries, you know, I don't know, anything that will help the index to be clean and to be searchable very quickly. So even if indexing takes a bit more time,
it can be complicated because we see that, for example, in the marketplace, I told you, it has to keep being fast. So there is a bit of a challenge here to be fast but performant, but the most important is that the search is very fast, always.
So in the search part, you also have some operations that you can do, for example, collapsing results or stuff like this, but keep it very unitary and very fast. So there are these two parts and the user will search. It comes here, easy.
As I said before, when we have a lot of indices, the problem would be the cluster state because as you handle a lot of index information, the cluster state gets very big.
So we needed to be multi-cluster and we built, some clusters where we put some store indices on any cluster and we replicated the common ones through all the clusters.
And we handle the search through clusters ourselves. Basically, we built a handmade router that has a table and when we ask for a store, it will route you to the right cluster. That's it. So it's a big infrastructure,
but it guarantees you, again, the quick search. To finish, some things you have to think about when it's about running. The first important thing would be to monitor what is happening on the search engine.
So we have to monitor what are the most frequent search terms, the filters. What happens? Well, what are the searches that gave zero results? What were the clicks on the second pages?
Where are also, when you search for product, where was it in the page? You also need histograms through the day. When is it most important? And the response times. It will learn you a lot about how it is used and this is also a part of scalability
because it's not only the fact that we have to grow a lot because it will always grow. We have to think that today we have thousands, we have two or three thousands of physical store. Maybe we'll have 10,000 in three years. We don't know.
Everything is growing, but also we know that the load is not the same throughout the day, throughout the different days of the week. So we also have to know about that in advance so that we be prepared when it's needed.
So this is just a quick view of what we're working on also. This is a proof of concept we did with pushing scalability to its best part. So we built our solutions. You see, if I get back to this one,
every part, most of this one is not dependent on the data. It is algorithms and stuff you have to proceed and this part can be seen as a microservice and we are able to scale.
So the idea would be to have different runners and we did this, we tried to do this using Kubernetes spots so that we can really scale up and down and see how we can with a bit of separation of concepts, be even thinner and more precise in the scalability.
So I told a bit, well, this was a quick view of what we have to think. Here's a checklist of what we just said. So first one, it's very important to keep adapting
to the place we are, the physical stores because the specificities are very important and physical stores always exist and it's important to stick to them. We can still use a schema or structure the data when we include all the sets coming from the web.
The response time is non-negotiable. This is really important. It's also important that your indexing is precise and also your data keeps coherent but response time is the most important.
It's important to monitor, to learn about what you monitor and to prepare for skilling even more. Try to think microservices for what can be scaled, what is not dependent on data. It can always be important
and I didn't speak a lot about security because to us a search engine is really in the backend. It's behind VPNs and stuff. So, well, security is important but we handle product data only, not personal data. So, well, the most important for us
is when we talk about security is to guarantee data integrity, error spreading, index corruptions that can also come from the inside of the system. So, that's it.
Thank you all. I think I can breathe now because I just came down south from upstairs so it was a bit complicated but thank you for your attention. Well, I'm open for any question you have. Have a great day. Thank you.
Thank you. You didn't mention testing of search relevance for example or where does that fit in? Is this part of your search console?
Like testing boost, search configuration. Just whether the results are relevant to the query. Well, we have different parts that include testing. First of it is code testing. So, obviously you need to check the integrity
when you do algorithms, you have to test them. Then you have also a big important part is the integrations testings. Also, when you are configuring very thinly with the search console, which is here. Yes, here you are doing very important things
that can be dangerous or can be corrupting also indices because you can also work on directly the elastic search settings and stuff. So, we have only manual integration testing at the moment with a pre-production environment in integration environments.
But we, well, in the configuration part it's difficult to automate integration testing. So, we do manual pre-testing of, you know, we click, we try such results and check it.
That's it. Yeah, thank you. I have a question about the very important response time that you mentioned. So, one second, is there any granularity to that? So, I mean, is it that everyone leaves at one second or maybe like half of them and something like that?
I don't see what you mean. So, one second, is it the time where everyone leaves or like there are people already start to leave at half a second or something like that? Well, I don't have the exact, you know, review of what is one second and half a second.
I don't know, if like, for me, it's more of a general thought, you know, when you're above a second, everyone leaves. Then about half a second is that when you say one second for me, it's the response time of the whole chain from the front end to, you know, getting the results
displayed and all that. So, in our team, you know, we are, well, in partnership with the front end team and the whole thing must not be more than one second, but the search part must keep even thinner because you have all the display and stuff. So, this is why we say for search,
it must be half a second. Here's one more. Yeah, hi, thanks for the talk. If I understand correctly, you, all right, yeah, if I understand correctly, then you're querying three indexes at the same time for each store.
So, the store index, the warehouse, and the. Yeah. How do you combine the candidates from the different indices? Combine and rank, so. Yeah, by chance, we have some common, you know, properties. Mainly the product identifier, the product card, you know, the main name, identifier, description,
more or less, is the same. So, we are able to get that and then get that back to the results. This is why I talked about the schema. You know, the common schema is important because you need to really have the, well, be able to combine information.
So, for the search terms, it will be such, you know, in any unitary specificity of a product in the specific index, but then when we retrieve the results, we will only be able to display a little part because what is common between the three indices.
You need to have a common structure, a bit of a common structure between the indices to do that. Okay, next question here. Yeah, hi. It's nice to have separate indexes, but I wanted to understand how do you organize your code? Do you have separate code bases for each store or is it the same code base
but organized in a certain structure because there may be differences between the stores that are very specific to the store. So, how do you manage that? You have the same code base. Really, we keep to keep the same one, but of course, the drawback of this
is that you have a huge configuration part. So, we are still with this currently. So, with one code base, a big configuration part and then you can fine-tune specificities and the configuration will be different regarding the index you are in. So, we can still do it with one code base.
I do not exclude that. Someday, if really you have different, you can say marketplace is a different business from a grocery store. So, it's not excluded that someday you would need to do plugins and stuff, very specific. We also have a version where we use plugins
and connectors specific for any stream also. So, there are different possibilities. Currently, we try to keep one code base, one product. Yeah, here.
Hi, thanks for the talk. It's very interesting, but I have a question similar to the one. So, you mentioned that you have so many indices, one per store, right? And how are you maintaining it? And when you scale, let's say you have thousands
of stores as your customers, then how do you maintain your settings? Let's say you have a separate setting on mapping for each of the index, then how are you maintaining the Elasticsearch updates? Let's say you have to change some mapping in an index. How are you maintaining this indices?
Well, it's a bit of the same answer. It's more with the configuration. So, you handle it with configuration. You have a very, in the store indices, you still have a lot of similarities. You have the same structure.
Not everything is different. The data is different. For the settings and mapping, I would say it's very similar. So, you can assume that your code will be the same, that you will be able to handle them on the code part the same way.
You see, you apply the boost, but the values will be different, but the keys will be the same. I don't know if it answers your question. Maybe not exactly. We can talk about it later if you want. Okay, maybe last question. Hi, thank you so much for your talk.
Pretty interesting point of view. I have a couple of questions. As you know, when you have a sales event, a Black Friday, there are a lot of events. Therefore, are you out of scale in Elasticsearch? Do you have any kind of recommendations or custom metric to do it?
So, until now, for the Black Friday, the infrastructure was handling it. We still don't have the Kubernetes thing. So, you know, it worked with the multi-cluster. Currently, the clusters we have
are not full. They are, you know, they can handle more, and it worked, it worked. We have, you know, five clusters currently with a lot of people requesting, and the load was maybe more in the indexing part, you know, all the corrections, all the everything that was happening
just before the Black Friday, all the ingestion part. And then the queries, as they are, they still keep very light, very unitary. Then, well, with, you know, only infrastructure, you can handle the load, currently. You know, with the,
it was the router also handling everything, but, well, it, we could do it. With Elasticsearch for that, I would say it's a bit of the same. You know, this is the scaling, the CEC also has a scalable thing. I think that Elasticsearch is also available,
for example, on Kubernetes, so you can, you know, scale, and unscale for such events. Okay. So thank you. Thank you.