We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

How Booking.com serves Deep Learning model predictions

00:00

Formal Metadata

Title
How Booking.com serves Deep Learning model predictions
Title of Series
Number of Parts
160
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
How Booking.com serves Deep Learning model predictions [EuroPython 2017 - Talk - 2017-07-13 - Anfiteatro 1] [Rimini, Italy] With so many machine learning frameworks and libraries available, writing a model isn’t a bottleneck anymore while putting your models in production is still a challenge. In this talk, you will learn how we deploy the python deep learning models in production at Booking.com. Topics will include: Deep Learning model training in Docker containers Automated retraining of models Deployment of models using Kubernetes Serving model predictions in containerized environment Optimising serving predictions for latency and throughpu
95
Thumbnail
1:04:08
102
119
Thumbnail
1:00:51
Mathematical modelPredictionSoftwareIntelScale (map)PredictabilityMathematical modelVirtual machineExpected valueView (database)Exploit (computer security)Video gamePoint (geometry)Machine learningPlanningMathematical modelBitOpen sourceProduct (business)Cartesian coordinate systemCycle (graph theory)WeightLibrary (computing)Game controllerMatching (graph theory)Hand fanProjective planeSoftware developerLevel (video gaming)Front and back endsComputer animationLecture/Conference
Scale (map)Computer-generated imageryAreaView (database)Digital photographySocial classContext awarenessCategory of beingCartesian coordinate systemInformationScaling (geometry)Digital photographyPhysical systemMedical imagingSocial classStatement (computer science)Formal languageContext awarenessClassical physicsView (database)Order (biology)Object (grammar)System administratorComa BerenicesSpring (hydrology)Different (Kate Ryan album)Natural numberBuildingMachine learningSet (mathematics)Artificial neural networkPoint (geometry)Mathematical modelMathematical modelHierarchyRule of inferenceUniform resource locatorType theoryString (computer science)ResultantAeroelasticityMultiplicationComputer animation
RippingResultantCartesian coordinate systemMathematical modelSet (mathematics)Product (business)Different (Kate Ryan album)PlanningVideo gameCycle (graph theory)Computer animation
Mathematical modelVideo gameMathematical modelLaptopWeightRead-only memoryStructural loadUniform resource locatorPredictionoutputMathematical modelMathematical modelProcess (computing)SoftwareScripting languageNumberVirtual machineMusical ensembleLoop (music)Wave packetEinbettung <Mathematik>Product (business)PredictabilitySanitary sewerWhiteboardDifferent (Kate Ryan album)Integrated development environmentDataflowPoint (geometry)Independence (probability theory)Data managementCore dumpLimit (category theory)Web applicationLine (geometry)Nichtlineares Gleichungssystem1 (number)Software testingOrder (biology)String (computer science)WebsiteLaptopoutputScaling (geometry)VideoconferencingCartesian coordinate systemParameter (computer programming)Bit rateComputer architectureGroup actionScheduling (computing)Extension (kinesiology)Mobile appCASE <Informatik>Game theoryMereologyDependent and independent variablesWeightVideo gameIRIS-TSet (mathematics)Semiconductor memory10 (number)Client (computing)Functional (mathematics)Server (computing)Graphics processing unitBoiling pointLastteilungMultiplicationRevision controlMultiplication signStructural loadDiagram
Overhead (computing)Reduction of orderPredictionGeometric quantizationLatent heatComputer networkInferenceComputing platformGraph (mathematics)NumberStapeldateiLimit (category theory)Cartesian coordinate systemCASE <Informatik>Execution unitMoment (mathematics)Table (information)Bit rateMaxima and minimaMultiplication signDependent and independent variablesSet (mathematics)Service (economics)2 (number)NeuroinformatikPredictabilityDampingView (database)Overhead (computing)FreezingoutputCoprocessorSemiconductor memorySummierbarkeitSoftwareInferenceVariable (mathematics)Web applicationReal-time operating systemGeometric quantizationLatent heatPoint (geometry)Mathematical modelClient (computing)Mathematical modelInstance (computer science)Linear regressionLogistic distributionLinearizationResultantBitBefehlsprozessorServer (computing)Type theoryStructural loadProduct (business)Mobile appTotal S.A.CalculationWordGame theoryArithmetic meanSource codeSound effectControl flowNetwork topologyKey (cryptography)Condition numberScheduling (computing)Cellular automaton4 (number)Parallel portKeyboard shortcutObservational studyINTEGRALOrder (biology)QuicksortJSON
Mathematical modelMathematical modelProcess (computing)Mobile appMathematical modelIntegrated development environmentCartesian coordinate systemWave packetInterface (computing)Scripting languageResultantLastteilungDependent and independent variablesReal-time operating systemComputer configurationComputer animation
Reduction of orderSoftwareSoftware developerLinker (computing)Touch typingHypermediaSoftwareLevel (video gaming)WebsiteLink (knot theory)Physical systemRule of inferenceMathematical model
Data managementGraphical user interfaceScaling (geometry)BefehlsprozessorSoftware testingMedical imagingVirtual machineQueue (abstract data type)ResultantBitStructural loadAlgorithmNumberoutputWave packetMathematical modelMetric systemCuboidMathematical modelMiniDiscEvent horizonPointer (computer programming)Cartesian coordinate systemProcess (computing)EmailScalabilityDampingInheritance (object-oriented programming)SummierbarkeitEntire functionNumeral (linguistics)Disk read-and-write headAreaWordSemiconductor memoryMereologyRule of inferenceConnectivity (graph theory)Set (mathematics)Graph coloringDataflowComputer animation
Revision controlDistribution (mathematics)Set (mathematics)PredictabilityMathematical modelInformationHash functionCASE <Informatik>Function (mathematics)Table (information)Latin squareComputer animationLecture/Conference
Table (information)Roundness (object)Cartesian coordinate systemReal-time operating systemComputer animation
Transcript: English(auto-generated)
Hi. Is this good? Yeah. So I will be talking about how we serve deep learning model predictions at Booking.com. And before I start, I would like to give a brief introduction about myself. So I would like to tell you what I am and what I'm not
so that we have a better understanding of each other to meet the expectations. So I'm a backend developer working on developing the infrastructure for deploying the deep learning models at Booking.com. And I'm also a machine learning enthusiast, so both of these things just match well for me.
And I'm also a big open source fan, and I'm a contributor in a couple of projects like Git tool that probably most of you have used already. And I'm a contributor in Pandas library, as well as Kintro by Mozilla, and Go GitHub project
by Google, and a bunch of other projects. And I'm also a tech speaker. So let me talk about what I'm not so that we have the expectations at the same level. I'm not a data scientist, and I'm not a machine learning expert. So if you have some specific questions about how things work
from a data scientist point of view and really about something related to deep learning or machine learning, I might not have the best answers right now. But I will be able to point you to where you can find the answer, or we can talk about that after my talk. So let me start with the agenda, what
I'm going to talk about. I'm going to start with mentioning a couple of applications of deep learning that we saw at Booking.com. And then I will talk about the lifecycle of a deep learning model from a data scientist point of view, like how this model looks like and what are the different stages of a deep learning model.
And next, I will talk about the deep learning production pipeline that we have that we have built on the top of containers and Kubernetes. And yeah, let's begin. So starting with the applications of deep learning at Booking.com. The first application that we saw at Booking.com,
so before I talk about the applications, I would like to talk about the scale, because I mentioned we work at a large scale. We have over 1.2 million room nights reserved every 24 hours. And these reservations come from more than 1.3 million properties, which are across 220 countries.
So we have this large scale. And this provides us access to a huge amount of data that we can utilize to improve the customer experience of our users. So the first application that we saw at Booking.com was image tagging. The question here is, what do we see in a particular image?
Like, for example, if you see this image, what do we see in this image? And this is a really easy question as well as a difficult one. So if you ask this question to a person, to a human, it's easy, because we know when we look at an image, we can identify the objects in the image. And this is easy for a human. But when we talk about this question being answered
by artificial intelligence, by machine learning or deep learning, it's not a very easy one. So for example, if we pass this image to some publicly available model, like ImageNet or something else, this is what we get. We get results like, so there are different classes,
oceanfront, nature, building, penthouse, apartment, and all this stuff. But when we ask this question, what is there in that image, it really depends what context we are talking about. From booking point of view, this is what we are concerned about. We are concerned about whether there is a sea view from the room or not, whether this photo is of a bed or not,
if it's a photo inside a room or not, or whether there is a balcony or a terrace. So there are a couple of challenges associated with this type of problems. First of all, this problem is not just
an image classification. It's image tagging. That means that there will be multiple labels, multiple classes for a particular image. And also, since our context is different from what other publicly available models may provide, we need to make sure that we come up with our own manual labels so that we can tag these images.
And the next challenge is there is going to be hierarchy of the labels. For example, if we see a photo of a bed, it may be of a, so we know that if there is a bed in the photo, the photo will be of an inside view of a room, unless you are in such a room where there is no room,
but there is only a bed. So yeah, once we know that what is there in an image, we can use this information to improve the experience of the users. For example, if we know that a user is looking for a swimming pool in the property
that they're going to book, we can show them, recommend or show them the hotel which we know that there is a swimming pool. There is some photo which is tagged with swimming pool. Or similarly, if we know that there are some customers, based on previous history, that there's some customer that is looking for breakfast buffet, we can show the hotels or properties
which we know have some photos tagged with breakfast buffet. So this way, we can make sure that we are improving the experience of the customers and helping them find the hotels or properties that they want easily and quickly. Another application that we saw was recommendation system.
So this is a classic recommendation problem. We have a user X. They booked a hotel Y. Now we have a new user, user Z. We want to recommend some hotels that the user Z is more probable to book. So the problem statement here is we want to find the probability of one user
booking a particular hotel. And what features do we have? We have some user features, which are like country and language of the user. And then we have some contextual features, like what's the day of the week when they are looking for it, or what's the season that they are looking for it. It's winter, spring, or what's the season.
And the next set of features we have is item features, features of the property that we are looking at, like price of the property or the location of the property or other information about that particular property. So once we realize that there are some set of applications
where we could achieve better results using the deep learning, we started exploring this field. And that's a credit to my colleagues, Stas, Gherkin, and Imra, who is a data scientist, who actually started with exploration of deep learning on different applications. And now we are actually using it in production successfully.
So next, let's talk about the lifecycle of a model, what it looks like for a particular model from the start of the idea to when it actually is used in the application, in your application, which may be anything. So these are the three steps, code, train, and deploy.
In first step, what we do is this is a step when a data scientist writes a model. When they experiment with the different kind of embeddings, different kind of features, or different number of hidden layers, or any kind of that kind of, they test or experiment with different kinds
of model architecture. And once they are happy with it, once they see good results, they move towards training on production data, and then they deploy. At Booking, we use TensorFlow Python API, which is a high-level API which provides easy-to-use
functions to write a model architecture easily in Python. So when we talk about the production pipeline, these are the two steps that we have in the production pipeline that we call it. Training of a model on production data and the deployment in containers which can be served by any application.
So you may wonder why training of a model is a part of a production pipeline. You may also use your laptops to train your models, right? But this is why it is not a good idea. So if you try to train your model on your laptop, this is what you may end up looking like.
There are a couple of reasons for that. One reason is your data may be too large that you can't use your laptop efficiently. Or another reason is that your laptop, in most of the cases, will have limited resources, will have some limited number of cores or may not
have a very powerful GPU. So these are the reasons why you may want to do the testing and experimenting with the model on your laptop. But then once you are sure that this is a model you want to go ahead with, it's a good idea to use some heavy servers or some specialized servers with GPUs or with a high number of cores so that you can speed up the process
and speed up the process of deployment when you actually get the model ready. So this is how the training of a model looks like. We use our servers. We have huge servers which have a lot of cores and sometimes GPS port as well. We wrap the training. So this is the training script for a particular model.
And we run that on our huge servers, which are production servers. But there are going to be multiple data scientists who are going to train their models. And sometimes there are going to be multiple models being trained on the same server or multiple servers at the same time. And we may not be able to provide
independent environment if we do this in this way on a single server. So what we do is we wrap this training inside a container. So what is a container? Container is a lightweight package of software which you can run on a host machine. And it includes all the dependencies
that your application may need. So we wrap this training script inside a container. We spawn up a container every time we want to train a model. And also this provides us easy versioning of the TensorFlow because once we have a particular model written in TensorFlow, let's say 1.1 version,
now the new model comes up and a new data scientist wants to use a new model. We can easily have that new container have the new version and use it. So basically on the same machine, we are having different versions of the dependencies. And that's why we're using containers to make sure that we have these independent environments for all
of these trainings. And also it helps in the GPU support. These containers can also utilize the GPU support on our big servers that we have. So this is how it looks like. We have this Hadoop storage where we have all the production data that we want to use for training our models.
We spawn up a new container when we want to train. It has a training script. And it fetches the data from the Hadoop storage. It runs the training. Once the training is done, we want to make sure that the model checkpoints, the model weights are stored somewhere so that we can utilize them later in production when we deploy them.
So what we do is we save the model checkpoints back to Hadoop storage. And the container is gone. So what can be more selfless than a container? It takes birth to do what you want it to do. And then it dies. That's the entire life of a container.
So once we have this training done, we have trained our model on production data. And we have stored the model checkpoints on Hadoop storage, which we can utilize now. Now deployment is putting that model in production somewhere, in servers or in somewhere, where you can utilize that model to have predictions
by your different applications that you may have. You may have your web application. Or you may have your app, Android, iOS, any app. And you want to make sure that you can utilize that model from those applications. So what we did was we have a Python app, which
is a basic WSCI HTTP server, where what it does is it takes the model weights from the container, from the Hadoop storage, and it loads the model in memory. So when we want to load a model, it needs two things. It needs a model definition as well as the model weights.
So we have the model definition already when we have this Python app running. And we get the model weights from the Hadoop storage. We combine these, and we load the model in memory so that it is ready to serve the predictions now. And on the top of this, it also provides a nice URL, a nice, easy to use, easy to remember URL
to get the predictions. So basically, it all boils down to sending a GET request with all your parameters that you have and getting the prediction back. This is how it looks like. Again, we have this app running in a containerized environment so that it's independent and it carries all the dependencies with itself.
And there's no problems. Like, it runs on my machine, or it runs on this version of OS. It doesn't run on that version. So it contains all the dependencies that it needs. And it can run on any server where you can run Docker containers. So basically, we use Docker to use a containers thing.
So this is how it looks like. We have the containerized serving of our model. And we can have any kind of clients which will just send us the input features and get back the predictions. But as I mentioned earlier, we have a huge scale that we operate on. And when we have thousands of requests or millions
of requests per second, we can't just have one server. So what we do is this. We spawn a lot of containers, put them behind a load balancer, and the client doesn't know how many servers are actually serving. You just send requests to a load balancer IP, and load balancer takes care of all the scheduling,
all the rerouting the request. Since we have a huge large scale, we have plenty of more containers. So once we keep on increasing the number of containers that we have for one application, we want a way to be able to manage these containers.
Because it's possible that sometimes we want to increase the number of containers, or sometimes we want to decrease the number of containers, and we see that there's less traffic. Also, we even want to diagnose some of the containers when something goes wrong with the containers. Or let's say we want to kill some of the containers and spawn them again because there's
some error or something. So for this, we use Kubernetes. Kubernetes is a container orchestration platform which helps us in scheduling, maintaining, and scaling applications using containers. So Kubernetes is a really nice tool
by Google which provides us a really nice, flexible way to scale up or scale down any application at any time. We can create new instances, new containers, put them behind the same load balancer, and those containers will be now serving the applications to the request from the clients. Or we can scale down easily with just one command.
And also, Kubernetes makes sure that if we mention that we want to have 50, let's say, for example, 50 containers for application, it makes sure that even if one of the containers or two of the containers or, let's say, 10 containers die because of some error, it makes sure that at any moment it's going to retry and create new ones so that we don't have to care about if something goes wrong
unless there is something seriously wrong and it can't create new containers. So basically, it will try to maintain the number of containers to a particular limit that we have set. So once we know how we have deployed the models, we also need to be able to measure the performance of these models in production when we have a lot of requests
coming in at a rate of plenty of thousands of requests per second. So this is how it looks like. Let's say you have your model, and it takes some computation time to compute the prediction for a set of input features.
But that is not going to be the time that your client is going to see. Your client is going to also have some request overhead because of networking latency, depending on where your app is hosted and where your client is coming from. So this is how it looks like. The prediction time total is sum of request overhead
and the computation time. And if you have n resources, if you have n instances you predict in one request, you just multiply it by computation time, and this is what you get as a rough calculation of your prediction time from client point of view. And we can see that if we have some simple models where
computation time is like simple model like logistic regression or linear regression where we have a small set of features and it's a small model, there we will have the request overhead will be the bottleneck, and the computation time will be almost negligible compared to the request overhead.
So once we know this is the kind of performance that we can expect, there may be two things. Either you may want to optimize for latency or throughput. Let's talk about latency. Latency is the amount of time it takes to serve one request. So you may have some applications,
like let's say you have a web application which needs to be served as soon as possible. So you want to optimize for latency there. And these are some of the ways that you can use to optimize for latency. First way is don't predict in real time if you can pre-compute. This is a simple way when you can pre-compute all the results
that you know that are going to be there to predict. You can just save them in the lookup table and serve from that lookup table and you will be really fast and you won't have any computation time in the real time. But we understand that it is not always possible. Almost in most of the applications, we have the need to predict real time.
What we can do there, we could reduce the request overhead. And one of the ways we could do that is we can have the model embedded in the application so that there is no latency in accessing the model and getting the predictions back. That's what we do as well.
We keep the model in memory in the container that is serving the app so that it's able to predict and return the request quickly. Next is predict for one instance. This is useful when you have computation time, which is huge as compared to the request overhead.
When you know that your computation time is a key, is a bottleneck for your request, you should send as many requests as instances you have. So let's say you want to predict for 10 set of instances. You should send 10 requests, because you know that your request overhead is not the bottleneck here. And you don't want to reduce the request overhead.
You just want to make sure that you've sent requests as soon as possible and get the results back. And you can also do some techniques like quantization. And what that means is it means you convert your float 32 values to fixed type 8 bits. And how it helps is that now your CPU can hold four times
more data in the same processor. And hence, it becomes faster in processing that data in computing the float values as compared to computing the float values. And there are some TensorFlow specific techniques like freezing the network. Freezing the network means that when you have some computation graph, what you do
is you have some variables, TensorFlow variables. And if you convert those variables into TensorFlow constants, you get some boost in the performance and the speed of the computation of the predictions. And another thing is you can optimize for inference. What that means is you can remove all the unused nodes from the graph. And that will help in boosting up the computation again.
Next is we may want to optimize for throughput. Throughput means the amount of work being done in one unit time, maybe one second, one minute, depending on what your use case is. If you want to get a lot of work done per unit time,
it's, again, the first thing you always is do not pre-compute if you can always have a lookup table with all the computations and use them when your request comes. And another thing you can do is batch the request. When you know that you want a maximum amount of work done in the unit time, you want to reduce the request overhead
as much as possible. So if you send a lot of requests together in one request, let's say thousands of requests, you're going to get performance boost of those 1,000 times request overhead, which you don't have now as compared to when you could have sent those requests one by one. And you can also send a parallelized request.
And you can just use an asynchronous request. Instead of waiting for one request response to come back before sending other requests, you can just send them all in parallel and let the service do its work and synchronously collect the responses and make sure that you get maximum work done in unit time.
So let's try to summarize what we talked about. First of all, we talked about training of models in containers. We spawn a new container. It fetches the data from our Hadoop storage. It can be MySQL as well. It really depends on the application.
And it runs the training script in an independent environment in a container. Once the training is complete, make sure that it stores the model checkpoints back in the Hadoop storage, and it dies. That's the entire process of the training of a model in container. The next is serving these models
from containers using Kubernetes. We spawn as many containers as possible as we need, depending on how many requests we have for that particular application. And we let the Kubernetes do its stuff with the load balancing as well as maintaining and managing the containers and providing us an easy interface to diagnose all the problems that we may have.
And the next is we optimize these serving of apps used for latency or throughput, depending on what the application is. If you have a cron job or something which has a lot of work to do in one burst, you can use the techniques to optimize for throughput.
Or if you have a real-time application in which you just need to show the result right away to the user, you can optimize your serving fault latency. We have all these options available in our pipeline. To work on all these cool things and a lot of other things like MapReduce, Spark,
recommender systems, and a bunch of other things we are hiring, we are hiring especially for software level roles as well as data scientist roles. So yeah, if you are interested in working on these things, you may check out this link, or you may get in touch with me on LinkedIn, Twitter, or GitHub.
I go by sahildua2305 name on most of the social media websites. That's it. Thank you. Thank you, Sahil. Please raise your hand if you have a question.
Thank you. So you use Kubernetes, and you can scale up and scale down number of replicas, right? As you mentioned, what do you use to decide whether you should scale up replicas or scale
down? What algorithm is behind the load balancer? Do you do it manually or automatically? I didn't answer your question. So what do we use as a metric to decide whether we want to? Yeah, yeah, exactly. You have number of replicas, like five. Now you have load, and you want
to decide if there should be 10 replicas or scale down. Yeah, so Kubernetes out of the box provides a support to a few metrics, like CPU usage, disk memory, as well as the traffic that we get in the number of requests. So it really depends on the kind of application
that we want, because in some of the areas, we want to have the metric CPU usage, which tells us how busy are our CPUs on the particular container. Or we may also want to use the WSTI queue size, because once we have a lot of requests coming to containers, we want to make sure that those queues are not full.
And once those queues are getting full, we want to spawn more containers so that that traffic maybe can be distributed, so that those queues are not dropping off the request. So it really depends. WSTI queue size is one of the metrics that we are looking at. OK, thanks.
And my second question is, how do you annotate your data for model training? Do you have some team of annotators, or how do you do it? Your question is, how do we come with this data, or what? How do you annotate data, like the images? If you have some team of annotators who draw, this is bad, this is a chair, this is window.
Oh, yeah, OK. So when we started writing this model for image tagging, we hired some, we outsourced this tagging manually. We had some huge number of images which were tagged manually by people, by humans. And we used that kind of data to train our model.
And it's some company, or how did you hire them? Sorry? It's some external company that has these annotators, or how you made it. Your voice is not clear, sorry. Oh, I will come after you listen. OK, so thank you very much for your talk.
I think this is one of the main problems with Python machine learning, deploying it.
I would be interested, if you want to use machine learning, you usually have to do some feature engineering, like you get some input data, and then you have to crunch some numbers. Where do you actually do that? Do you do that in the app, and you tell the app, OK, you have to provide this data? Or do you do that in the container? Or do you do it on the Hadoop when the data comes in, and you just kind of like send a pointer to the Hadoop data?
Yeah, so that's something that I didn't cover. So what we do is we have some kind of events data that is being logged in all the activities that we have on our website. And once we have the data, we have some easy workflows or con jobs which deal with the data and prepare the data to the kind of data
that we want to use in our models. So we have separate workflow, which takes care of the data managing and the preparation of data, basically, for these models. I was wondering if you could talk a little bit about how you iterate with your models.
And let's say, I'm not sure whether that's the case, but if you have some new training data, you want to take it into account, you want to retrain your models, and then check whether they're still performing well or not, how do you deal with these kind of things? So are you asking about how we deploy new models or the performance testing of new models?
Both. OK. So once we know that we have new models, we want data scientists want to update the model, what we do is we have some particular, so we use OpenShift on the top of Kubernetes to manage the graphic interface of the entire structure. So once we know that there is a new model,
we update our deployment with that new model. And we can use A-B testing to see what kind of results we get. And we have proper monitoring, which tells us what's the distribution of our feature sets or the distribution of our outputs for a particular model. And we can use that information to decide whether the model is good or not,
whether we want to keep it or move to the previous version. In your talk that you, one of the ways to improve throughput and latency was to cache or to have a hash table of previous predictions. How did you implement that per container?
Or do you have a centralized? And what technology do you use for that? So the thing that I mentioned for caching and keeping the predictions in the lookup tables, that is something that really depends on your use case.
So honestly, we haven't found out that kind of use case where we already know what kind of predictions we are going to predict. So what we do is we don't use the lookup tables. We predict in the real time. So we use the other kind of techniques that I mentioned to optimize for latency as well as throughput. So we don't already have some application where we could employ the lookup tables.
OK, so that's it. Thank you, Sahil. Please give a warm round of applause to Sahil.