How Booking.com serves Deep Learning model predictions - TIB AV-Portal

How Booking.com serves Deep Learning model predictions

00:00

44

Formal Metadata

Title

How Booking.com serves Deep Learning model predictions

Title of Series

EuroPython 2017

Number of Parts

160

Author

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/33795 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

How Booking.com serves Deep Learning model predictions [EuroPython 2017 - Talk - 2017-07-13 - Anfiteatro 1] [Rimini, Italy] With so many machine learning frameworks and libraries available, writing a model isn’t a bottleneck anymore while putting your models in production is still a challenge. In this talk, you will learn how we deploy the python deep learning models in production at Booking.com. Topics will include: Deep Learning model training in Docker containers Automated retraining of models Deployment of models using Kubernetes Serving model predictions in containerized environment Optimising serving predictions for latency and throughpu

EuroPython 201730 / 160

1

25:54

Kajiki, the fast and validated template engine you were looking for

2

44:56

Fast Python! Coding competitions with CPython and PyPy

3

21:43

Django Rest Framework, one year after: tips, tools, tricks and pitfalls.

4

30:27

Get over the boundaries between client and server in web app development

5

28:25

Building a real-time embedded audio sampling application with MicroPython

6

32:56

But how do you know your mock is valid? Verified fakes of web services

7

29:40

Pythonist view on Microservices & Containerization

8

31:26

How CPython parser works, and how to make it work better

9

30:29

Understanding Celery & CeleryBeat

10

26:36

Introduction to TensorFlow

11

39:43

Skynet your Infrastructure with QUADS

12

44:14

Lessons learnt building a medical chatbot in Python

13

24:37

Feeding a real-time user interface

14

26:40

Green threads in Python

15

36:23

Pythonic JavaScript for Web Developers

16

32:22

Pandas - not just for data scientists

17

52:40

The Different Roads We Take

18

29:41

Automating Instagram with Python and Selenium

19

31:50

Front-end testing with Python

20

34:53

Jupyter notebooks for teaching and learning

21

26:14

OpenAPI development with Python

22

30:47

Optimizing queries for not so big data in PostgreSQL

23

27:14

Django: From a nightmare to a dream with Best Practices.

24

31:34

Fixture factories for faster end-to-end tests

25

35:18

Scientific computing using Cython: Best of both Worlds!

26

33:14

Python Profiling with Intel® VTune™ Amplifier

27

30:33

There should be one obvious way to bring python into production

28

27:02

Why you don't need design patterns in Python?

29

25:21

Fast product development using Django Rest Framework. #lessonslearned

30

31:14

How Booking.com serves Deep Learning model predictions

31

43:19

Simple ETL in python 3.5+ with Bonobo

32

1:05:47

TCP / IP Animated

33

58:47

AAA Games with Unreal Engine 4 and Python

34

21:33

Optimization using Flow Networks in NetworkX.

35

27:17

MSS - Software for planning research aircraft missions

36

39:34

Hi, my name is README! - A Look at Why Docs are So Important

37

28:18

Python REST frameworks review

38

40:24

Practical Debugging - Tips, Tricks and Ways to think

39

18:38

MicroPython Workshop

40

44:21

CyberSecurity.bootcamp()

41

43:44

How to use pandas the wrong way

42

29:23

Infrastructure as Python Code: Run your Services on Microsoft Azure

43

24:56

Improve your developer's toolset

44

42:32

A faster Python? You Have These Choices

45

31:50

An introduction to PyTorch & Autograd

46

20:54

Django and Graphql

47

20:36

Full-Text Search in Django with PostgreSQL

48

26:33

Introduction to Nonparametric Bayesian Models

49

27:15

Python and Angular, a perfect match?

50

18:22

How Facebook uses Python to build (and operate) datacenters at scale

51

29:08

When Django is too bloated - Specialized Web-Applications with Werkzeug

52

47:25

Graph Databases: Talking about your Data Relationships with Python

53

30:18

Explaining behavior of Machine Learning models with eli5 library

54

45:23

Python on Windows, Like a Boss

55

34:47

Programming in Parallel with Threads

56

27:08

Teach your (micro)services speak Protocol Buffers with gRPC.

57

43:05

Python Microservices

58

34:19

Lessons learned in X years of parallel programming

59

43:28

Running Python code in parallel and asynchronously

60

29:26

Developing elegant workflows in Python code with Apache Airflow

61

25:55

Linespots: Predicting Bugs in your Code

62

27:10

How to make money with your Python Open-Source Project

63

22:54

How to apply deep learning for 3D object

64

33:36

Executing scripts in a few milliseconds with MicroPython

65

47:47

Making Games with Python: Mission Impossible?

66

30:00

Rendering complex 3D-Geodata using pyRT

67

30:01

Pythonic Refactoring: Protecting Your Users From Change

68

40:19

Finding bugs for free: The magic of static analysis.

69

25:50

Effortless Logging - Let the loggers work for you

70

25:33

Discovering Descriptors

71

29:28

Django, Django Rest Framework and Angular2: RAD on SaaS platforms

72

34:41

EuroPython 2018: Help us build the next edition!

73

59:16

EPS General Assembly

74

31:09

Automatic Conference Scheduling with PuLP

75

32:07

A Gentle Introduction to Data Science

76

18:05

Using Python and microservices to fuel WebPush at Mozilla

77

24:43

Writing Awesome PyPI packages in Python

78

28:17

Mock it right! A beginner’s guide to world of tests and mocks.

79

35:38

Despicable machines: how computers can be assholes

80

29:56

Tracing, Fast and Slow: Digging into & improving your web service’s performance

81

58:40

EuroPython 2017: Lightning Talks - 2017-07-14

82

53:06

EuroPython 2017: Lightning Talks - 2017-07-13

83

1:25:40

EuroPython 2017: Lightning Talks - 2017-07-12

84

29:49

Abstract Base Classes: a smart use of metaclasses

85

27:22

Teeing up Python: Code Golf

86

28:14

Bringing Python to Godot game engine

87

25:20

From Fun to Business - How Open Source Changed my Life

88

45:04

Serverless Applications with Chalice

89

24:25

Best Practices for Debugging

90

50:57

How to build up a Python community and empower women

91

28:38

GPU Acceleration of a Global Atmospheric Model using Python based Multi-platform

92

58:27

If Ethics is not None

93

48:27

Replacing passwords with multiple factors: email, OTP, and hardware keys

94

40:00

Meson: compiling the world with Python

95

1:04:08

96

23:18

You need more security in your application packaging

97

31:20

Call a C API from Python becomes more enjoyable with CFFI

98

28:46

Why you might want to go async

99

48:08

How to create inspiring data

100

38:58

Python Packaging - current state and overview

101

37:58

pybind11 - seamless operability between C++11 and Python

102

44:45

103

29:38

Inspect (Or Gadget)

104

42:26

Deep Learning your Broadband Network @HOME

105

44:06

PostgreSQL - The Database for Industry 4.0 and IOT

106

30:09

From an old-school data managing company to data analytics with Python

107

43:04

Inside Airbnb: Visualizing data that includes geographic locations

108

28:16

Dockerized pytests

109

29:09

Big Data Analytics at the MPCDF: GPU Crystallography with Python

110

38:07

PostgreSQL and Python, a match made in heaven

111

30:25

2 + 2 = 5: Monkey-patching CPython with ctypes to conform to Party doctrine

112

24:21

Facing the challenge of climate change with xarray and Dask

113

30:32

Cloud Native Python in Kubernetes

114

29:12

Declarative Thinking and Programming

115

40:38

Rethinking how we build HTTP APIs

116

39:27

Feeding data to AWS Redshift with Airflow

117

45:11

A robotic platform for natural and effective human-robot interaction

118

20:55

Writing code? Pfft... Evolve it instead!

119

1:00:51

120

28:06

Debugging in Python 3.6: Better, Faster, Stronger

121

34:26

Async Web Apps with Sanic

122

59:21

PyCharm tips and tricks

123

30:31

Large-scale data extraction, structuring and matching using Python and Spark

124

27:36

Infrastructure design patterns with Python, Buildbot, and Linux Containers

125

28:03

I don't like Mondays-what I learned about data engineering after 2 years on call

126

44:31

Fighting the controls: tragedy and madness for programmers and pilots

127

56:48

The Encounter: Python’s adventures in Africa

128

34:20

Scripting across hosts with Chopsticks

129

29:34

How SAP is using Python to test its database SAP HANA

130

45:27

Streaming: Why should I care?

131

42:05

Identity management, single sign-on and certificates with FreeIPA

132

58:21

Inspiring all children, a journey: diversity and computing education

133

28:46

Bitcoin and Blockchain for Pythoneers

134

46:27

Physical computing with Raspberry Pi and Python

135

42:43

A Python for Future Generations

136

29:34

PyPy meets Python 3 and Numpy

137

38:22

Write more decorators (and fewer classes)

138

24:02

Space weather monitoring for a virtual reality simulation

139

58:39

The joy of PyPy JIT: abstractions for free

140

45:06

Using the right Async tool, present day

141

35:06

Network visualization and automation

142

41:40

Overcoming Cognitive Bias

143

44:20

Mary had a little lambda

144

31:48

Building a full-stack web application with Python, NPM, Webpack and React

145

44:27

A journey into Git internals with Python

146

30:52

Taking the Hipster out of Streaming

147

45:11

Type Annotations in Python 3: Whats, whys & wows!

148

33:18

Modelling pollution from traffic, using Smartphone data and Python

149

24:52

Writing Beautiful Code

150

29:09

Asynchronous I/O and the real-time web

151

30:36

Baby steps in short-text classification with python

152

33:34

Sustainable Scientific Software Development

153

1:23:28

Community teaching practices

154

43:29

Leveraging consistent hashing in your python applications

155

30:52

Testing microcontroller firmware with Python

156

29:30

Realtime Distributed Computing At Scale: Storm And Streamparse

157

28:53

Neat Analytics with Pandas Indexes

158

45:08

Testing in Layers

159

45:03

Plone: where is it today and where is it going

160

41:31

Protocols and Practices enforcing in Python through bytecode and inspection

Automatic playback

Speech

Text

Image

00:00

Mathematical modelPredictionSoftwareIntelScale (map)PredictabilityMathematical modelVirtual machineExpected valueView (database)Exploit (computer security)Video gamePoint (geometry)Machine learningPlanningMathematical modelBitOpen sourceProduct (business)Cartesian coordinate systemCycle (graph theory)WeightLibrary (computing)Game controllerMatching (graph theory)Hand fanProjective planeSoftware developerLevel (video gaming)Front and back endsComputer animationLecture/Conference

02:12

Scale (map)Computer-generated imageryAreaView (database)Digital photographySocial classContext awarenessCategory of beingCartesian coordinate systemInformationScaling (geometry)Digital photographyPhysical systemMedical imagingSocial classStatement (computer science)Formal languageContext awarenessClassical physicsView (database)Order (biology)Object (grammar)System administratorComa BerenicesSpring (hydrology)Different (Kate Ryan album)Natural numberBuildingMachine learningSet (mathematics)Artificial neural networkPoint (geometry)Mathematical modelMathematical modelHierarchyRule of inferenceUniform resource locatorType theoryString (computer science)ResultantAeroelasticityMultiplicationComputer animation

06:58

RippingResultantCartesian coordinate systemMathematical modelSet (mathematics)Product (business)Different (Kate Ryan album)PlanningVideo gameCycle (graph theory)Computer animation

07:37

Mathematical modelVideo gameMathematical modelLaptopWeightRead-only memoryStructural loadUniform resource locatorPredictionoutputMathematical modelMathematical modelProcess (computing)SoftwareScripting languageNumberVirtual machineMusical ensembleLoop (music)Wave packetEinbettung <Mathematik>Product (business)PredictabilitySanitary sewerWhiteboardDifferent (Kate Ryan album)Integrated development environmentDataflowPoint (geometry)Independence (probability theory)Data managementCore dumpLimit (category theory)Web applicationLine (geometry)Nichtlineares Gleichungssystem1 (number)Software testingOrder (biology)String (computer science)WebsiteLaptopoutputScaling (geometry)VideoconferencingCartesian coordinate systemParameter (computer programming)Bit rateComputer architectureGroup actionScheduling (computing)Extension (kinesiology)Mobile appCASE <Informatik>Game theoryMereologyDependent and independent variablesWeightVideo gameIRIS-TSet (mathematics)Semiconductor memory10 (number)Client (computing)Functional (mathematics)Server (computing)Graphics processing unitBoiling pointLastteilungMultiplicationRevision controlMultiplication signStructural loadDiagram

15:10

Overhead (computing)Reduction of orderPredictionGeometric quantizationLatent heatComputer networkInferenceComputing platformGraph (mathematics)NumberStapeldateiLimit (category theory)Cartesian coordinate systemCASE <Informatik>Execution unitMoment (mathematics)Table (information)Bit rateMaxima and minimaMultiplication signDependent and independent variablesSet (mathematics)Service (economics)2 (number)NeuroinformatikPredictabilityDampingView (database)Overhead (computing)FreezingoutputCoprocessorSemiconductor memorySummierbarkeitSoftwareInferenceVariable (mathematics)Web applicationReal-time operating systemGeometric quantizationLatent heatPoint (geometry)Mathematical modelClient (computing)Mathematical modelInstance (computer science)Linear regressionLogistic distributionLinearizationResultantBitBefehlsprozessorServer (computing)Type theoryStructural loadProduct (business)Mobile appTotal S.A.CalculationWordGame theoryArithmetic meanSource codeSound effectControl flowNetwork topologyKey (cryptography)Condition numberScheduling (computing)Cellular automaton4 (number)Parallel portKeyboard shortcutObservational studyINTEGRALOrder (biology)QuicksortJSON

22:43

Mathematical modelMathematical modelProcess (computing)Mobile appMathematical modelIntegrated development environmentCartesian coordinate systemWave packetInterface (computing)Scripting languageResultantLastteilungDependent and independent variablesReal-time operating systemComputer configurationComputer animation

24:16

Reduction of orderSoftwareSoftware developerLinker (computing)Touch typingHypermediaSoftwareLevel (video gaming)WebsiteLink (knot theory)Physical systemRule of inferenceMathematical model

24:49

Data managementGraphical user interfaceScaling (geometry)BefehlsprozessorSoftware testingMedical imagingVirtual machineQueue (abstract data type)ResultantBitStructural loadAlgorithmNumberoutputWave packetMathematical modelMetric systemCuboidMathematical modelMiniDiscEvent horizonPointer (computer programming)Cartesian coordinate systemProcess (computing)EmailScalabilityDampingInheritance (object-oriented programming)SummierbarkeitEntire functionNumeral (linguistics)Disk read-and-write headAreaWordSemiconductor memoryMereologyRule of inferenceConnectivity (graph theory)Set (mathematics)Graph coloringDataflowComputer animation

29:48

Revision controlDistribution (mathematics)Set (mathematics)PredictabilityMathematical modelInformationHash functionCASE <Informatik>Function (mathematics)Table (information)Latin squareComputer animationLecture/Conference

30:47

Table (information)Roundness (object)Cartesian coordinate systemReal-time operating systemComputer animation

Transcript: English(auto-generated)

00:04

Hi. Is this good? Yeah. So I will be talking about how we serve deep learning model predictions at Booking.com. And before I start, I would like to give a brief introduction about myself. So I would like to tell you what I am and what I'm not

00:23

so that we have a better understanding of each other to meet the expectations. So I'm a backend developer working on developing the infrastructure for deploying the deep learning models at Booking.com. And I'm also a machine learning enthusiast, so both of these things just match well for me.

00:43

And I'm also a big open source fan, and I'm a contributor in a couple of projects like Git tool that probably most of you have used already. And I'm a contributor in Pandas library, as well as Kintro by Mozilla, and Go GitHub project

01:01

by Google, and a bunch of other projects. And I'm also a tech speaker. So let me talk about what I'm not so that we have the expectations at the same level. I'm not a data scientist, and I'm not a machine learning expert. So if you have some specific questions about how things work

01:22

from a data scientist point of view and really about something related to deep learning or machine learning, I might not have the best answers right now. But I will be able to point you to where you can find the answer, or we can talk about that after my talk. So let me start with the agenda, what

01:42

I'm going to talk about. I'm going to start with mentioning a couple of applications of deep learning that we saw at Booking.com. And then I will talk about the lifecycle of a deep learning model from a data scientist point of view, like how this model looks like and what are the different stages of a deep learning model.

02:03

And next, I will talk about the deep learning production pipeline that we have that we have built on the top of containers and Kubernetes. And yeah, let's begin. So starting with the applications of deep learning at Booking.com. The first application that we saw at Booking.com,

02:22

so before I talk about the applications, I would like to talk about the scale, because I mentioned we work at a large scale. We have over 1.2 million room nights reserved every 24 hours. And these reservations come from more than 1.3 million properties, which are across 220 countries.

02:42

So we have this large scale. And this provides us access to a huge amount of data that we can utilize to improve the customer experience of our users. So the first application that we saw at Booking.com was image tagging. The question here is, what do we see in a particular image?

03:02

Like, for example, if you see this image, what do we see in this image? And this is a really easy question as well as a difficult one. So if you ask this question to a person, to a human, it's easy, because we know when we look at an image, we can identify the objects in the image. And this is easy for a human. But when we talk about this question being answered

03:22

by artificial intelligence, by machine learning or deep learning, it's not a very easy one. So for example, if we pass this image to some publicly available model, like ImageNet or something else, this is what we get. We get results like, so there are different classes,

03:43

oceanfront, nature, building, penthouse, apartment, and all this stuff. But when we ask this question, what is there in that image, it really depends what context we are talking about. From booking point of view, this is what we are concerned about. We are concerned about whether there is a sea view from the room or not, whether this photo is of a bed or not,

04:05

if it's a photo inside a room or not, or whether there is a balcony or a terrace. So there are a couple of challenges associated with this type of problems. First of all, this problem is not just

04:20

an image classification. It's image tagging. That means that there will be multiple labels, multiple classes for a particular image. And also, since our context is different from what other publicly available models may provide, we need to make sure that we come up with our own manual labels so that we can tag these images.

04:41

And the next challenge is there is going to be hierarchy of the labels. For example, if we see a photo of a bed, it may be of a, so we know that if there is a bed in the photo, the photo will be of an inside view of a room, unless you are in such a room where there is no room,

05:03

but there is only a bed. So yeah, once we know that what is there in an image, we can use this information to improve the experience of the users. For example, if we know that a user is looking for a swimming pool in the property

05:21

that they're going to book, we can show them, recommend or show them the hotel which we know that there is a swimming pool. There is some photo which is tagged with swimming pool. Or similarly, if we know that there are some customers, based on previous history, that there's some customer that is looking for breakfast buffet, we can show the hotels or properties

05:41

which we know have some photos tagged with breakfast buffet. So this way, we can make sure that we are improving the experience of the customers and helping them find the hotels or properties that they want easily and quickly. Another application that we saw was recommendation system.

06:01

So this is a classic recommendation problem. We have a user X. They booked a hotel Y. Now we have a new user, user Z. We want to recommend some hotels that the user Z is more probable to book. So the problem statement here is we want to find the probability of one user

06:22

booking a particular hotel. And what features do we have? We have some user features, which are like country and language of the user. And then we have some contextual features, like what's the day of the week when they are looking for it, or what's the season that they are looking for it. It's winter, spring, or what's the season.

06:41

And the next set of features we have is item features, features of the property that we are looking at, like price of the property or the location of the property or other information about that particular property. So once we realize that there are some set of applications

07:02

where we could achieve better results using the deep learning, we started exploring this field. And that's a credit to my colleagues, Stas, Gherkin, and Imra, who is a data scientist, who actually started with exploration of deep learning on different applications. And now we are actually using it in production successfully.

07:24

So next, let's talk about the lifecycle of a model, what it looks like for a particular model from the start of the idea to when it actually is used in the application, in your application, which may be anything. So these are the three steps, code, train, and deploy.

07:44

In first step, what we do is this is a step when a data scientist writes a model. When they experiment with the different kind of embeddings, different kind of features, or different number of hidden layers, or any kind of that kind of, they test or experiment with different kinds

08:01

of model architecture. And once they are happy with it, once they see good results, they move towards training on production data, and then they deploy. At Booking, we use TensorFlow Python API, which is a high-level API which provides easy-to-use

08:22

functions to write a model architecture easily in Python. So when we talk about the production pipeline, these are the two steps that we have in the production pipeline that we call it. Training of a model on production data and the deployment in containers which can be served by any application.

08:42

So you may wonder why training of a model is a part of a production pipeline. You may also use your laptops to train your models, right? But this is why it is not a good idea. So if you try to train your model on your laptop, this is what you may end up looking like.

09:03

There are a couple of reasons for that. One reason is your data may be too large that you can't use your laptop efficiently. Or another reason is that your laptop, in most of the cases, will have limited resources, will have some limited number of cores or may not

09:20

have a very powerful GPU. So these are the reasons why you may want to do the testing and experimenting with the model on your laptop. But then once you are sure that this is a model you want to go ahead with, it's a good idea to use some heavy servers or some specialized servers with GPUs or with a high number of cores so that you can speed up the process

09:43

and speed up the process of deployment when you actually get the model ready. So this is how the training of a model looks like. We use our servers. We have huge servers which have a lot of cores and sometimes GPS port as well. We wrap the training. So this is the training script for a particular model.

10:02

And we run that on our huge servers, which are production servers. But there are going to be multiple data scientists who are going to train their models. And sometimes there are going to be multiple models being trained on the same server or multiple servers at the same time. And we may not be able to provide

10:22

independent environment if we do this in this way on a single server. So what we do is we wrap this training inside a container. So what is a container? Container is a lightweight package of software which you can run on a host machine. And it includes all the dependencies

10:41

that your application may need. So we wrap this training script inside a container. We spawn up a container every time we want to train a model. And also this provides us easy versioning of the TensorFlow because once we have a particular model written in TensorFlow, let's say 1.1 version,

11:03

now the new model comes up and a new data scientist wants to use a new model. We can easily have that new container have the new version and use it. So basically on the same machine, we are having different versions of the dependencies. And that's why we're using containers to make sure that we have these independent environments for all

11:20

of these trainings. And also it helps in the GPU support. These containers can also utilize the GPU support on our big servers that we have. So this is how it looks like. We have this Hadoop storage where we have all the production data that we want to use for training our models.

11:41

We spawn up a new container when we want to train. It has a training script. And it fetches the data from the Hadoop storage. It runs the training. Once the training is done, we want to make sure that the model checkpoints, the model weights are stored somewhere so that we can utilize them later in production when we deploy them.

12:02

So what we do is we save the model checkpoints back to Hadoop storage. And the container is gone. So what can be more selfless than a container? It takes birth to do what you want it to do. And then it dies. That's the entire life of a container.

12:21

So once we have this training done, we have trained our model on production data. And we have stored the model checkpoints on Hadoop storage, which we can utilize now. Now deployment is putting that model in production somewhere, in servers or in somewhere, where you can utilize that model to have predictions

12:43

by your different applications that you may have. You may have your web application. Or you may have your app, Android, iOS, any app. And you want to make sure that you can utilize that model from those applications. So what we did was we have a Python app, which

13:00

is a basic WSCI HTTP server, where what it does is it takes the model weights from the container, from the Hadoop storage, and it loads the model in memory. So when we want to load a model, it needs two things. It needs a model definition as well as the model weights.

13:21

So we have the model definition already when we have this Python app running. And we get the model weights from the Hadoop storage. We combine these, and we load the model in memory so that it is ready to serve the predictions now. And on the top of this, it also provides a nice URL, a nice, easy to use, easy to remember URL

13:41

to get the predictions. So basically, it all boils down to sending a GET request with all your parameters that you have and getting the prediction back. This is how it looks like. Again, we have this app running in a containerized environment so that it's independent and it carries all the dependencies with itself.

14:03

And there's no problems. Like, it runs on my machine, or it runs on this version of OS. It doesn't run on that version. So it contains all the dependencies that it needs. And it can run on any server where you can run Docker containers. So basically, we use Docker to use a containers thing.

14:22

So this is how it looks like. We have the containerized serving of our model. And we can have any kind of clients which will just send us the input features and get back the predictions. But as I mentioned earlier, we have a huge scale that we operate on. And when we have thousands of requests or millions

14:43

of requests per second, we can't just have one server. So what we do is this. We spawn a lot of containers, put them behind a load balancer, and the client doesn't know how many servers are actually serving. You just send requests to a load balancer IP, and load balancer takes care of all the scheduling,

15:03

all the rerouting the request. Since we have a huge large scale, we have plenty of more containers. So once we keep on increasing the number of containers that we have for one application, we want a way to be able to manage these containers.

15:22

Because it's possible that sometimes we want to increase the number of containers, or sometimes we want to decrease the number of containers, and we see that there's less traffic. Also, we even want to diagnose some of the containers when something goes wrong with the containers. Or let's say we want to kill some of the containers and spawn them again because there's

15:40

some error or something. So for this, we use Kubernetes. Kubernetes is a container orchestration platform which helps us in scheduling, maintaining, and scaling applications using containers. So Kubernetes is a really nice tool

16:01

by Google which provides us a really nice, flexible way to scale up or scale down any application at any time. We can create new instances, new containers, put them behind the same load balancer, and those containers will be now serving the applications to the request from the clients. Or we can scale down easily with just one command.

16:21

And also, Kubernetes makes sure that if we mention that we want to have 50, let's say, for example, 50 containers for application, it makes sure that even if one of the containers or two of the containers or, let's say, 10 containers die because of some error, it makes sure that at any moment it's going to retry and create new ones so that we don't have to care about if something goes wrong

16:42

unless there is something seriously wrong and it can't create new containers. So basically, it will try to maintain the number of containers to a particular limit that we have set. So once we know how we have deployed the models, we also need to be able to measure the performance of these models in production when we have a lot of requests

17:02

coming in at a rate of plenty of thousands of requests per second. So this is how it looks like. Let's say you have your model, and it takes some computation time to compute the prediction for a set of input features.

17:22

But that is not going to be the time that your client is going to see. Your client is going to also have some request overhead because of networking latency, depending on where your app is hosted and where your client is coming from. So this is how it looks like. The prediction time total is sum of request overhead

17:40

and the computation time. And if you have n resources, if you have n instances you predict in one request, you just multiply it by computation time, and this is what you get as a rough calculation of your prediction time from client point of view. And we can see that if we have some simple models where

18:01

computation time is like simple model like logistic regression or linear regression where we have a small set of features and it's a small model, there we will have the request overhead will be the bottleneck, and the computation time will be almost negligible compared to the request overhead.

18:22

So once we know this is the kind of performance that we can expect, there may be two things. Either you may want to optimize for latency or throughput. Let's talk about latency. Latency is the amount of time it takes to serve one request. So you may have some applications,

18:42

like let's say you have a web application which needs to be served as soon as possible. So you want to optimize for latency there. And these are some of the ways that you can use to optimize for latency. First way is don't predict in real time if you can pre-compute. This is a simple way when you can pre-compute all the results

19:02

that you know that are going to be there to predict. You can just save them in the lookup table and serve from that lookup table and you will be really fast and you won't have any computation time in the real time. But we understand that it is not always possible. Almost in most of the applications, we have the need to predict real time.

19:24

What we can do there, we could reduce the request overhead. And one of the ways we could do that is we can have the model embedded in the application so that there is no latency in accessing the model and getting the predictions back. That's what we do as well.

19:40

We keep the model in memory in the container that is serving the app so that it's able to predict and return the request quickly. Next is predict for one instance. This is useful when you have computation time, which is huge as compared to the request overhead.

20:01

When you know that your computation time is a key, is a bottleneck for your request, you should send as many requests as instances you have. So let's say you want to predict for 10 set of instances. You should send 10 requests, because you know that your request overhead is not the bottleneck here. And you don't want to reduce the request overhead.

20:22

You just want to make sure that you've sent requests as soon as possible and get the results back. And you can also do some techniques like quantization. And what that means is it means you convert your float 32 values to fixed type 8 bits. And how it helps is that now your CPU can hold four times

20:41

more data in the same processor. And hence, it becomes faster in processing that data in computing the float values as compared to computing the float values. And there are some TensorFlow specific techniques like freezing the network. Freezing the network means that when you have some computation graph, what you do

21:02

is you have some variables, TensorFlow variables. And if you convert those variables into TensorFlow constants, you get some boost in the performance and the speed of the computation of the predictions. And another thing is you can optimize for inference. What that means is you can remove all the unused nodes from the graph. And that will help in boosting up the computation again.

21:26

Next is we may want to optimize for throughput. Throughput means the amount of work being done in one unit time, maybe one second, one minute, depending on what your use case is. If you want to get a lot of work done per unit time,

21:42

it's, again, the first thing you always is do not pre-compute if you can always have a lookup table with all the computations and use them when your request comes. And another thing you can do is batch the request. When you know that you want a maximum amount of work done in the unit time, you want to reduce the request overhead

22:02

as much as possible. So if you send a lot of requests together in one request, let's say thousands of requests, you're going to get performance boost of those 1,000 times request overhead, which you don't have now as compared to when you could have sent those requests one by one. And you can also send a parallelized request.

22:22

And you can just use an asynchronous request. Instead of waiting for one request response to come back before sending other requests, you can just send them all in parallel and let the service do its work and synchronously collect the responses and make sure that you get maximum work done in unit time.

22:44

So let's try to summarize what we talked about. First of all, we talked about training of models in containers. We spawn a new container. It fetches the data from our Hadoop storage. It can be MySQL as well. It really depends on the application.

23:01

And it runs the training script in an independent environment in a container. Once the training is complete, make sure that it stores the model checkpoints back in the Hadoop storage, and it dies. That's the entire process of the training of a model in container. The next is serving these models

23:22

from containers using Kubernetes. We spawn as many containers as possible as we need, depending on how many requests we have for that particular application. And we let the Kubernetes do its stuff with the load balancing as well as maintaining and managing the containers and providing us an easy interface to diagnose all the problems that we may have.

23:43

And the next is we optimize these serving of apps used for latency or throughput, depending on what the application is. If you have a cron job or something which has a lot of work to do in one burst, you can use the techniques to optimize for throughput.

24:01

Or if you have a real-time application in which you just need to show the result right away to the user, you can optimize your serving fault latency. We have all these options available in our pipeline. To work on all these cool things and a lot of other things like MapReduce, Spark,

24:21

recommender systems, and a bunch of other things we are hiring, we are hiring especially for software level roles as well as data scientist roles. So yeah, if you are interested in working on these things, you may check out this link, or you may get in touch with me on LinkedIn, Twitter, or GitHub.

24:42

I go by sahildua2305 name on most of the social media websites. That's it. Thank you. Thank you, Sahil. Please raise your hand if you have a question.

25:10

Thank you. So you use Kubernetes, and you can scale up and scale down number of replicas, right? As you mentioned, what do you use to decide whether you should scale up replicas or scale

25:23

down? What algorithm is behind the load balancer? Do you do it manually or automatically? I didn't answer your question. So what do we use as a metric to decide whether we want to? Yeah, yeah, exactly. You have number of replicas, like five. Now you have load, and you want

25:41

to decide if there should be 10 replicas or scale down. Yeah, so Kubernetes out of the box provides a support to a few metrics, like CPU usage, disk memory, as well as the traffic that we get in the number of requests. So it really depends on the kind of application

26:00

that we want, because in some of the areas, we want to have the metric CPU usage, which tells us how busy are our CPUs on the particular container. Or we may also want to use the WSTI queue size, because once we have a lot of requests coming to containers, we want to make sure that those queues are not full.

26:24

And once those queues are getting full, we want to spawn more containers so that that traffic maybe can be distributed, so that those queues are not dropping off the request. So it really depends. WSTI queue size is one of the metrics that we are looking at. OK, thanks.

26:40

And my second question is, how do you annotate your data for model training? Do you have some team of annotators, or how do you do it? Your question is, how do we come with this data, or what? How do you annotate data, like the images? If you have some team of annotators who draw, this is bad, this is a chair, this is window.

27:04

Oh, yeah, OK. So when we started writing this model for image tagging, we hired some, we outsourced this tagging manually. We had some huge number of images which were tagged manually by people, by humans. And we used that kind of data to train our model.

27:22

And it's some company, or how did you hire them? Sorry? It's some external company that has these annotators, or how you made it. Your voice is not clear, sorry. Oh, I will come after you listen. OK, so thank you very much for your talk.

27:55

I think this is one of the main problems with Python machine learning, deploying it.

28:00

I would be interested, if you want to use machine learning, you usually have to do some feature engineering, like you get some input data, and then you have to crunch some numbers. Where do you actually do that? Do you do that in the app, and you tell the app, OK, you have to provide this data? Or do you do that in the container? Or do you do it on the Hadoop when the data comes in, and you just kind of like send a pointer to the Hadoop data?

28:24

Yeah, so that's something that I didn't cover. So what we do is we have some kind of events data that is being logged in all the activities that we have on our website. And once we have the data, we have some easy workflows or con jobs which deal with the data and prepare the data to the kind of data

28:41

that we want to use in our models. So we have separate workflow, which takes care of the data managing and the preparation of data, basically, for these models. I was wondering if you could talk a little bit about how you iterate with your models.

29:02

And let's say, I'm not sure whether that's the case, but if you have some new training data, you want to take it into account, you want to retrain your models, and then check whether they're still performing well or not, how do you deal with these kind of things? So are you asking about how we deploy new models or the performance testing of new models?

29:22

Both. OK. So once we know that we have new models, we want data scientists want to update the model, what we do is we have some particular, so we use OpenShift on the top of Kubernetes to manage the graphic interface of the entire structure. So once we know that there is a new model,

29:41

we update our deployment with that new model. And we can use A-B testing to see what kind of results we get. And we have proper monitoring, which tells us what's the distribution of our feature sets or the distribution of our outputs for a particular model. And we can use that information to decide whether the model is good or not,

30:01

whether we want to keep it or move to the previous version. In your talk that you, one of the ways to improve throughput and latency was to cache or to have a hash table of previous predictions. How did you implement that per container?

30:23

Or do you have a centralized? And what technology do you use for that? So the thing that I mentioned for caching and keeping the predictions in the lookup tables, that is something that really depends on your use case.

30:41

So honestly, we haven't found out that kind of use case where we already know what kind of predictions we are going to predict. So what we do is we don't use the lookup tables. We predict in the real time. So we use the other kind of techniques that I mentioned to optimize for latency as well as throughput. So we don't already have some application where we could employ the lookup tables.

31:06

OK, so that's it. Thank you, Sahil. Please give a warm round of applause to Sahil.

Recommendations