We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

#bbuzz: Deep learning in production: Serving image models at scale

00:00

Formale Metadaten

Titel
#bbuzz: Deep learning in production: Serving image models at scale
Serientitel
Anzahl der Teile
48
Autor
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Deep learning achieves great performance in many areas, and it’s especially useful for computer vision tasks. However, using deep learning in production is challenging: it requires a lot of effort for developing and running the infrastructure to serve deep learning models at scale. In this talk, we present a system for classifying images on one of the largest online classified advertising platforms. The main requirement for this system is to classify tens of millions of images daily and be able to operate reliably even during peak hours. It took a year and lots of trial and error to arrive at the system we currently use. We present the details of this journey and tell our story: how we approached it initially, what worked and what didn’t, how it evolved and how it’s working right now. Of course, we also walk you through the technical details and show how to implement a similar system using Python, AWS, Kubernetes, MXNet, and TensorFlow.
MaßstabSpezialrechnerProdukt <Mathematik>GruppenkeimGruppenoperationVirtuelle MaschineSoftware EngineeringWort <Informatik>Physikalisches SystemProdukt <Mathematik>ComputeranimationXMLUMLBesprechung/InterviewJSON
Mobiles InternetLoginKategorie <Mathematik>SALEM <Programm>Plot <Graphische Darstellung>Pay-TVWeb SiteComputeranimation
Abstimmung <Frequenz>RechenschieberOnline-KatalogVollständiger VerbandComputeranimation
SpezialrechnerDigitale PhotographieElektronische PublikationCoxeter-GruppeWellenpaketDienst <Informatik>Produkt <Mathematik>EvoluteMultiplikationsoperatorInformationSichtenkonzeptComputerarchitekturComputeranimation
CASE <Informatik>Mailing-Liste
Objekt <Kategorie>SpezialrechnerObjekt <Kategorie>CASE <Informatik>Computeranimation
Abelsche KategorieProdukt <Mathematik>FokalpunktSpezialrechnerDigitale PhotographieWrapper <Programmierung>InstantiierungMetadatenDienst <Informatik>BenutzerbeteiligungVorhersagbarkeitWrapper <Programmierung>Produkt <Mathematik>ResultanteClientAlgorithmische LerntheorieDienst <Informatik>Message-PassingWarteschlangeBitArray <Informatik>StapeldateiCASE <Informatik>DatenbankVersionsverwaltungElektronische PublikationWellenpaketInformationProzess <Informatik>Exogene VariableDatenfeldZentrische StreckungServerFunktion <Mathematik>Virtuelle MaschineKategorie <Mathematik>Automatische HandlungsplanungMetadatenParametersystemMultiplikationsoperatorComputerarchitektursinc-FunktionObjekt <Kategorie>PunktwolkeVorgehensmodellSpieltheorieMAPPunktAbstandProtokoll <Datenverarbeitungssystem>Microsoft dot netModul <Datentyp>ComputeranimationXMLDiagrammFlussdiagramm
SpezialrechnerKlasse <Mathematik>Web SiteProdukt <Mathematik>Architektur <Informatik>MetadatenInstant MessagingAbelsche KategorieProgrammbibliothekSchlüsselverwaltungDienst <Informatik>WarteschlangeProdukt <Mathematik>SystemplattformMetadatenInformationsspeicherungPhysikalisches SystemCASE <Informatik>DatenbankElektronische PublikationHash-AlgorithmusClientBitMomentenproblemWrapper <Programmierung>PunktInstantiierungMathematische LogikMultiplikationsoperatorKategorie <Mathematik>RechenbuchResultanteCachingMultiplikationComputerarchitekturEchtzeitsystemWeg <Topologie>Exogene VariableLokales MinimumGewicht <Ausgleichsrechnung>Elektronisches ForumEindringerkennungDigitale PhotographieAnalytische MengeSystemaufrufProjektive EbeneMessage-PassingComputeranimationFlussdiagrammDiagramm
SpezialrechnerAbelsche KategorieGenerator <Informatik>CachingSystemprogrammierungKomplex <Algebra>MetadatenPhysikalisches SystemSynchronisierungMaßstabCodeEreignishorizontCASE <Informatik>Mechanismus-Design-TheorieResultanteProzess <Informatik>WarteschlangeMultiplikationsoperatorInstantiierungPhysikalisches SystemBitCodeProdukt <Mathematik>MomentenproblemGenerator <Informatik>VorhersagbarkeitZentrische StreckungMetadatenEreignishorizontExogene VariableWellenpaketBrowserDienst <Informatik>Gewicht <Ausgleichsrechnung>URLParametersystemDatenbankPunktServerComputerarchitekturWrapper <Programmierung>SystemplattformAnalysisMessage-PassingTabelleAnalytische MengeKlasse <Mathematik>SoftwarewartungElektronische PublikationEchtzeitsystemFortsetzung <Mathematik>SynchronisierungOnline-KatalogStreaming <Kommunikationstechnik>ComputeranimationFlussdiagramm
SpezialrechnerProdukt <Mathematik>ZeitabhängigkeitMereologieMetadatenDienst <Informatik>MereologieGewicht <Ausgleichsrechnung>Bitp-BlockInformationWeb logComputeranimation
Maschinelles LernenCodeGebäude <Mathematik>CodeKondition <Mathematik>Virtuelle MaschineProjektive EbeneBootenVerschlingungComputeranimation
BildschirmmaskeRückkopplungAdressraumCodeE-MailQuarkmodellVirtuelle MaschineRechenschieberFreewareVerschlingungBootenComputeranimation
XMLUML
Transkript: Englisch(automatisch erzeugt)
Hello, my name is Alexei, and today I'm going to talk about serving deep learning models in production. First, a few words about me.
I have been working as a software engineer professionally for more than 10 years, and six of them I spent working with machine learning systems. Right now I work as a lead data scientist at the Olex group, and you can see the logo of Olex group in this picture.
Olex group is a company, is a group of online classified companies. So maybe you've heard of some of them. Avita or Olex let go and a couple of more brands. And the main idea of online classifieds is a place where you can share,
where you can sell something, where you can buy something. So this is OlexIndia. This is the place where people can come and sell things they don't need anymore, or where people can come and buy things, use things for cheaper prices.
OlexIndia is one of the biggest websites we have. There are a couple of ours, Olex Ukraine, Olex India, and we also have presence in Africa, Asia, South America.
So on this slide you see Olex Ukraine, and typically for online classifieds pictures are very important. So when you want to buy something, when you browse through the catalog, you see many, many pictures of things,
and having good pictures is very important for deciding whether you want to learn more about something, whether you want to contact the seller and ask to arrange a meeting to buy something. On Olex we have a lot of images. 10 million images are uploaded by our customers daily,
and the challenge we are going to talk about in this presentation is how to apply deep learning models to 10 million images per day. So in this talk we'll first discuss motivation, so why are we doing this, why do we need it.
Then we'll spend some time talking how we train models, and then after training is done there comes next step, how to actually serve the model, and we'll talk about the evolution of model serving that we've had at Olex,
so how we started initially, how it evolved, what was the original architecture, what were some drawbacks of this architecture, how we changed it and simplified it. So imagine you have a car and you want to sell it.
So what you do is you go to Olex, you create a listing, fill in some details, and then you take a picture, and what happens next? This picture is uploaded to our image hosting.
Our image hosting is based on S3. S3 is a service in AWS for storing files, so this is basically a thing where you can put files and then you can get files back. And every day 10 million images are uploaded to these image hostings,
this image hosting. So it means that we have billions and billions of picture in our S3 packets. What we want to do, what we want to know about these images is we want to know some information about them. How good are these images?
Because as mentioned earlier, for people who want to buy things, it's very important to have a good picture, to get a good impression of how the item might look like, and decide whether they want to contact the seller or not.
And if a picture is good, then it maximizes the chances that they will decide to actually contact the seller and buy the item eventually. So here we have two pictures. One picture is better quality, the other picture is a bit worse.
So we want to know which pictures are good, which pictures are bad. And in case a picture is not the best quality, we want to contact the seller and suggest some ways to improve the image and the overall listing. Then we are also interested in what is it on these images?
What are the objects on these images? As you see, there are many, many things that people can post and upload. Bikes, cars. Sometimes people can upload something they want to...
They can try to sell something that they aren't supposed to sell, like a weapon. We also want to know that. So in this case, we have somebody trying to sell a machete, of course. We should prevent this from happening.
So from images, we want to know what is on the image. So for each image, we want to know whether it's an image of a truck, whether it's an image of a fridge, or it's an image of some weapon. So the idea is simple. So how can we extract this information? Of course, using machine learning and deep learning.
So we get all these images that we have, send them to a machine learning model, and then store the output somewhere in a database. So this can be a metadata database, and all these labels, all these categories are the fields in this database,
basically the information about each image. So the plan is clear. We want to use machine learning. How do we actually train models? So training models is quite simple these days. There are many services, cloud services,
that make the job a lot easier than a few years ago. We use Amazon SageMaker. With SageMaker, it's quite simple. So all we need to do is upload the data to S3 and run a SageMaker job. The SageMaker job gets the data from S3, trains the model with the parameters we specify,
and saves the results. So it looks like that. So we have a way to collect the data. So it can be our own service for labeling, or we can also use Amazon Mechanical Turk. And then for each image we know,
for example, image quality, how good the image is. We get all this data and save to S3. And in S3 we have the actual images plus the labels. Then we can run a SageMaker job. SageMaker job fetches the images from S3, trains the model,
and then saves the results again to S3 as a model file. Quite simple. And we can have a model in no time. So let's say we spend some time, a week or two, training the model. We have it. The model is quite good.
What's next? What we can do next is, for example, since we use SageMaker for training, we can also use SageMaker for model serving. So what we want to do is just take this model and put it inside a SageMaker endpoint. But then we actually, we don't just want to put this model to an endpoint.
We want to get all the images that we have in S3, run them through the model, and save the results in metadata database, such that the users can benefit from this.
So we need a way to get images, put them to model, save the results, and make the user happy. So now we'll talk about how to actually do this. To do it, we create a special service called metadata service that clients of this service,
that is typically the other teams that need the predictions of our models, they communicate to this. This metadata service talks to a model wrapper. This is a simple, very simple service that simply fetches images from S3,
and then it gets the images and then sends them to SageMaker. SageMaker processes the images, returns the predictions, and then this model wrapper returns predictions again to metadata service. Metadata service saves the results to database and responds to the user. Now the user can use these predictions
to do whatever they want. And the first initial version was quite good, so we could already analyze the quality of images that we had and somehow educate our users that in cases that image is not good,
we can say what is wrong with the image and suggest ways to improve it. There were some problems with our initial architecture. First of all, SageMaker turned out to be quite expensive. If we, when we simply deployed to our own Kubernetes cluster
instead of using SageMaker endpoint, we could reduce costs like four times. Then the next step, we noticed that it's quite difficult to to deal with spikes of traffic when we have sudden spikes of traffic
and when during some days, suddenly a lot of users try to upload images. Then to gracefully scale up and scale down, we added a bit of, we made our services synchronous, so we basically added a queue
between the metadata service and model wrapper and with this queue, we can, it was a lot easier to actually scale our models because we didn't need to process everything immediately. We could just wait a bit and then simply scale our models up
and then process through peaks of traffic. We had two models, so basically for each of the model, we had a special thing that we previously called model wrapper for each model. We had a separate one,
which could fetch images from S3 and then talk to TensorFlow serving or MXNet server serving and process models. So let's walk through the process. So first, a client sends, submits a request.
So it can be, for these files, I want to know the category of the objects on these images. So it means we want to run a classification model. Then the metadata service responds immediately, saying that your request is enqueued, wait, will tell you when it's finished.
Now metadata service checks the database to see if we already have some results for some of the files. If we don't, we submit this request to enqueue. Then the image category model wrapper listens to this queue,
pulls messages from there, typically does this in batches of 10 images, gets the images from S3 and then does some pre-processing because we need to get the images, resize them, convert to NumPy array, do some pre-processing,
like normalize the arrays. Then eventually it takes the arrays and converts them to protobuf. Once we have protobuf, we use gRPC to talk to TensorFlow serving. We send the request, TensorFlow serving replies with results again over gRPC.
Then the model wrapper puts the results to the response queue, metadata service listens to this queue, gets the results, saves them to the database, and then finally responds to the client. This is using a callback saying,
hey, for these IDs, these are the categories. And this worked quite well. We could use these models for many use cases, like one I already mentioned, detecting the images with not great quality
and then suggesting the sellers ways to improve it. The other case was moderation. When somebody is trying to sell something they are not supposed to sell, we could catch these images and not let them go live to the platform.
There were unfortunately some drawbacks in this architecture. So this was pretty inconvenient for the clients because when a system is asynchronous it makes it quite difficult to work with this.
Instead of just simply sending request and getting response, they need to also keep track of what were the requests and then also provide a callback. So it means that they need to make the service publicly available to make this endpoint. So it's a bit difficult for the clients.
Then it's also not really real time because in some cases when we have peaks of traffic it's difficult to, let's say, at the same time 10,000 users uploaded the images. So of course there will be some delay
in working through this backpack of images and it means that if somebody is trying to sell a gun at this point we might need to wait 5 or sometimes 10 minutes before we can actually catch this case and remove the ad. In these cases we of course want to react in real time
and already know about guns at the moment they are uploaded to the platform. And it's also too much asynchronous. We need to have a lot of queues and just following through the system
to see how requests are propagating is sometimes quite difficult. It's also expensive. We use SQS for the queues and it works through polling so the clients, the services who listen on the queue
they simply ask the queue hey, are there new messages, are there new messages? And they keep doing this. Of course there is some delay but eventually for each request we have to pay a certain amount of money and then at some point half of the costs for the infrastructure
was just simply SQS polling. That was too expensive. Then we also had some duplicated logic because these model wrappers they both need to talk to S3, do some pre-processing so when creating these services we need to somehow put this logic in the library.
It was a bit difficult to maintain. Then database that we have was not good for analytics. We can of course store the results there and use them for responses but when people wanted to analyze the results of the models
we couldn't simply use that for that. And we also used MySQL and with our traffic we found out pretty fast that it just grows too much and it's a bit of a burden to maintain to make sure that the database is not overloaded
it has sufficiently large instance and things like that. And finally when we need to add a new model turns out that there are too many places that we need to modify. So when we want to add a third model we need to change something in the metadata service
we need to add two more queues we need to create a new service for the wrapper we need to put some logic there for getting the files from S3 then put this to TensorFlow serving or MMS so it's a lot of work to just add a new model.
That's why we tried to make it a bit simpler. This is something I'll talk about now. So this is what we have. And we tried to see how we can improve it. So the first step that we did was to get all these model wrappers and put them into one service.
So now we don't have multiple services it's just one thing. And we called it ImageModelService because this thing contains the logic for getting the data from S3 processing the images and then talking to TensorFlow serving
to the actual models. With this setup we no longer needed the metadata service and all these queues and the client simply could send HTTP requests to IMS itself.
And then we weren't happy about MMS MMS is MXNet model service so that's why we also removed it. And with this our architecture becomes quite simple. Right now it seems quite easy so this is in retrospect what we should have started with
just a simple thing that accepts requests and then forwards them to actual models. Of course we need to have a caching layer on top of that to make sure that we don't we don't score our images multiple times.
And the way we do it we use DynamoDB key value store and for the key instead of using the file name we use MD5 hash of the file. We have quite a few duplicates on our platform people often upload the same image multiple times so in this case we don't want to
to send it to score it again and send it to the model. We can see that for this MD5 hash we already have results and we can simply return to the user immediately. We don't need to calculate MD5 hash we can simply use ETag from S3
because there ETag in most cases is the same as MD5 hash. Then we needed to add a few more models so the first model was detecting if there is artificially embedded text on the pictures in some cases it's prohibited on our platforms to add this text
so we want to have a model that detects that it's the case and help us remove this from the platform. So there was another model that simply checks if there is text or not. Adding this model was quite simple so we didn't need to touch many places
so we just needed to to adjust the code of MS to add a new pre-processing class for this and add another instance of TensorFlow serving with this model. Then we had another problem that we also wanted to solve with deep learning
it was we wanted to detect nudity so we trained another model for classifying if an image is safe for work or not. So that was another model we used MaxNet for that again trained using SageMaker
and for that we wrote our own MXNet serving because we didn't like MMS so we wrote a simple thing a simple wrapper around MXNet that can get a compressed numpy array
in the packet apply the model and return back the response something pretty similar to TensorFlow serving but instead of using PradaBuff it used HTTP. That was our final architecture so it was quite simple one caveat here was that
tuning it was more difficult than in asynchronous case because when we have a synchronity we can gracefully react to peaks and traffic when there is a sudden sudden peak when 10,000 images are uploaded we can simply put them to the queue
and scale it out and then process through the backlog and then scale it down and we don't need to worry that something gets lost. Here we need to be more careful with this because if we are synchronous
and need to respond immediately it means that to react to these peaks of traffic we need to sometimes overprovision instances to have some instances that are idle and once they start receive traffic we add more instances
and this way we can react to peaks of traffic with less problems but it's again it took quite a while to actually tune it and to be able to process through a lot of requests at the same time with asynchronous case it was easier.
Then just one last thing is we have analysts and analysts are quite interested in analyzing the results of these predictions and of course they often want to see
how many images for example contain text or how images were pornographic or how images contained guns or things like this. How many cars were there among the images. With the previous setup it was difficult and let's briefly talk about how we made it easier for analysts.
So the moment a user uploads an image to S3 S3 can generate S3 event notification saying hey there was a file in the bucket do something with this and it's possible to put these notifications to a queue.
And then what we can do is we can simply listen to this queue to all the events to all the newly uploaded images and then simply send them to IMS to our image model service. Then model service responds with results we can put it to results
to a kinase stream in our case and then eventually save it to S3. When we have the data in S3 we can simply put them to glue catalog and have the data in Athena tables. Athena is
basically a managed SQL engine, a Presto and you can use it to query all the results. So analysts could use SQL a tool they know and love to query the results of our models and they were very happy about this.
That was it. So we talked about motivation, why we wanted to build our model server why we needed to serve deep learning models when we briefly talked about how we trained models and then discussed our
architecture for doing this. Just want to summarize all the talk into the main takeaway points. So we use deep learning to extract metadata from images that is a couple of models and then we run them and save the results in the metadata database.
We use AWS SageMaker and it makes it very easy to train deep learning models. We simply need to specify the location with the data the parameters of the model press a button and then train the model and save the results to S3.
Serving with SageMaker is not as nice as training so it still requires a bit of code to actually get the images from S3, process them in the way we need and then it gets
to process a lot of images at the same time. Working through a lot of images through peaks of traffic is easier when the system is in synchronous but there is a downside of that that they are more complex they are more convenient for the users and then they are not always real time
because when there is a backlog of items in the queues that we need to process we use the S3 event notifications mechanism as a non-intrusive way to connect our systems to connect all the images
our image hosting from the image hosting and then process them and put the results to the Afina table and that made it a lot easier for analysts to analyze all the results. Then finally
quite an easy to maintain solution for analytics it's scalable you pay only for the data you scan and you simply can put all the data in S3 and let analysts play with the data.
That is almost all from my side so this talk is based on two blog posts in our blog you can search for more details for more information there is also in the part 2 a bit there are some details about
the way we serve MXNet models so if you are interested go check it. And then finally I am working on a book called machine learning bootcamp the idea is to teach machine learning through projects if you are interested
you can check and you can get a 40% discount with the code here. I will appreciate if you give me any feedback on the talk if you find it interesting or maybe it was too slow or too fast
so if you want to do it you can check this quark code or this link and give me some feedback also there you'll find the link to the slides and you can also if you are interested in a free copy of machine learning bootcamp
leave your email address and you get a chance to win a free copy. This is all from me thank you for your attention please let me know if you have any questions I'll be very happy to answer them. Thank you.