Train. Serve. Deploy! Story of a NLP Model ft. PyTorch, Docker, Uwsgi and Nginx
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 130 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/49959 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Wave packetMathematical modelMathematicsKey (cryptography)Proxy serverTouchscreenComputer animationMeeting/Interview
00:46
Mathematical modelWave packetFormal languageOffice suiteSoftware testingTranslation (relic)System programmingVirtual machineMathematical modelConvex hullTerm (mathematics)PreprocessorWordProcess (computing)Software testingNumberSet (mathematics)Codierung <Programmierung>SubsetBitData dictionaryUtility softwareType theoryProduct (business)Computer programmingMathematical modelCompilerLevel (video gaming)Interactive televisionDifferent (Kate Ryan album)SequenceoutputPattern languageVirtual machineMathematical modelRight angleStapeldateiScripting languageFormal languageTerm (mathematics)Parameter (computer programming)Machine learningData managementPrototypeCASE <Informatik>FacebookValidity (statistics)Software frameworkOperator (mathematics)NamespaceCloud computingMaschinelle ÜbersetzungComputer fileSpacetimeArithmetic meanCommon Language InfrastructureVariety (linguistics)Interface (computing)Office suiteMedical imaging2 (number)Coefficient of determinationLine (geometry)Ocean currentFingerprintWhiteboardHypercubeArmPopulation densitySequelState of matterMathematical optimizationChainVideo gameLikelihood functionOrder (biology)Web applicationQuicksortMereologyMetropolitan area networkObservational studyCuboidStress (mechanics)Event horizonComputer animation
09:26
12 (number)Personal digital assistantServer (computing)Virtual machineMobile appSoftware frameworkComputer fileSource code2 (number)Cartesian coordinate systemCompilerFunctional programmingRoutingQuery languageResultantServer (computing)Semiconductor memoryCommunications protocolComputational linguisticsBuffer solutionTranslation (relic)Mathematical modelDependent and independent variablesParameter (computer programming)Multiplication signConfiguration spaceGoodness of fitComputer configurationMathematical modelTransformation (genetics)Software testingSubsetStructural loadModule (mathematics)Order (biology)Web applicationFlash memoryThread (computing)Software developerScalabilityQueue (abstract data type)Network socketCASE <Informatik>Error messageFormal languageSlide ruleNumberMaxima and minimaComputer programmingProgram codeType theoryMachine codePoint (geometry)Video game consoleProduct (business)Arithmetic meanGame theoryHypermediaHeegaard splittingThomas BayesWebsiteQuantificationProgramming languageFood energyInteractive televisionOnline helpArmCoefficient of determinationPrice indexMereologyExecution unitProcess (computing)VotingPredictabilityConnected spaceStaff (military)Workstation <Musikinstrument>Observational studyComputer animation
18:06
Computer fileCoordinate systemSoftwareCache (computing)Software testingExecution unitBlogServer (computing)Web pageVirtualizationDirectory serviceProcess (computing)DampingThermodynamisches SystemComputer fileLetterpress printingLoginComputational linguisticsSet (mathematics)Existential quantificationMedical imagingMultiplication signMaxima and minimaMathematical modelVirtual machineMoore's lawPoint (geometry)Network socketComputer programmingUtility softwareRight angleCartesian coordinate systemLink (knot theory)Configuration spaceUnit testingElectronic mailing listSemiconductor memoryVariable (mathematics)Data managementCASE <Informatik>Inheritance (object-oriented programming)Line (geometry)Error messageExecution unitSystem callPoisson-KlammerStructural loadBlock (periodic table)Product (business)Query languageData loggerBit rateArmExpert systemCoefficient of determinationNumberHypermediaWhiteboardBlogTwitterGoodness of fitDefault (computer science)Projective planeGeneric programmingPhysical lawPoint cloudDesign by contractDifferent (Kate Ryan album)State of matterMusical ensembleMachine codeLatent heatPhysical systemMereologyResultantArithmetic progressionNatural languageMappingAreaComputer animation
26:46
Mathematical modelMathematicsMathematical modelConnected spaceProcess (computing)Multiplication signNumberMathematicsQuicksortSingle-precision floating-point formatGame controllerEuler anglesWhiteboardMobile appRight angleHypermediaDisk read-and-write headQuery languageResultant1 (number)Thermodynamisches SystemPoint (geometry)Mortality rateNatural languageComputer-assisted translationThread (computing)MereologyType theoryStructural loadBitConfiguration spaceComputer animationMeeting/Interview
Transcript: English(auto-generated)
00:07
So, our next speaker is Shreya Khurana. Actually, we met at GeoPison 2019 last year. Yes, we did. Now with different, different conference.
00:21
She will talk about train, serve, deploy, story of an NLP model, PyTorch, Docker, YuleWhiskey and Nginx. Please start screen sharing. Alright, thanks Martin. Let me start presenting.
00:41
Okay. Alright. Hello everyone and welcome to this talk. Today we're going to be talking about how we can get a machine learning model in production. And we're going to be talking about certain technologies like Docker, USB and Nginx.
01:03
Okay, first a little bit about myself. I am a data scientist at GoDaddy. I've been working with unstructured language data and I build models based on deep learning. I am into dancing, hiking and I've recently gotten into talking. So as Martin said, I actually participated in last year's GeoPython.
01:22
I also gave a talk at this year's PyCon. And this is an interest that I've recently developed. This is, and so this is why I'm here today. But the other interesting thing that you need to know about me is I am a huge meme collector and I really love the office. And some of that you will be able to see in the coming slides.
01:42
Okay, so when we talk about machine learning, we usually talk about two things and they are training and testing. And coming from an academic background, usually we see this whole process like we get the data, we preprocess, clean it. We get it to a stage where we can actually start training. We have all of these state-of-the-art models.
02:02
We try different hyperparameters, train it, evaluate it on a whole dark set and then go back again and repeat the process. And in research and academic settings, that works really well. But about a year earlier, when I joined GoDaddy, I realized something. That people actually use these machine learning models for something.
02:22
And it's not that you are just going to be using it for your research. It's just that other people are going to be calling this machine learning model. Which means you have to get it to a stage in which you can present it to them. And it has to be really secure. It has to be really stable. It has to be able to handle all of those requests.
02:40
And this is where machine learning production comes into picture. So, about a year back when I started getting into this, I had all of these terms being thrown at me. Like Docker, Flask, Django, which is a framework, Kubernetes for cluster management, cloud platforms like AWS, TCP. And USB and Nginx for request management.
03:04
And I did not know any of this. And which is why this talk is happening right now. Because I thought it would be a good idea to introduce people who have trained a model before. But who don't know much about machine learning production. Or how we can actually get a machine learning model into production.
03:22
So, for the purpose of this case study, we'll assume that you're familiar with training a model. And the whole process. The only thing that we'll be covering is all of these new technologies. Like Docker, USB and Flask.
03:41
And for this case study, we're going to assume that we'll be training a sequence to sequence model. So, if you're not familiar with it, it's basically just a machine translation system. So, in that what we do, we give it a sequence of tokens, which is just like a syntax. And then in this talk, we'll be assuming that we have or we'll be training a model that can do this.
04:06
We'll just take an input of sentence in German and we'll be translating it to English. All right. So, this is the data that we'll be working with. So, this is a set of TED talks that have been transcribed in both German as well as English.
04:24
And they contain all of these transcriptions based on a variety of talks. And it's a relatively small data set. So, it doesn't really need too much of training. It can just get to a very decent accuracy level in a few epochs, which is why this was really good for prototyping.
04:44
And the framework through which we've trained this model is FairSeq. Again, this is only for the purpose of very quick prototyping. So, FairSeq is this AI toolkit to build sequence to sequence models by Facebook AI research. And it's built on top of PyTorch.
05:02
And basically, it contains a set of scripts by scripts, Python scripts that you can easily run to preprocess and train it. All right. Now, there's a lot of documentation available on FairSeq. But just to actually introduce you to the major or the or a few steps that we need to do before we actually get to production level.
05:26
So, the preprocessing step is like you just learn a set of merge operations. So, now we're going to be using BPE. BPE is byte pair encoding. If you ever work with NLP or any set of language models, you know what BPE is.
05:45
But BPE, for other people who are not familiar with it, is just a set of vocabulary words. It's just a set of operations that you're learning. So that you know that a few sequences are much more likely to occur than the others.
06:01
So, it's a way for us to figure out the patterns or the sequences in the vocabulary that have a high likelihood of occurring. So, once we have our training set, which is just both our German as well as English data, we learn what operation, what words are most common in that data through this script called learnbpe.pipon.
06:26
And then we have a training set, validation set, test sets, right? So, we just apply all of these learned operations to them and we store it in a file. So, this is where we are till preprocessing.
06:41
Now, FairSeq also gives us a really good one-line command through which you can actually just start training it. And the machine learning model is usually like a sequence-to-sequence model or it could be another model built on a different framework. But because we're working with sequence-to-sequence here, you can give it the architecture, the dictionary files that you're loading,
07:08
the optimizer, the learning rate, whether you want it to have a dropout or not, what is the batch size. So, batch size can be given in terms of number of sentences or number of tokens, which is the number of subwords.
07:24
And how many pox you want to train it for. Now, there are a lot many hyperparameters that you could experiment with, but this is just a subset of them because we want to train a model really quickly. Okay, so now that we have our model, let us see if it can actually predict to a recent level.
07:44
So, in this, we'll assume we have a command called FairSeq Interactive. And again, what we just do is we give it a path to load it from. So, this is the checkpoint or the model you've trained, what preprocessing technique you're using, and what is the beam size.
08:02
So, beam size is how you're predicting at each step. So, like, let's say you were predicting a word after the previous one. How many candidates should you span across? How many candidates should you check for the best candidate? So, it is sort of like defining how big your search space is.
08:23
And you can give it all these parameters. And if it works correctly, it will print out all of them in a name space. It will tell you how many words there are in each of your dictionaries in German and in English. And then, this is a utility in which you can actually type your sentence and you can get the translations.
08:45
So, remember, this is a machine learning model that's taking an input of German sentence or just words and it's translating it into English. So, till now, we have the CLI command that can do that. We have a model that can do that.
09:00
But, as I mentioned before, like, this is okay if you're working on it alone. If you're just the only one or if you're just working with a team that has access to the CLI. But, in practice, what happens is this is not the case. In practice, we have a lot of requests coming in from other people. From, possibly if you're working in the industry, you have all your customers hitting this API which is calling the model.
09:25
Now, an API is an application programming interface. So, what Flask, and we'll be using Flask for this. So, Flask is this web application framework that is written in Python. And, it actually started as a very humble programming code base.
09:42
It just started as an April Fool's prank. And then, it blew up to be the most widely used Python-based web application framework. And, the reason that it has gotten to this point is because it's fairly easy to use. And, it's a development server. So, almost everyone who starts with making apps out of their machine learning models or of anything that they've built in Python starts with Flask.
10:07
And, that is what we're going to be doing as well. So, with this framework, we already have a model. We're just going to be loading it and then trying to predict it using certain API endpoints.
10:21
To do that, we just need to do a few things before we need to load the Flask modules that we think will be important. So, these are all functions that will help us create the response of this app in a way that can be understood by HTTP. So, we'll assume for our purpose that HTTP is the protocol that we'll be using.
10:44
And so, basically, in this, we need to make sure that we can create a response that is in JSON. This is FairSeq, a function called transformer model, which will just help us load the pre-trained model. So, what we do is we load this model checkpoint test into our memory.
11:06
We give it the dictionary path, what pre-processing technique to use. German is a source language. Target language is English. What is the beam size and whether you want it to use the CPU or not.
11:20
So, till now, we have a machine learning model that is loaded into memory. Now, we have to be very specific in the way we're going to be calling this model. Now, remember we were working with CLI earlier, but to actually work in a much more efficient way, we will define an endpoint for which the model will respond.
11:42
So, to do that, we simply just define the endpoint keyword. So, like, we want it to respond to translate. So, whenever we are hitting the endpoint and it has the keyword translate at the end, we want the model to predict. And the way we do that is once you define this endpoint, let us define the function that will actually do that.
12:03
So, here we just define the timer. So, like, if you want to see how much time it's taking for each inference, you can do that. You get the query. So, request is this module that will help us get the parameter from the HTTP request.
12:20
So, HTTP request is someone, some other person who is wanting to get translations from your model. And that person will probably do something like a Q is equal to, which means that he is, or they are just trying to get translations on that particular sentence. So, if there is no query, you raise a bad request, which means it's empty,
12:43
and then a very simple function called translate, and you give it the query parameter, and then you have your translation. The rest that is left is that we'll be parsing our result into JSON, we'll make this into a protocol it can identify, and then we'll just return it.
13:01
It's a very simple function. And the way to run this app is very simple as well. You do just app.run, you give it the host IP that you're trying to host it on. So, if you're doing it on local host, you can do that, and a specific port on which this app is going to run.
13:21
Okay. So, now what do we have? We have this really good flash server that can load the model into memory, that can make predictions. We have a model that is able to do that. Now, the only thing is that our flash server is a development server. It's not a production-based server because it's not as stable, not as efficient,
13:40
it's not as secure. And these are all the things we want to be existing in an HTTP server. So, what do we do? We actually use this new thing called uWhiskey. So, uWhiskey will actually help us make this flash app much more secure, much more stable. And the way to do that is that we wrap our app in the uWhiskey file.
14:06
So, now we have this Python file, you just import the app, and you give it the name application, and then you just run it. But remember, this is a production-based server, which means you can do all of the scalability things
14:20
that you might not have been able to do with Flask. And to actually do that, we have this configuration file called uWhiskey.ini, and we'll be giving it certain parameters. So, all of these config files are used to actually give certain arguments to this USB server in telling them how to run.
14:41
So, first we load the module. So, like in the USB.Python file, you wanted to run this application. So, you give it this name. How many requests you want to listen to. So, like in the queue at one point of time, how many HTTP requests can you actually load, whether you want to disable logging.
15:00
And this is suggested the first time you do this, because you want to log everything. You want to see, just in case an error arises, what happens. The file you want to log it to. This is just a file path, and lazy apps. Okay, so the way uWhiskey works is a master and a worker framework. What I mean by that is, so if your app is being loaded by uWhiskey,
15:24
it will either load in the master first, and the master can give the order to load it into workers later, or lazy apps. Lazy apps allows us to actually load it in the worker itself, without the master needing to initialize it first. So, that obviously depends on how much memory you have,
15:43
if you have a lot of memory, if you have a lot of computation power, that you can load each of the apps and each of the workers. You enable the master process, and the number of workers you have. You can also give it a multi-threaded application type of thing.
16:01
So, this again depends on how well your code is written. Obviously, you'll have to account for deadlocks, in case one of the threads gets delayed. And the buffer size. So, right now we're working with HTTP requests. All of these requests, they are HTTP,
16:20
which means that they have headers, and they have a response size. So, what is the maximum size that request can take? A socket. So, a socket is useful, because we are running a multitude of programs on one machine, and the way for two programs to interact is through a socket. Now, this socket is going to be used by USB and Nginx,
16:42
which we'll be covering in the next slide. But for now, just think of it as a way, it's a temporary file for USB and Nginx to interact with each other. What permissions do we want to be giving the socket, whether we want to enable threads, and whether we want the logging to happen in a separate thread or not.
17:02
Now, there are a lot more configuration options available, but these are just a subset of them, because right now it's a very simple application. All right. So, now we have this secure, stable server with us that is serving the model.
17:20
We know the model can make predictions. The only thing that is left is let's say you were based out of California, but you have like a thousand other requests or a million other requests coming in from all over the place. Now, that place or those requests, they're going to be coming not one at a time, but they're going to be coming at the same time sometimes,
17:42
and they'll have a specific load carrying with them. What I mean by that is often in production, we talk about QPS, which is queries per second. So, you have to make sure that your HTTP requests are being routed off properly or in a very efficient manner to your server, and that is what Nginx does.
18:01
Nginx will route all of these requests to your server, and we can actually tell Nginx how to do that, again, with the way of a config file. So, Nginx actually comes with its config file itself, but we just modify it to suit our case. So, this is nginx.conf.
18:21
You give it a file path in line one where you want to log the errors to, and because we're working with HTTP requests, we will tell it, yeah, that use this HTTP bracket or block. Now, if all of the requests are coming, how do you want to log it or where do you want to log it?
18:44
And because right now we assume that Nginx is going to reside on localhost, the server name is that, and you are listening to a specific port. Now, this port is different than the one that we use with Flask, and this port, 8002, is actually what we're going to be listening when we create our Docker container out of this.
19:06
Okay, so now Nginx knows that it has to listen to certain requests coming in on this server, this port, but then which endpoint does it actually listen to? That you can define through these specific endpoints, so like all of the paths that are arising from this home path,
19:21
it knows that it has to use the uwiski parameters, and again, remember how we talked about this socket? So Nginx and uwiski will interact through this Unix-based socket, which is wiski.sock. It's a temporary file, and the reason we use this Unix-based socket
19:40
is because these two, uwiski and nginx, are on the same machine or on the same computer. They are going to interact with each other very fast, and Unix socket allows us to do that. So that's it for Nginx. But what we have till now are Nginx, so it's routing all of the HTTP requests,
20:02
a very stable and secure server that is serving our model. The only thing that we haven't done is look at the big picture. So let's say all of them are running on the same... Excuse me. So let's assume all of them are running on the same... Sorry. My voice is just a little sad.
20:23
So let's assume that all of them are running on the same computer. We still haven't looked at the big picture. Like, let's say your processes or the machine gets killed or something, or it's out of memory. What happens in that case? And how does uwiski interact with nginx,
20:41
or which goes first, which gets killed first, which gets started first, and all of these things that actually come when you are working with a system that is in production. And the way we're going to look at that is through supervisor. Again, with supervisor, we have this config file
21:01
in which we can actually define certain programs. So the top of the page or the top of the code, you want to be writing supervisor and no daemon, which means that it's not going to be running in background, but in foreground. Which file do we want to log the supervisor logs to?
21:21
And then another program called uwiski. What command should we run it with? What is the stop signal? What is the time it should wait before actually stopping that program? Priority. So priority means in the list of programs that we're asking supervisor to manage.
21:41
So supervisor is going to be this process management system, right? And it's managing all of these programs under it. So what is the priority of each program? The lower the number, so like three over here, it means it's a very high priority system. It gets started first, gets shut down last, and that kind of thing. So what is a log file
22:01
for anything that you're printing on STD? What is the max size? And here I've set it to zero, which means by default, just take it to be the max size that supervisor allows us to do. And similarly, we start nginx, we have this program command that we're going to be starting it with, and the similar things that we did with uwiski.
22:24
All right. So supervisor is going to manage all of these uwiski nginx on its own. And now with all of the previous things that we've done before, you will be able to run it on your machine. The only thing is when we actually run this somewhere, we don't run it on our local machine.
22:41
We either go to the cloud, so like GCP or AWS, we run it on a VM there, or we actually use some on-prem machines. But whatever the case is, all of these are very, like these are very dependency-based systems, right? Remember, we installed all of these programs, we installed all of the Python libraries,
23:03
we installed all of these uwiski nginx. So what we want to be doing right now is actually following this, but in a very isolated way. So we want to create a snapshot of whatever we've done right now, and then just load that snapshot onto some machine where we don't have to do anything,
23:21
and it just starts running on its own. And that is what Docker helps us to do. So it's a virtualization software, and it will help us to create these containers on our own. And with Docker, what we do is just we write a very simple Docker file. So here we have this OS that we're loading,
23:42
so like we're using Ubuntu, it's the base image from which we create this snapshot. So just like you would do with any of your own machines, like anytime you're starting a new project, you install certain programs. So like here we install supervisor, nginx, vim, git, g++, curl, and zipzip.
24:05
So these are basically all the utilities starting from scratch that you would build for a system. Okay, so then we define certain environment variables. And then we just, just like for a Python project, we have dependencies, we install them,
24:23
copy everything to the working directory, make working parent directories. Remember this Unix socket we talked about? We just create this file, give it certain permissions, and then we have the, we set the working directory so that whenever the container starts, you know which directory to be in.
24:41
And then we just have this entry point that I set in which we have all of the commands that we want to run. So for our case, it's just going to be the supervisor because it'll run your whiskey and nginx on its own. And remember, this port is the one that nginx was listening to. So we expose this port over here. And we've created a Docker file.
25:01
To build the image, it's fairly simple. You just do a Docker build, you give it the name, and you start, you say that copy everything from the working directory. Docker run actually helps us to run this image. You give it a port mapping, a name. And once you see all of these logs, right,
25:21
like a supervisor is running, your fair seek is loading the model, you whiskey and nginx are working, it means that your model is ready or at least the programs that are required to do that are ready. But the way to actually check it is just through a very simple call command. All you do is, remember, we defined this translate endpoint.
25:41
So you just give it a particular query and you get the result. Now, how do we check if something's not working correctly? You just do Docker logs. And that will give you the Docker logs of when the Docker container was starting. And then you give it, if you want to sometimes enter the container
26:01
and you want to see what's happening inside the container, you want to check the log files, you do that. Certain other practices are unit tests. We haven't written any for this because this was a very simple application. But generally, for each piece of code, we write unit tests, just so we know where we're going wrong. And caching, which means that anytime someone's going to be hitting your model, they're not always going to be unique,
26:21
which means you can actually save some computation power and time by just storing those results earlier. So it's something that definitely to look out for. All right. And with that, I'm done with my talk. This is the Discord channel and I also post a few other links
26:41
that might be useful in that channel. So I'm open for questions. Thank you very much. That was nice talk. Are there any questions? Please use the Q&A button. Actually, you can already write questions during the talk
27:00
because now we are online. Don't forget about that. So you have to type there. Okay, so if there are no questions yet, I do have a question, unfortunately a little bit technical one. I didn't quite understand. In uWhiskey, you use the configuration with the three processes and one thread.
27:23
Does that mean you have your model loaded three times in total? And then in Nginx, you have these worker connections. It's over 8,000. So if 8,000 people connect, is this distributed to the three processes? Is this, I really don't quite get.
27:41
Oh, okay. So the number of threads is the number of threads per process. So essentially, it's going to be three workers. So three workers will have the same app loaded. And that will actually help us by calling, if someone calls the model at three QPS,
28:00
so like three queries per second, it will all be translated at once. So that is the point of workers, but number of threads is different. Number of threads is number of threads in each process. So like in each way that your app is loaded, your app is translating, how many different threads exist.
28:21
But Nginx is another thing. Nginx, that 8,000 do is the port. It's not the number of connections that are mapping. It's the port that it will be listening to, to listen to all the HTTP requests. Okay, so, but this still means in the worst case, your model has to be reloaded every time there is a request
28:42
or do we see something wrong? No, no, no. Okay. Yeah, yeah, yeah. Your model will not be reloaded every time. The model will be reloaded just once and then you can solve it multiple times. That is the full advantage of this system. Okay, great. Yeah, so that's the most important part.
29:00
I didn't yet use Uiski, but I will certainly do that to solve exactly that problem. Nice. So are there any questions? I see in the Discord chat, there are also no questions yet. You can go to the Discord, you see it online, talk NLP model.
29:22
You can use control or command K and enter NLP and you will find the channel. We are unfortunately out of time, so thank you very much again, Shreya, for your wonderful talk.