We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Applied MLOps to Maintain Model Freshness on Kubernetes

00:00

Formal Metadata

Title
Applied MLOps to Maintain Model Freshness on Kubernetes
Title of Series
Number of Parts
69
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
As machine learning becomes more pervasive across industries the need to automate the deployment of the required infrastructure becomes even more important. With data velocity increasing every day it becomes more and more important to keep models fresh. Combined with the ever growing popularity of Kubernetes, a full-cycle, containerized method for maintaining model freshness is needed. In this talk we will present a containerized architecture to handle the lifecycle of an ML model. We will describe our technologies and tools used along with our lessons learned along the way. We will show how fresh training data can be ingested, models can be trained, evaluated, and served in an automated and extensible fashion. Attendees of this talk will come away with a working knowledge of how a machine learning pipeline can be constructed and managed inside Kubernetes. All code presented will be available on GitHub.
Mathematical modelElectric currentSoftwareComa BerenicesMetric systemMachine learningMathematical modelMathematical modelOpen sourceReading (process)FrequencySoftware developerBlack boxMathematicsBlogUltraviolet photoelectron spectroscopyProduct (business)Software engineeringNumberConnected spaceProjective planeMultiplication signSoftwareTask (computing)Physical systemWebsiteStatisticsObservational studyOperator (mathematics)AbstractionWordAlgorithmLattice (order)TwitterAreaTouch typingHand fanResultantMachine learningElectronic data processingFeedbackParameter (computer programming)Software design patternData management1 (number)Open setVideo gameMathematical optimizationPoint cloudType theoryAxiom of choiceMereologyCodeLatent heatElectronic program guideMoving averageWave packetImage resolutionGoodness of fitDifferent (Kate Ryan album)QuicksortService (economics)Virtual machineNetwork topologyCurveFunction (mathematics)outputSlide ruleXMLComputer animation
BuildingPhysical systemEntire functionLocal GroupOffice suiteLoop (music)InformationSubsetDatabaseSubject indexingGroup actionHeat transferDifferent (Kate Ryan album)Subject indexingBit error rateTransformation (genetics)Mathematical modelHypothesisMultiplication signMathematical modelTask (computing)Mathematical analysisCache (computing)DiagramResultantWave packetSubsetSet (mathematics)Flow separationCategory of beingWebsiteBitSearch engine (computing)Reading (process)TwitterNatural languageField (computer science)Cartesian coordinate systemElectronic mailing listFormal languageSequenceFamilySocial classConnectivity (graph theory)Order (biology)AlgorithmStreaming mediaLevel (video gaming)Type theoryWordInferenceCountingOperator (mathematics)Service (economics)Revision controlSource codeMereologyProcess (computing)Phase transitionUniformer RaumMetropolitan area networkComputer architecturePhysical systemFigurate numberMultiplicationGame theorySlide ruleDirection (geometry)TunisNP-hardComputer animationXML
Mathematical modelControl flowMetadataMathematical modelSource codeMathematical modelRepository (publishing)Computer fileProduct (business)CodeProjective planeMetric systemMotion captureMathematical modelWave packetCollatz conjectureRepresentational state transferSource codeMetadataMereologyService (economics)InferenceConnectivity (graph theory)ResultantRight angleVirtual machineMultiplication signServer (computing)Web pageSound effectNP-hardoutputMedical imaging1 (number)Different (Kate Ryan album)View (database)Point (geometry)BitTwitterCategory of beingPredictabilityFunctional (mathematics)Social classInterface (computing)PlastikkarteGoodness of fitReading (process)Type theorySoftware frameworkAuthorizationPresentation of a groupCurveKernel (computing)EmailFunction (mathematics)Subject indexingDigital electronicsVideo gameOperator (mathematics)Scaling (geometry)DataflowStructural loadCASE <Informatik>Machine learningData managementFront and back endsFilesharing-SystemComputer animation
Mathematical modelPhysical systemMetric systemSoftware testingDifferent (Kate Ryan album)Motion captureIntegrated development environmentResultantRoutingMathematicsSubject indexingHydraulic jumpMechanism designContent (media)File viewerMoment (mathematics)Multiplication signMereologyRight angleXMLUML
Transcript: English(auto-generated)
So to get started about us, my name is Jeff Zimrich. I'm an engineer at Open Source Connections, and I'm the current chair of the Apache Open NLP Project and a member of the Apache Software Foundation.
In a previous life, I was primarily Cloud Engineer on AWS and GCP. Open Source Connections is sponsoring Berlin Buzzwords, and so please stop by in the partner booth. We're also hiring, so if you like this kind of stuff, please definitely stop by.
You can always get in touch with me at email or join the relevancy Slack, always available there as well. David? Yeah, hi everyone. My name is David Smithbauer. I'm excited to be here with all of you today. Appreciate the opportunity. I'm an engineer with Leidos, specifically, I guess, in the last few years,
specifically related to DevOps. Fan of Kubernetes, containerization, all things DevOps. So again, thanks for being here for the talk, and if you have any questions, please feel free to reach out to us. Jeff? Cool, thanks. So introduction of what our talk is about.
I know it had a lot of words in the title and being, you know, Buzzwords, Berlin Buzzwords, had to make a lot of buzzwords in there. But what we want to show is an illustration of how we can make a containerized system that's ML stuff, how it can evolve over time, and how we can monitor to increase performance and relevance.
This isn't intended to be a how-to guide for one specific tool. It's intended more to illustrate the underlying concepts, so then you can take those away. And if you want to use a tool, if you want to roll your own stuff, you're free to make that choice on its own. And we also wanted to describe MLOps,
what it is, the problem it's trying to solve, the challenges that it brings along with it. So what is MLOps? We see it all the time pretty much in, I would say, you know, 90% of the blog posts that come out. There's some mention to MLOps. And at its heart, it's intended to be the intersection
of DevOps and machine learning. So we're taking the successful parts of DevOps over the past few years, and we're trying to apply those to machine learning concept. The ultimate goal for MLOps is to get a model into production. Simple as that said, but not as simple as that done.
So we want to increase our automation. We want to apply software development lifecycle, CI-CD, all those best practices from DevOps over to MLOps. So first thing about AI, ML, if you've done any,
or if you haven't, if you just read, or if you're a beginner or a seasoned veteran, you're probably aware that things are easy to deploy, but very hard to maintain. It has become in the past few years, the NLP, AI, and other areas,
the technology has become so democratized that the learning curve to get something running, to get something usable is small, and that's absolutely wonderful. But the part that we can't ignore is how to maintain it once we have something that we think works. And so when we try to apply MLOps to this,
we realize that things are all of a sudden more complex because now we're not just dealing with DevOps issues, we also have ML issues that come into play as well. And so some of these are like DevOps issues, and some of them are issues on their own.
For example, treatment of models as black boxes. For the coders out there, when you get something that works, maybe you wrote it, maybe someone else wrote it, maybe you don't know who wrote it, but you know it works. It takes some input, it takes some output, and you're tasked with plugging that into your system.
And we know that the code that we write on the input and the code that we write on the output, it doesn't take real long for it to become glue, and it becomes hard to maintain, and it can turn into a disaster quickly. And with machine learning models,
the treatment of those as black boxes is even more common because it's hard to see how an ML model is working. Changes in the data. We make a model today on data that we have today, and it works great. And so we deploy it, and we put it to use.
And what we sometimes don't think about is what happens to that data tomorrow? Will that data be the same? Will the data change? Maybe we don't know. So we have to keep in mind that everything is changing all the time. And another good example is that good software engineering practices don't always apply.
We're taught in pretty much all of our software engineering courses about design patterns and abstraction, and trying to apply those to ML models can be difficult or near impossible to implement some sort of abstraction in a machine learning model for reusability
can be a task on its own. And these issues, along with others, come from a 2015 paper, a slide at the bottom, titled Hidden Technical Debt Machine Learning Systems. It's a very interesting read, especially when you think that it was written six years ago. Maybe things have just changed in that time period,
or maybe the paper was a little ahead of itself. But reading it today, it definitely brings to light issues that are certainly still there. So it's a very good read. So like software engineering, as we develop code, we develop models, we get technical debt.
And so this is another area in which we need to be aware of when we're trying to apply our ML ops. The first one is entanglement. Everything is somehow connected. And we have the phrase, changing anything changes everything. Nothing is independent. So you can imagine in an ML ops pipeline
that's doing ETL or some other type of data processing, one change somewhere is going to change everything. Or if you're creating your ML model, changing your hyperparameters, changing your features, it may seem like an innocuous change. But the impacts of that change could be extremely dramatic.
Pipeline management itself. ETL is three short little letters, but it's a daunting task in itself. And so it's easy to acquire technical debt around your ETL pipeline. Hidden feedback loops, such as different models or different services
can unknowingly interact with each other and affect the operation of each other. And these things that may not be obvious immediately, may only come after months or years of use before you realize it. And so we can't really apply the Marie Kondo approach of,
does it bring you joy if not throw it out? Otherwise we would just throw out all of our technical debt forever. So spoiler there, don't throw out your technical debt. We have to address it. And so machine learning brings its own technical debt that we have to address just as we would from software development.
So back to MLOps. Lots of AI projects don't make it into production. If you try to Google that and find statistics, you'll see anywhere some studies say 53% up to 87% of AI projects don't make it into production. And whatever those reasons are, it varies.
But we can take away from that number that that's a decent amount of projects. So the ones that fail for technical reasons, MLOps is trying to put standard as ways around those projects to help make them more successful. And an example of that was Kubeflow that started in 2018.
And there's eight areas known together as the big eight areas, data collection, data processing, feature engineering, data labeling, model design, model training, model optimization, and deployment and monitoring. So these are the eight areas that MLOps tries to address. And if you're like me and you say, oh, hey, it can't be that bad, but then you look.
And then, for example, deployment monitoring. That is a lot more than just three words. There's a lot that goes into that. And so each of those areas. So MLOps becomes a broad and daunting task to address those eight. So for our project here, to illustrate some of this,
what we want to build is we're going to build a ML-powered system that uses trending hashtags from Twitter to influence our search results. So we are going to monitor Twitter hashtags, and we are going to use those hashtags in our naive algorithm to determine what's trending
to influence our search results. So in this example here, you can see I have a Christmas tree there on my movie. So at the end, what we want to do is if we search for, say, family movies at Christmas, if Christmas is a trending hashtag, then our search results will be influenced
by those trending hashtags. So here's some example search results to help illustrate it a little bit better, a little more concrete. So on the left-hand side is 10 movies that were retrieved from a search index that contained the TMDB data set, a subset of the data set.
And so indexed these movie documents into the index, and then just ran a family search. And these are the movies that came back. And so on the right side is where we want to get. So let's say that the hashtag Christmas is trending,
and we do that search now. Then as you can see, we get 10 family movies that in some way deal with Christmas, whether it's at Christmas time or somehow other closely related to Christmas. So the components of our system
is going to be a consumer to read tweets from Twitter and collect those hashtags. And for that, we're going to use an Apache Flink application. And we're going to index our movies into Elasticsearch. What search engine you use doesn't matter.
Concurrently, there's a debate going on in one of the other Berlin buzzwords rooms between Elasticsearch, Solr, and Vespa. So what you want to use is up to you. This example doesn't dictate any of that. We're going to use a natural language classifier. It's a zero shot learning classifier to be able to assign our movies to categories.
And the categories are the hashtags. And we're going to use part of Kubeflow to help with that. To be able to gauge our search results, we're going to use Cupid, which I know I've heard referenced a few times throughout the conference. It's a tool to do judgments on search results.
And we have to store everything. So we're going to use MySQL and Oretis cache just to make everything available. So the architecture diagram of this, to give you a better idea, is we have our tweets that come in. We have our stream consumer, our Flink application. And then from those tweets, it pulls out hashtags.
And I've said before, our trending hashtag algorithm is rather naive. It simply counts them and orders them from most to least. And so whatever is most, we call it as trending. And whatever is least is not trending. And we pass those hashtags along with movie summaries to our classifier.
And that tells us some probability that that movie has something to do with that hashtag, for example, Christmas. And so then the user can execute their search against the search engine and get those ranked movie results back influenced by the trending hashtags.
So the data set to describe it, it's a subset of listing of movies. It has a whole bunch of fields. This is just a few to give you an idea. So for each movie, it lists the genres, the language, the title, the overview. And I did Die Hard here because I think we all agree
that Die Hard is a Christmas movie. So I had to include that one. And so it's in JSON, so it's easy to get and index. Our stream consumer, as I think I said before, just a Flink job. It just maintains a map of hashtags to their count of occurrences.
As it accumulates this map, it persists them to Redis. So that way we can easily pull out what we want and we can sort it. So it's not very smart, per se. And so it just works on raw hashtags. In the real world, you might want to do some manipulation of the hashtags.
And sometimes they're not all nice friendly words like Christmas, could be other things. So how do we classify those movies based on some hashtag that comes through? If you think about it, the problem that comes to mind first is we don't know what's going to be trending tomorrow.
We don't even know it's gonna be trending later today. In 2019, before December, we did not have any idea that pandemic was going to be more than trending in 2020. So how can we classify those movies based on labels that we don't know? It's really hard to train some type of classification model
with labels that we don't know. If we had some predefined set of categories, sure, we can do that. But without those labels, it becomes more difficult. So we need something else. And for that, we're going to use a zero-shock classifier to do it. So a zero-shock classifier is built
on top of natural language inference, NLI. And it's a type of NLP task where given a premise, some sentence, and another sentence, which is the hypothesis, determine the relationship between those two sentences. And the relationship could be entailment, contradiction, or neutral.
This task is sometimes referred to as recognizing textual entailment, RTE. So if you see that, it's referring to the same thing. So in the examples here, the premise, a soccer game with multiple males playing. Given a hypothesis, some men are playing a sport.
Then we would label those as being true, entailment, versus the other example where a man inspects the uniform of a figure in some East Asian country, and the hypothesis, the man is sleeping. Those are contradiction. So we label them as such. So the training data that's used for this model
is just that data. Sentence pairs along with a label for entailment, contradiction, or neutral. So in this example, we have our premise, we have our hypothesis, and we have our label. In the Hugging Face ecosystem, there are several NLI datasets available for training
on your own models. This example was taken from SNLI. So if you go on Hugging Face dataset website and search for NLI, you can find this dataset and others to train your own models with. So using this model as a zero shot learning classifier,
we want to be able to classify our text, which is our movie summaries, into one or more categories. And like I said before, we don't know what those categories are going to be because hashtags just come and go. And to step back a little bit in just an illustration of transfer learning versus zero shot learning, we hear a lot about BERT and transformers
about fine tuning our models. So the illustration in the bottom corner down there is the difference between using a pre-trained model versus a zero shot model, where you do your fine tuning for some tasks, such as sentiment analysis, versus a zero shot model, where we have our model
and we just throw text out of direct use of it. So classifying our movie overviews, the hypothesis that we have is this text is about blank. And so we are going to take our sequence, which is our movie overview, and our candidate labels, which are our hashtags.
So for each candidate label that we get it, our model will say, this text is about hashtag one. This text is about hashtag two. And each time it does that, it gets back a probability, somewhere between zero and one, that indicates the model's belief that this text can be classified as that category.
So if you're doing multi-class classification, each one will be between zero and one. Where we just don't single class, the probabilities will sum up to one. So using the model, we're just going to take each movie's summary, and we are going to take
tuning hashtags, and we are going to throw them at the classifier. And you see in a little bit that we just kind of wrapped the classifier as a REST service to be able to take that text and return back some numeric probabilities for us. And then we can use those fields in our sorting.
Model training, use the Hugging Face Transformers. It makes it really easy to do, really available. We use DVC to store our models along with our source code. So one of those things where applying DevOps to MLOps
is a way to have everything under source control. And so DVC, which I have a slide on next, I think, is a real good way to be able to do that. Everything runs in containers, so we can deploy it to our cluster or run it locally or wherever that we need to. So model versioning with DVC.
So DVC is a really good tool, if you're not familiar with it, to be able to capture artifacts like models into a Git repository. So it stores the metadata about the files in Git while pushes the actual models, which can be many gigs in size, of course, to some back end,
whether it's S3 or to an SSH file share, NFS, or some other place. So you can do Git post and Git pull and get your models and persist your models right along the source code. Whatever code or data that you use to train your model, you can now have it right there with the model itself.
So evaluating the model. So after we make a model, we need to know how well it's working. There's two bullet points, trends come and go, and new movie releases. You put those together, and we end up with a stale model. It's probably not going to do what it needs to do
a few days, months, or years from now. So we have to keep improving it over time. And we have to have a way to be able to judge the performance of our model. One way to do that is to use human judgments of movies against a few categories. So in that example search at the beginning, you can go through and you can score each result
as somewhere between one and four of how good of a search result it is. You can save those judgments, and you can compare them to judgments and search results using your model. How well does it perform? Is it better than human judgment? Is it worse?
How does it compare to our human judgments? So that can give us a baseline performance going forward. So as we evaluate new models, we can test them against our human judgments as well and hopefully see some improvement from our last one. And this is where we use the Cupid tool to make those judgments. So here's an example of our human judge movies on the left.
Most Christmas shows there are very much about Christmas. The ones that we gave treats to are only a little bit about Christmas, so they take place during Christmas time. And the movies on the right, I took some liberty there on some of those Christmas movies,
just to illustrate the point a little bit more. Predators certainly did not come back as a search result. But you can give it a zero, as it has nothing to do with Christmas, same as Die Hard. It only takes place during Christmas. It's not actually about Christmas. And so those judgments are less than the ones on the right.
So as our model improves, then we hope that our scores of our search results on the right would show improvement over time. So deploying our model, use KF serving. It's part of the Kubeflow package. And KF serving is really nice because it encapsulates
a lot of stuff that you normally have to write yourself. So for example, you could certainly take your model and wrap your own rest service around it in just a couple of hours and call yourself done and deploy it. But you're missing out on some of the stuff that KF serving gives you, such as auto scaling,
networking, health checks, and so on. So just by using KF serving, you can have all that available to you. And once you do, it becomes available. Your model becomes available for inference over at REST API. So using a kernel command, such as the example shown, we can send sentences and labels over to the model
and get the output pretty easily. The model itself, use a DVC for training at production time. You can keep using DVC. You can also use the new Hugging Face Model Hub. You can persist it up there and just pull it automatically
if you choose. So once you're good, a new classifier, it's ready to be used. So a little bit more about KF serving briefly. Just model inference on Kubernetes supports a lot of ML frameworks, regardless of what you use. And you can use a custom Docker image.
So you can use pretty much any type of model as long as you implement the interface. And here is an example of the interface that you implement for KF serving. So you just implement the init, the load, and predict functions inside your Python class. So your init to set stuff up and load your model
and predict is where the REST service grabs your input and does whatever you need to do. So simply by implementing those functions, you get to take advantage of everything that KF serving gives you. Deploying the KF serving is custom resource in Kubernetes.
And so you can just define your image using the resource and you can just deploy it to Kubernetes without much trouble. So just like with DevOps, MLOps requires monitoring. We need to be able to monitor pretty much everything all the time.
And so MLOps is no different. At training time, you need to capture the metrics that go along with the model. And deployment, how well is the model performing? Is it responsive? In this case, our inference is not done. At search time, it's done when we update our index.
So that's a little bit better, but still inference times are important to cut down on the time. And the effectiveness. Monitor our click results on our search page to make sure that the search results that the users are getting back are what they expect, what they're looking for.
If people are continually going to page two and three, then maybe we're not doing as good as we should. Our model needs improved. So important to just like DevOps to monitor everything that we can. All of this code is up on my GitHub under the Berlin Buzzwords repo.
It all runs and works to give you an idea of how things are set up. So feel free to clone that project. Just follow the steps in the README, and you can run each component using Docker Compose, if you wish, for simplicity. But you can set it up to consume tweets and grab hashtags, index the movies,
throw the movies at the classifier, and then do a search and see how the results are based on some hashtags. And you can also short circuit a little bit in the hashtags to make your own hashtags if you want to cut down on the time it takes
just to experiment with it. So please feel free to give it a go. So if you're familiar with Kubeflow, one question you probably say is, why not just use more of Kubeflow? And short answer is, you totally could. It is a bit heavier on things, and it does have a little more of a learning curve. If you already have ML pipelines in production,
or you're experimenting with them, and you're going to have multiple, then you probably should. Rather than managing everything one-off, using Kubeflow to do all of those is probably your best bet. And again, this presentation was more designed to show principles. So now that we have this knowledge
with what we need from MLOps, we can look at Kubeflow or some other orchestration framework to see if it's what we need. So to summarize, MLOps may just be thought of as applying DevOps to machine learning, but it's not all that it is.
It brings a lot of new challenges. We have to address our ML tech debt. We can't just say, it doesn't bring us joy, throw it away. There's another paper by the previous authors called Machine Learning, the High-Interest Credit Card Technical Debt from 2014. Again, it's a very good read for insight
into how we must manage ML tech debt. A zero-shot classifier is a wonderful tool to help us label text for categories that were not known at training time. Has a lot of use cases, and I think will be probably widely used going forward. KF serving makes life easy
to deploy models to Kubernetes even without using all of Kubeflow. And so with a lot of care, we can get an ML project to production. We can hopefully increase those percentages of projects that go from dev or experiment to get them into production. And lastly, we just need to look at the whole ML project ecosystem holistically.
Remember that everything affects everything else, dated today may not be dated tomorrow. So just take the 10,000-foot view approach when working on it. Lastly, thank you. Again, really happy to be here
presenting at Berlin Buzzwords. If you're working on something similar, I'd love to hear about it. I'm on the relevant Slack. Just look me up there or email either way. And thanks again.
Thanks for that, Jeff. That was a lot of content. I really, really enjoyed listening to that. I don't see any questions coming in from our viewers, but I have one if we have a few moments, which I think we were short on time, but how would you incorporate AB testing,
experimentation, segmentation, all of those kind of fun things about ML? What's the approach and how does that get included into all of this MLOps? Sure. Yeah, great question. With the system like we did,
the AB part, you can really just kind of apply how you would typically do AB testing. And so this one, with your search results and your classifier, you don't want to make changes to your index on the fly what people are using. So you kind of need a separate system to test stuff there. And for AB on your search results,
you can apply different metrics or capture. You don't have to make it available to everyone at the same time. You can roll it out slowly or however you need to based on your demographics or to try to measure results that way.
And I'll jump in here too. I think Jeff mentioned the KF serving, right? So we can utilize some of the capabilities inherent to Kubernetes as well as we serve up those models using KF serving and put them into the cluster, right? We can use some of the other mechanisms that are there and approaches with Kubernetes as far as being able to route percentages of traffic
to different models that are being served by using KF serving to serve those up in the Kubernetes environment and route percentages of traffic that way too.