We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The state of Machine Learning Operations in 2019

00:00

Formal Metadata

Title
The state of Machine Learning Operations in 2019
Subtitle
This talk will cover the tools & frameworks in 2019 to productionize machine learning models
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This talk will provide an overview of the key challenges and trends in the productization of machine learning systems, including concepts such as reproducibility, explainability and orchestration. The talk will also provide a high level overview of several key open source tools and frameworks available to tackle these issues, which have been identifyed putting together the Awesome Machine Learning Operations list (https://github.com/EthicalML/awesome-machine-learning-operations). The key concepts that will be covered are: * Reproducibility * Explainability * Orchestration of models The reproducibility piece will cover key motivations as well as practical requirements for model versioning, together with tools that allow data scientists to achieve version control of model+config+data to ensure full model lineage. The explainability piece will contain a high level overview of why this has become an important topic in machine learning, including the high profile incidents that tech companies have experienced where undesired biases have slipped into data. This will also include a high level overview of some of the tools available. Finally, the orchestration piece will cover some of the fundamental challenges with large scale serving of models, together with some of the key tools that are available to ensure this challenge can be tackled.
Keywords
20
58
Virtual machine1 (number)BitProduct (business)AreaOpen setMachine learningQuantum stateCore dumpLecture/Conference
Machine learningVirtual machineOpen sourceCore dumpSet (mathematics)NP-hardLemma (mathematics)Mach's principleWaveOpen sourceProjective planeLevel (video gaming)Link (knot theory)Electronic mailing listSet (mathematics)Library (computing)Complex (psychology)TrailProduct (business)Slide ruleSoftwareFunctional (mathematics)Software frameworkMathematical modelScaling (geometry)PredictabilityOpen setMessage passingDifferent (Kate Ryan album)Right angleArithmetic meanMathematical modelInformation engineeringVirtual machineSoftware engineeringMultiplication signProcess (computing)Extension (kinesiology)Regulator geneQuicksortExpected valueIterationGraph coloringOperator (mathematics)Machine learningServer (computing)HypercubeParameter (computer programming)Software developerDataflowBoiling pointStandard deviationInferenceAlgorithmRadical (chemistry)AreaMetric systemDescriptive statisticsNoise (electronics)Metropolitan area networkWeb crawlerBuildingCASE <Informatik>ResultantComputer animation
Virtual machineLink (knot theory)Content (media)Mathematical modelMachine learningWaveScale (map)Service (economics)Computational physicsComplex (psychology)Graph (mathematics)NP-hardResource allocationError messageMetric systemInferenceMachine learningLibrary (computing)Open source1 (number)Extension (kinesiology)Basis <Mathematik>Mathematical modelMathematical modelError messageVirtual machineWave packetRight angleProduct (business)Resource allocationLoginChemical affinityService (economics)Matrix (mathematics)Self-organizationStandard deviationCartesian coordinate systemQuicksortRegulator geneCentralizer and normalizerConfiguration spaceRevision controlRow (database)CodecNumberMultiplication signSingle-precision floating-point formatSoftware repositoryLatent heatMaxima and minimaSocial classComplex (psychology)Domain nameConnectivity (graph theory)Expert systemCASE <Informatik>Metric systemGraph (mathematics)SpacetimeStrategy gameDivergenceCore dumpMachine codeAreaException handlingWeb applicationMedical imagingSet (mathematics)Chemical equationScaling (geometry)Operating systemSystem administratorComputer hardwareMathematical optimizationStaff (military)Computer animation
Lipschitz-StetigkeitMathematical modelWave packetComputer-generated imageryContent (media)Mathematical modelMereologyForestWave packetRight angleIntegrated development environmentPreprocessorWrapper (data mining)Functional (mathematics)Perspective (visual)RandomizationMorley's categoricity theoremElement (mathematics)CASE <Informatik>Codierung <Programmierung>Point (geometry)Vector spaceLink (knot theory)Representational state transferMathematical modelGraph (mathematics)Numeral (linguistics)outputSet (mathematics)Interface (computing)QuicksortLine (geometry)Inheritance (object-oriented programming)BitTransport Layer SecurityMedical imagingOrder (biology)Process (computing)Letterpress printingSocial classRow (database)Software testingPredictabilityCartesian coordinate systemTerm (mathematics)InternetworkingPerfect groupXML
Local ringMathematical modelCASE <Informatik>LoginMatrix (mathematics)Latent heatHookingMetric systemWrapper (data mining)CuboidTerm (mathematics)Table (information)Product (business)Dependent and independent variablesBitStandard deviationInstance (computer science)Right angleDatabaseFunction (mathematics)1 (number)InferenceCore dumpService (economics)Mathematical modelServer (computing)Elasticity (physics)Graph (mathematics)2 (number)Computer animation
Data structureMathematical modelProcess (computing)Software repositoryWave packetInterface (computing)QuicksortStandard deviationCentralizer and normalizerMathematical modelBinary codeMachine codeConfiguration spaceINTEGRALMultiplication signCASE <Informatik>Wrapper (data mining)Right angle
Revision controlMathematical modelWrapper (data mining)Parameter (computer programming)Bit error rateMathematical modeloutputMereologySoftware repositoryQuicksortConfidence intervalPredictabilityConfiguration spaceFraction (mathematics)Negative numberBinary fileStructural loadDifferent (Kate Ryan album)XML
Mathematical modelVirtual machineMathematical modelProcess (computing)Time domainAlgorithmSocial classChemical equationCorrelation and dependenceLatent heatMetric systemMathematical analysisDivergencePerformance appraisalSoftware bugRight angleMeta elementProduct (business)Interpreter (computing)Performance appraisalSpacetimeNumberAreaMetric systemCuboidLibrary (computing)Pattern languageVirtual machineCombinational logicCASE <Informatik>Process (computing)Point (geometry)Goodness of fitSerializabilityStructural loadMathematical modelLaptopAlgorithmWave packetSet (mathematics)ThumbnailBlack boxLocal ringGeneric programmingServer (computing)Domain nameQuicksortSerial portMathematical analysisTerm (mathematics)Mathematical modelCross-correlationoutputSoftware design patternReverse engineeringSocial classMereologyChemical equationSelf-organizationStandard deviationCovering spaceSoftware engineeringElectronic mailing listData analysisStrategy gameDependent and independent variablesBitExpert systemEntire functionPosition operatorNegative numberOpen sourceConstraint (mathematics)Perspective (visual)WordArithmetic meanSoftware
Modal logicMathematical modelProduct (business)Uniform resource locatorLibrary (computing)Wave packetPredictabilityFunctional (mathematics)Client (computing)outputDependent and independent variablesRight angleInferenceCodecBlack boxSet (mathematics)Standard deviationMaxima and minimaQuicksortLoginMathematicsGenderMetric systemTable (information)CASE <Informatik>Different (Kate Ryan album)Numeral (linguistics)Sound effectNumberConnectivity (graph theory)RandomizationForestInstance (computer science)Function (mathematics)Medical imagingSocial classWrapper (data mining)XMLComputer animation
Modul <Datentyp>Mathematical modelContent (media)Social classExpert systemLibrary (computing)Projective planeQuicksortChemical equationMathematical modelDifferent (Kate Ryan album)Black boxField (computer science)Computer animation
Mathematical modelConfiguration spaceEntire functionAbstractionLine (geometry)CASE <Informatik>Standard deviationMereologyConfiguration spaceWave packetData managementAbstractionMetadataProjective planeMachine learningConnectivity (graph theory)Virtual machinePoint (geometry)Graph (mathematics)Mathematical modelDataflowEntire functionMathematical modelRight angleProduct (business)CubeAreaQuantum stateProcess (computing)Machine codeoutputScaling (geometry)Covering space
Component-based software engineeringService (economics)UsabilityCommon Language InfrastructureGamma functionComputer wormControl flowProjective planeDifferent (Kate Ryan album)Vector spaceParameter (computer programming)NumberFront and back endsTwitterSpeech synthesisLinear regressionoutputConnectivity (graph theory)Mathematical modelCase moddingProduct (business)Volume (thermodynamics)QuicksortToken ringRight angleLoginSpring (hydrology)Graph (mathematics)Logistic distributionDataflowFunctional (mathematics)Task (computing)LaptopQuantum stateProcess (computing)Object (grammar)Interface (computing)Set (mathematics)CubeFunction (mathematics)CASE <Informatik>NeuroinformatikCommon Language InfrastructureData storage deviceScripting languagePerspective (visual)Term (mathematics)Transformation (genetics)Demo (music)1 (number)Mathematical analysisDrag (physics)Mathematical modelComplex (psychology)Wrapper (data mining)Drop (liquid)DebuggerXML
CASE <Informatik>Function (mathematics)Machine codeMathematical modeloutputRepository (publishing)Revision controlControl flowDataflowContent (media)Computer iconTwitterQuantum stateMetric systemContent (media)Quantum stateConfiguration spaceOrder (biology)Archaeological field surveyComplex (psychology)Control flowProduct (business)Virtual machineTrailUniform boundedness principleFitness functionData storage deviceSerial portQuicksortFunctional (mathematics)Logic gateWordRevision controlAreaLibrary (computing)Connectivity (graph theory)Machine codeINTEGRALMathematical modelDimensional analysisInformation privacyIterationMathematical modelWave packetCategory of beingMereologyDataflowCASE <Informatik>Common Language InfrastructureComputer animation
Metric systemMereologyBlack boxMachine codeFeedbackDifferent (Kate Ryan album)Multiplication signType theoryQuicksortTerm (mathematics)Mathematical modelContrast (vision)Real-time operating systemPrototypeMathematical modelPredictabilityThresholding (image processing)Artificial neural networkoutputLatent heatSocial classSet (mathematics)CASE <Informatik>RandomizationNumberChemical equationProduct (business)Group actionInstance (computer science)VideoconferencingConfidence intervalNegative numberWordSoftware development kitInferenceNeuroinformatik1 (number)CuboidRight anglePoint (geometry)Gradient descentOptimization problemRow (database)ResultantGoodness of fit2 (number)BenchmarkSystem administratorFunction (mathematics)Extension (kinesiology)Dependent and independent variablesSoftwareWeightSoftware engineeringOpen sourceOrder (biology)ForestHydraulic motorServer (computing)Data structureInformationService (economics)Moment (mathematics)CurveSpacetimeConnectivity (graph theory)Representational state transferBinary fileView (database)Computer virusFraction (mathematics)Power (physics)File archiverStandard deviationComputational complexity theoryAreaVarianceThumbnailLaptopLecture/Conference
Transcript: English(auto-generated)
All right. So, I think we're going to get started. So, thank you very much to everybody for coming. I'm quite excited today to give you an insight of the state of production machine learning in 2018. This talk is going to be a high-level overview of the ecosystem.
And it's going to tackle and dive into three key areas. The ones that I personally am focusing the most. So, to tell you a bit more about myself, I am currently the chief scientist at the Institute for Ethical AI and Machine Learning. And also, engineering director
at this open source, open core startup called Seldon Technologies based in London. To tell you a bit more about both of my roles, with the institute, I focus primarily on creating standards as well as open source frameworks that ensure people have the right
tools and infrastructure to align with all those ethical principles that are coming out as well as industry standards. So, it basically asks the question of what is the infrastructure required so that reality matches expectation. If there's a regulation like GDPR that demands
the right to explainability, it's really questioning what does that mean from an infrastructure level and what would it be required to even enforce it. And then from the day-to-day, so I lead the machine learning engineering department at Seldon. Seldon is an open source
machine learning orchestration library. So, you would basically use Seldon if you want to deploy models in Kubernetes and basically manage, you know, hundreds or thousands of models in production. And some of the examples that I'm going to be diving on are actually going to be using some of our open source tools. You can find the slides as well as
everything that we're using in that link on the top right corner. The link's going to be there, so, you know, don't rush to take a picture. So, let's get started. In terms of small data science projects and just data science projects in general, they tend to boil down into two different steps. The first one is model development.
The second one is model serving. In the first one, you know, the standard steps that you would go through is basically getting some data, you know, cleaning the data based on some knowledge, defining some features to transform the data, then selecting a set of models with hyper
parameters. And then with your scoring metrics, you would then iterate many, many times until you're happy. And once you're happy with the results of the model that you've built, you would want to persist this model, and then you would go to the next step, which is you serve it in production. That's when unseen data is going to go pass through the
model, and you're going to get predictions and inference on that new data. That is basically, you know, a very big simplification, but, you know, we're going to be using this throughout the talk. However, as your data science requirements grow, you know, we face new issues.
You know, it's not just as simple as you keeping track of the features and the different algorithms that you use at every single stage. You know, you have an increasing complexity on the flow of your data, right? You perhaps had a few cron jobs running the models that you pushed in production, and now that you have quite a few, you go into cron job
hell, right? I mean, I don't know who uses that, you know, color palette for the terminal, but I guess each data scientist has their own set of tools, you know, some hard TensorFlow, some love R, Spark, you know, you name it, and, you know, good luck trying to take
them away, not just because they just really like them, but also because some are more useful for certain jobs than others. So you're going to see a lot of different things that you're going to have to put in production. Serving models also becomes increasingly harder, so you actually have multiple different stages that have their own complexities in themselves,
you know, building models, hyperparameter tuning, those in themselves become, you know, one big theme themselves. And then when stuff goes wrong, you know, it's actually hard to trace back, right? If something goes bad in production, you know, is it because of the data engineering piece or the data scientist or the software engineer, right? You always have
like the Spiderman, you know, pointing fingers. So basically what we've all done from there is that as your technical functions grow, so should your infrastructure. And this is what we refer to today as machine learning operations or just production machine learning concepts. In this
case, you know, it is that green layer that involves that model and data versioning, orchestration, and, you know, really it's not just, you know, those two things. And the reason why it's challenging is because we are now seeing an intersection of multiple roles. This is basically software engineers, data scientists, and DevOps engineers, which are
both condensing into this role of machine learning engineer. And, you know, the definition of this role in itself is quite complex because it does fall in expertise in those areas. And you see that when you look at a job description, right? You know, these AI startups are hiring for this PhD with, you know, 10 years experience in software development, you know,
maybe three years McKinsey-style consulting experience for a salary of an intern, right? I mean, that's basically what you have a lot of the times. And, you know, the reason why it is challenging is because we're now seeing things like, you know, data science at scale. And you have the requirements for the things that you would normally follow in the sort
of like data science world to also apply in some of the, in certain extent in the software engineering and DevOps world. And, you know, when I say it's challenging is because it actually breaks down into a lot of concepts. And we've actually broken down the ecosystem in an open source, awesome production machine learning list, which, you know,
we would love for you guys to contribute. You know, you see one of the tools that is missing, you know, is one of the most extensive lists specifically focused on production machine learning tools. So basically, you know, just the explainability piece has, you know, an insane
amount of open source libraries. But the ones that we're going to be diving to today, not saying that, you know, they're not, the rest are not as important for sure, but it's the ones that I myself work mostly on a day-to-day basis, are orchestration,
explainability, and reproducibility. And for each of these principles, we're going to be diving into the conceptual definition of what they mean together with an example, a hands-on example showcasing what is the extent of the ways that you can address this challenge, as well as a few shout-outs to other libraries that are available for you to check out.
So to get started, model orchestration. So this is basically training and serving models at scale. And, you know, this is a challenging problem because you are really dealing with, I guess, you know, in a very conceptual manner handling, you know, an operating system challenge
at scale, right? You need to allocate resources as well as computational hardware requirements. For example, if you have a model that requires a GPU, then you need to make sure that the model executes in the area where the GPU is available. So it is really hard. So it's
important to make sure that you are aware that this complexity involves not just the skill set of the data scientists, but also it may require the sysadmins and infrastructure expertise to be able to tackle it. And the reason why it also gets hard is because having stuff in production that is dealing with real-world problems, you know, also
dives into other areas. So you have this already ambiguous, you know, role of machine learning engineering, and it's currently intersecting with the roles of, you know, industry domain expertise as well as policy and regulation to create this sort of like centralized industry standards. This already introduces that ambiguity of how do you have
that compliance and governance with the models that you deploy in production. And, you know, this is kind of like the very, very high level, but, you know, for some of the DevOps engineers, they may say, well, the standardization of metrics, right? If you're in a large organization, you may actually have to abide by certain SLAs. And with microservices,
these SLAs are quite standard. They are uptime, you know, they could be latency. But when it comes to machine learning, you may actually have some metrics that you have to abide by like accuracy, things that you need to be aware like model divergence. And, of
every single one of your deployments, but, you know, to a certain extent, it is necessary to be able to standardize and abstract these concepts on an infrastructural level. And that's what we're going to be diving into in a certain level today. And it's not only metrics, you know, as you would know with any microservice or web app that you
would, you know, deal with in production, but it's also logs and errors, right? If you have an error with a machine learning model, the error may not just be a Python exception, right? This may be an error because the new training data was potentially biased towards a specific class, right? So you had a class imbalance with more examples
in one class than in the other. That could be in a way leading to errors that are not specifically, you know, exceptions, right? So you may not get notified because something failed, but, you know, you may see stuff failing because of that. And it is also how do you standardize the stuff that comes in and out of the models? How do you
track this? And then also, for example, if you have images coming into a model, you know, you can't just go into your log, you know, Kibana dashboard and just see that like binary dump of the data, right? So it's really understanding what to log in those cases. Now, when you actually deal with machine learning in production, you
also dive into complex deployment strategies, right? So normally you may imagine just putting a text classifier in production, but perhaps you may want to reuse components. Or maybe you want a more complex computational graph where you have some sort of like rooting
based on some conditional cases. You may have some, you know, multi-armed bandit optimizations that may have different models at which they're having at the end. Or you may have other things like explanations. Right now, you know, we're going to dive into that, but explanations are a big thing in the machine learning space. And you may
want to have those things in production so that your domain experts can make sense of what's currently deployed. And again, you know, yes, you could actually do this custom for every single thing, but the reason why you wouldn't want to do that is because if you have a manual work with every single model, what you're going to end up having
is, you know, each data scientist having a maximum of, say, for example, 10 models that they can maintain in production at one possible time. So if you want to deploy more models, you're going to end up having to hire more staff, right? So you actually want to avoid that linear growth of your resources, technical resources, with your
actual internal staff. Okay. And this is where, you know, the concept of GitOps comes in. And this is the concept of you define your, you use your GitHub repo or your version control system as your single source of truth. And whatever actually
gets updated there will reflect what you have in production. This may not be only limited to the code of your application, but may also, you know, reach the extent of the configuration in which your cluster may actually be currently following. And in this case, you know, we're going to be showing an example where we are going to first start
with a very, very simple model. We're going to be taking a, you know, very common data set that you're probably used and followed a tutorial with, which is the income classification data set. And we're going to basically assume that, you know, we're taking this, you know, data set of, you know, people's details like, you know, your number of working
hours per day, your, you know, working class, et cetera, et cetera. And we're going to train a machine learning model to predict whether that person earns more or less than 50K. And in essence, in this example, we're going to assume that we're using this metrics for approving someone's loan, right? If you get more than, you know, if it predicts
more than 50K, it would be approved, otherwise rejected. You know, I don't recommend anyone to do this in production. This is just an example. And what we're going to be doing is we're going to be wrapping this Python model and then deploying it and, you know, seeing how we can get some of this, like, standardized metrics, getting some of this standardized logging, et cetera, et cetera. So in this case, all of these examples are
actually open source and they're all available on the link. So you can actually go and try them yourself. Within an hour, I mean, we're only going to be able to cover them in a high-level perspective. So in this first part of the example, we're
only going to be creating a Python model, then we're going to be wrapping it, and then we're going to be deploying it in a Kubernetes cluster, right? So it's going to be containerized with Docker, and then it's going to be exposing the internal functionality through a RESTful API. So the way that we would do it is we would set up our environment, which basically requires you to have a Kubernetes cluster running. You know, I'm not going
to be trusting the internet for that to actually help us today. So, you know, I already have everything set up in tabs, as you can see, just in case. So what we're going to be doing in this case, we're downloading in here the dataset. So this dataset contains,
you know, in this case, applications of people and whether they get approved or rejected. We do a train test split, as you normally would, and in this case, you know, you would have, let's actually have a look at the dataset. Yeah, so you basically have already
a normalized dataset where you have, in the first row, the age of the people, and then remaining classes for the rest of the features. And then we actually can print the
labels as well. We can see the feature names, and that's basically the order in which we have the age, the working class, education, et cetera, et cetera. Perfect. So the first thing that we're going to be doing, we're going to be using scikit-learn. So just
to get a bit of an understanding, I mean, who here has used scikit-learn? Let's see a show of hands just for tutorial. Okay, perfect. Awesome. So what we're doing here is we're just building a pipeline. We're going to be scaling our numeric data points, as well as, you know, creating a one-hot encoding of our categorical data points, and we're
going to be transforming the data with that. So now that we've actually, you know, fit our preprocessor, we're going to be then training a random forest classifier with that sort of data set so that it, you know, takes the preprocessed data and then predicts
whether that, you know, a person would be able to get a non-approval rejected. Once we actually train our model, we can use the test data set to see how it performs. You know, we can see that in terms of accuracy, it has about, you know, 85 percent, you know, precision, recall, et cetera. So now we have a trained model, right?
We have, with our scikit-learn, you know, CLF is our random forest classifier, preprocessor is basically our pipeline of our standard scaler and the one-hot vectorizer. So then what we're going to do is we're going to actually take this model and containerize it. And what we're going to do for this is we're going to first just dump those two models
that we've created. So for that, you know, preprocessor and classifier, so we're dumping them in this folder and, you know, we can actually see the contents. So we can see that we basically dumped it there.
Once we have those two models that have been already trained, we basically create a wrapper and this wrapper is just going to have a predict function that will take whatever input comes in. You know, this predict function is exposed, will be exposed through a RESTful API, but basically whatever input we pass it through
the preprocessor and we pass it through the classifier and then we return the predicted, the prediction, right? So this is very simple, right? So we load the models and then we just run whatever is passed through this predict function and return the predictions, right? Super simple. This wrapper is basically the interface that we just require so that we can
actually containerize it. So for the next one, for the containerization, we just need to define any sort of dependencies. So in this case, we use scikit-learn and the image because we're actually going to be sending, well, in this case, we actually don't need the image, just scikit-learn and then we actually just define, you know, the name
of our file and we run the basically S2I CLI tool that basically what it does, it takes our image, our standard image that exposes and wraps this model file through a RESTful API and gRPC API, right? So once we actually have this container,
so just to get a bit of an understanding in the room, who here has used Docker before? Perfect. So here you just have, oh great, awesome. So here you basically just have a Docker image called loan classifier 0.1. This Docker image, when you run it,
the input command is basically just going to run a Flask API that exposes the predict function. Whatever you send to that predict endpoint, you know, will be passed through basically, you know, your wrapper. So that is basically what it would be doing, right? So once we have that, we would just, you know, specify it in our Kubernetes
definition file. So this is, you know, just saying like the container that we're going to have is this loan classifier and your computational graph in this case just has one element, which is the loan classifier and that's all basically you would have. Once you define that, if it's built, now you can actually deploy it. Here you can actually
see that it's being created in local Kubernetes cluster. So I think it is downloading it, which is not great, but basically what you would then see is this model is now deployed in our Kubernetes cluster. It's going to be listening to any requests. So it's basically
as if it was a microservice, right? And then as any other RESTful endpoint, we can actually interact with it, in this case with cURL. So in this case, we're actually just sending it, you know, one instance to actually perform an inference.
The response is an in the array of the positive and negative label. In this case, it predicted a negative label. So in this case, what we've done is we've actually wrapped a model with a very, very simple thin layer wrapper, put it in production. The wrapper itself also
exposes a metrics endpoint, which for the people that have used Prometheus or Grafana in the past, you know, Prometheus, you can actually hook it up to this metrics endpoint. And you're able to get some metrics out of the box. In this case, let's see if I can actually show it. Here is basically our income classifier that we have deployed. And out of the box you get,
you know, in this case, this is a Grafana dashboard. You would get basically all of the requests per second. You get, you know, the latency for that specific container, et cetera, et cetera. And we're actually going to be diving a bit more into some of the metrics
in a bit. You also get some of the logs. So again, this is basically just the output of the container is just being collected with a fluent D server and then, you know, stored in an elastic search database. So for the ones that have used Kibana in the past, this is just basically also us querying the elastic search for the logs. And for
tabular data is basically what, you know, we actually expose out of the box. But basically, that is an initial overview of, you know, the orchestration piece. The benefits of actually, you know, containerizing your models, of course, it's obvious in terms of like making it
available for business consumption. But the core thing from this is the push towards standardization. Right? If you were to have, you know, a hundred models in production, you would be able to interact with them as if they were microservices. Right? And what this allows you to do, you know, we have just covered a very, very simple example,
but what this really allows you to do is to leverage this GitOps structure that I was talking about earlier. And just to see here, who here is familiar with PyTorch and with PyTorch Hub? Okay. So PyTorch Hub is basically a new initiative from PyTorch where they
encouraged people to save trained models like BERT or VGG where you can actually submit your models to a Git repo. And what that allows you to do is to have a central sort of like standardized interface towards your already trained models. So in this case,
basically you're able to define any model. In this case, it's ResNet. And you say, this is how you load it. And this is where the trained binary is located. So there's an initiative from PyTorch Hub. And what we have been able to do is to actually create an integration to PyTorch Hub where anytime that you actually point a new sort
of like configuration deployment towards a repo, what it would do is a very thin layer wrapper that just downloads that model. Because the actual code to load it is standardized by the actual deployment. And to be more specific, the way that we actually do it is a wrapper
where you basically take the repo and the name as input parameters that you can pass through the conflict files. And then when it actually loads, it downloads the model from PyTorch Hub. So you basically have a new ability to dynamically publish any sort of
like BERT or VGG-like models. I mean, anyone who has actually tried using BERT or one of those state-of-the-art models would know the pain of often setting them up. So there's a lot of benefit of actually trying to standardize the way not only to define them, but also to deploy them. And again, you can actually jump in and try these examples.
So that is basically a high-level overview on the orchestration part. Before we jump into the explainability piece, some other libraries to watch, you know, one of them is mleap serving. So their approach is they actually have a single server that allows you to load standardized sort
of like serialization of models. So if anyone is familiar with the ONNX sort of serializable definition of models, you know, you'd be able to have a single model that loads your trained
binaries and expose them through, again, an API. And then another one that is also one to watch is deep detect, which unifies behind a standardized API a lot of these Python-based models. And these are two of, you know, a large number of libraries to check out. I definitely would advise you to have a look at the entire list. It's quite extensive.
All right, so the second piece, oh, it should be actually explainability, so we're going to jump on that one. Explainability, this tackled the problem of black box model and white box model situations, where you have a trained model
that you want to understand why did the model predict whatever it predicted, right? And, you know, the way that we tackle it requires the people tackling this issue to go beyond the algorithms. And the reason why is because this is not just an algorithmic challenge, it does take a lot of the domain expertise into account. You know, and the way that we
actually emphasize this is that interpretability does not equal explainability. You may be able to interpret something, but that doesn't mean that you understand it. And of course, you know, in terms of like, you know, the English definition of those words, there is not that
conceptual perspective in place, but we tend to push that sort of way of thinking about it, because it's not just bringing the data scientist to address these challenges. It may require also the DevOps software engineer, but also the domain expert to be able to understand how the model is behaving. And we actually did a three and a half hour tutorial at the
AI O'Reilly. So each of these things, you know, we could actually dive into an insane amount of detail. But just for the sake of simplicity, today we're going to go and do a high level overview. The standard process that we often suggest to follow, it actually extends the existing data science workflow that we showed previously,
and it adds three new steps, which they're not really new, but they are, you know, three steps that are explicitly outlined for explainability. These are, you know, data analysis, model evaluation, and production monitoring. Production monitoring being the one that we're going to dive into today. In terms of data assessment, you would want to explore things
like class imbalances, you know, things whether you're using protected features, you know, correlations within data, you know, perhaps removing a data point may not mean that, you know, you are actually removing a hundred percent of the input that is actually being
brought by that, as well as data representability, right? This is how do you make sure that your training data is as close as possible to your production data. And this is, you know, a very well-known problem. The second one is model evaluation. You know, this is asking questions of what are the techniques that you can use to evaluate your models, things like feature importance, whether you're using black box techniques or white box techniques, whether you're using
local methods or global methods, you know, whether you can actually bring domain knowledge into your models. And this is important because, you know, what your models are doing, they're learning hidden patterns in your data, but if you can actually give those
patterns upfront as features or as, you know, combinations of your initial features that leverage some of the domain expertise, then you're able to actually have much simpler models doing the processing at the end, right? One of the use cases that we had is in
NLP, so automation of document analysis. We actually have been able to leverage a lot of the domain expertise of lawyers, right? Asking like meta learning questions of how do you know this answer is correct or what is the process that you go into finding an answer,
right? Things like that allow you to actually build smarter algorithms and not just in the machine learning models but in the features as well. And then the most important one is the production monitoring, right? How can you then reflect the constraints that you introduced in your experimentation and make sure that you can set those in production, right?
If you think that precision is the most important metric and that you should not have a set of, you know, false positives or false negatives, then you need to make sure that you're able to have something in production that allows you to enforce that and monitor that, right? So evaluation of metrics, manual human review, you know, not forgetting that you can
leverage humans too, right? Like that is also something that with machine learning you can definitely do. And the cool thing about this is that, you know, with the push that we have into the Kubernetes world, we're able to convert these deployment strategies from just things like
explainers into design patterns. So instead of just having a, you know, machine learning model in production, you can have deployment strategies where you may have another model that is deployed in production whose responsibility is to explain and, you know, reverse engineer your initial model, right? And this may get into a little bit of inception, but
this is actually a pattern that has been seen quite effective and a lot of organizations are starting to adopt, which we named the explainer pattern, which is not very original. But this is what we're going to be doing now. We have already our model deployed in production. We're saying that this model is predicting whether someone's loan should be approved or rejected.
And assuming that this is a black box model, we're now going to deploy an explainer that is going to explain why our first model is behaving as it is, right? So that's what we're going to be doing now. And we're going to be using that same example that we were leveraging.
So now we have our initial model in production. We can actually reach it through this URL. So what we're going to do now is we're going to actually leverage this explainability library for which there are actually many of, but this is one that we maintain. It's called Alibi,
and it offers basically three main approaches to black box model explanations. The first one is Anchors. And Anchors, it answers the question of, from the features that you sent to your model for inference, what are the features that influence that
prediction the most? And the way that it does it is by actually going through all the features and replacing a feature for a neutral value, and then seeing which one affects the output the most. So this is Anchor, and this is what we're actually going to be using. But there's another very interesting one called Counterfactuals. And Counterfactuals are
basically the opposite, well not really the opposite, but conceptually it's the opposite of Anchors. It asks the question of what is the minimum changes that I can add to this input to make that prediction incorrect, or at least different to what it was. So if you were
actually approving someone's loan, the question would be what are the changes that you can make to that input so that the loan is rejected. So this basically allows you to understand things like, for example, with NIST, you can ask questions of, well, what are the minimum changes that you can do to make that four not a four? But more interestingly, you can actually go from
one class to another. You can say, what are the minimum changes that I can do to this four to make it a nine? So what we're going to be doing is first Anchors on our dataset. So here we're actually just using our Seldon client to also get the prediction. So we're
literally just sending a request and this is the response, which is the same as their coral. But yeah, so we're going to create an explainer and we're going to be using Alibi and the Anchor tabular explainer. So for this, what we're going to be doing is we're going to take our classifier, so that classifier that we trained, that random forest predictor
that we trained before, and we're going to actually expose the predict function and we're going to feed that into our Anchor tabular. Because it's going to be interacting with the model as if it was a black box model. It's only going to be interacting with the inputs and outputs. When using text or image, only when using tabular. The reason why is because
with tabular you need to ask the question of what would be the neutral numbers that you would use to replace. In this case, for numeric datasets, you have to get the minimum and
the maximum and then you say, well, I want it to be the quartiles or something like that. That's the only reason why you would use the training data. But yeah, so you would fit it and then you would actually see what is the inputs that we're going to be sending. We are actually sending this one, somebody of age 27, and we just predict it as negative
and we're going to actually explain it. And it basically says, well, what makes this prediction what it was is the feature marital status of separated and gender of female. So that's what basically your explanation for this instance is.
And what is now starting to get interesting is that we're now going to actually use our local explainer on our model that we basically deployed already. So in this case, that predict function that we basically had, we're now going to be using that remote model.
So we're actually going to be sending the request to the model that is currently in our Kubernetes cluster. And when we actually request the explanation, we're going to get the same The only difference is that we're now actually reaching to that model in production.
And now we're going to actually follow the same things. We're going to just containerize the explainer and we're going to put the explainer in production. So again, we actually create a wrapper. The wrapper has a predict function. The predict function just basically takes the input and runs explain and returns the explanation.
So now what we have in production, so we've containerized, we deploy it, and now what we have in production is now an explainer. So we have our loan classifier explainer as well as our initial model. So what this is interesting is that now you can actually send
to one of these components a request to do an inference and you can send another request to explain that inference by interacting with that model in production. And we can actually visualize it here. If you remember, with our income classifier, if we actually have a look at the logs, these are all the predictions that have gone through the model through basically as
requests. So what we can do now is we can actually take one of these and send a request for the explainer to explain what's going on. And look, this is the exact
same thing that you just saw in the other one but just flashy, shiny, and colorful. This just basically says, for that other explanation, you still have that marital status of separated, influence your prediction by this much, gender female by this much, and capital gain by this much, and then you also can see predictions that are similar
or different. But in essence, you're still getting the same insights as if you were using it locally, but again, you're getting those sort of standardized metrics. So that explainer also has the metrics exposed, also has the logs exposed, et cetera, et cetera. So you get that benefit. And that's basically the example to the explainers. And now we're actually going
to go one level deeper. But before that, I want to give some libraries to watch in the model explanation world. These are le5, which is explained like I'm five.
This is a very cool project. They do a lot of different techniques. SHAP, which you've probably come across if you're in this space or have looked at model explanations. And XAI is one that we really specifically focused on on data. So techniques for class imbalance, et cetera, et cetera. And then again, as I mentioned, there's tons,
right? I mean, with this black box model explanations, you can actually dive into so many different libraries. It's a very exciting field. So I do recommend to actually have a look. Now, for the last part, for the last part is on reproducibility. So reproducibility, this
basically answers the question of how do you keep the state of your model with the full lineage of data as well as components? And this really breaks down into the abstraction of its constituent steps. For every single part of your machine learning pipeline, you're going to have a piece of code, configuration, input data, right? And for each of those
things that you have, you may want to actually freeze that as an atomic step. And the reason you may want to do that is because you may want to perhaps debug something in production or for compliance have audit trails of what happened, when it happened, and what did you
have in there. And the reason why it's also hard is because it's not only the challenge in an individual step. The challenge also goes onto your entire pipeline. So each of the components on your pipeline, each of those reusable components may actually require to have that level of standardization. And you saw it with the configuration definition
that we had in the previous example, where we actually had a graph definition. There you can have multiple different components, which are Docker containers, which are containerized pieces of your atomic steps. And one thing is to actually be able to keep those atomic steps, and another one is to actually be able to keep the understanding of the metadata of the
artifacts that are within each of these steps, right? Because metadata management is hard, right? And now we're getting into a point where it's not only metadata management, but it's metadata management on machine learning at scale. And it's doable, it's just that it requires to sort some areas, you know, a new way of thinking. And what we're going to be diving
into here is basically the point that we haven't covered. We talked about models that are already trained, but what we haven't talked about is potentially the process of training models. And we're actually currently contributors to this project called Kubeflow, which I'm not sure if
you've heard about, but Kubeflow focuses on training and experimentation of models on Kubernetes. And what it allows you to do is to actually build reusable components. What we're going to be diving into in this last example is going to be a reusable NLP pipeline in Kubeflow. And what this is going to be more specifically,
let me actually open it, is going to be this example, which I have the Jupyter notebook. You can try it yourselves, but we're going to be actually creating a pipeline with these
individual components. If you guys have ever done NLP tasks, we're going to be doing a, let's call it sentiment analysis, where you would find the usual steps, cleaning the text, tokenizing it, vectorizing it, and then running it through a logistic regression classifier.
The first step is just going to download the data. And we're basically using the Reddit hate speech dataset. So from our science, all the comments that were deleted from mods, they've been compiled. And yeah, so basically what we have here is these components, we would want to actually create this computational graph in production
that uses them as separate entities. And the reason why you want that is because maybe you want to reuse your spacey tokenizer for different other projects. And you want to keep perhaps your, you know, wholly met a feature store, right, where you actually just pick and choose different things, you know, that ultimate drag and drop
data science world. But yeah, so basically this is what we're going to be doing in this example. You know, from a high level perspective, what it's going to consist of is five repeats of wrapping models. But in this case, it's just wrapping scripts in that same process
that we did previously. For example, the clean text step is basically, again, just a wrapper called transformer with a predict function that takes, you know, the text as
a NumPy array. It runs the vectorization of that. Or in this case is the actual TF-IDF vectorizer. It runs the vectorization and then returns the actual vectorized output, right? And then in terms of the actual interface to it, it's just like a CLI. But once we have
these components, then, you know, we're able to define our pipeline and we can upload this pipeline into Kubeflow, which then looks like this, right? It's basically all of the steps with all the dependencies. The only difference is that it uses a volume that is attached to each
of the components to pass the data from one container to the other, right? So for each component, the volume is attached. And the interesting thing here is that you can actually create the sort of like experiments through your front end. You know, you can actually choose what parameters you expose. And you know, here I can actually change the number
of TF-IDF features, et cetera, et cetera, and then just run our pipeline. And then you can actually see your experiments. You can see which ones have run. And then for each of the steps, you can actually see the input and output as you print it for each of the components.
So here we can see the text coming in and then the tokens coming out from the other side. And then the last step, you know, it's a deploy. And what that basically does, it just puts it again in production listening through any requests. For this specific demo, what we've done is, you know, you can see the deployed model here. So it's an NLP
Kubeflow pipeline. And you know, you can see that there's actually live requests through each of the components. So you can see that the clean text, the spaCy tokenizer, the vectorizer, et cetera. What we're sending it live, and this is actually quite funny, we're actually sending all the tweets related to Brexit. Do you guys know what Brexit is?
Yeah? So it's actually doing hate speech classification. So the funny thing is, that doesn't matter what side you're in, there's a lot of hate. And we can actually see, you know, here we have this sort of like nice looking logs. But you know, as I mentioned,
you can also jump into the Kibana. And here you can see like, you know, Celtic, Brexit, Spring. Yeah, well, I don't know. I don't want to read them out loud because there are some that, you know, are not very appropriate. But yeah, so basically, now we have just this like production, you know, Brexit classifier that can actually be
trained with different data sets. And it just exchanged automatically through this step. And the objective here is to actually just show the sort of complexities of this reproducibility piece, and how there are different tools trying to tackle it. This dives more into the experimentation and training part. And I haven't even dived into
the pieces around the complexity for tracking metrics as you run experiments, right? This is basically I run 10 iterations of the model, I want to know which perform better, how do I keep track of my metrics, as well as the models that I used. So, you know, the each of these things that I've covered has so many different dimensions to tackle them from. And, you know, we actually have talks online, where we have,
an hour, an hour and a half of just one of these, you know, today was more of like a high level overview. And, you know, other libraries to watch, you know, data version control, DVC, they're basically a Git-like sort of CLI that allows you to, you know, run the usual
sort of like commit, push, workflows, but for that sort of like three components of your code, configuration, data, et cetera. And another one is MLflow from Databricks, and this focuses on actually experiment tracking. We actually have some examples where we integrate. And
Pachyderm, which dives into full compliance. So as you can see, you know, the ecosystem of this is so broad, but it's also at the same time, super, super interesting. And yeah, so I'm going to wrap up and jump into questions just in case anyone has any questions on this or any other libraries. But before that, I'll just, you know, give a few words on this sort
of stuff. You know, we covered, you know, three of the key areas that, you know, I have been focusing on. These are orchestration, explainability, and reproducibility. But as I mentioned, you know, the content is, you know, insanely broad. Things that I actually
haven't talked about, which is also insanely interesting, are things like adversarial robustness. You know, as you saw, some of our explainability techniques have an approach to explain through adversarial attacks, kind of. So it's also interesting to see how there is a lot of overlap across each of these areas. And also not only overlap, but also
different levels into which some fit in other of the categories, right? You know, privacy is one that is super interesting that, you know, we haven't covered that dives into privacy preserving machine learning, which is an interesting area in itself. Storage serialization,
function as a service, et cetera, et cetera. So with that, you know, I have been able to give a high-level overview of the state of production machine learning in 2019. It wasn't exhaustive, but, you know, it does feel like it was. But yeah, so if we have some
questions, I'm happy to cover them now or later at the pub. Thank you very much, guys. It was a pleasure. Thank you very much for your talk. I'm actually chairing your session. So do we have questions? Please come ahead. Come to the microphones. It's working.
For your questions, please. Hi, excellent talk. Thank you. I was mostly inspired by this explainability idea and have two questions about it. So first, let's assume we have
a lot of features and they have, like, they produce a huge space of variance that can be, like, huge space of variance. So it seems that when I try to explain this black box, I need to
iterate all these features, all variants of these features, and it seems like performance issue here. How can it be solved? And yeah. So that's the first question, was the second one, and I'll repeat it. Okay, it was first. And the second one, that some models itself
has some information about featuring importance within it, like this random forest. Have you compared some results from this explainer with internal results of the model itself?
Yeah, okay, no, that's two really good questions. So the first one was basically, you know, you have a lot of features, what's the computational complexity around that and how you deal with that. The second one is basically on, what was the second one,
the second question was? Internal importance. Internal importance, yeah, comparing internal importance to the black box model explainability. So let's dive first into the computational challenges. So that is 100% correct, and in terms of anchors as a technique, we are conscious
that in order for you to explain black box models as a whole, it often becomes quite expensive. The way that we have been able to tackle it is by separating the way that you actually request explanations and predictions. So for explanations, you may not want something
that is like real time and for every single one of the predictions that go through, but instead is for actually diving deeper into one or a few of the inference predictions that you may have. So perhaps if something went wrong, you can use explanations to debug how it performed, or if the threshold that you set for accuracy was 90%,
you would only request explanations for things that fall under when you assess them. So that is from one side. In the other, interestingly enough, this week, our data science team just published a paper that actually proposes a way to deal with the computational
challenges with counterfactuals specifically and with contrastive explanations, and that is basically using prototypes, the concept of prototypes, and this is with sort of like
neural networks to reduce the dimensionality of your features themselves. So that paper is in our archive and you can check it out, but there is a lot of research in that space to actually make it more feasible without sacrificing the power on explanations. You know, unfortunately there is no silver bullet, so I do acknowledge that it is a challenge,
but then that is why there is the benefit of also leveraging white box model predictions in certain situations where you actually can, and following to your second question, you can actually leverage some of the internal structures of the models like random forests or neural networks, you know, using sort of like the weights of the networks to actually explain
much easier. It is also worth mentioning that the explanations themselves, the explainers, some of them they are optimization problems. So, for example, we use gradient descent to find some of the explanation techniques for the counterfactuals. Now for the second piece in terms of leveraging the internal stuff and also seeing how it
performs against the black box models, so we actually have not done benchmarks of how it performs against, but that is definitely something that we would be interested on. If you are interested on that, you know, otherwise it is open source, so we would love a pull request or an issue to our documentation on that, but that is a really good question. Yeah. Yeah. Okay. Thank you.
Thank you. Do we have more questions? Please go ahead. Hello. I have two questions. First is what are your views about setting up a kind of a feedback loop, a pipeline for feedback loops saying that after your model has gone into production and you have a result at the end of the day saying that, hey, you know what, for these
things you had the correct predictions and for these number of records you had wrong predictions. How do you go about retraining or incrementally retraining your model after it has gone out into production? Yeah, that is an excellent question and, you know, that was one of the key things that I actually discussed in, I guess they called
it three-hour workshop, I call it three-hour rant because I was trying to push how important that piece is. And unfortunately, again, there is no silver bullet in terms of, you know, you can't just deploy a model and have that feedback loop out of the box
because not always you actually have data that is relabeled in production, right? I will give you a specific example. If you are doing automation of support tickets routing, then at the end the support tickets will be resolved at some point, right? So you are actually getting data that is being labeled in real time so you could actually get that
feedback real time. Other times where actually labeling of data is so expensive, you may not have that benefit but you may still want to have that specific feedback loop and in that term you may actually require to establish that manually and what that would mean, say, would require every month or every week or every year once a year to evaluate the performance
of the model by having a set of random data, you know, perhaps, you know, on a balanced set of classes that is labeled by hand and then compared to what it should be and to see the performance. So that feedback loop should definitely be in place. The way that it should be
installed is different depending on the use cases. There is also that sort of other part which is not feedback loop in terms of performance but it could be just feedback of real-time performance of the metrics and actually for one of the things that I mentioned in the, I think it was orchestration, is, you know, you may have like three
different models that in real time you may want to optimize the routing, that's also other type of feedback. So in the API, in the SDK that we build, we actually have an endpoint called feedback that allows you to actually send, you know, stuff back but yeah the word
feedback can mean so many things but on those two specific ones that would be my thought. And just one thing more, you mentioned about production monitoring and when you said that data scientists has to maintain a certain number of models in production, what would actually trigger a manual action on that particular model? What are the
KPIs that you actually, yeah, the prediction accuracy is one of them but what actually would trigger that yes there's something wrong with the model and the data scientist needs to actually go and evaluate that model from the ground up? So I think it's not as explicit as, you know, the time, the manual time is because things go wrong. The manual time, it actually goes all
away from the moment the data scientist goes like my model is ready, I want to put it in production for business consumption. From that moment the data scientist has to think well maybe I need to expose a RESTful API so he needs to write, he or she needs to write the code to actually, you know, wrap it on a flask server then it needs to expose the endpoints
because the endpoints are quite custom and are not standardized across all other models that other data scientists, you know, put in production, you know, he or she needs to actually like assess how it's performing. If something goes wrong, you know, the data
scientist needs to jump in and assess why it went wrong. If it needs to be retrained, again, data scientist needs to retrain it. So it's a lot of little things that require that manual, not just inputs but also continuous thinking around that because the responsibility of that model beyond it's ready is it still falls within the data scientist. So it's just pushing it towards
that. Once a model is done, it should become similar to microservices to a certain extent because, you know, as a software engineer you still have to jump in and debug it, but to a certain extent once the model is ready it becomes a sysadmin or devops challenge,
right? So then you can have hundreds under the same metrics so it's not just individual people assessing their own things in production. And you have the same thing with software engineering when you deploy microservices. You want to avoid that and standardize it. Thank you. We have two minutes left for questions. Do we have more questions for our
speaker today? Yes. Do you have generic components to ensure that the confidence levels that are outputted by the models are calibrated in one way or another? So we don't have a standardized sort of metric per se, but you are able to
expose custom metrics. What is standardized is the way that these metrics are collected. So they're collected through Prometheus, well they're exposed through a metrics endpoint that then can be collected through Prometheus and then, you know, consumed by Grafana.
Then it's very easy to set thresholds to get notified. So it is possible to just set thresholds for, you know, any of that standardized accuracy metric. But then again when you say 90% accuracy that may vary from use case to use case. And also accuracy is often irrelevant because
sometimes, you know, false positive may have more influence than a false negative. So what we try to standardize is the metrics that come out and the way that they can be evaluated as opposed to the metrics that should be evaluated, if that makes sense.
Yeah. My question was more specifically on, for instance, if you do classification as the example that you gave. The model can output, I'm confident that it's 80% chance negative. But maybe it's actually not, like, if you take a...
out of 100 predictions and you bin the predictions by the confidence level, you could see that the fraction of negatives in each bin are not actually reflected the confidence levels outputted by the model. And depending on the models that you do, you might have different calibration issues. And I was wondering if calibration is something
that is like a generic tool that you could put in your pipeline and is something that is requested by the users or how to leverage the calibration and or maybe it's not addressed yet and that's it. No, no, I think definitely calibration is one of the important things. I mean, we do have some open source work
that exposes not only the things like the multi-arm bandit but also techniques for things like outlier detection that you can use. We don't have like a generic piece but that is not because there's no demand, it's just because we don't have enough hands. So we'd love that. Again, it's open source, open and issue.
If we get enough thumbs up, then we definitely prioritize it. And we actually have a bunch of examples. We'd love to have just another Jupyter notebook example showcasing how you would do that. But that is definitely a good point and it's a very interesting area in this space, yeah. Awesome, thank you.
We have time maybe for one last. Okay, if we don't have any further questions, let's have a very warm applause for Alejandro. We'll start.