The state of Machine Learning Operations in 2019

Video in TIB AV-Portal: The state of Machine Learning Operations in 2019

Formal Metadata

Title
The state of Machine Learning Operations in 2019
Subtitle
This talk will cover the tools & frameworks in 2019 to productionize machine learning models
Title of Series
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Identifiers
Publisher
Release Date
2019
Language
English

Content Metadata

Subject Area
Abstract
This talk will provide an overview of the key challenges and trends in the productization of machine learning systems, including concepts such as reproducibility, explainability and orchestration. The talk will also provide a high level overview of several key open source tools and frameworks available to tackle these issues, which have been identifyed putting together the Awesome Machine Learning Operations list (https://github.com/EthicalML/awesome-machine-learning-operations). The key concepts that will be covered are: * Reproducibility * Explainability * Orchestration of models The reproducibility piece will cover key motivations as well as practical requirements for model versioning, together with tools that allow data scientists to achieve version control of model+config+data to ensure full model lineage. The explainability piece will contain a high level overview of why this has become an important topic in machine learning, including the high profile incidents that tech companies have experienced where undesired biases have slipped into data. This will also include a high level overview of some of the tools available. Finally, the orchestration piece will cover some of the fundamental challenges with large scale serving of models, together with some of the key tools that are available to ensure this challenge can be tackled.
Keywords Architecture Data Data Science Deep Learning Machine-Learning
Loading...
Area Machine learning Open source Quantum state 1 (number) Virtual machine Bit Open set Product (business)
Complex (psychology) Building Web crawler Set (mathematics) Parameter (computer programming) Mathematical model Expected value Inference Different (Kate Ryan album) Set (mathematics) Software framework Extension (kinesiology) Descriptive statistics Area Predictability Machine learning Algorithm NP-hard Software engineering Regulator gene Software developer Open source Virtual machine Radical (chemistry) Message passing Arithmetic mean Process (computing) Right angle Quicksort Metric system Boiling point Resultant Slide rule Dataflow Trail Functional (mathematics) Link (knot theory) Open source Virtual machine Graph coloring Mathematical model Product (business) Mach's principle Term (mathematics) Operator (mathematics) Metropolitan area network Noise (electronics) Standard deviation Multiplication Scaling (geometry) Lemma (mathematics) Projective plane Core dump Hypercube Personal digital assistant Library (computing)
Complex (psychology) Metric system Multiplication sign System administrator 1 (number) Set (mathematics) Mathematical model Medical imaging Machine learning Strategy game Single-precision floating-point format Core dump Matrix (mathematics) Extension (kinesiology) Error message Resource allocation Social class Exception handling Area Machine learning Service (economics) Link (knot theory) NP-hard Regulator gene Complex (psychology) Electronic mailing list Maxima and minima Virtual machine Mathematical model Web application Software repository Inference Self-organization Configuration space Right angle Quicksort Metric system Spacetime Row (database) Service (economics) Open source Connectivity (graph theory) Virtual machine Content (media) Login Mathematical model Product (business) Number Wave packet Revision control Wave Computational physics Latent heat Centralizer and normalizer Graph (mathematics) Divergence Computer hardware Operating system Domain name Scale (map) Standard deviation Graph (mathematics) Scaling (geometry) Chemical equation Expert system Basis <Mathematik> Machine code Cartesian coordinate system Resource allocation Error message Personal digital assistant Codec Library (computing)
Point (geometry) Functional (mathematics) Randomization Computer file Link (knot theory) Transport Layer Security Computer-generated imagery Letterpress printing Set (mathematics) Mereology Mathematical model Mathematical model Perspective (visual) Element (mathematics) Wave packet Preprocessor Medical imaging Term (mathematics) Forest Software testing Codierung <Programmierung> Lipschitz-Stetigkeit Social class Predictability Graph (mathematics) Mathematical model Inheritance (object-oriented programming) Wrapper (data mining) Interface (computing) Content (media) Morley's categoricity theorem Bit Line (geometry) Representational state transfer Cartesian coordinate system Wave packet Process (computing) Numeral (linguistics) Vector space Integrated development environment Personal digital assistant Order (biology) output Right angle Quicksort Local ring Row (database)
Inference Dependent and independent variables Hooking Wrapper (data mining) Personal digital assistant Matrix (mathematics) Cuboid Right angle Instance (computer science) Mathematical model Product (business)
Server (computing) Graph (mathematics) 1 (number) Bit Database Function (mathematics) Login Latent heat Personal digital assistant Cuboid Elasticity (physics) Metric system Table (information)
Standard deviation Service (economics) Process (computing) Term (mathematics) Software repository Core dump Right angle Data structure Mathematical model Product (business) Wave packet
Computer file Wrapper (data mining) INTEGRAL Interface (computing) Multiplication sign Structural load Binary code Machine code Parameter (computer programming) Mathematical model Mathematical model Process (computing) Personal digital assistant Software repository Configuration space output Right angle Quicksort
Metric system Serial port Combinational logic Set (mathematics) Data analysis Mereology Perspective (visual) Mathematical model Strategy game Negative number Cuboid Process (computing) Position operator Social class Covering space Software bug Algorithm Meta element Constraint (mathematics) Software engineering Structural load Electronic mailing list Bit Entire function Virtual machine Mathematical model Latent heat Divergence Arithmetic mean Process (computing) Self-organization output Pattern language Right angle Quicksort Metric system Reverse engineering Point (geometry) Server (computing) Algorithm Chemical equation Virtual machine Serializability Mathematical analysis Black box Mathematical model Product (business) Wave packet Number Time domain Performance appraisal Cross-correlation Term (mathematics) Software design pattern Hydraulic jump Domain name Dependent and independent variables Mathematical model Chemical equation Expert system Mathematical analysis Correlation and dependence Performance appraisal Word Personal digital assistant Interpreter (computing) Social class Local ring Library (computing)
Predictability Sound effect Black box Function (mathematics) Mathematical model Product (business) Inference Uniform resource locator Mathematics output Right angle Library (computing) Social class
Functional (mathematics) Randomization Connectivity (graph theory) Set (mathematics) Client (computing) Black box Mathematical model Product (business) Wave packet Number Inference Medical imaging Different (Kate Ryan album) Forest Predictability Modal logic Dependent and independent variables Gender Maxima and minima Instance (computer science) Numeral (linguistics) Personal digital assistant output Right angle Table (information)
Predictability Standard deviation Gender Codec Right angle Quicksort Login Metric system
Different (Kate Ryan album) Chemical equation Projective plane Expert system Modul <Datentyp> Black box Content (media) Mathematical model Field (computer science) Social class Library (computing) Mathematical model
Point (geometry) Dataflow Line (geometry) Connectivity (graph theory) Virtual machine Mereology Entire function Mathematical model Mathematical model Metadata Wave packet Product (business) Machine learning Personal digital assistant Configuration space Abstraction Area Standard deviation Graph (mathematics) Scaling (geometry) Quantum state Projective plane Machine code Entire function Mathematical model Data management Process (computing) Cube output Configuration space Right angle Abstraction
Laptop Common Language Infrastructure Dataflow Functional (mathematics) Token ring Transformation (genetics) Logistic distribution Connectivity (graph theory) 1 (number) Function (mathematics) Parameter (computer programming) Perspective (visual) Computer Number Front and back ends Product (business) Case modding Term (mathematics) Different (Kate Ryan album) Task (computing) Scripting language Service (economics) Graph (mathematics) Common Language Infrastructure Quantum state Linear regression Interface (computing) Projective plane Mathematical analysis Data storage device Usability Volume (thermodynamics) Component-based software engineering Process (computing) Vector space Personal digital assistant Cube Speech synthesis Right angle Quicksort
Dataflow Demo (music) Token ring Connectivity (graph theory) Function (mathematics) Mathematical model Product (business) Twitter Vector space Cube Speech synthesis Computer worm Gamma function
Complex (psychology) Serial port INTEGRAL Archaeological field survey Set (mathematics) Mereology Information privacy Mathematical model Machine code Dimensional analysis Different (Kate Ryan album) Personal digital assistant Repository (publishing) Logic gate Area Computer icon Fitness function Data storage device Control flow Twitter Mathematical model Category of being Order (biology) Configuration space Quicksort Metric system Trail Dataflow Functional (mathematics) Quantum state Connectivity (graph theory) Virtual machine Control flow Content (media) Login Mathematical model Product (business) Wave packet Revision control Computer worm output Dataflow Quantum state Content (media) Machine code Uniform boundedness principle Word Spring (hydrology) Personal digital assistant Function (mathematics) Revision control Iteration Object (grammar) Library (computing)
Randomization Group action Confidence interval Multiplication sign System administrator 1 (number) Set (mathematics) Real-time operating system Function (mathematics) Mereology Mathematical model Fraction (mathematics) Inference Different (Kate Ryan album) Forest Negative number Videoconferencing Cuboid Extension (kinesiology) Thumbnail Social class Predictability Area Curve Software engineering Optimization problem Moment (mathematics) Feedback Representational state transfer Instance (computer science) Benchmark Type theory Order (biology) output Right angle Quicksort Metric system Resultant Spacetime Row (database) Laptop Point (geometry) Server (computing) Service (economics) Open source Connectivity (graph theory) Black box Mathematical model Binary file Computer Thresholding (image processing) Product (business) Number 2 (number) Goodness of fit Latent heat Prototype Term (mathematics) Data structure Contrast (vision) Software development kit Dependent and independent variables Graph (mathematics) Information Artificial neural network Weight Chemical equation Machine code Computational complexity theory Word Software Personal digital assistant Hydraulic motor Gradient descent
all right so I think we're gonna get started so thank you very much to everybody for coming I'm quite excited today to give you an insight of the state of production machine learning in 2018 this talk is going to be a high-level overview of the ecosystem and is going to tackle and undyed into three key areas the ones that I personally am focusing the most so to tell you a bit more about myself I am currently the chief scientist at the Institute for ethically ion machine learning and also engineering director at this open source open course startup called Selden technologies based in London to tell me more about both of my roles with the Institute I focus primarily on creating
standards as well as open source frameworks that ensure people have the right tools and infrastructure to align with all those ethical principles that are coming out as well as industry standards so it basically asks the question of what is the infrastructure required so that reality matches expectation if there's a regulation like gdpr that demands the right to explain ability it's really questioning what does that mean from an infrastructure level and what would it be required to even enforce it and then from the day-to-day so I lead the machine learning engineering department at Selden Selden is an open-source machine learning orchestration library so you would basically use Selden if you want to deploy models in kubernetes and basically manage you know hundreds or thousands of models in production and some of the examples that I'm gonna be diving on are actually going to be using some of our open source tools you can find the slides as well as everything that we're using in that link on the top right corner the link is gonna be there so you know don't rush to take to take a picture so let's get started in terms of small data science project and just data science projects in general they tend to boil down into two different steps the first one is model development the second one is model serving in the first one you know the standard steps that you would go through is basically getting some data you know cleaning the data based on some knowledge defining some features to transform the data then selecting a set of models with hyper parameters and then with your scoring metrics you would then iterate many many times until you're happy and once you're happy with the results of what of the model that you've built you would want to persist this model and then you would go to the next step which is you serve it in production that's when unseen data is gonna go pass through the model and you're gonna get predictions and inference on that new data that is basically you know a very big simplification but you know we're gonna be using this throughout the talk however as your data science requirements grow you know we face new issues you know it's not just as simple as you keeping track of the features and and the different algorithms that you use at every single stage you know you have an increasing complexity on the flow of your data right you perhaps had a few cron jobs running the models that you pushed in production and now that you have quite a few you go into a cron job he'll write I mean I don't know who uses that you know color palette for the terminal but I guess each data scientist has their own set of tools you know some hard tensorflow you know some loves our spark you know you name it and you know good luck trying to take them away not just because they just really like them but also because some are more useful for certain jobs than others so you're gonna see a lot of different things that are gonna have to put in production serving models also becomes increasingly harder so you actually have multiple different stages that have their own complexities in themselves you know building models high parameter tuning those in themselves become you know one one big theme on the Psalms and then when stuff goes wrong you know it's actually hard to trace back right if something goes bad at production you know I said because of the bait Engineering's piece or the data scientists or the software engineer right you always have like that the Spider Man you know pointing fingers so basically what we boil down from there is that as you're technical functions grow so should your infrastructure and this is what we refer to today as machine learning operations or just production machine learning concepts in this case you know it is it is that green layer that involves that model on data versioning orchestration and you know really it's not just you know those two things and there is no noise challenging is because we are now seeing an intersection of multiple roles this is basically software engineers they're the scientists and the engineers which are best condensing into this role of machine learning engineer and you know the definition of this role in itself is it's quite complex because it does fall in expertise in those areas and you see that when you look at a job description right you know this this this AI startups are hiring for this PhD with you know 10 years experience in software development you know maybe three years McKinsey style consulting experience for a salary of an intern right I mean that's that's basically what you have a lot of the times and and
you know the reason why it is it is challenging is because we're not seeing things like you know data science at scale and you have the requirements for the things that you would normally follow in the in this sort of like data science world to also apply in some of the in certain extent in the software engineering and DevOps world and you know when I say it's challenging is because it actually breaks down into a
lot of concepts and we've actually broken down the ecosystem in an open source awesome production a machine learning list which you know we would love for you guys to contribute you know you see one of the tools that is missing you know is one of the most extensive lists specifically focused on production
machine learning tools so basically you
know just the explained ability piece has you know and insane amount of open-source libraries
but the ones that we're going to be diving to today not saying that you know they're not the rest are not as important for sure but it's the ones that I myself work mostly on a day-to-day basis are orchestration explain ability and reproducibility and for each of these principles we're going to be diving into the conceptual definition of what they mean together with an example a hands-on example showcasing what is the extent of the ways that you can address this this challenge as well as a few shout outs to other libraries that are available for you to check out so to get started model orchestration so this is basically training and serving models at scale and you know this is a challenging problem because you are really dealing with I guess you know in a very conceptual manner handling you know an operating system challenge at scale right you need to allocate allocate resource resources as well as computational hardware requirements for example if you have a model that requires a GPU then you need to make sure that the model executes in the area where the GPU is available so it is really hard so it's important to make sure that you are aware that this complexity involves not just the skill set of the data scientists but also it may require that the sysadmin is an infrastructure expertise to be able to tackle it and the reason why it also gets hard is because having stuff in production that is dealing with real-world problems you know also dives into other areas so you have this already ambiguous you know role of machine learning engineering and it's currently intersecting with the rows of you know industry domain expertise as well as policy and regulation to create this sort of like centralized industry standards this already introduces that ambiguity of how do you have that compliance and governance with the models that you deploy in production and you know this is kind of like the very very high level but you know for some of the DevOps engineers they may say well standardization of metrics right if you're in a large organization you may actually have to abide by certain SLA s and with microservices this SL a's are quite standard they are uptime you know they could be latency but when it comes to machine learning you may actually have some some metrics that you have to abide by like accuracy things that you need to be aware like model divergence and of course you could actually put together the code required for every single one of your deployments but you know to a certain extent it is necessary to be able to standardize and abstract these concepts on an infrastructural level and that's what we're going to be diving into in certain level today and it's not only metrics you know as you would know with any micro service or web app that you would you know deal with in production but it's also logs and errors right if you have an error with a machine learning model the error may not just be a Python exception right this may be an error because the new training data was potentially biased towards a specific class right so you had a class imbalance with more examples in one class and the other that could be in a way leading to errors that are not specifically you know PI exceptions right so you may not get notified because something failed but you know you may see stuff failing because of that and it's also how do you standardize the stuff that comes in and out of the models how do you track this and then also for example if you have if you have images coming into a model you know you can't just go into your log you know keep on a dashboard and just see that like binary dump of the data right so it's it's really understanding what to log in in those cases now when you actually deal with with machine learning and production you also dive into complex deployment strategies right so you normally you may imagine just putting a text classifier in production but perhaps you may want to reuse components or maybe you want a more complex computational graph where you have some sort of like routine based on some conditional cases you may have some you know multi-armed bandit optimized since that may have different models at which they're having it at the end or you may have all the things like explanations right now you know we're gonna dive into that but explanations are a big thing in the machine learning space and you may want to have those things in production so that your domain experts can make sense of what's currently employed and again you know yes you could actually do this custom for every single thing but the reason why you wouldn't want to do that is because if you if you have a manual work with every single model what you're gonna end up having is you know each data scientist having a maximum of say for example 10 models that they can maintain in production at one possible time so if you want to deploy more models you're gonna end up having to hire more stuff right so you actually want to avoid that linear growth of your resources technical resources with your with your actual internal stuff okay and this is where you know the concept of get-ups comes in and this is this is the concept of you define your you use your your your github repo or your version control system as your single source of truth and whatever actually gets updated there will reflect what you have in production this may not be only limited to the code of your application but may also you know reach the extent of the configuration in which your cluster may may actually be currently following and in this case you know we're gonna be showing an example where we are gonna first start with a very very simple model we're going to be taking a you know very common data set that you're probably used and follow the tutorial with which is the income classification data set and we're gonna basically assume that you know we're taking this you know data set off of you know people's details like you know your number of working hours per day your you know working class cetera et cetera and we're gonna train a machine learning model to predict whether that person earns more or less than 50k and in essence in this example we're gonna assume that we're using this matrix for approval approving someone's loan right if you get more than you know if it predicts more than 50k it would be approved otherwise rejected you know I don't recommend anyone to do this in production this is just an example and what we're gonna be doing is we're gonna be wrapping this this Python model and then deploying it and you know seeing how we can get some of this like standardized metrics getting some of the standardized logging etc etc so in this
case all of these examples are actually open source and they're all available on on the link so you can actually go and try them yourself within an hour I mean we're only gonna be able to cover them in a high-level perspective so in the in this first part of the of the of the example we're only going to be creating a Python model then we're gonna be wrapping it and then we're gonna be deploying it in a kubernetes cluster right there's gonna be containerized with docker and then it's gonna be exposing the internal functionality through a restful api so the way that we would do it is we would set up our environment which basically requires you to have a cornetist cluster running you know i'm not gonna be trusting the internet for for for that to actually help us today so you know I already have everything set up in tabs as you can see just in case so what we're gonna be doing in this case we're downloading in in here the data set so this data set contains you know in this case applications of people and whether they get approved or rejected we do a train test split as you would normally would and in this case you know you would have let's actually have a look at the data set yeah so you basically have already a normalized data set where you have in the first row the age of the people and then remaining classes for the rest of the other features and then we actually come come print the labels as well so I think I'm here Oh feature names actually we can see the feature names and that's basically the order in which we have them so as the age the working class education etc etc perfect so the first thing that we're gonna be doing we're gonna be using psychic line so just to get a bit of an understanding I mean who here has used scikit-learn let's see a show of hands just for tutorial okay perfect awesome so what we're doing here is we're just building a a pipeline we're gonna beast a scaling our numeric data points as well as you know creating a one-hole encoding of our categorical data points and we're gonna be transforming the data with that so now that we've actually you know fit our preprocessor we're gonna be then training a random forest classifier with that sort of data set so that it you know takes the pre process data and then predicts whether that you know a person would be able to get a loan approved or rejected once we actually train our model we can use the the test data set to see how it performs you know we can see that in terms of accuracy it has about you know 85% you know precision recall etc so now we have a train model right we have with our scikit-learn you know CLF is our random forest classifier preprocessor is basically our pipeline of our standard scaler and the one-hot vectorizer so then what we're gonna do is we're gonna actually take this model and containerize it and what we're gonna do for this is we're gonna first just dump those two models that we've created so for that you know preprocessor and classifier so we're dumping them in this in this folder and you know we can actually see the the contents so so we can see that we basically dumped it there once we have those two models that have been already trained we basically create a wrapper and this wrapper is just going to have a predict function that will take whatever input comes in you know this predict function is exposed it will be exposed through a restful api but basically whatever input we pass it through the preprocessor and we pass it through the classifier and then we return their predicted the prediction right so this is this very simple right so we load the models and then we just run whatever is passed through this predict function and return the predictions right super simple this wrapper is basically the interface that we just require so that we can actually containerize it so for the next one for the containerization we just need to define any sort of dependencies so in this case we use scikit-learn and the image because we're actually gonna be sending well in this case we actually don't need the image just so I could learn and then we actually just define you know the name of our file and we run the basically sty CLI tool that basically what it does it takes our image our standard image that exposes and wraps this model file through a RESTful API and G RPC API right so once we actually have this container so just to get a bit of an understanding in the room who here has used docker before perfect so here you just have a great awesome so so here you basically just have a docker image called lone classifier 0.1 this docker image when you run it the input command is basically just gonna run a flask API that exposes the predict function whatever you send to that predict function to that predict endpoint you know will be passed through basically you know your wrapper so that that is basically what it would be doing right so once we have that we would just you know specified in our kubernetes the finishing file so this is you know just saying like the container that we're gonna have is this lone classifier and your computational graph in this case just has one element which is the the known classifier and that's all basically you you would have once you define that if it's built now you can actually deploy it here you can actually
see that it's being created in local kubernetes cluster so I think it is downloading it which is no great but
basically what you would then see as this model is now deployed in our kubernetes cluster it's going to be listening to any requests so it's basically if it was a micro-service right and then as any other restful endpoint we can actually interact with it in this case with curl so in this case we're actually just sending it you know one instant one instance to actually perform an inference you know the response is an in the array of you know the positive and negative label in this case it predicted you know a negative label so in this case what we've done is we've actually wrapped a model with a you know very very simple thin layer wrapper put it in production the wrapper itself also exposes a matrix endpoint which for the people that have used Prometheus or Gravano in the past you know Prometheus you can actually hook it up to this matrix endpoint and you're able to get some metrics out of the box in this case let's see if I can actually show it here
is basically our income classifier that
we have deployed and out of the box you get you know in this case this is a graph on a dashboard you would get basically all of the requests per second you get you know the the latency for that specific container etc etc and we're actually going to be diving a bit more into some of the metrics in a bit you also get some of the logs so again
this is basically just the output of the container is just being collected with a fluent the server and then you know storing on an elastic search database so for the ones that have used Cabana in the past this is just basically also oo squaring the elastic search for the logs and for tabular data is basically what you know we actually expose out of the box but basically that is an initial
overview of you know the orchestration piece the benefits of actually you know continued icing your models of course it's obvious in terms of like making it available for business consumption but the core thing from these is the push towards standardization right is if you were to have you know a hundred models in production you would be able to interact with them as if they were micro services right and what this allows
- do you know we have just covered a very very simple example but what this really allows you to do is to leverage these get-ups structure that I was talking about earlier and just to see here who here is familiar with PI torch and with Pytor chub okay so PI third job is basically a new initiative from PI torch where they encouraged people to save trained models like Bert or vgg where you can actually submit your models to a git repo and what that allows you to do is to have a central sort of like a standardized
interface towards your already trained models so in this case basically you're able to define any model in this case is resonate and you say this is how you load it and this is where the trained binary is located so it's an initiative from titer job and what we have been
able to do is to actually create an integration to peiter job where any time that you actually point a a new sort of like configuration deployment towards a repo what it would do is very a very thin layer wrapper that just downloads that model right because the actual code to load it is standardized by the actual
deployment and you know to be more specific the way that we actually do it is you know a wrapper where you basically take the repo and the name as input parameters that you can pass through the config files and then when it actually loads it downloads the model from PI torch all right so you basically have a new an ability to dynamically publish you know any sort of like Bert or vgg like models I mean anyone who was actually like tried using Bert or one of those you know state-of-the-art models would know the pain of you know often setting them up so there's a lot of benefit of actually trying to standardize the way not only to define them but also to to deploy them you know and again you can actually jump in and
try these examples so that is basically
a high-level overview on the orchestration part before we jump into the explain explain ability piece some other libraries to watch you know one of them is M leap serving so their approach is they actually have a single server that allows you to ACTU to load standardized so like serialization of models so if anyone is familiar with the Onan X sort of serializable definition of models you know you'd be able to have a single model that that loads you're trained binaries and expose them through again an API and then another one that is also one to watch is deep detect which unifies behind a standardized API a lot of this Python based models and these are two of you know a large number of libraries to check out I definitely would advise you to have a look at the entire list it's quite extensive alright so so the second piece oh it should be actually explaining ability so we're gonna jump on that one explain ability they stagger the problem of black box model and white box model situations where you have a trained model that you want to understand why did why did the model predict whatever it predicted right and you know this the way that we tackle it requires the people tackling this issue to go beyond the algorithms and the reason why is because this is not just an algorithmic challenge it does take a lot of the domain expertise into account you know and the way that we actually emphasize this is that interpretability does not equal explained ability you may be able to interpret something but that doesn't mean that you understand it and of course you know in terms of like you know the English definition of those words there is not that that conceptual perspective in place but we tend to push that sort of way of thinking about it because it's not just bringing the data scientist to address these challenges it may require also the DevOps software engineer but also the domain expert to be able to understand how the model is behaving and we actually did a three and a half hour tutorial at the AI O'Reilly so each of these things you know we could actually dive and to an insane amount of detail but just for the sake of simplicity today we're gonna go and do a high-level cover over the standard process that we we often suggest to follow it actually extends the existing data science workflow that we showed previously and it adds three new steps which they're not really new but they are you know three steps that are explicitly outlined four explain ability these are you know data analysis model evaluation and production monitoring production monitoring being the one that we're going to dive into today in terms of data assessment you would want to explore things like class imbalances you know things whether you're using protected features you know correlations within data you know perhaps removing a data point may not mean that you know your you you are actually removing a hundred percent of the the input that is actually being brought by that as well as data represent ability right this is how do you make sure that your training data is as close as possible to your production data and this is you know a very well-known problem the second one is model evaluation you know this is asking questions of what are the techniques that you can use to evaluate your models things like feature importance whether you're using black box techniques or white box techniques whether you're using local methods or global methods you know whether you can actually bring domain knowledge into your models and this is more this is important because you know what what you what your what your models are doing their learning hidden patterns in your data but if you can actually give those patterns upfront as features or as you know combinations of your initial features that leverage some of the domain expertise then you're able to actually have much more much simpler models doing the processing at the end right one of the use cases that we had is in in automation in NOP so automation of document analysis we were we actually have been able to leverage a lot of the domain expertise of lawyers right asking like meta learning questions of how do you know this answer is correct or what is the process that you go into finding and right things like that allow you to actually build smarter algorithms and not just in the machine learning models but in the features as well and then the more the most important one is the production monitoring right is how can you then reflect the constraints that you're introduced in your experimentation and make sure that you can set those in production right if you if you think that that precision is the most important metric and that you you you should not have a set of you know false positives or false negatives then you need to make sure that you're able to have something in production that allows you to enforce that and monitor them right so evaluation of metrics manual human review you know for not forgetting that you can leverage humans to write like that is also something that with machine learning you you can definitely do and and the cool thing about this is that you know with with it with the push that we have into the kubernetes a world were able to convert this this deployment strategies from just things like explainers into design patterns so instead of just having a you know machine learning model in production you can have deployment strategies where you may have another model that is deployed in production whose responsibility is to explain and you know reverse engineer your initial model right and this this may get into a little bit of inception but this is actually a pattern that has been seen quite effective and a lot of organizations are starting to adopt which we named the explainer pattern which is not very original but this is what we're gonna be doing now we're going to be the we have already our model deployed in production we're saying that this model is is predicting whether someone's loan should be approved or rejected and assuming that this is a black box model we're now going to deploy an explainer that is going to explain why our first model is behaving as it is right so that's what we're gonna be doing now and we're gonna be using that same example that we were
leveraging so now we have our our initial model in production we can actually reach it through this through this URL so what we're gonna do now is we're gonna actually leverage this a
explain ability library for which they are actually many of but this is one
that we maintain its called alibi and it offers basically three main approaches to black box model predictions so a black box model explanations the first one is anchors and anchors it answers the question of from the features that that you sent to your model for inference what are the features that influence that prediction the most right and the way that it does it is by actually not like going through all the features and replacing a feature for a neutral value and then seeing which one effects the output the most right so this is anchor and this is what we're actually going to be using but there's another very interesting one called you know counterfactuals and counterfactuals are basically the opposite well not not really the opposite but conceptually is the opposite of anchors it has the question of what is the minimum changes that I can get that I can that I can add to this input to make that prediction incorrect or at least different to what it was right so if you were actually you know proving someone's loan the question would be what are the changes that you can make to that input so that the loan is rejected right so this this basically allows you to understand things like for example with with nest you can ask questions of well what are the minimum changes that you can do to make that four not a four but more interestingly you can actually go from one class to another you can say what is the minimum changes that I can do to this four to make it a nine right so what we're going to be doing is its first anchors on our on our data set so well in here we're
actually just using our Seldon client to also get the prediction so we're literally just sending a request and you know this is the response which is the same as their coral but yeah so we're gonna create an explainer and we're gonna be using alibi and the anchor tabular explainer so for this what we're gonna be doing is we're gonna take our classifier so that classifier that we trained that random forest predictor that we trained before and we're gonna actually expose that that the the predict function and we're gonna feed that into our anchor tabular right because it's going to be interacting with the model as if it was a black box model it's only going to be interacting with the inputs and outputs
when using text or image only when using tabular the reason why is because with tabular you need to ask the question of what what what would be the neutral numbers that you would use to replace right and in this case for numeric data sets you have to get the minimum and the maximum and then you say well I wanted to be the quartiles or something like that that's the only reason why you would use the training data but yeah so you would fit it and then you would actually you know see what is the the inputs that we're gonna be sending you know we're actually sending this one you know somebody of age 27 and we're gonna you know we just predict it as negative and we're gonna actually explain it and it basically says well what makes this prediction what it was is the feature marital status of separated and gender of female right so that's what basically your explanation for this instance is and what is now starting to get interesting is that we're now going to actually use our local explainer on our model that we basically deployed already right so in this case that predict function that we basically had we're now going to be you know using sort of like that remote model so we're actually gonna be sending the the request to the to the to the to the to the model that is currently in our communities cluster and when we actually you know request the explanation we're gonna get the same thing right the only difference that we're now actually reaching to that model in production and now we're gonna actually follow the same things we're gonna just containerize the explainer and we're gonna put the explainer in put it in in production right so you know again we actually create a rapper the rapper has a predict function the predict function just basically takes the input and you know runs explained and returns the explanation right so now what we have in production so we've containerized we deploy it and now what we have in production is now you know an explainer so we have you know our lone classifier explainer as well as you know our initial model so what this is interesting is that now you can actually send to one of these components a request to do an inference and you can send another request to explain that inference by interacting with that model in production and we can actually
visualize it here if you remember with our income classifier if we actually
have a look at the logs these are all the predictions that have gone through the model through through through basically as requests so we can do now is we can actually take one of this and send a request for the the explainer to explain what's going on and and look this is the exact same thing that you just saw in the other one but just flashy shiny and colorful right like this just basically says for that other explanation you still have that you know marital status of separated influence your prediction by this much you know gender female by this much and you know capital gain by this much and then you also can see you know predictions are similar or different but in essence you know you're still getting the same insights as if you were using it locally but again you're getting that those sort of standardized metrics so that explainer also has the metrics exposed also has the logs exposed etc etcetera so you get that benefit and that's basically the the sort of like example
to the to the explainers and now we're actually going to go one one one level deeper but before that I want to give some some libraries to watch in the expert in the model explanation world these are le five which is explained like I'm five this very cool project they do a lot of different techniques shop which you've probably come across if you are in this space or have looked at moral explanations and xai is one that we really specifically focused on on data so techniques for class imbalance etc etc and then again you
know as I mentioned there's tons right I mean with this a black box model explanations you know you can actually dive into so many different libraries it's a very exciting field so I do recommend to actually have a
look now for the last part for the last part is on reproducibility so reproducibility this basically answers the question of how do you keep the state of your model with the full lineage of data as well as components and this really breaks down into the abstraction of its constituent steps for every single part of your machine learning pipeline you're going to have a piece of code configuration input data right and for each of those things that you have you may want to actually freeze that as an atomic step and the reason you may want to do that is because you may want to perhaps you know debug something in production or for compliance have audit trails of what happened when it happened and what did you have in there and the reason why it's also hard is because it's not only the challenge in an individual step the challenge also goes onto your you know entire pipeline right so each of the components on your pipeline each of those reusable components may actually require to have that level of standardization and you saw it with with the configuration definition that we had in the previous example where we actually had a graph definition right there you can have multiple different components which are docker containers which are containerized pieces of your atomic steps and one thing is to actually be able to keep those atomic steps and another one is to actually be able to keep the understanding of the metadata of the artifacts that are within each of these steps right because metadata management is hard right and now we're getting into a point where it's not only metadata management but its metadata management on machine learning at scale and you know it's doable it's just that it requires to sort some areas you know a new way of thinking and what we're going to be diving into here is basically the point that we haven't covered we talked about models that are already trained but what we haven't talked about is potentially the process of training models and we're actually currently contributors to this project called cube flow which I'm not sure if you've heard about but a cube flow focuses on training and experimentation of models on kubernetes and what allows you to do is to actually build reusable components what we're going to be diving into in this last example is going to be a reusable NLP pipeline in cube flow and what this is going to be more
specifically let me actually open it
it's going to be this example which I
have the Jupiter notebook you can try it yourselves but we're going to be actually creating a pipeline with this individual components if you guys have ever done NLP tasks we're going to be doing a let's call it sentiment analysis where you would find the usual steps cleaning the text tokenizing it vectorizing it and then running it through a logistic regression classifier right the first step is gonna download the data and we're basically using the reddit hate speech data set so from our science all the comments that were deleted from mods you know they've been compiled and yeah so basically what we have here is this components we would want to actually create this computational graph in production that uses them as separate entities and the reason why you want that is because maybe you want to reuse your spacey tokenizer for different other projects and you want to keep perhaps your you know holy made a feature store right where you actually just pick and choose different things you know that ultimate drag-and-drop data science world but but yes so so basically this is what we're gonna be doing in this in this example you know from a high level perspective what it's going to consist of is five repeats of wrapping models but in this case it's just wrapping scripts in that same process that we did previously for example the clean text step is basically again just a rapper called transformer with a predict function that takes you know the text as an umpire ray it runs the vectorization of that or in this case is that is the actual tf-idf vectorizer it runs the vectorization and then returns the actual vectorized output right and then in terms of the actual interface to it it's just like a CLI but once we have these components then you know we're able to define our pipeline and we can upload this pipeline into cube flow which then looks like this right it's basically all of the steps which all with all the dependencies the only difference is that it uses a volume that is attached to each of the components to pass the data from one container to the other right so for each component the volume is attached and the interesting thing here is that you can actually create sort of like experiments through your front-end you know you can actually choose what parameters you expose and you know here I can actually change the number of tf-idf features etc etc and then just run our pipeline and then you can actually see your your experiments you can see which ones have run and then for each of the of the steps you can actually see the input and
output as you print it for each of the of the components so here we can see the text coming in and then the tokens coming out from from the other side and then the last step you know it's a it's a deploy and what that basically does it just puts it again in in production listening so any requests for this specific demo what we've done is you
know you can see the deployed model here so it's an NLP cube flow pipeline and
you know you can see that there's actually live requests through each of the component so you can see that the clean text the Spacey tokenizer the vectorizer etc what we're sending it live and this is actually quite funny we're actually sending all the tweets related to brexit do you guys know what breaks it is yeah so it's actually doing hate speech classification so the funny thing is that doesn't matter what side you're in there's a lot of hate and we can actually see you know here we
we have this sort of like nice-looking logs but you know as I mentioned you can
also jump into the Cabana and here you can see like you know Celtic directs it spring yeah well I don't know I don't want to read them out loud because there
are some that you know not very appropriate but yeah so so basically now we have just this this like production you know breaks a classifier that can actually be trained with different data sets and it just exchanged automatically through this step and the objective here
is to actually just show the sort of
complexities of this reproducibility piece and how their different tools trying to tackle it these dives more into the experimentation and training part and I haven't even dived into the pieces around the complexity for tracking metrics as you run experiments right this is basically I ran ten iterations of the model I want to know which perform better how do I keep track of my metrics as well as the models that I used so you know they each of these things that I've covered has so many different dimensions to tackle them from and you know we actually have talks online where we have you know an hour an hour and a half of just one of this you know today was more of like a high-level overview and you know other libraries to watch you know data version control DVC they're basically a gate like sort of CLI that allows you to you know run the usual sort of like commit push workflows but for that sort of like three components of your code configuration data etc and another one is ml flow from data breaks and this focuses on actually experiment tracking we actually have some some examples where we integrates and pachyderm which dives into full compliance so as you can see you know the ecosystem of this is is so broad but it's also at the same time super super interesting and yeah so I'm gonna wrap up and jump into questions just in case anyone has any questions on this or any other libraries but before that I'll just you know give a few words and this on the sort of stuff you know we covered you know three of the key areas that you know I've been focusing on these are orchestration explain ability and reproducibility but as I mentioned you know the content is you know insanely broad things that I actually haven't talked about which is also insanely interesting are things like adversarial robustness you know as you saw some of our explained ability techniques have an approach to explain through adversarial attacks kind of so it's also interesting to see how there is a lot of overlap across each of these areas and and also not only overlap but also different levels into which some fit in order of the categories right you know privacy is one that is super interesting that you know we haven't covered that dives into privacy privacy preserving machine learning which is an interesting area and it's self storage serialization function as a surveys etc etc so with that you know I have been able to give a high-level overview of the state of production machine learning in 2019 it was an exhaustive but you know it does feel like it was but yeah so if we have some questions I'm happy to cover them now or later at the pub thank you very much guys pleasure
[Applause] thank you very much for your talk I'm actually chairing yellow session so do we have questions please come I had come to the microphones it's working for your questions please hi excellent talk thank you I was mostly inspired by this explain ability idea and have to question about it so first let's assume have a lot of features and they have like they produces a huge space of variants that can be huge special variants so it seems that when tried to explain this black box I need to iterate over all these features all variants of these features and it seems like performance issue here how can it be solved and yeah so that's the first question what second one and I'll repeat it okay it was first yeah and the second one that some models itself has some information about which important importance within it like this random forest no have you compared some results from this explainer with internal results of the model itself mmhmm yeah okay no that's that's that's two really good questions so the first one was basically you know you have a lot of features what the computational complexity around that and how you deal with that the second one is basically on what was the second one the second question was internal important in certain importance they're comparing internal importance to to the black box model explain ability so yeah yeah okay so let's dive first into into the computational challenges so that is that is a hundred percent correct and in terms of anchors as as a technique you know we are conscious that in order for you to explain black box motors as a whole it often becomes quite expensive the way that we have been able to tackle it is by separating the way that you actually request explanations and predictions so for explanations you may not want something that is like real time and for every single one of the the predictions that go through but instead is for actually diving deeper into one or a few of the inference predictions that you may have right so perhaps what if something went wrong you can use explanations to debug how it performed or if you know the threshold that you set for accuracy was 90% you know you would only request explanations for things that fall on there when you assess them so so so that is from one side in the other interestingly enough this week our data science team just published the paper that actually proposes a way to deal with the computational challenges with counterfactuals specifically and with contrast of explanations and that is basically using prototypes the concept of prototypes and this is with sort of like neural networks to reduce the dimensionality of your features themselves so so you know that papers in our curve and you can check it out but there is a lot of research in that space to actually make it more feasible without sacrificing the power-on explanations you know unfortunately there's no silver bullet so you know I do acknowledge that it is a challenge but then that is why there's the benefit of also leveraging white box model predictions in certain situations where you actually can and fall into your second question you can actually leverage some of the internal structures of the models like random forests or neural networks you know you're seeing sort of like the weights of the networks to actually explain much easier it was a worth mentioning that the explanations themselves the explainers some of them they're optimization problems so for example we use gradient descent to find some of the explanation techniques for the counterfactuals now for the second piece in terms of leveraging the internal stuff and also seeing how we perform against the black box models so we actually haven't done benchmarks of how it performs against but that's definitely something that we will be interested on if you're interested on that you know alibi is open source so we would love a pull request or an issue to our documentation on that but that's that's really good question yeah okay thank you thank you do we have more questions please go ahead I have two questions first is what are you views about setting up a kind of a feedback loop a pipeline for feedback loops saying that after your model has gone into production and you have a result at the end of the say day saying that hey you know what for these these days you had the correct predictions and for these number of records you had wrong predictions how do you go about get trading that retraining or incremental retraining your model but after it has gone out into production yeah that is an excellent question and and you know one of the that was one of the key things that I actually discussed in I guess they called it three eight three hour workshop I call it three hour rant because I was trying to push how important that that piece is and unfortunately again there's no silver bullet in terms of you know you can't just deploy a model and have that feedback loop out of the box because not always you actually have data that is relabeled in production right I'll give you a specific example if you're doing automation of support tickets routing then at the end the support tickets will be resolved at some point right so you're actually getting data that is being labeled in real time so you could actually get that feedback real-time other times where actually labeling of data is so expensive you may not have that benefit but you may still want to have that specific feedback loop and in that term you may actually require to establish that manually and what I would mean say would require every month or every week or every year once a year to evaluate the performance of the model by having a set of random data you know perhaps you know on a balanced set of classes that is labeled by hand and then compared what it should be and to see the performance so so that feedback loop should definitely be in place the way that it should be installed is different depending on the use cases there is also that sort of other part which is not feedback loop in terms of performance but it could be just feedback of real time performance of the metrics and actually for for one of the the things that I mentioned in the I think it was orchestration is you know you may you may have like three different models that in real time you may want to optimize the routing that also other type of feedback so in the API in the sdk that we build we actually have an endpoint called feedback that allows you to actually send you know stuff back but yeah the the the word feedback can mean so many things but on those two specific ones that would be my fault and just one thing more you mentioned about production monitoring yeah and when you said that data scientist has to maintain a certain number of models and production what what would actually trigger a manual action on that particular model what are the KPI is that you actually yeah the prediction accuracy is one of them but what actually would trigger that yes there is something wrong with the model and the data scientist needs to actually go and evaluate that model from the ground up mmm so I think it's not as explicit as you know the time the manual time is because things go wrong the manual time it actually goes all the way from the moment the data scientist goes like my model is ready I want to put it in production for business consumption from that moment the the data scientist has to think well maybe I need to expose a RESTful API so he needs to write he or she needs to write the code to actually you know wrap it on a flask server then it needs to expose the endpoints because the endpoints are quite custom are now not standardized across all all their models that other data scientists you know put in production you know he or she needs to actually like assess how it's performing if something goes wrong you know the data science needs to jump in and assess why it went wrong if it needs to be retrained again they design needs to retrain it so it's a lot of little things require that manual not just input but also continuous thinking around that because the responsibility of that model beyond its ready is it still falls within the data scientist so it's just be it's just pushing it towards that once a model is done it it should become similar to microservices to certain extent because you know as a software engineer you still have to jump in and debug it but to a certain extent once the model is ready it becomes a sysadmin or DevOps challenge right so then you can have hundreds under the same metrics so it's not just individual people assessing their own things in production and you have the same thing with with software engineering when you deploy micro services you want to avoid that and standardize it thank you we have four questions do we have more questions for our speaker today yes do you have a generic components to ensure that the confidence levels that are outputted by the models are calibrated in one way or another so we don't we don't we don't have a standardized sort of metric per se but you are able to expose custom metrics what a standardized is the way that these metrics are collected so they're collected through Prometheus well they're exposed through a metrics endpoint that then can be collected through Prometheus and then you know consumed by graph on ax then it's very easy to set thresholds to get notified so it is possible to just set thresholds for you know any of that standardized accuracy metric but then again when you say ninety percent accuracy that may vary from use case to use case and also accuracy is often irrelevant because sometimes you know false-positive may have more influence than a false negative so what we try to standardize is the metrics that come out and the way that they can be evaluated as opposed to the metrics that should be evaluated if that if that makes sense yeah my question was more Leon for instance video classification as the example that you gave the model can output i'm confidence that it's 80% chance negative but maybe it's actually not like if you take out of 100 predictions and you've been the predictions by the confidence level you could see that the fraction of negatives in each bin are not actually reflected the the true confidence levels outputted by the model and depending on the models that you do you might have different calibration issues and I was wondering if calibration is something that is like a generic tool that you could put in your pipeline and is something that is requested by the users or how to leverage the calibration and or maybe it's not addressed yet that's it no no I think definitely calibration is is one of the important things I mean we do have some open source work that exposes not only the things like the multi-armed bandit but also techniques for things like outlier detection that you can use we don't have like a generic piece but that is not because there's no demand it's just because we don't have enough hands so we'd love that again you know it's open source you can open an issue if we get enough you know thumbs up then you know we definitely prioritize it and we actually have a bunch of examples we'd love to have just another jupiter notebook example showcasing how you would do that but that is that is definitely good point and it's very a very interesting area in this space yeah thank you we have done maybe for one lost okay if we don't have any further questions let's have a very warm applause for alejandro you
Loading...
Feedback

Timings

  561 ms - page object

Version

AV-Portal 3.20.2 (36f6df173ce4850b467c9cb7af359cf1cdaed247)
hidden