Productionizing your ML code seamlessly

Video in TIB AV-Portal: Productionizing your ML code seamlessly

Formal Metadata

Title
Productionizing your ML code seamlessly
Title of Series
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Identifiers
Publisher
Release Date
2018
Language
English

Content Metadata

Subject Area
Abstract
Data science and Machine Learning are hot topics right now for Software Engineers and beyond. And there are a lot of python tools that allow you to hack together a notebook to quickly get insight on your data, or train a model to predict, or classify. Or you might have inherited some data wrangling and modeling notebook code from someone else, like the resident data scientist. The code works on test data, when you run the cells in the right order (skipping cell 22), and you believe that the insight gained from this work would be a valuable game changer. But now how do you take this experimental code into production, and keep it up-to-date with a regular retraining schedule? And what do you need to do after that, to ensure that it remains reliable and brings value in the long term? These will be the questions this talk will answer, focusing on 2 main themes: 1. What does running an ML model in production involve? 2. How to improve your development workflow to make the path to production easier? This talk will draw examples from real projects at Yelp, like migrating a pandas/sklearn classification project into production with pyspark, while aiming to give advice that is not dependent on specific frameworks, or tools, and is useful for listeners from all backgrounds.
Loading...
Software Machine code
Laptop Web page Service (economics) Software developer Transformation (genetics) Mobile Web Virtual machine Set (mathematics) Menu (computing) Mathematical model Machine code Product (business) Number Wave packet Data model Machine learning Core dump Uniqueness quantification Ranking Endliche Modelltheorie Local ring Predictability Service (economics) Scaling (geometry) Uniqueness quantification Point (geometry) Moment (mathematics) Projective plane Mathematical analysis Bit Machine code Product (business) Type theory Event horizon Website Video game Musical ensemble Object (grammar) Local ring Resultant Laptop
Laptop Cellular automaton Point (geometry) Virtual machine Stress (mechanics) Machine code Machine code Virtual machine Product (business) Data model Different (Kate Ryan album) System programming Video game Circle Endliche Modelltheorie Determinant Physical system
Point (geometry) Presentation of a group Dataflow Focus (optics) Series (mathematics) Product (business)
Metric system View (database) Multiplication sign Source code Set (mathematics) Parameter (computer programming) Mereology Dimensional analysis Data model Mathematics Strategy game Endliche Modelltheorie Physical system Predictability Area Source code Curve Constraint (mathematics) Feedback Bit Product (business) Type theory Category of being Prediction output Metric system Programmschleife Laptop Point (geometry) Slide rule Functional (mathematics) Sequel Feedback Login Product (business) Wave packet Latent heat Performance appraisal Operator (mathematics) Software testing Module (mathematics) Scale (map) Focus (optics) Scaling (geometry) Mathematical model Basis <Mathematik> Machine code Performance appraisal Personal digital assistant Sampling (music)
Suite (music) Group action Randomization Metric system Confidence interval Multiplication sign Execution unit Set (mathematics) Parameter (computer programming) Mereology Machine code Mathematics Different (Kate Ryan album) Personal digital assistant Endliche Modelltheorie Social class Predictability Source code Trail Linear regression Software developer Bit Flow separation Product (business) Hypothesis Process (computing) Prediction Phase transition Software testing Virtual reality Writing Classical physics Web page Point (geometry) Trail Statistics Functional (mathematics) Service (economics) Computer file Maschinenbau Kiel Merkmalsextraktion Login Mathematical model Product (business) Number Wave packet Revision control Writing Goodness of fit Performance appraisal Software testing Form (programming) Scale (map) Execution unit Mathematical model Mathematical analysis Basis <Mathematik> Total S.A. Machine code Line (geometry) Performance appraisal Software Personal digital assistant Query language Sampling (music) Revision control Video game Object (grammar)
Trail Metric system Multiplication sign Mereology Machine code Number Product (business) Word Mathematics Goodness of fit Performance appraisal Machine learning Different (Kate Ryan album) Error message Address space Physical system Predictability Email Software engineering Mathematical model Trail Projective plane Machine code Line (geometry) Product (business) Error message Prediction Time evolution
Trail Software engineering Validity (statistics) Computer file Linear regression Multiplication sign Software developer Data storage device Virtual machine Function (mathematics) Mereology Mathematical model Product (business) Revision control Software repository Intrusion detection system output Software testing Right angle Office suite Endliche Modelltheorie Logic gate Resultant
Ocean current Metric system Trail Mathematical model Linear regression Direction (geometry) Virtual machine Merkmalsextraktion Machine code Mereology Login Machine code Number Revision control Performance appraisal Error message Prediction Blog Pattern language Endliche Modelltheorie Metric system
Web page Laptop Service (economics) Transport Layer Security Real number Multiplication sign Set (mathematics) Mereology Regular graph Mathematical model Product (business) Goodness of fit Mathematics Cuboid Software testing Data structure Endliche Modelltheorie Social class Module (mathematics) Source code Collaborationism Algorithm Software engineering Bit 3 (number) Perturbation theory Machine code Instance (computer science) Unit testing Cartesian coordinate system Hand fan Loop (music) Befehlsprozessor Prediction Software repository Repository (publishing) Network topology Right angle Table (information) Writing Lambda calculus
[Applause] hi thank you for being here so we I'm Lori's as was just said we're here to talk together about how to productionize your and Marcus seamlessly as was quickly mentioned I work for Yelp and this is not working nice no need yes yes awesome
yup is all about connecting people with great local businesses is that's a core mission you probably have used the site
especially here in the UK if you type things around hidden work you might end up looking for this that's what you will do find restaurants your local plumber a moving company to help you move your code seamlessly into production anything that is needed just to give you
an idea of the scale I tweet Yelp work we have a lot of reviews very high number of unique visitors per month quite a big team lots of service lots of code and which means a lot of people working on data so we have learned quite some ideas on how to make our life a bit easier putting the models into production what's on the menu for today well it's kind of like presenting what does that mean like putting a model into production that's probably the least explainatory self-explanatory sentence ever made and then I will kind of give some chicken trips who actually make your life easier before we actually go into the deep of the subject I just want to kind of just take a moment so we all kind of agree on what I'm talking about and what's kind of all of this fits into I think every ml project in pison are started in a notebook if it didn't probably something weird was happening and this notebook started you started writing this notebook because someone gave you a data set and questions and you needed to answer it and in the way that's kind of the core of it you don't do machine learning just to the machine learning you're trying to predict the desirable behavior you're trying to recommend you're trying to detect you're doing something you have an objective when you answering machine learning and your notebook does something whether your notebook is simple your features are already all nice and you just apply a few transformation you do train your model do some feature analysis and maybe check that your model actually didn't train in two crazier manner or you end up with pages and pages of SQL queries that perform feature extractions and you have a complicated model and you have everything music to hell it doesn't really matter in a way because once brought you to the stage where you think you can bring this to production is at the N you had a result you had some things that made you think yeah yep I think I can do it there's a problem I can crack it but now you cracked it in a way once and you want to crack it on a regular manner and you want to be able to generate train regularly manage make sure that every day your model is not too bad and then your model produce some things that you want to use and you need to deliver this something without the prediction strand to the final users and finally kind of hey I wanted to do something right at the beginning not that everything is done is it still happening so right now you are the first step and I will show this probably over
shown schematics from hidden technical depth in machine learning systems that Google presented at nips two three years ago and right now you're here this small very black circle which I put in red because it's even really hard to see and you need to interface yourself with all the rest of what your life systems are
so now anyway what does running an ml model in production involved well it's kind of putting any other piece of code in production in a way there is not so much difference you're going to interface yourself with the surrounding infrastructure and the goal is to run from something that runs and there you Beneful and supervisions skipping the few cells which actually don't run correctly in your notebook to something that runs every day tells you when it's wrong perfectly and probably even after you stopped looking at it great so as you might start to guess I'm
not going to talk too much about tooling that's really not the point this present day this conference was full of great people showing you awesome tools to do everything I'm more going to
focus on actually what does it mean like what are the things how you think about the composing the problem putting things into production into a series of stepping questions that you're should be asking yourself and answering to arrive to it probably the answer to many of these questions is use air flow but that's another topic great so for
sake of argument and as I'm talking generality let's just agree all together on a simplified view of what the pipeline is so first you have data sources this can be many many different things you performed some sampling on it may be or not you extract some features that you have defined probably that's your notebook told you that this set of feature was working really well I should definitely use that training model you evaluate it everything went well you have a model rinsing repeat for production and then you're loading this prediction you have obtained into your product kind of fair why does this why is this useful well actually probably your notebook performs exactly all of this operation but what this tells you is which piece should go together which piece should be one function or one Python module and be tested together because it makes sense together right tests
cool let's kind of focus on specific parts of this pipeline the first part I want to check on is the data sources even though I'm not going to talk too much about it it can be as three logs as three logs anything redshift my sequel Postgres something I didn't think about when I was writing the slides it's likely to be changing regularly you have new data coming in on a regular basis and how your system ingest the data is just admitted for the purpose of this talk I don't go into detail you might also have noticed especially if you have already worked with models before that this looks a lot like a kind of specific type of training which is everything happens offline you have all the data you need at all time you have offline training offline predictions things with online prediction or online training would not be that different it's mostly constraint on the data sources I will stay in the simple case cool so first thing first you need to update your model on a regular basis and now the question is a regular basis does it mean so that's up to you in a way it's how often does your data change enough that your model needs to be changed too the other point is what happens when things go wrong how do you rerun your pipeline what happens if your data or some data is missing do you pick all data and you fill it in maybe if you maybe it's not worth it you could just reuse an old model or any other strategy you can think about the other one is the scale which is how many prediction or how many what size is my input training how long does this thing should take this is what allow you to thing out how you should dimension the infrastructure that all of this is running on we're talking about model we're talking about failure very often when people write code and they sing failures they think oh I got to trace back from Python but it's not exactly the only way a model training can be failed and that's why you have an evaluation step that I really want to take a bit deeper into you and some questions you could be asking yourself is does the evaluation metric I'm using actually reflects the problem I'm trying to solve and I think we all use like things like la glass or area and the ROC curve or other kinds of problem and the question and I want to tie it back to what I was saying is does it solve your problem maybe maybe not think about it some functions are very good mathematical property but it might not represent what you're actually trying to move last part when you've out your model kind of think about which feature are used because this is the point when you can look as if there is feedback loop your model doesn't just run on its own now it's generating prediction which means is affecting probably how your data is generated and if in time you see things like Oh your model just rely on one feature maybe this one feature is actually your model just reproducing what he was predicting before so beware this is a time when you could say actually this training is failed because the model doesn't behave the way it should be
now let's good use a prediction side of the schema questions are exactly the same what happens when things fail how often does this happen and how many predictions should I be generated every day and the last thing is which is a bit savory is how do prediction go are used and how are they using your product and that's what I would want to kind of push into right now I've said nothing about in which all those things should be made I just said that's the whole problem deal with it actually is this probably what you might want to start with how as a prediction using your product because you have predictions you had them once and so if you can already start using these just to test it means you can see if you're actually successful or not and that's probably the last thing which is how are you measuring success how you say I did my job it's right project over thank you very much the first thing is you need to track the metric business metric yet you're trying to move you'd need to go back to your internal problem and be sure that you can actually track your model is doing something and you need to test it confront it to reality and you might not get it right from the get-go in which case you should test new versions against old versions against probably status quo and the last part is measuring success is always very easy you could say yeah I'm going to do this and it's well story time I work for a team called B's gross and our main objective is to get people to create a special kind of account which is tied to your business so people can manage it one of the good way we found to do that would be to show people a little pop-up but if we show it to everyone it's kind of not really working so we thought hey we're going to start predicting which people would be actually likely to be owners and Suzanne they would create the accounts this special business account and we show them the pop-up so a great design if we draw the pop-up or not and then we move forward and so that's what we did we built the whole model we trying to predict if someone was potentially on owner or not we should send the pop up and all the time like 94% of the times it would create the account immediately afterward it was great we had correct numbers because we're measuring our success by is a clicked the little button is they clicked on the pop up we're showing them and then they were creating the account but actually the number of total big number of account created didn't go up at all it's that so we looked again and actually what our model was doing is it was predicting what would happen whether we did something about it or not so we kind of check that we created all that set of people that we could have shown a little pop-up to and we didn't just to test that our model was behaving correctly and they were creating almost as much account as people were showing them to but a bit less so we're still happy you were still working still was investing in it still it's really hard to get that wrong it's really worth spending sometimes thinking this through I've already kind of think smoothly into a transition with stories tip and tricks tips and tricks sorry so start with general advice you might feel like this is something you might have seen in - putting your service into production yeah good reason for as I was saying ml code is code use containers doc your cabinet is what whatever you want containers are great virtual environment even more awesome try to spend some time persisting your work whether with version control with persist everything you do lose the logs everything you want you might want to be able to look back at it in two weeks and months maybe not a year you might want to have a TTL to that the last two points are maybe a bit less common it's use of production technology from the get-go story time again we had data scientists made a lot of effort try to figure out how search and all lots of pages were related it was all done with redshift it was working in redshift because Richard didn't have all the data set and then I had to write 1,000 line of SQL query into Sparks is two three months this is lost of time for every one spark is probably even easier to use in SQL the production technology are widely available even if it's an extra cost it's actually saving development time most of the time and the last part is if you're working in a company you're doing ML models it's probably already a lot of things happening you probably already have a software that runs software in a regular basis just don't reinvent the wheel cool now let's dig into several parts my schema was not that great because it might have led you to believe that these two things were different they are not you should not have two piece of code for feature extraction you should apart form label okay suitcase use this should be unit tested you can use things like a potus's to generate a bunch of random data or to taste all of your edge cases and don't write SQL this is making people life hard right everything else codes that can be tested you need tested mostly now on the training I'm going back to evaluation again actually I might have said this in previous part but just to put emphasis on it perform feature important analysis keep it somewhere so you can know when your model actually goes out of way if you had done any piece of code that you just check that your data set was right or that you anything you are done basically in your research phase that give you confidence that your model was good you should still keep it implement it and run it regularly this is what going to keep make sure that works as the assumption you've made with the data set stay true in time also classics have a small sanity check sanity check test suit with just a small set of data just to be sure everything is running while you're developing not breaking production with the push this is bad
now to all the sings advice first one is log all the things you want to know what happened with your pipeline it's not running under your supervision get you'll get it to log everything like the sampling you had like some ideas on how the class was selected what are the things like its feature extraction log everything that happens how many features were extracted maybe some small statistics about the feature you have a problem you can just look at your log you don't have to recalculate everything model training log everything that happens evaluation we just talked about it log feature importance these kind of things log log log log log memory and in your product log when you're actually using your prediction what usage is made of it everything every time you're actually doing something you want logs you want to know about it this is how you measure things second one is version all the things and yes that's how you keep track of change feature extraction you might add remove features write different functions maybe with different names so also if you want to persist the feature that I extracted before the model training in case the model training fails same thing if you have a file which is written with the days of features we extracted and which version of your feature extraction might actually perform it's much easier to go back and understand what happened if your model has a version it can be to several like just to get comment it could be semantic versioning it can be both it's great you want to change which go from logistic regression to easy boost bump subversion you're changing the IP parameters bump the versions there is nothing more frustrating than to knows that there was a model that was generated and you can't remember how or know how it was done plus if you have a version and you have login it's really really easy to know which prediction by generated by what especially when they are used which facilitates you evaluating success and doing experiments this is basic data traceability so now
with all of this we can think of maybe some general ideas on how to monitor the pipeline as a whole
keeping track of the number of prediction generated is a good way and prediction used to so it tells you which part of your things could be broken if it's not things keeping track of trimmings to be able to see which part of your code starts to became slower and why that's also very important especially if it has been running for one year no problem and slowly crept up it's very different from it was running really well and suddenly is a amount of time it takes doubles a lot some errors in your pipeline code it can go as simply as send me an email at this email address if you don't have an alerting system setup otherwise set one it's really practical and kind of a the last line of defense if anything else fails alert on the metric you are trying to move you had a project at the beginning and this all ml sings this all production all this effort don't live in void they serve a purpose so just alert when the purpose is not accomplished and maybe everything else would find that if this doesn't maybe there is a problem it's worth looking into it last but not least right run works how is this system supposed to work and initially how you thought about it and then every time some things that you didn't think about actually happens because reality is reality at it's there so you we can remember how you solve the problems in how and to end all failures easily especially when it breaks at 4:00 in the morning so now just if it was a bit boring and you fell asleep I'm really sorry but if you just need to remember three things it would be this design for change it's in very easy main difference when you do systems and productions the thing the code is probably going to outlive you in the company sometimes think just change likely the mindset about it and as I was saying and machine learning code is codes there were thirty good years of best practices in writing software engineering and use it it's not different it seems different but it's not all the good ideas are mostly there the last part is verifying any assumption you make because again things change and evolve and so we need to be sure that any assumption you've made are still verified with your systems we're hiring
if any of this was interesting for you and you want to join us at Yelp we have offices in Hamburg London San Francisco we had a stand with this
end of advertisement I thank you all very much and I will be taking in questions thank you for this great talk where our questions let's begin here thank you for a talk so you say it was on to talk about tooling but you still have any advice because as well some part of software engineering still apply to data science but we've got some specific problems like like that are you need testing or input validations and also that's running everything you can just put all your data on gates so that's also program my voice is we use s3 quite extensively because storage is actually really really cheap and just throwing everything at s3 and figuring out later is actually working it samba it sounds really bad but it works but as you keep like this model was loud on vista that you have like just keep ids everywhere or do you have any way of I'm not sure I understand the question when you you get this model and was loud on this extract of data how would you keep track of that that's this model produced this result from this data and this right now we have semantic versioning so every time we change something to a model which is in production we bump this version all the models when they are written are written as a file names it contains the version that was generated with and sometimes they get commit that of the version that was running when this was done and most of the time combining the two you can actually know what happened and just reset your repo to have earlier version two we need to reload it or remember what was done thank you very much thanks for the talk any comments about regression testing after you deploy models so this is something which I have faced in teams that I've worked for because regression testing is a part of software development and machine learning the deployments are kind of non-traditional because when models change outputs of non inputs change and then QA and let's go ah everything's changed I don't really have a great answer but I would say if I couldn't go
back here if you look regularly at the metrics you're trying to move and what your model is supposed to do you will know when there is a regression because the numbers you're trying to grow are not going to move in the direction they were they used to so just I don't it's not like regression in the same sense as a code it's more you look at regression and the patterns in the things this is I would admit this is more problematic than Wis code because it can be also reasons like hey today is actually also metrics are going down what's happening it's horrible oh yeah or today's a holiday so actually everything is fine well yeah so here what we're doing right now is we're not never deploying a new model immediately we are doing progressive releasing so you have the current model version which is big we release the new version as for a small part of the traffic and we look everything is fine before actually switching it so that's how we solved it I yeah it works thank you for the interesting talk one comment any question on the measuring things there's a great blog entry by
uber that talks about how they do it with actually machine learning the metrics on a question on the log everything kind of approach that goes very much against the way for example
trooper to note box makes us write how do you square that and and what's your advice on that yeah I might have gone a bit about it I'm not really a fan of there are a bunch of things that allow you to production on is tribute in notebook directly I think there are several problem with that which goes with code quality because structure interpreted notebook is very linear and could isn't it's hot right test you know I would be very interested in seeing test actual real nice tests that are written interpreting notebook maybe it exists I've not seen it so it's complete it's completely different beast in a certain sense you're going to check in your code real and it's going to become not an experimental thing anymore it's going to be a git repo that runs is deployed I maybe there is a way to bridge that gap at the same time I'm not sure it's a good idea like attributing notebook is really really good out doing what it does and checking in code so it doesn't change too much and it's not that easy to change and you can track all of the change and do everything is what you need for production does that answer the question so we use tributon notebooks for everything that happens before we put it in production and then you start copy pasting the code from Djibouti notebook into a nashville repository you start writing tests you start going like going back to data scientist like so you did these two things here like the two intervals are not exactly the same you asked questions to try to improve and you start doing code reviews and you make sure everything works in a regular fashion so it's really two separate workflow thank you and it was brilliant and I think great advice for people who are developing the models do you have any advice for application software engineers who need to integrate model into a product or a feature or perhaps advice around collaboration between a data scientist coming up with the model and a software engineer implementing it yeah that's so I have some advice in a way and it's it's more organizational and what I think is important is everyone kind of is on the same page that the data scientists are using roughly the same tools you're going to use in production and not have their own little world which works but actually never never you never able to translate it and the other part is once you if use your data scientists you spend four months working on the model and you have an engineer like that would be me that comes and kind of take it and those a lot of things with it and putting in production I think it's very important to keep that person in the loop to have because you have a sentiment of ownership with this and this is what we are doing so keeping people in the loop and being sure that things stay within the things that was a very vague answer tonight I hope this was understand ok two more questions I think Thanks you mentioned this tree do you also use like for example s tree to trigger let's say the modules are in the lambda or something whose service for that or so I don't I knew some people are playing with it I have no experience in the subject Thanks one last question if anyone know ok then thanks Ari so it may be a little bit off topic because you said you won't be talking about tools itself but can you guide us what to look for and just have some names to check out yes I have a particular liking liking for spark to be honest because it handles data set of all size it's very easy to run locally you can write unit tests with it and just pin pretend that your class your test instance is gigantic and actually it's just I'm using one CPU and everything works if so yeah I have had very good lots of algorithm models extra boost are compatible with spark trying I would always say kind of have to two guns one which is a good old tool which is well tested and this spark is kind of old and not trendy anymore which means it's table and actually works in production and have like triads and new things to see if you can leverage them and what are the problems with them thank you so thank you very much again [Applause]
Loading...
Feedback

Timings

  442 ms - page object

Version

AV-Portal 3.20.2 (36f6df173ce4850b467c9cb7af359cf1cdaed247)
hidden