Using Django, Docker, and Scikit-learn to Bootstrap Your Machine Learning Project


Formal Metadata

Using Django, Docker, and Scikit-learn to Bootstrap Your Machine Learning Project
Title of Series
Part Number
Number of Parts
Mesa, Lorena
Confreaks, LLC
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
DjangoCon US
Release Date

Content Metadata

Subject Area
Reproducible results can be the bane of a data engineer or data scientist’s existence. Perhaps a data scientist prototyped a model some months ago, tabled the project, only to return to it today. It’s now when they notice the inaccurate or lack of documentation in the feature engineering process. No one wins in that scenario. In this talk we’ll walk through how you can use Django to spin up a Docker container to handle the feature engineering required for a machine learning project and spit out a pickled model. From the version controlled Docker container we can version our models, store them as needed and use scikit-learn to generate predictions moving forward. Django will allow us to easily bootstrap a machine learning project removing the downtown required to setup a project and permit us to move quickly to having a model ready for exploration and ultimately production. Machine learning done a bit easier? Yes please!
Machine learning Slide rule Video game Linker (computing) Control flow Machine code Algebraic variety Twitter
Laptop Asynchronous Transfer Mode Spacetime Service (economics) Process modeling Computer file 1 (number) Interface (computing) Bit Water vapor Machine code Weight Revision control Data mining Mathematics Repository (publishing) Function (mathematics) Telecommunication Metropolitan area network Algebraic variety
Point (geometry) Machine learning Computer programming Gender Multiplication sign Closed set Bit Weight Mereology System call Functional (mathematics) Social engineering (security) Formal language Data mining Medical imaging Word Mathematics Machine learning Computer animation Lie group Normal (geometry) Immersion (album) Computing platform Algebraic variety
Machine learning Pattern recognition Algorithm Building Focus (optics) Observational study Observational study Algorithm Robot Multiplication sign Constructor (object-oriented programming) Artificial neural network Computer Bit Prediction Formal language Word Prediction Machine learning Computer animation Field extension Self-organization
Machine learning Computer programming Algorithm Building Computer Analytic set Bit Measurement Frame problem Formal language Degree (graph theory) Thomas Bayes Machine learning Computer animation Hypermedia Computer science Cuboid Data management Task (computing) Task (computing) Algebraic variety
Randomization Freeware Gender Expression Price index Line (geometry) Rule of inference Pointer (computer programming) Word Voting Machine learning Computer animation Subdifferential Smart card Hill differential equation Right angle Data structure Freeware Units of measurement Error message Square number Cloning
Random number Freeware Texture mapping Information Matter wave Process modeling Block (periodic table) Sound effect Line (geometry) Prediction Cartesian coordinate system Wave packet Number Goodness of fit Pi Machine learning Computer animation Vertex (graph theory) Website Arrow of time Right angle Freeware
Decision theory Characteristic polynomial ACID Bit Measurement Wave packet Formal language Word Machine learning Hypermedia Formal verification Task (computing) Task (computing)
Mobile app Context awareness Software developer Set (mathematics) Multiplication sign Process modeling 1 (number) Data analysis Mass Mereology Information engineering Mathematics Software Diagram Subtraction Computing platform Algorithm Graph (mathematics) Mapping Process modeling Software developer Analytic set Bit Prediction Formal language Computer animation Graph coloring Prediction Natural number Computer science Computing platform Natural language Statistics
Laptop Machine learning Process modeling Stack (abstract data type) Machine code Degree (graph theory) Word Visualization (computer graphics) Database Logic Integrated development environment Subtraction Algebraic variety
Machine learning Open source Process modeling 1 (number) Analytic set Mass Element (mathematics) Field (computer science) Formal language Machine learning Computer animation Visualization (computer graphics) Software Object (grammar) Software Modul <Datentyp> Maize Data type Subtraction Algebraic variety Scalable Coherent Interface
Meta element Empennage Batch processing Connectivity (graph theory) Source code Design by contract Control flow Shape (magazine) Login Machine learning Database Representation (politics) Subtraction Algebraic variety Area Machine learning Series (mathematics) Product (category theory) Process modeling File format Software developer Bit Prediction Perturbation theory Cartesian coordinate system Shape (magazine) Metric tensor Embedded system Supervised learning Physical system Window
Slide rule Dataflow Asynchronous Transfer Mode Histogram Context awareness Transformation (genetics) 1 (number) Multinomial distribution Wave packet Usability Frequency Machine learning Operator (mathematics) Software testing Scripting language Regular expression Extension (kinesiology) Algebraic variety Machine learning Software bug Algorithm Process modeling File format Witt algebra Fitness function Counting Bit Set (mathematics) Prediction Machine code Metric tensor Word Process (computing) Computer animation Vector space Website Permian
NP-hard Computer programming Context awareness State of matter PRINCE2 Multiplication sign Black box Mereology Food energy Twitter Hypermedia Operator (mathematics) Program slicing Negative number Physical system Process modeling Consistency Multitier architecture Prediction System call Symbol table Symbol table Network topology Graph coloring Right angle PRINCE2 Pressure Resultant
Group action Computer animation File format .NET Framework Bit Machine code Physical system Inverter (logic gate) Library (computing) Physical system
Computer file Transformation (genetics) PRINCE2 Multiplication sign Wave packet Revision control Medical imaging Machine learning Mathematics Dean number Physical system Product (category theory) System of linear equations Process modeling Consistency Gender Moment (mathematics) Interface (computing) Volume (thermodynamics) Directory service Machine code Symbol table Supervised learning Computer animation Order (biology) Revision control Procedural programming Data type Physical system Row (database)
Laptop Point (geometry) Medical imaging Greatest element Computer animation Computer file Process modeling Process modeling Volume (thermodynamics) Thumbnail
Point (geometry) Laptop Slide rule Context awareness Building System call Computer file Transformation (genetics) Token ring Multiplication sign Demo (music) Simultaneous localization and mapping Home page Process modeling Volume (thermodynamics) Entire function Machine code Wave packet 19 (number) Medical imaging Information Hydraulic jump Physical system Multiplication Demo (music) Process modeling Directory service Machine code Repeating decimal Hand fan Software development kit Word Keilförmige Anordnung Computer animation Repository (publishing) Procedural programming Volume Units of measurement Resultant Abstraction Laptop
Laptop Surface Building INTEGRAL System administrator View (database) Multiplication sign Disintegration Process modeling Number Wave packet Web 2.0 Medical imaging Negative number Software framework Data structure Position operator Process modeling Software developer Consistency Surface Gender Analytic set Set (mathematics) Cartesian coordinate system Metric tensor Word Computer animation Visualization (computer graphics) Revision control Right angle Automation Resultant Library (computing)
Dataflow Process modeling File format Wrapper (data mining) Building Set (mathematics) Volume (thermodynamics) Food energy Optical disc drive Medical imaging Prototype Computer animation Prediction Repository (publishing) Website Freeware
the and and home in the middle of
the room and rooted in the middle of
some brilliant OK so thank you so much for joining a expecially after lunch so I hope you all really excited CO basically look at every single technology that I try not to make my title too long but we well these titans of abuse which is fantastic so just to recap on just to recap a title this talk it's titled using Django Dr. and so learn to be chapter machine learning project something I do want to point out is that all these slides are actually available on but that the link on the bottom left-hand corner I also will be sharing on twitter and I will also be sharing the code which I will not begin any life code today but do have a repository of and would love people to use things break things tell me that I need add things so that we really fantastic but yes
there are about to start up a little bit with a story about something that happened to me recently so nothing many of us use some kind of of communication tool like slack at work that's a great so on August 1st used by this fantastic thing I'm from a co-worker of mine that you who had these things like Haiti like I've got questions the interface on this model is like what happened what's going on and light weight hold up there's a few models and there's a lot of what change would it was like the model version you're talking about I don't understand and I think that's right now speaks to kind of like the column space and currently in which is tooling around it assigns I think I this just happened to me on on his 1st and only 150 so the struggle is real I'm still in this and I'm still this problem space trying to think through this quite a lot and the love to hear in the back you have so august 1st Matthew things my OK well maybe I can go and piece together what this model it's I go jump on get hold and look at where the data scientists for tracking I'm going to 1 of the data scientist projects repository and there's like 10 year Bruno books in their with like a lot of code and a lot of things I don't understand them and I was like OK I have no idea which of these you are notebooks just the model that is on this on is on any service that nothing you're using so I hold on let me go look so go look and see if I can piece together the story from the pickled version of the model and I mean we store it as an acid rain and looking in S 3 you'll notice June 12 July 28 it OK there's something called latest which is the weaker act I hope you'll notice the size of the model is exactly the same and there's been like 6 5 or 6 weeks between the 2 models being dumped in man and I'm not exactly sure which 1 it so that is pretty much my
struggle every day i'm kind somewhere in this place where I was I'm not exactly sure which models the ones you think you know I guess this is the correct version yes this is the interface where are we with the data science and water our past practices so under talk a little bit about a few things today but before I get
started I did wanna do a quick quick introduction to who I am offer high mining for NMF and as our session rendered indicated I am totally a big start mirrored IP weight with money to the captains on the part of fantastic but on something a little bit about me is I've actually come out the world of being the focal scientist so 2 years ago I was like I conduce equal and pipeline and I like this kind of thing so that the immersion program at with camp in Chicago and I've been with the the sprout social engineering team ever since May 5th 2014 it's been really exciting in my time there I've been on the platform team working on 2 different parts of the platform and last year and some change I've been on a rainy assigns t but I will talk a little bit about the anatomy of a data science team but what I do wanna point out is the data science team is newer and we are growing quickly which is kind of where this talk comes from on some of the things they do I helped run highly the Chicago and I do you that on the paper and soccer Foundation Board of Directors I think there's a few of us here at gender constitute any questions or want to know more about the PSF this is just my call to you come speak with us for super happy to get to know who you are and learn more about your D so in
regards to today's talk we're going to be kind of going through these 5 items were image how little bit get some common language around what machine learning as we're going to talk about the anatomy of a data science team they know what Django and with plate and we can get people from a broad variety of disciplines so I wanna make sure we talk a little bit about science and how might team kind of functions on then we will high-level talk about the engineering of machine learning problem the will then move into will thinking about machine learning engineering using some of the latest and greatest close words like Dr. gender is right in the Khyber I like it learn and some other goodies and then we'll kind of leave with some open questions about what could be next for the project I'm working out on maybe what are some ideas for you all the work and data science teams what we could use an ecosystem for a data science infrastructure and also tools for more so
great What is Machine Learning
I know I don't always get a little bit of a kick out of it when people think about what machine learning is like in high I'm Johnny 5 if you don't know Johnny 5 is really ridiculous movie I movies from when I was little where you legacy proficiency like intelligence of robot can do all the things and I think sometimes some people may think when they think of machine learning expressly they don't work in an organization that has some and practice already but for the for a lot a better word I ideality always refined and common jargon just kind of boring myself and think through well when we talk about Mean Machine Learning what is it work for getting into so rather than read this fall of tax I just wanna kind of highlights of the big things here so machine learning it's of feel as of building pure science does Pattern Recognition Computational Learning Artificial Intelligence loblaw I'm sure we've heard a lot of that so I'm really care so much about that what I you really wanna focus on is machine learning explores the construction and study of algorithm that that should learn from and make predictions on the so the big thing here algorithms we all know what that is data we work with all the time and making predictions on data cool we now have some ground late with language to think about machine learning home but was that of the
let's think about it may be in a little bit more of a traditional way that some of us who maybe I'd have computer science degrees may have been exposed to machine learning in a language like this so that is 1 which be very useful comes from Tom Mitchell on computer science professor at Carnegie Mellon where's the frames machine learning with these of these 3 pieces of emphasis here is that a computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on t as measured by P improves with experience the so when we are did talking algorithms are algorithms and been doing some task they're using data and from that is from that data we're gonna be trying to drive some insights so that we can go ahead and make predictions on on what new things may be happening other depending on the features your building there's only some task some kind of thing we want to accomplish and how do we know if we're doing things cracked well we have some idea of a performance metric OK
so just because I like to kill everything with details on I think it's a is it have a little bit of a toy problem to think about when we think about you know what might be some ways in which we do machine learning in our day-to-day and what simply project we work on so sprout social where I work is the is a largely business-to-business social media management and analytics tool so something we work a lot with social media ideas of text data that's a really big thing for me so text data that's something that I think we might have all heard about we'll we'll all have things like the spam filter in our in boxes marked something if it's spam or not well let's let's maybe take that example of it further and think about a different examples so the idea here we were of something else we could think about as a machine learning problem could be predicting altruism with the naive Bayes classifier so I'm guessing some of us
I hope many of us like that especially because of the words free associated with that and that makes me really excited so what would you all do to get free pizza do not screaming out loud on a brainstorm up here so let's see a would you may be run a marathon perhaps because maybe the Scottish so much better after you run a marathon but would you high don't know what you do interpre billions in front of gender con for 45 minutes I think we I don't know what your requirements are forgetting free pizza but this
is actually a really fun little toy example that's someone actually put into the world and actually actually please data for us so that we can actually start working on thinking about measuring and predicting altruism with this idea that if you get pizza some gives you pick so that an altruistic act so where does this where does this example come from on Reddit there's a subtle and it's called a random acts of pizza of and essentially what this sub right it is is it's basically invites people to pay to come on and say hey you can make a request as you'll notice here there are some there is some structure for how a request happens which is pretty nice because an error of unstructured data having some structure to data is great expression using machine learning are so in that in line item 2 we have all right your request post and submit it remember to start the title with request so yeah there's some ideas here but essentially this is a subgradient of on Reddit where people can go and basically make requests for for free pizza and then the community will also or down vote and based on the rules of how this works and eventually when you have so many votes yet you and get a free pizza or someone might buy it for you so if you're hungry tonight and here you can do this tonight of some examples of
texture of free pizza goodness I really got a kick out of the of the 1st line I actually have money for the pizza but all I have is a 50 dollar bill and the delivery boys don't accept anything larger than a 20 dollar bill so that was actually the text of someone's request on another 1 another request that someone made was I've got the Torah in 1 hand and on the other a pizza pie so I guess there just like trying to get some excited like I'm ready almost there you have the blind people but as you can see that it these these kind of request they can be all over the place they can be kind of silly and some of them are actually little heartbreaking to read the so in
this little in this toy example the side data science competition website cable actually already had cleaned this dataset and made easily downloadable so it's probably very hard to see but on the right hand side is a snippet of somebody some blocks that represent the training data for this problem on you've got a lot of features in there including the request tax which is some examples of things I gave you you've guy last how the number of nodes you have user information of who requested it that it was requested things like that so that kind of example of your training data and then essentially what this problem does is it says hey I I want you given these 5 thousand 671 requests of which 994 of them are labeled as true and the rest are labeled as false I want you to write a novel machine learning model that can actually predict and tell me if when I get a new request coming in effect requesters can be successful or not so for machine learning problem this might be an example of a classifier I don't know maybe you're you've got a you want the out idea where you want to Arrow hacking away to all the free food in the world maybe this is your wavelength bringing machine learning to your application you have some idea of text data and you you got some that are historically labeled examples of things that were successful or not and then you can use that when you're constructing a model to help you make future predictions so the refrain this in the
language that we saw earlier from Tom Mitchell we have a task which is classifying a piece of data essentially the question is is a pizza request successful for the acid differently is that is is an act of altruism enough to our experience here is gonna be the labeled training data would essentially like a CSP which where you have a of a stadium and he does that Boolean representing true or false if that cost was successful or not and a performance measurement is the label correct that's pretty much a performance measurement because we're working with labeled data and you're actually gonna go ahead and take on summer training data if you have a label already you know just in check and the hate that actually successfully predict this thing are not so in the in example can machine learning maybe you have some kind of classifier problem like this you have your task you have your experience and I invite you you've got some notion of a performance measurement what
next OK so we've talked a little bit about what machine learning may look like for you it as an engineer what kind of decisions you have to think about a media he may be working with so could be text data it could be other things about the pizza request maybe you're looking for if something is i incorrectly capitalized in in the request maybe there's other characteristics of the data that you wanna look at beyond just the words that are in its well with all the data science
team on something to think about is that you've that kind of a few folks were working on these problems it's not just mere development is not just a developer's so I'm kind of borrowing this from IBM's UX personas on as applied engineering they've they've kind of think they felt like the idea that the differences between app developer data scientists and the data engineer so the data scientists that would be someone who's going to be going in there to doing a lot of the feature engineering there that they're the ones who and the trained on the different statistical methods and algorithms on in your data engineers and it was probably building the part the pipeline but the infrastructure around us are making this thing happened and the apple perhaps for the ones actually bringing the model and putting it into the application and and making it so a user can can actually use this feature maps so in in that world let's talk a little bit
about my science the so you may have seen this for this then graph then diagram of madness is the to color before where you know with them the idea that someone has all the skill sets to me is just a total fallacy the idea that you have mass that's subject-area expertise and computer science and that the unicorn which allegedly the data scientists now I can tell you that stifling uppercase are omitted in the kind of the context of my team I we've we've got a team that's been growing pretty quickly like I said about a year and some change so we've got for data scientists without someone who comes from natural language processing PhD who has been a long time in computer science and working with NLP and are we have folks who come from predictive analytics economics now we also have a person who came out of the post doctor in chemistry but then there is need alone soul suffer near and the team and I guess that I think I came from the platform engineering side of things and I had done a little data analysis and analytics in my time as a as an as a person in the professional world the we do also have some designated infrastructure support by largely that is still kind of in a real house so why is
that problem great we've got teach these words where the title of of data scientists there's mean we have some infrastructure for why do I care about that well I think this can
give you an I. D. on this so from the keynote that logic bonapartist it this year in Portland as he talks a lot about the variety of tools that are and scientific stack you guys psychic blurring you've got a lot of different ideas you can use like the degree you have tools like can does a lot of visualization tools and plotting tools and also just like the actual work of developing the model and executing you've got things like Japan notebooks this is a big world that you that people can pick up and select tools from
so that's pulled wise type on the place of people are going to fit in lot modelling work the thing I like to think about and why I get excited by the the science and Python is namely that it really is it really is a place where uh but it really is a place where we can kind of mass a lot of things together in the scientific community there's a lot of there's a lot of work for example in the astronomy fields using pipeline to do a lot of analytics and visualization on kind of again refraining with OJ upon + highlights pipe and act as a glue it plays well of other languages we've got our batteries included kind of idea here we don't have to go and umbrella proprietary coders go in working with data right from the get go on the on simple in dynamic and hasn't of beneath us well suited to scientists
so that's all fine and dandy but again on my side and the ones suffer engineer on my team with people who come from a variety of different academic disciplines and the kind of tools that they can use can strongly vary depending on the types of machine learning problems were working on so to kind of think about what is the model for my team or maybe what model for me as a person supporting a data science team on and then add modify of Professor Ralph Johnsons quote here and instead of saying before software can be read can be useful it must be reusable and as they explicitly perform machine learning can be usable it must be reusable we have a lot of different expertise we a lot of open source tools we have a lot of different kinds of problems people can use and I think it's a leading to it can be a little overwhelming and is kind of thing I wanna highlights so back that this idea of
machine learning on 1 thing that we have to think of a bit about is what the problems are an answer and what does that look like answering those problems so batch that example of the other at predicting if it and get a free pizza or not so we might have text data there but that could depending on machine learning come you working on you might have a variety of other data sources like you're integrating on and this kind of component of like shape in the data the format that you need and selecting the data and then maybe the data into the format that you need as well I can be quite expensive so we call this feature engineering the I this representation kind of on the left hand side is kind of a broadly simplistic idea of what may be a pipeline for a supervised learning approach to machine learning so you can see like me without data in our in our the in in a production databases and we have logs maybe we have on so examples of some locks maybe we have some metrics in which were kind of capturing how people are using the application maybe we've also got some proprietary data that we paid for the other can get quite a few series of things that were taking together and integrating and we're plotting together into new format so this idea here of getting the data merging it together and then that feature extraction and preprocessing that is going to be a lot of the work of our data science teams then we have I once we have data and in the format and and we selected the features at once we can then go ahead and apply a learning algorithm great From there were able to get a model and then using that model we can go and make predictions in example of our of our classifier window make it we can make predictions on future pieces of data and say hey that thing is successful or he no it isn't but that writing it here is reminding a lot of things together there's going to be different areas of expertise and depending on who is doing what work we need to think a little bit more critically about
that and so this gets into 1 really big question the way that application developers is used the data that we're collecting and are proprietary data 1st how I using the data vicinity hot science uh the data scientist perhaps themselves reading the data can be quite different and we've actually got in an entirely different kind of infrastructure that we need to start developing and building and the with that the question then ultimately becomes you know where is the handoff between data science and production what does that look like which a social contract the because the way that I'm just I'm collecting and shipping data the kind of data pipelines that were laying down are going to be of are going to be of importance the application developers poor responding to user facing features so let's break this down a
little bit the OK so in the kind of flows that we that I'm on going to be talking about which we talked a little bit about the gets and shape the data that's the 1st thing the 2nd example the 2nd step about is training the model on the date of 3rd we want pick the model we there was something like in Python job and then 4th we can go ahead and use it and make predictions on the so in this in this kind of example here using Python's ecosystem you might have some kind of script and this is quite quite simplified we you might have a script like what we see in the left hand side so in the Python find fixed on in the pipeline scientific stack we have a few learned why has to learn why not underflow as the learners try to woo ineffectual 19 years so you might have something like this where is saying OK we've got the example of the Peter we want we have data that we've been collecting and I want to be able to apply model and make predictions on its so we're using 6 site it learned we actually have all those algorithms built and so naive Bayes 1 1 variant of that is the multinomial Naïve Bayes and the flows and a kind of a book like this and in the Python code you may receive something to the extent of K but splitted that data into the training and the test on the training set and the test sets and we're going to go ahead and fit that multinomial model or whatever model you using it in a fit on the training data and essentially what that does on the training data is it starts to represent whatever features is selected so if you're doing something like me working with words perhaps you're turning that into a numerical vectors saying hey words that appear most often in the in request that you are successful for winning a pizza and here's the here's the kind of histogram count and the the ones that aren't successful in getting people here's here's a vectorized format of those workouts that the 1 example of feature there might be other features including and you can represent that it on in a numeric format but essentially when were fitting against historical data there's a process like that happening underneath once he stated on that historical data you can go aside and say hey I've now got a model that's ready to go ahead and be used and that's a really go ahead and jump out of the I the model in a pickled format so again that that kind of shaping of the data out like what kind of prep work and I can be doing I'm not doing a talk at very deep and psychic pipelines if you are curious about that there is a fantastic talk I recommend from paideia Chicago in August 2016 about 40 minutes the goes in and talks about different transformations you can use and things like that I have going to my slides but that are being said and you have your data you're going to have to do some kind of mounting evidence of it together getting it to work and you can use transformers from cited learned to actually get in the way you want so to think back to the idea of what might be an example of a transformation perhaps you have a bag of words with a lot of 80 there are what we call the stop words is a word that they provide a lot of our context perhaps maybe the frequency of these words can throw off on introduce bias into your model that says hey I and over bias over I overestimated this is likely going to be a successful requested so a transformer might be your bag into a words removing stop words you might also reading something like standing which is trading words like shopping shopping as the same word you can remove that I am gene but transformers like that you can write your own custom once you can do that with the 2nd 1 pipeline operators the so when were back into this flow we've got training we've got training examples we've got the forming of the of the data in the informant needs remembering to go had high that that learning algorithm the what's pretty nifty to about like it learn is you've got a variety of things in there so few are kind of new to machine learning there is a lot of ease of use for you to go ahead and and start getting to work in a pretty quickly it on so I would encourage you to go and explore a lot of that and you do have some metrics that are also built into psychic learnable will see that a little bit of and but for the context of the of the machine learning that I'm talking about that I work with this is true historically flow we've got supervised learning we have some previous data that's and formatted in some way will use cite it learned applied transformations get the format that we need them we can go ahead and that the model on it will take that model from such such it learned and then we can dump that are using job led in a apical format and that thing is ready to go the
cool OK so I was talking a little bit about earlier about reusability so we got the slowly talked little bit about what the code may look like for for the that if you're using python 2 D machine learning explicitly following the example of the supervised Naive Bayes example soul reproducibility doesn't matter so the engineer for that can I
really think social media data is just so fantastic other than that in you have to develop a machine learning model that says hey the sentiment of this tree is positive or it's like a warm good feeling you know hobby a slice and dice that you have like separating negative from positive what here's a thing a mildews are part of text data the the idea of having to write I think it's called tiers of joy people use that so differently if you have some folks who think that that's actually like someone who's like sneezing crying you have people use that and maybe think that it's like a very positive exciting thing so the example here from the we i it says I'm laughing so hard tears of joy tears of joy tears of joy lots more tears of joy is on have tag on mentally dating Justin Bieber and that shirt says single taken and sleeping Justin Bieber so if I ask you is this a positive or a negative thing I'm very curious what you would say it is I would say not so positive of not a huge Justin Bieber program like I would think that that here the joys tears of is like all the destruction I don't care if this is not a positive tweets another really fun 1 that I I've come across a 4 is the purple heart energy so you know I'm super sad that prince is no longer with us but princes purple is the color of Prince and what if you're looking at me please the Purple Hearts pressure on anniversary of princes passing which I believe was April this year armed with the 1 year anniversary you start getting a lot of purple hearts and so let's look at this to be the printer state has announced on at 10 times new purple hue named in honor of Princess famous love symbols so to me I'd like all that's so cool remembering Princeton were empowering friends to other people they might still to be you know watching have Purple Rain adjust utterly destroyed by things so that you know we have all these examples of kind of inconsistencies right reproducibility does matter if I had a machine learning model that's predicting the sentiment of something ideally we would want it to be consistent right yeah I would hope so it but there's a lot of there's a lot of context there and so this is where things get a little tricky and many gonna start on have tagged LOL Saab because that is just so that's a silly 1 so the idea here
about data and Candida governance who owns the data who's doing what how are we going to get consistent results so in the pipeline operator of things I wanna be a little different system and then the black boxes I knew that I've got the black box that shapes data and does things at the black box that's that's at a model and then have a black box that can go hand the call to make prediction right so I'm going to say as an engineer what I care about is developing tools that will help me get this reproducibility so that's where I'm going to start
saying doctor is fantastic and also I just really like doctors imagery and it's very excited so containers on there is there is a very often talk later today on Cuban and is also talked a thing like if they I think the topic and and Jane going through a net is which just fantastic but Gasso containers but look into containers and talk a little bit about it
so the doctor and get some probably most of you have used but it so it's good to get a refresher if you have nots on so you can think of doctor is basically a big executable tarball that has an explicit format so for me is an engineer and might be like so the code that that action does the thing of any kind of libraries I need to help me with the code to be the thing immediately system tools I need cool well here's a fun gotcha in data science that might also include it this
data because we don't want this weird inconsistency around the Purple Heart Purple Hearts the a little solve whatever is the latest and greatest thing that someone's introduced into the
world so that data is also probably going to be a big thing if you talk about supervised learning problems so put back a can also you know when we think of Dr. maybe that terrible can also learn how to get that training data down so that we introduce new versions of models can retrain on the same thing because this consistency is good on and the that again you know why use doctor I just have to steal from from healthy Hightower who eloquently said in the 1st row pipeline is that you don't use the system installed version of Python I mean I am sure many of us have struggled new having on gender girls quite a few times telling everyone do not use the the Python that's installed on your machine will alone will solve all they have took so obviously as we know and as we know setting products of can be difficult but then introduced all the complications of machine learning like the data that we're using the correct order of transformations that we have to do the other thing that sir to become more complicated we talk about machine learning problems so for the doctor and there's an interface what's really cool is allows us to make container with basically takes a snapshot of how the code works at a given moment and then we can just put it somewhere and then we could check it out later for while so we do this have the doctor filed and essentially what the doctor filed does is you have steps in your code and what's really nifty is that you have the same procedure your in your doctor file I time after time you won't need to re download everything is essentially caching layers so only you make changes in your data file will have to go and re download things which is great were now making a little easier was set up so when we write the darker file work and then the 2nd step is to build the doctor image which would look something like builds a doctor built the top this than in the model predicting altruism given a type of latest and then if I wanted to run that Dr. container it might look like from the command line Dr. Randy attached on nearing of associating port 8 8 8 8 2 8 8 8 8 and you mountable data directory volume
so as an example of a doctor filed sense what this is doing top to bottom this is an example maybe what I would give a data scientist for them to go ahead and start working on their stuff so that I can say OK now save things so that I can check things out when you give me the thumbs up that you've got a model that you wanted to download and were essentially from the top to the bottom what this is doing its as user Python 3 image go ahead at some users and then go ahead and mount the Data Directory volume on to my doctor image and then go ahead install the requirements and then I have this entry point which basically says up Jupiter notebook that's what I provide pretty straightforward
so with that of files we can easily
start controlling when changes happen because you can build a snapshot of what the code what the data with all that looked like at a point in time so I'm a big fan of using the multiple data directories essentially what that allows us to do it takes data on a directory on your local system and then it's going to go ahead and now that into the image so like I said we want reproducible results so far I have 1 scientists who found a really cool dataset and maybe they want to work on that but they wanna use the same model well they can go ahead and build a new image and maybe bringing it with this is the name of the of the training data that I had so in thinking about
0 how I can get my data scientists to work essentially what I would argue for them to do is go ahead work I want you to use the stopper filed we're gonna have a amount data directory which includes you give notebook which is going to have your exploratory code to again back step process psychic learned pick pick the naive Bayes or whatever classifier you're using then go ahead and to transform the model transform the data using those transformers as the data scientist selects and pickle the model and building jump into the her image they conducted into the dark image that's fine where that where that where that model alternately wind-up then at least I know I can go check out the DARPA image and I've got the training code that goes with that at the book that is ideally with some naming conventions it explicitly tell him when it was last touched who owned and things like that but none of them basically creating kind of abstractions for them to work in where they control the modeling side I can check out the thing that they're working on and continued on my merry way so if I'm using a doctor container what I could do is use 5 in Python we've got a darker are module which allows us to create an image so if I were to stand up a darker M points in a in a dingo API maybe the URL looks something like this great image name of model go have pre image on in that that that will kind of go ahead and goes to the procedure saying hey and then read the darker file and then you go ahead and build this image and I'm going to go ahead and stuff it somewhere and what I wanna return back is basically you were out aware of where that doctor image lives for context there is a demo also here on my repository so an additional slides you can check that out but essential that agenda chronic comes in we can now basically stand up and and point use Dr. abstract away with the need to do and then allow the allow the designed to say okane hit endpoint it's called predicting altruism and I'm gonna go ahead and ask them to to go ahead and build a new snapshot of the work that I'm working on so the image of a notebook in might have something like this this example here of the clean text tokenizer kind of what I was talking about before breaking up words into a bag of words in the maybe doing whatever other transformers you need but in the jeep another book that they're working and on essentially when we build that no 1 we build that up image were then able to in Dr. image jump people model and have it lives somewhere so
essentially what we can do it with the death of the dark gray PI is basically built in this doctor workflow word data scientist as the working on their on the other on the duper notebook when they're ready to to go ahead and save it is supposed to gender and point managing and create an image and that images can be tagged with a mutually agreeable model naming convention that we agree on such that when Matthew asks me hey I need I need to know about this model I could say OK well the last 1 that was updated on to doctor is here and here's the URL you can use to go ahead and check it out with that kind of structured way those processes from people
so the big thing that I did not talk about to to keep in mind is that it's not just that we want to order kind the building of these things but we wanna understand how models will do over time right that's the big thing here that I think is kind elephant in the room is we want you to store analytics as well so criminal my teams working on is where using other django admin to start bringing in some of the analytics that begin with like it learns on the right hand side you'll see there's a chart of false positive vs. false negative and see if we can start lifting some of those metrics out so that when we train against that historical I training set we now have consistency like OK model during 1 here's the results of false positive false negative model very number 2 you here's the results such that when we start looking at me to what's the best model because surface than those in the Django admin view some other things kind of think about too is there are and some of the things that thing but you is a we having the traffic talking about Cuban eddies in this talk but communities is a great way to stand up an image and stand behind and so again taken that containers technology to the next step and allowing people then who are on the application developers to go ahead and opinion and we criminative can also allows to say to get to work and start doing that so if you are curious about the when it even make a plug for that topic for 10 and also gender is a great framework in which to use you can build upon some of the great robust of libraries that are out there like the Django admin in use in some the other visualization tools but there's also other web frameworks maybe you might wanna think about so I know that a fact today there's a generous blast talk I I have also used Balkan as well as I think it really depends on what your team needs but I think gender is a really good place to start that really robust supported integration with a lot of these tools we talked about the so if you're curious about
knowing what next there's a lot of things as we go next and I really enjoy Rob stories talk Britain pipe on the JVM odd because that's another thing maybe we prototype a model and Python but we wanna make use of the JVM is there something we can do that without we do this for the flow we just might be to transform that people model into a different into into a different format how can we do that so there is there's some discussion there on if you wanna learn more about the site of the 2nd 1 pipelines that link to that's on but also the last not least I do have to repositories 1 that allows you to work with the mountable data volumes with a ready full-fledged Peter altruism model that you can use and do what you want to your heart's content to stop playing with the men also do have it set up an agenda wrapper which is pretty great so that all being
said our what do you see an image her clip art after I don't know I'll say I hope we see lots of free people so thank you so much money Lorena I'm excited yielding hits the energy of the questions of Iran the world so fewer and
things was so if that


  794 ms - page object


AV-Portal 3.11.0 (be3ed8ed057d0e90118571ff94e9ca84ad5a2265)