Bridging the Gap: from Data Science to Production
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44953 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201852 / 132
2
3
7
8
10
14
15
19
22
27
29
30
31
34
35
41
44
54
55
56
58
59
61
66
74
77
78
80
81
85
87
91
93
96
98
103
104
105
109
110
111
113
115
116
118
120
121
122
123
125
127
128
129
130
131
132
00:00
Bridging (networking)SoftwareComputer musicInformationStack (abstract data type)System programmingBusiness IntelligencePoint cloudProduct (business)Physical systemHand fanSoftware developerArithmetic meanDigital signal processingProduct (business)Mathematical modelWordCartesian coordinate systemProof theoryOperator (mathematics)Focus (optics)Projective planeOffice suiteStack (abstract data type)Multiplication signComputer animation
01:16
Self-organizationData modelPerspective (visual)Product (business)Source codeDatabaseFile systemStreaming mediaStapeldateiFrequencyScaling (geometry)Machine learningForestRandom numberoutputReal-time operating systemLevel (video gaming)Characteristic polynomialStack (abstract data type)Java appletLinear regressionPoint cloudEuler anglesPerformance appraisalLatent heatArchitectureState of matterInterface (computing)TestdatenInstance (computer science)Endliche ModelltheorieProduct (business)Multiplication signPoint cloudComputer fileCASE <Informatik>AlgorithmPoint (geometry)Perspective (visual)Software frameworkResultantProjective planeSource codeAverageProof theoryMereologyFrustrationInterface (computing)Data managementPredictabilityDatabaseStreaming mediaLaptopGame controllerVolume (thermodynamics)Decision theoryScalabilityFrequencyInferenceVelocityReal-time operating systemCharacteristic polynomialPhysical systemStapeldateiWebsiteCondition numberRelational databaseArtificial neural networkRepresentational state transferLinear regressionInternet service providerWeb serviceForestVirtual machineBasis <Mathematik>Java appletStack (abstract data type)Library (computing)Social classIdentifiabilityConnectivity (graph theory)CodeLevel (video gaming)File systemConstructor (object-oriented programming)State of matterPreprocessorIntelligent NetworkComputer animation
09:17
Self-organizationDivision (mathematics)IterationSmoothingProcess (computing)Structural loadSolid geometryCodeComputer programSoftwareCellular automatonContinuous integrationRevision controlSystem programmingSoftware testingExecution unitData conversionNumberState of matterPredictionGoogolMathematical analysisProduct (business)Data modelDependent and independent variablesDistribution (mathematics)Asynchronous Transfer ModeFeedbackMultiplicationMetric systemElectric currentDatabaseMultiplication signBit rateStreaming mediaProcess (computing)Product (business)AlgorithmCodeLimit (category theory)Service (economics)Endliche ModelltheorieProof theoryTotal S.A.Electronic program guideRun time (program lifecycle phase)Open sourceSoftware testingHeuristicHistogramDifferent (Kate Ryan album)Arithmetic meanSoftware developerComputer-assisted translationInstance (computer science)Computer programmingLaptopProjective planeTask (computing)DialectResultantDataflowMereologyTrailCodeTransformation (genetics)Data miningDependent and independent variablesNumberData conversionUnit testingPredictabilitySoftware design patternCASE <Informatik>GoogolWebsiteContinuous integrationIterationMetric systemMathematical analysisDistribution (mathematics)Data storage deviceData managementKey (cryptography)Price indexSystem callStandard deviationExecution unitHierarchyComputer animation
17:17
Software testingProduct (business)Video trackingData modelGroup actionRollback (data management)Independence (probability theory)Self-organizationOperations researchConfiguration spaceControl flowRevision controlCodeInformation securityDisintegrationAsynchronous Transfer ModeFlow separationRippingSet (mathematics)Maxima and minimaExpert systemVertical directionFormal languageSineAerodynamicsVirtual machineProduct (business)Disk read-and-write headSoftware developerThumbnailOperator (mathematics)Goodness of fitDependent and independent variablesMultiplication signInformation securityCASE <Informatik>Instance (computer science)Open sourceSoftwareConfiguration managementTraffic reportingAdditionData managementRevision controlInformation engineeringSoftware engineeringAnalytic continuationResultantHypercubeProcess (computing)Different (Kate Ryan album)Online helpDecision theoryProper mapSingle-precision floating-point formatParameter (computer programming)Mathematical optimizationClosed setSelf-organizationEndliche ModelltheorieRule of inferenceExpert systemUser interfaceFormal languageLink (knot theory)Cartesian coordinate systemOrder (biology)Execution unitPoint (geometry)WebsiteWeb 2.0Software testingMultiplicationDataflowGroup actionContinuous integrationTrailField (computer science)PredictabilityTensorNormal (geometry)Projective planeJava appletComputer animationDiagram
25:18
Formal languageImplementationKolmogorov complexityComputer configurationPredictionScale (map)Personal digital assistantStapeldateiData modelFatou-MengeSCSIJava appletProduct (business)Interpreter (computing)TheoryPortable communications deviceFunction (mathematics)Software frameworkTensorInstance (computer science)Cartesian coordinate systemServer (computing)Morley's categoricity theoremPerfect groupProjective planeJava appletFormal languageSingle-precision floating-point formatCASE <Informatik>StapeldateiProduct (business)PredictabilityService (economics)Point (geometry)Library (computing)Core dumpProgramming languageImplementationBinary fileWave packetMedical imagingWeb serviceScaling (geometry)NumberWebsiteWeb browserLimit (category theory)Goodness of fitoutputProcess (computing)Computer fileFile formatAlgorithmCodeTask (computing)RewritingData conversionCompilerSoftware bugEndliche ModelltheorieMusical ensembleFunctional (mathematics)Standard deviationRoundness (object)BitFlow separationVirtual machineElectric generatorSoftware frameworkOperator (mathematics)ResultantNatural numberBookmark (World Wide Web)PreprocessorComplex (psychology)Computer animation
33:19
Formal languageSingle-precision floating-point formatSoftware frameworkSelf-organizationSoftware engineeringContinuous integrationData managementProcess (computing)Mathematical analysisIntegrated development environmentOvalSimilarity (geometry)Product (business)Rule of inferenceMachine learningData modelSoftware testingIterationCodeRepository (publishing)Software repositoryCloningDistribution (mathematics)DisintegrationSineConfiguration spaceDeclarative programmingStandard deviationSoftware frameworkProduct (business)AutomationEndliche ModelltheorieJava appletIntegrated development environmentFormal languageBlogUnit testingMereologySoftware engineeringVirtual machineArchaeological field surveyPerformance appraisalContinuous integrationCodeProjective planePhysical systemError messageInternet service providerBitAlgorithmAnalytic continuationComputer fileCASE <Informatik>Focus (optics)Game controllerLimit (category theory)Process (computing)Personal identification numberRevision controlConfiguration spaceRule of inferenceInstance (computer science)Data managementDifferent (Kate Ryan album)Set (mathematics)Template (C++)Repository (publishing)Decision theoryLink (knot theory)CuboidComputer programmingData storage deviceSoftware developerGoodness of fitSimilarity (geometry)CloningOnline helpStandard deviationSubject indexingResultantExtension (kinesiology)ImplementationSlide ruleDependent and independent variablesSoftware testingMetric systemExecution unitSoftware repositoryComputer configurationSingle-precision floating-point formatMultiplication sign2 (number)PreprocessorAsynchronous Transfer ModeComputer animation
41:20
Interface (computing)CASE <Informatik>Java appletLevel (video gaming)Mathematical analysisDatabaseBit rateEndliche ModelltheorieCoprocessorDivisorCartesian coordinate systemMetric systemFunction (mathematics)Software engineeringNumberMultiplication signArrow of timeRevision controloutputFlow separationData conversionDifferent (Kate Ryan album)Physical systemMereologyError messageOutlierProcess (computing)CalculationSoftwareRun time (program lifecycle phase)Category of beingDecision theoryPredictabilityFormal languageProduct (business)Projective planeBitExecution unitRight angleAxiom of choiceResultantScripting languageSlide ruleHistogramNeuroinformatikCodeLibrary (computing)Dependent and independent variablesCache (computing)QuicksortShared memoryPresentation of a groupComputer animation
Transcript: English(auto-generated)
00:01
Yeah, thanks a lot welcome to my talk bridging the gap from data science to productions After the introduction a few words about myself, so I'm a data scientist at InnoVEX. I Have a mathematical background, so this is why I really like mathematical modeling I mean this is important for being a data scientist I've done a few projects in recommendation system, which is a really nice and interesting topic of course
00:26
I'm always interested to bring things into production meaning that I don't just look to make some proof-of-concept Some some nice models one time and then like kind of forget about it so I really want to see the gained value that you only get if you put things into production and
00:44
Of course, I'm a big fan of the the Python data stack Um just a few words about the company I work for for InnoVEX who is giving me the possibility to speak at cool conferences like the Europe item InnoVEX is an IT project house with a focus on digital transformation
01:04
And we offer everything around this meaning from operation to application development Big data and data science of course and we got many offices all over Germany So to the actual topic data science to production I mean
01:21
Who of you have already like worked on the on a data science project and in the end you had some? Some really cool proof-of-concept, but it was never really put into production can maybe okay, so it seems like this really is a big topic and a lot of people are talking about this and It's it's also a source of frustration
01:42
I mean data scientists get frustrated after a while if they see like one proof of concept after the other that not really moves to production and also the business sides gets frustrated So they have maybe hired a huge team of data scientists that do cool things But in the end they can ever say okay our data scientists
02:02
They did this and now we have maybe increased our revenue by 10% And this is exactly why one should care about moving things to production And this topic is definitely not an easy one it has many different facets and throughout the talk I'm going to touch many of them before and actually
02:23
One of the important things is the actual use case so this data product or this model you are building So we're going to now look at this from a really high-level perspective so this data product you want to build in a company and if you Look at that you can basically say okay, it's it's quite easy
02:44
You have some some data somewhere you have your you have your model This is basically doing some transformations and in the end you have some results be it predictions or like Decisions or whatever so this is the really high-level perspective, but from this
03:01
Components from these three components we can already Kind of classify what our use case is and this We have to later keep in mind if we want to things put things into production for instance the data Is it coming from some relational database or some some known SQL database?
03:20
Is it coming from some distributed file system, or do you have to deal with stream based data? That's your model Really the whole time need to consume data as a stream so this is important question and Depending on your use case you have to clarify How you're going to do this in production and what the reasons in frequency
03:42
Requirements are so like are you doing with batch data? Does your model need to react near real-time real-time or in a stream based fashion? Then the model itself and with a model when I'm talking about model I'm not only talking about a machine learning algorithm so a lot of people say like yeah my model
04:00
And they actually think yeah, that's the artificial neural network or that's the random forest But actually the model includes everything from the point where you get your raw data to the point where you Give back some kind of results so this includes also the pre-processing So how you do cleansing importation how you scale your data?
04:22
Maybe all kind of feature engineering you do like construction of new derived features Like some exponential moving average and so on so this is all part of the model because if you do this on your laptop in some proof of concept you later have to also put this into production and you need to do the you need to think about
04:43
this and That you don't kind of re-implement everything So then the last part is the results, so what do you do with your result? I mean in a proof of concept your results basically are maybe some some CSV file and You make some nice plots and show it to some to some product manager
05:04
But in production you need to care so do I put this in another database and are the consumers of my predictions of my? Decisions or whatever are they reading from the database so the database would be your your interface Or is it again some distributed file system?
05:22
Are you writing back new topics in a stream or? Maybe in a real-time use case like how it's many How it is quite often the case for recommendation system that you have to provide some kind of rest API So that people ask real-time for recommendations given a user
05:43
user preferences so Looking back now at the whole picture. We have our data We have a model and we have a results and everything needs to be in the end in production so we care about deploying this model and I've already said a lot about that. We need interfaces, so we are in control of the model and
06:06
You need to define how you? Access the the data, and how you in the end return the results and there most of the times many other like like teams
06:22
Exists who are in control of this so it's important to to speak to attempt to communicate and to define interfaces so To actually give some more characteristics of the use case We said already it's the delivery so you can depending on your data use case you can say it's you need to have a web
06:44
Service or stream or database then important is also the problem class that you early on decide, okay? I want to do a classification regression recommendation I need or do not need explainability because this will later on decide what kind of libraries you can use So it's important to to think about this early on then the volume and velocity
07:06
This will later tell you how your model What kind of scalability requirements your model needs to have then the inference and prediction is it enough to do the inference? Like once a day in a batch way or again real time or stream
07:23
So all this will later on decide. How you gonna put things into production Additionally you have a technical site conditions. Maybe you are working in a company where? There's a huge Java stake I mean many companies they have a Java stack and this could be that this won't that they say, okay
07:42
But in the end we can only roll out in a scalable bait some Java model Then this is a technical condition and you should think about this early on Because if you do then everything in pure Python Then you will be bound to just providing a proof of concept because your code will not be able to be moved to production
08:02
or other things are like Is it going to be an on-premise solution or? Maybe in the cloud. So one important thing is there's no one size fits all solution for this I mean there are provider offering like like some holy grail like use our framework and everything will work
08:24
so for me, it's Yeah, this is not true. So do you really have to? Evaluate your use case before and then decide on a use case by use case basis So the takeaway and the learnings from this high-level
08:41
Perspective of your data use case your data product is that you need to distate the requirements of your use case early on Think about how to move things in production before you actually start some kind of proof of concept Identify and check your data sources. So that means don't just get One-time data dump that someone gave you on an USB stick
09:04
Rather think okay. Where is the data coming from and how could a later access this data? In a productive way then define interfaces with other departments meaning Like if there's a special team for the management of the of the databases and
09:23
The guys who are filling the the databases so that you define how the data should be formatted And so on this will be important for production because if someone later changes, maybe the schema of a database Everything could fail of course and another another
09:41
Good advice actually is to test the whole data flow Early on with some kind of dummy model or some kind of heuristic meaning that you try to make the whole process from data reading a simple transformation and writing it back into a Database or writing the results back into a stream that you test this technically early on to directly see
10:06
Where things could go wrong? So This is the part about the the more like organizational Or more like the use case aspect another big important thing I see in the topic of data sensor protection is actually quality insurance
10:24
For especially for data scientists it's quite often so that they that they Yeah that they program in notebooks and so on and code is more like a one-shot kind of deal but actually
10:41
Building a data product is an iterative process and this isn't really old Method actually, it's has been invented by by IBM and it's more than 20 years old. It's the cross in this industrial standard process for data mining and Already there. They said okay, if you do data mining then it's going to be an iterative process
11:05
So you're gonna grab your data You're gonna prepare your data you come up with modeling you evaluate you get more insights about a data And this goes on and on and on and the same goes actually for for any kind of data product. So
11:20
If you have now in mind that it's going to be iterative, of course quality is going to be an important aspect So quality in different in different regions for instance If you program something even if you just start with a with a kind of proof-of-concept Make your code clean. So
11:42
What I see quite often is that people use a lot of Jupiter lab to be their notebooks and just put everything into a huge huge Notebook and if they have a similar Task, they just copy over things and so on. This is not really a good clean coding
12:02
And here we can actually learn a lot from Clean coding principles that Java developers most of the times have like software design patterns the solid principle and Especially the clean code developer. I mean who knows the website clean code developer who's Okay, not so many hands. Yeah, this is actually what I
12:22
What I thought so clean code is really important In the end for if you want to things move things into production because other people are going to read your code you have to Make adjustments and so on and so there are many good resources and even Or even or especially as a Python developer. You should care about this another
12:45
Practical thing one should care about and do is continuous integration so that you Continuously if you're if your team works on something that you and continuously integrate your code into a master code that you have unit tests that continuously test your code that you think about versioning about
13:04
Packaging about putting your packages your artifact on an artifact store that you automize this and embrace some kind of development process this and This is actually quite easy to be done, I mean there's an open source tool Jenkins
13:22
I mean, I guess most of you know Jenkins who knows Jenkins who's who knows and who's actually using Jenkins Okay, that's good. So What I always do when I when I start a project directly Implement a really simple continuous integration process because it will help you so much later on and
13:46
You're gonna need it then later for production anyways another thing is monitoring so um if you if you Do any kind of data product you of course? Interested in improving some key performance indicator like like recommendations could be we got to improve our
14:04
Click-through rate our conversions and so on so if you do that of course you need the whole time to monitor things So how was it before? You implemented your cool new algorithm How is this was it before maybe you tune something or you retrain?
14:21
So it's important to really monitor your KPIs. It's also important to monitor the whole the whole setup like how many requests did you have if you provide your Your recommendations or your predictions as a rest service To see if this comes to some kind of limit
14:41
Then also check the total number of predictions maybe something was wrong in the data ingestion and now you you're predicting not enough and Check the runtimes and so on so Monitoring gives you decide so not having any kind of monitoring is like like flying an airplane blindfolded and this is also something that that Google says so there's
15:04
There's an open-source book by Google the site reliability engineering guide and they have this nice Hierarchy where they say for any kind of product the most important and fundamental thing They ask at Google is actually we have to have
15:20
monitoring in place and I've seen so many times that people start with some some data science project and no one actually cares until monitoring about monitoring Especially important for data science and and data products is also monitoring how good the quality of your model is and I mean this is
15:43
Yeah, you you have normally your metrics and of course you check your metrics, but you can also do this in a live test so here is the so-called response distribution analysis and It's a classification. Let's say you're classifying if a picture is a cat or a dog and if you
16:01
Just make a histogram over all the results So if it's rather around zero a cat or around one a dog and if you'd make a histogram over all the responses You would directly see. Okay, a is a working model and B is a rather confused model So it's not really sure about what it's outputting and having a simple thing like this in place will tell you directly
16:24
if the model you maybe just deploy it is Nonsense and you have to replace it or fall back to another model or or not. I Mean, it's definitely better to see this yourself before another department calls or maybe a customer telling you that
16:42
Yeah, whatever you just deployed is Not predicting anything meaningful Another thing regarding monitoring is think about a B test. So if you go to production you will You will care about those iterative processes you will start implementing new features
17:02
You will make improvements to your model So it becomes really important to keep track of how much you improved With respect to the current baseline and it's not always like this that you can do this in an offline test You also have to show this in online metrics and in the end the business unit or the product owner the stakeholder
17:24
Will care about the the KPIs because this is what he or she gonna report to their superiors nice additional advantage you have if you are using a B test is That you can for instance also
17:41
Use hyper parameter optimization with the help of multi-armed bandits The technical requirements you have for a B test is of course you have to have versioning in place over. I'm really Eager on versioning so Version your models version your things provide proper Python packages because those versions
18:02
You need a link to the test groups if you do some a B test that you see this was version 1.0 and this was version 1.1 with a new cool feature And also you need to be able in Production to deploy then several models at the same time because you're going to have at least two
18:22
groups and You need to be able to track the results really until the point where it's facing the customer or whatever Consumer you have so in case of recommendations for instances would be You need to track that this prediction or this recommendation from model a was shown to this
18:41
user in in group a and so all this tracking has to be in place as if you're Using a tensorflow I can recommend so in one project. We use tensorflow serving with this which is a tool an Open-source tool by by Google which does a lot for this organization and management of different models
19:03
for you So those were some some quality assurance aspect another big topic in the field of data science to production is actually Organizational problems or like cultural problems and again It's nothing really new so if we look at the at the problems that
19:26
normally just normal developers and operations have most of the time if It's like if you have a group like a team of pure developers and a team of pure operations People then the developers they say okay our responsibility is to
19:43
To code to test to make releases of course they use version control they do in the best case also continuous integration and so on and when they are happy with something they make this release they throw it over the wall of confusion and The operations team like yeah, thank you and now we got a package this we don't understand
20:05
What's in there, but we have to package this we deploy it. We do the whole lifecycle of course There's going to be some configuration management to do We have to care about the security and the monitoring and if you keep this completely split up then people already
20:22
Realized like years ago that this is not the way you can really fast and efficiently Develop software and with data science and data products this thinking even hurts more So it's especially dangerous for data products and teams and you
20:42
Seriously going to have a problem with all your speed and time to markets if you just think as a data scientist Yeah, how do I get the things in production? I don't care. It's not my job So this is definitely the wrong way of thinking So the the better way of thinking is that you have a team that thinks let's build a great data product and not
21:02
Okay, I made a great model. So this is just a different way of thinking And for the well for the world of software engineering actually, there's this big movement How many of you know a deaf ops deaf ops culture have heard of it? Okay, so few so that the idea is just to to overcome this wall of confusion to make a continuous delivery
21:25
So that's continuous integration, but one step further that you could at any point in time if you decide also deploy and deliver your software and That you have heater heterogeneous teams of developers and
21:40
Operations people working together. So now on on the side of of data scientists, it's actually Yeah, we can apply the same thing. So from my experience like having pure teams of data scientists they don't get anything into production because they just lack the knowledge the knowledge how to deploy and
22:05
how to do all those things you need to do to get it in production so that me so that the learning is actually That you have to have heterogeneous teams of software engineers of data scientists of data engineers of operations
22:21
People and if they all work together, they also start sharing their knowledge and They can work together and on on a single product and see us as their responsible Responsibility to get that product into production and as a rule of thumb It's even that for a single data scientists
22:42
You need two to three data engineers which do the things which help to do the things around So it's really you don't actually need that many data scientists And right now it's even harder to find good data engineers At least on the German market than to find good data scientists
23:03
Optionally what is also a good thing is to have a Product manager also embedded directly in the team and if your data product is any way related to some Yeah, for instance like again the recommendation topic if you if you see if you have a customer facing
23:24
User interface and it's also good to have directly the the user interface or UX expert in your team because How you show things to your customer will also dramatically influence the results
23:40
So it's good to have this Close and not in another team where they maybe do completely different decisions without telling you about it a company that actually Does a lot of this Organization is Spotify. So they are really advanced when it comes to this. They have fully autonomous teams for
24:03
for every feature So they call it it's like vertical teams with an end-to-end responsibility So really from from the design and from where the data comes to how it is shown in the in the Spotify Application or in the Spotify website They're completely responsible responsible for this and this allows him to to iterate really fast and to have especially less
24:29
politics and This is I've added a link here so you can later Read about it. So it's really interesting to see and there are also a lot of talks on the web how how Spotify
24:42
organizes their teams around this So This was The the organizational or more like the cultural aspect of data science to production, but we also have a language Aspects or as I would call it like a two language problem. So as
25:04
I've said before in the industry Many people use Java and the reasons artists for this are quite Yeah quite obvious so many people argue that having a strongly typed language is so more safe because already the compiler finds a lot of edge cases and so on and
25:26
It has a strong emphasis on on robustness and edge cases. Then it's it has been an industrial standard for many ages and For many years and people know how to deploy things So if you go in many companies, you will find that
25:43
If there's a separate operation team, they will say, okay Only Java things will get into production in the end So I don't care what you do as a data scientist But it's going to be Java in the end and then there's the other side the other world Where as a data scientist, you're more like science
26:01
Guy or science person and you of course like Python or R You like the dynamic nature of the language and you have a stronger emphasis on a cool methods and cool results and Not on on robustness, maybe and you are happy as long as it runs on your machine. And so there are just two sides and
26:23
Of course, there are many ways to resolve this problem and I'm gonna present now several ways how I've seen in projects how it was done and and Can discuss this for instance one is just to select one to rule them all so I've
26:41
Once been in a project where it was said that in the end. Yeah. Okay. It's got to be Java in the end. So Right directly start doing everything in in Java and I know that I've heard of I heard that Netflix for instance for their recommenders they do directly everything in in Java Java, so this has the
27:02
The upside of it is that if you have a single language, of course, it's later is going to reduce the complexity of your deployment I mean most companies to know how what to do with Java you can package everything into a nice char and run it in some application server or so on and The downside the huge downside of it is of course that you're completely
27:25
Abandoning one ecosystem in case of Java. It would be the the Python Ecosystem so you don't have Psyched learn you don't have pandas and so on so you have to re-implement a lot But it's a it's a solution that some companies do another thing is if you just say okay a Python is the winner how?
27:46
About putting everything in Python in production. This is yeah, especially cool if you are than a data scientist because You can still do your favorite programming language I've found from my experiences that it's especially useful for the batch prediction use cases
28:05
So in the categorization I've shown before so if you're doing some kind of predictions That you only have to do once a day Something like like we did at blue yonder that you you predicting the demand of the next two weeks
28:22
Or so if you have if you if you have 24 hours to do one batch prediction That's a perfect use case for for Python actually if you need some kind of web service delivery Of course you have many Python is a general-purpose language you have many nice libraries like flask to make some small rest
28:44
Service when you do Python you can also always just scale horizontally if someone comes with a with a point that maybe Python is not fast enough compared to Java you can always scale horizontally during prediction and during training. It's
29:01
What I like the most is to have just a big metal node with many cores a huge number a huge amount of RAM where you can then a train your model and The good thing with Python is that you are also not only bound to The Python ecosystem you can also tap into the Hadoop world for instance with using PI spark
29:23
And PI hive so their libraries they of course have some limitations compared to the the Java libraries But you can do you can nowadays with spark 2.3 Use a lot of things from from Python, and if you then later want to deploy something
29:44
It's good to think about Isolated containers and maybe use docker for it just to have the all the dependencies and so on in packaged in one thing because there exists nothing like a like a char file for instance where you have everything packaged
30:00
Another solution to the problem is what I think is the worst case scenario is you let a team of Data scientists do something in in in Python or R And then some poor person has to rewrite everything in Java So this is something which once happened to me that I wrote a lot of Python
30:21
And then we were sitting together and making a conversion to Java because it was only allowed to have Java in production It's really lots of effort. It's low if you then later on we said it's Building a data product is an iterative process, so if you later on decide on new features then of course
30:42
You implement them in first in in Python then someone Moves it over to Java. It's it takes forever It's causing a lot of bugs if you see a bug in production. It's always hard to find out Okay, it's the back maybe in the Java code
31:00
Or is this an actual reason in the in the was it in the in the Python code, so is it by design? mistake so The upside is that everyone gets what they want, but I would never argue in favor of this Solution to the two language problem
31:22
So another thing I've never really tried out Is that you say okay? Let's just use exchangeable formats I mean there are many around like PMML or an X and so on they were great in theory But if you try a little bit out with them so just we tested it once we never put something like this in production is
31:44
That they have quite a limited functionality You have no guarantee that if you use Python you do your model you save it in some Exchangeable format, and then you read it in in Java for instance that it really does the same thing so you have to input you have to trust those two implementations and
32:03
And Yeah, I mean it's just like even with something like HTML you never two websites I never ran that the same way on two different browsers, so why would it be? Why would it work then for those exchangeable formats, so? Yeah, and another downside they often have they don't include the pre-processing and feature generation
32:26
So this is what I said before when I'm talking about the model. It's not only the machine learning algorithm It's also all the imputations and all the things you did beforehand and Of course those exchangeable formats. They need to be able to specify this otherwise you're
32:42
re-implementing things again Another solution for the language problem is using frameworks, so We've used tensorflow for especially for some recommendation tasks And it's really nice in the way that you use Python to train your model you save it in some some binary format
33:03
Some protobuf based format And then this this binary blob can be read in by Java and served By Java, and this is a really nice thing there are other frameworks of course H2o is quite common we've also done something with this and
33:22
There we had a little bit of problem that not it doesn't allow so much Pre-processing so that you have basic machine learning algorithms in there, but not all of the pre-processing and there It's also that you use Python to build your model, then you save everything into a mojo file
33:40
it's called and Later on Python can run it if you opt for this solution Which I think can be a valid one depending on your use case As I said many times, but of course you should always keep in mind that You are paying with flexibility So if you decide on a framework you will only ever be able to do what the what the framework?
34:05
Provides which can be fine but maybe it's a little it's a limitation also, so we've Basically we have seen different ways Different possibilities different doors how to How to overcome this to language problem? There's the reimplementation?
34:26
Just reimplement everything in Java or use a framework or decide in a single language so from my experience definitely Reimplementation is no option, so don't do this I've been there. It's It's not working so good
34:43
Of course frameworks are a valid solution if you use tensorflow or h2o They can really help you get things into production way Easier and overcoming the the to language problem, and if you decide on a single language, okay? I'm a bit biased here. I would definitely choose
35:01
Python and not let data scientists program in in Java Because this is really frustrating Or even Scala so Talking about Yeah, so we've talked about the language problem and now a little bit more about the deployment and some maybe general
35:26
advices and good practices Of course the deployment. There's no as I said before there's no one-size-fits-all it heavily depends on your use case and of the Use case of late evaluation that you have done before of course there are
35:46
software engineering principles that you should always use like as I said before continuous integration continuous delivery I can say it often enough just do it and and also think about What part of your machine learning code actually?
36:04
How big it is compared to all the other things so there's a nice paper by Scully 2015 already a few years old it's It's saying where the technical debt in machine learning systems actually is and we see That in the middle your machine learning code. There's not much technical debt in there, but
36:24
Everything around just doesn't get enough focus and a lot of those Boxes are actually related to deployment so your configuration your process management your Machine rules resource management your serving infrastructure your your monitor especially those are all things you need to care about and
36:45
This doesn't get enough attention in in really many projects so this scali paper was a kind of survey and It's good to to keep this in in in mind so general principles again version of your things package have processes and quality managers management in place and
37:07
It's also helps to keep the development and production environment as similar as possible as possible of course so like Programming your one thing on a Mac and moving everything else then on on the Linux system I mean already there you can run into problem even if it's if it's Python
37:25
Automate as much as possible again continuous integration continuous delivery and this also wide avoids human errors and think about controllable environments like For instance by using docker or at least having conda environments or other environments
37:44
That you can pin versions down Google also thinks a lot about this and they have Also nice a blog post about best practices for machine learning engineering. I'm not going to go through all those
38:01
different rules Basically many have been already a set design and implement metrics and so on Most of them are actually if you want to bring things in into production most of the things are actually engineering problems So this is it's in the end. It's not your cool data science model It's really a lot of engineering problems. You have to overcome to bring things into production
38:27
Just As a practical tip how easy or practical advice how easy it is to do continuous integration There it's also a blog post link you will later see the slides
38:42
But if you use Jenkins and let's say dev pi artifact store To to save your rebuild packages. It's just like two jobs You have one Jenkins shops that clones the repo builds the package pushes it in some unstable index then you have another Jenkins job that
39:03
installs the package runs the unit tests after having cloned the repository again and then depending on the results of the unit test pushes it back into some Some testing or some stable index and then other people can use the new version and
39:24
speaking about packaging so a really cool tool for doing this and it's really easy to use it's like a 5-seconds thing is PI scaffold. It provides you easy insane Python packages. It's just giving you a kind of template
39:40
tool for this template a scaffold for a typical Python project It provides you with versioning for every commit so you basically just do Git tax and so on for the version and then it enumerates the commit so you have unique Versions out of the box. It's integrates really well with with a git has pre commit
40:03
You have a declarative way of defining all the the configuration for your For your package with the help of setup Config it follows community standards and you can even extend it with your extension so
40:21
as the last slide Short recap what we learned so the the key learnings Really are for data science to production that There's no one size fits all solution Evaluate your use case and then think about how you can bring things into Production early on and
40:42
Think about quality quality assurance is really important try to establish a DevOps Culture and the team responsibility for the whole data product and not just for some fancy data science model Then think about how you overcome the the two language problem that you might have as a Python developer
41:03
embrace processes and automate automate as much as possible and The key thing really is production is not an afterthought. So think early on about how you can later Move things into production with this. I want to close my talk. Thank you for your patience and your attention
41:30
Thank you very much for a very interesting talk and many interesting and important things You have to do and we develop software any questions. Yes
41:43
Thanks, thanks for the talk that's really great to see some putting effort in sharing those insights I've got a question on the monitoring part of your talk how would you put a process in place to monitor the performance of the model whether it's making suitable recommendations or
42:03
predictions After right because I think you mentioned a technique whereby you can visually see if the model is confused But what about when we don't really know what sort of input the model is gonna it's gonna get How can we later on see if we can improve the model based on errors? I might have done and put a process around that
42:23
Yes, so I would divide the monitoring in in several parts. Of course, you need to have some monitoring for the incoming data This is really important because then you can easily see all the errors which are just due to the fact that you got New outliers or maybe just not available values somewhere so you should have monitoring in place
42:47
This is the incoming data. Does it still look like last week? You can define alarms on this and Like okay suddenly we have not seven categories in this feature but ten or
43:02
Why are the number of not available values went up from ten percent to two fifty percent and so on So this is like the early alarming what goes into your model and then you have the monitoring some some after Your model so the results of your model there you can check simple things
43:22
Like how many predictions did I make is the number of predictions still? as high as maybe last week if you're Yeah, I don't know depending on your use case what you are predicting then what I showed before just one slide about That you really check each result and do this histogram this response analysis of your model this
43:44
can really help and of course also those When you iterate and make a new model you will have some offline metrics that you also save those and Put the version number next to it that you can see maybe I mean it looked good offline
44:06
But then the online KPI metrics went down So this is again like offline the offline metrics there you can automate mice a lot and check for accuracy or recall or whatever you want to check and
44:20
At the same times you have to look at the yeah at the KPIs Which might be then the the click-through rate, so there are many it really depends on the use case But there are many aspects so it would say input Outputs then the model quality technical things like also is there maybe maybe your model is getting slower, and you were running into a lot of time outs and
44:45
all those things Hey, so I wanted to ask you about the the Bob's culture You have the Bob's culture Okay
45:00
If you have experienced that before and what problems did you find? integrated the whole different thing skills to to work as a system thinking So I've in in one project. It was before that that we were like only data scientists, and then we had all those problems then
45:24
There was a decision made that we have heterogeneous teams, and then we were Yeah, doing more DevOps culture. I mean of course first of all it's a little bit Okay, why do we were now work together and people react differently on this and then there's also this
45:42
like struggle sometimes if Let's say there comes the software engineer and asks you about your model And then make some people get critical like so hey, I'm the data scientist. What are you asking me? Why am I doing this in my model? I'm the expert and
46:00
For some people this can be quite hard at first But yeah, you have to overcome this you you you need to communicate and you need to think okay this person has another background, but It has all the person has all the rights to know what is going on in the model So there's at least it starts with a little struggle
46:21
I would say but then it calms down, and it's definitely better in the end than it was before for my experience, but Yeah, it's it also depends on what kind of people are In in your team if you have maybe some completely introverted Data scientist, and it could be hard for them. Maybe so
46:43
Yeah Okay, last question Hi, so I think that the choice of the language is definitely a big issue in my company So basically we have a very heavy Java legacy process a to serve two things one thing is a big like
47:01
Pipeline like you need one to sing one thing and then feed the data to doing the two that are in Java But now we want to plug in Python computations So the way we are trying it is from Individual we still keep the bike bone as a Java and then the individual node We try to wrap around the Python script basically Java wrap around the Python and then fire up the Python process
47:25
You know the data and caching certainly the problem, so we would like to you know explore like Apache Arrow In the near future, so do you have any you know? experience of you know Java fire up fire Python present share the cache a sort of experience that they
47:43
Actually also tried to I once had the idea yeah Well, I just do some char where I put in all my Python code And then run it and I had extreme problem getting this to run with any kind of library like like numpy and so on day It's just pi for J. You can do things like this and for simple
48:04
For really simple Python application it works, but really simple, but I would not it's a really It's a hack and if you then have any numpy, which is see Also wrapped in this and you have those conversion costs, but then again
48:21
I'm not I'm no expert in in those Java to Python thing like on a on a really software level. I know I know arrow and It's in spark 2.3 and things get a lot faster but I've never programmed with arrow directly because
48:41
But I would actually I would be careful with doing things like this Wrapping your Python things in in in Java Sounds like a huge hack to me. I would rather go for establishing interfaces I mean if you have a pipeline and I mean depending on your your runtime requirements if you can say
49:03
You use a database as an interface kind of thing that it's safe there And you grab it from Python you do your calculation you save it there Then it could work depending how fast it needs to be in the end But I would rather define some clear Interfaces and not do any kind of black magic with Python inside a Java
49:25
Thanks a lot