Merken

# Testable ML Data Science

#### Automatisierte Medienanalyse

## Diese automatischen Videoanalysen setzt das TIB|AV-Portal ein:

**Szenenerkennung**—

**Shot Boundary Detection**segmentiert das Video anhand von Bildmerkmalen. Ein daraus erzeugtes visuelles Inhaltsverzeichnis gibt einen schnellen Überblick über den Inhalt des Videos und bietet einen zielgenauen Zugriff.

**Texterkennung**–

**Intelligent Character Recognition**erfasst, indexiert und macht geschriebene Sprache (zum Beispiel Text auf Folien) durchsuchbar.

**Spracherkennung**–

**Speech to Text**notiert die gesprochene Sprache im Video in Form eines Transkripts, das durchsuchbar ist.

**Bilderkennung**–

**Visual Concept Detection**indexiert das Bewegtbild mit fachspezifischen und fächerübergreifenden visuellen Konzepten (zum Beispiel Landschaft, Fassadendetail, technische Zeichnung, Computeranimation oder Vorlesung).

**Verschlagwortung**–

**Named Entity Recognition**beschreibt die einzelnen Videosegmente mit semantisch verknüpften Sachbegriffen. Synonyme oder Unterbegriffe von eingegebenen Suchbegriffen können dadurch automatisch mitgesucht werden, was die Treffermenge erweitert.

Erkannte Entitäten

Sprachtranskript

00:05

thank you for the kind of 11 where it is firmly under and I'm presenting Tuesday how to use the learn this good work good interfaces for writing maintainable and tested for machine learning curve so this talk will not really focus on you the best model development and or and the

00:30

best algorithm at just show you the way how to structure the code in a way that you can test and that you can use it in a reliable way introduction of for some of you might not know notice land cycle in this probably the most well-known machine learning package for Python and and it's really it's really a great package that has all batteries included and versus its interface all right

01:02

but the problem find general that's that I'm talking about is that of supervised machine learning and in this talk and just imagine In a problem we have on the left side around the table we have a table with with data it's the season

01:25

that's the spring summer and fall and winter and we have a binary variable indicating whether we have a day that is a holiday or not each row is a data point and each column is what we call a feature on the right hand side we have some some variable that will pull in target is closely associated to the features and the target is a variable that we would like to predict from our features and so features unknown data targets and the other is the data that we want to estimate from a given table on the left In order to do do this in order to do this we actually have 1 dataset where we have features and target matching features and target data and we can use this to to train a model and then have a model that predicts

02:21

so the interface is as follows we have in a class No that's represents our engineering algorithm that has a method to fix that gives gets features named acts and the target array called y and that's the trains the model so the model learns about the correlations between the features and target and then we have a method to predict but can be called upon the trains estimator and that's what gives us an estimate y for the given model and that the given features X and so in this is the basic problem of machine learning there algorithms to solve this and I'm not gonna talk about these algorithms I rather would like to focus on how to prepare data that the future data X in this talk and how to make it in a way that is both testable reliable and readable to software developers and data scientists so if

03:29

you I'm sure you want to see how this looks like in the short code snippet and that's actually quite succinct so in this example here we generate some some datasets extraneous test and y train why test in then we create a and so on support vector regressor that is our last summer algorithm that i take off the shelf and I could learn we fit the training data and we predict on the test dataset and in the end we can obtain a metric we can check and test how well as our prediction based on our input features x test and so on so this it's a trained model is secular and it's very simple very easy and the big question now is how do we obtain and how can we best

04:32

prepare input data for estimating because that table that I showed you might come from an SQL database or from other inputs it sometimes has to be prepared for the model so we get a good prediction

04:47

and so you can think of this preparation in a way it's a bit like a some some of of operational as it stands in a factory so there's certain certain steps the executed to prepare data and so on you have to and and you have to cut pieces into the right shape so that the algorithm can work with them 1 typical

05:17

preparation that we have for a lot of some a lot of machine learning algorithms is that of a I got normal normal adjusts for the scaling so what we imagine that your data has a very high numbers in very low numbers and then but the algorithm really would like to have the values that are nicely distributed around 0 with a standard deviation if you want and so such a such a scaling can be rephrased in Python code so X is an array and so we just take the mean over all columns and substracted from our array at x so we substrate the mean of each column from each column and then we calculate based on this we calculate the standard deviation and then divide by the standard deviation so now each column each column should be distributed and around a mean of 0 now with the standard deviation of approximately and 1 and then compare the small sample from this so you can see it above and include array of x and it has 2 columns let's 1st focus on the right most column and that would be and future variable with values 32 to 1831 of course in reality where would have huge arrays but for the example of a very small 1 is sufficient and then we apply a scaling and in the in the end that color now has values that are based around 0 and a very close to and now I put in another problem that we have and data processing and we have missing value so just imagine I showed you in the and the 1st slide of striking example well we have weather data just imagine that the beta that's measured the temperature was broken on Monday so you don't have to tell you are here but you would like that Europe estimation of you still like an estimation for that day and in such cases we can we have ways how to our fill the state of values and strategies so in 1 strategy is just to replaced this not a number value with the mean of this feature variable so you could take the mean of temperatures of historic data to replace such a missing temperature slot and because if you apply our algorithm with the mean and dividing the center viations what you get is just a idea in this example and you'll get a a data error from our code because the note number values will just break of the mean so i've prepared a bit so cold that it was a bit more our code before so before we just subtract the mean and dividing by the standard deviation and and now we would like to replace the and the number of values the mean and the and the reason or could fail before what that's the taking the meaning of a column that contains some number values numerically just raises not number so here I replaced or a number I mean function by a function of and non mean which will yield even with the non values in our array X that will yield a proper value for me then we can see us substract again as we did before the mean and divide by the standard deviation and in the end there will execute a function non-financial to know which will replace all not number values by 0 and in our rescaled state of 0 is the mean of the data so we have replaced and the number values on the answer is no this this new coach transform data the and it actually seems to work pretty well about the same data example with the new code as the resulting regime where both problems are distributed around with a small standard deviation and so on so this is an an example of some data preprocessing that you would apply maybe to you data for you feed it into the estimate and yeah this but this small example actually has a few properties that are very interesting so I said that we the if we go back to the we

10:39

actually transform an aura of array X and take the standard deviations of all columns and the mean of all and before we call estimator but predict but what about the next call when we call estimated that predicts there's also an array feed that into and we have to process data that goes into predict accordingly as we have trends from the data like goes interfered wisely because our estimator is has learned about the shapes and correlations of the that we gave it infinity so the data has so that 1 has to have the same distribution the the same shape as the data that it's all during

11:30

fits and so and how can we do

11:33

that how can we make sure that the data has been transformed the same way and so learn has a constant for this and that's the transformer concept and a transformer but

11:46

is not just but that's has

11:50

this notion of fits and I transform stuff so we can fit the data if it must transform we can train it during with the method fit and we can transform it movement transform and there's shortcuts defined 2nd and the transformed that display both the same time on what's important about this this transform returns very modified version of our feature matrix acts that given a matrix X and during the 3rd test to you can also see the light and so on so now we can actually rephrase or

12:28

current that is the best scaling and not number replacement into such transformer so I called this I reluctantly passed the golden but that guessing scalar so because it gets small replacement always the numbers and the skills that and I implemented method fit but has the scope the the mean correlation as a conceived and it saves the means and presented the evaluations of the columns as attributes of the object itself and then it has a method transform and the transformed the actual transformation it subtracts the mean and it divides the the standard deviation and it replaces not number values zeroes which are 0 is the mean of are transformed and data and using the and pattern and we can fix our note number guesting transform with our training data and then transform the data that we actually would like to use for predicts we can translate transformative very same and another opportunity here is since we have a nicely defined interface and uh for this we can actually start testing

13:51

and I wrote little test for our class I think you remember our example array and I create a node number guessing scalar I invoke the transform to obtain data on transformed the matrix and manage starts and testing assumptions that I have about the outcome of of this transformation and then of the issues so that this test actually this assessed fines mission our implementation was wrong or because if I calculate the standard deviation for each call a

14:33

and I expect this activation for each column to be 1 the I realize there be a that that's an innovation is not 1 and all that has a very simple reason if you look back at the code and I calculate the standard deviation of the input sample before I replaced not number values where once you with the mean and so In this example and the center division of the input samples is wider then the actual distribution of the data after replacing and the number of values with with the mean and because that the mean is in the center so and we map not number values of the center of the data and that makes the distribution kind of smaller and also in a way if we want to fix this codes we have to we have to think about this transform method and an the solution is actually to in make 2 transformations step

15:50

but 1st we want to have 1 transformation step that replaces not number values with the mean and then we want to have a 2nd transformation step that this the actual scaling of the data so you we want to

16:08

transformations and the secular and has a nice way

16:13

to how to do this offers ways to compose several transformers several transformations but in this case we use a

16:22

building block in applications for the low contrast we use the building blocks and that called pipelines of type 1 and a pipeline is a sequential thing that changed a chain of transformer and so during fits when we have only 1 we our training and learning from the feature matrix X that we use by 1st transformation transmitter 1 and invoke the transform to obtain a transformed and the data and then we take our 2nd transformer also and apply the transform with the result of the stress tensor formation and finally we will obtain transformed dataset that was transformed by several steps that can have an arbitrary number of transformers in the predict when we have already learned the properties of the data like in our example the mean and the standard deviation we can just that invoked transformed and get a transformed x in from doing it on secular and we can build them pretty easily there's a make pipeline function and transform objects and it will the it returns a pipeline objects the type of an object itself is a transformer that means that it has fit and the transform method and we can just use it instead of our note number guessing scalar that the center so we could go back and

18:00

rewrite this and into 2 classes 1 doing this scaling and wondering about the net number replacement of

18:10

or the question is maybe there's actually some someone has focused us already and that indeed from Python has batteries included and 2nd has batteries included so we can actually

18:25

also used in to transform us from the cited learns library about 1 of these transformers is called precomputed because computes missing values and on so here's number would be replaced by the mean and then we have a standard scalar that stands the data over those distributed in this example represented by the red distribution to 1 2 0 a dataset of displayed a around the and then these 2 transformants and can be joined by a

19:04

pipeline and so you can see a list we just put together the building blocks that we already have this all make pipeline we use make problem here and passive fate imputed instance and this in the scalar into instances and the and then if we and fit transform a our example array we can actually make sure that our assumption holds true that we would like to have a standardization of 1 with here also check for the means and there are some other tests we have wrapped the data processing with those psychic transformers aren't that we've done this in a way where we can individually test each building blocks assume that these were not present in so I could learn we could just write them ourselves and the test would be very easy and so

20:06

on yeah I think that this is the biggest game that we can have from the so if you live in this talk and you want to take something away with that something away from it on if you want to write maintainable MIT enables us that you want to interrupt avoided spaghetti code in numeric codes try to find

20:25

ways how to seperate different concerns different at purposes new code into independent composable units that you can then combine and you can test them individually and combine them and then you can make a test for the combined model and that's a really good way to structure you numerical so at the beginning I showed you an

20:54

example off a machine learning 0 problem where we just used in machine learning algorithms with and secular and estimated that we fitted a critical not I extended this example and with a pipeline that as the preprocessing makeup 1 we use the imputed we use a set of scalar and we can also underestimated those pipeline and now our object as does contain our whole algorithmic pipeline it does contain the preprocessing of the data and does contain the machine learning code and also it does contain all the fitted and estimated parametres coefficients that are present in our model so we could easily see realize this estimate object using pickle or another sualization library and the story to disk or standard across the world into a different uh um network and then we could loaded again restore it and make predictions from and so on to so to summarize what the cycle and these interfaces can do for you and how we should use them in we found that it's really beneficial to to use this these interfaces that could learn provides for you of if you want to write criticizing coach and you can use the for transform interface for the Transformers you from right your own transformers if you don't find those that you need in a library if you write your own transformers tried to seperate but concerns separator responsibilities on estimating the of scaling you data has nothing to do with correcting of number values don't put them into the same transformer just right to you and composer and you transform out of the 2 is for you're modeling the end if you keep your transformers and a class small there were a lot easier to test and then if a test fails you will find the issue a lot faster if there is a simple and use the features like serialization because of you can actually quality control the estimators you can store them you can look at them again in the future it's really have and in the short time I was not able to tell you everything about the compositional and testing things that you can do with Cycorp man has just want to give you the on an outlook on what else you could look at it if you want to get into this topic and there's tons of other transformers the end of the meter transformers that compose and so I could learn that it can take a look at for example a future union where you can combine different options for almost future generations and also estimators the Our composable and I could learn so there's a cross-validation building block the grid search and so I could learn that actually takes estimators and extends the functionality so that the predictions are cross validated according to the statistical methods the all so I'm at the end of my talk

24:38

are thank you for your attention and happy to take questions if you like and if you also want

24:46

to put me talk we can come up to me at the time and and I have to do these described in our tests are meant to user like this in the library like this 1st of the 2 well basically we we use unit testing frameworks well like unit has 4 titles Cyprus would prefer practice as the test runner and the structure of test of the test or likelihoods unit testing other situations so even the most basic form testing America does not fundamentally different than testing of the code it's it's code has we test that you have to think of inputs and outputs and had to structure your code in a way that you don't have to to be but in most cases you don't have to do too much work to get us and so yeah we have some tools to to generate data and to get and more tests that are more going into the direction of integration tests of some that's in general we just use the Python tools that non data scientists also use other questions you would in the data you might also transformations once you have all made all training yes on that is so if I consider question correctly the question was if we also applied the transformations to the test data so you talking about the data that I passed to predict right in the 1st example in the model and that you use for training that's the 1 that's yeah so sorry here you talk about yeah exactly yes we do this was the purpose of splitting In the transformer into those two methods all just plot the slide again but the whole purpose of splitting fit and transform here is that we can retrieve repeat this transformation can transform without having to change values for all of them those estimated parameters and mean and standard if we would execute the code and fit again then we would say that and not get the same kind of data into our algorithm that the algorithm expects any of if you want our you because it was and I IQ model performance over time so and some of our applications we have like data going for 4 years and we have models that are built on and then for instance that model the assumptions underlying probability is the data so was Ephesian models and probabilities are changing and we want to be valid it to see how I'm on previous data versions of datasets highly the models here overfitting underfitting depending on on what we have to be doing anything across versions of datasets to make sure that your assumptions on missing stuff or adding a new suffragette didn't have before is there and you're asking how we actually test the stability of our machine learning models in well this is done with cross-validation efforts and so on we can there we we have also for for a sample dataset we have reference sources so reference costs and if the reference costs are going getting worse in the future tests fail basically and then if that happens 1 has to look into into things why it why things are getting worse there's not really a better way than using cross-validation methods the yeah it's more of a monitoring things of this talk was more about the role of actually testing testing the code where the question was rather at testing the quality of the model so I think these are 2 different concerns and just saying their complementarity develop so adjusted use when you do this what you want in them into working memory IPython notebook during a separate scripts on what you do this yeah personally not using IPython notebooks that much I just use that I write tests and test files and execute might have sprung on them and then use continuous integration and and all the tuning that around unit testing on the yeah I personally will because no environments synonymous but it's really great exploring things of but it's not a environment for test-driven development and so on and so there's no test from an accident and I personally think all the effort was put into thinking about some tests certain that could type Internet pattern but if I put it into a unit test and check into my repository it's done continuously over and over again so I really prefer this over a extensive use of i have notebooks right to use it if I want to quickly explore something and this is just an atom so no questions and you talk was about the testing stuff and this is really great with this more modules safe small units that but of course it's also important to have reusability then because then you can really the change the model or applied to to different problems reusing parts of few pipeline if any other questions the OK thank you thank you very much

00:00

Virtuelle Maschine

Informationsmodellierung

Güte der Anpassung

Vorlesung/Konferenz

Softwareentwickler

Kurvenanpassung

Schnittstelle

00:28

Schnittstelle

Virtuelle Maschine

Algorithmus

Rechter Winkel

Maschinencode

Dreiecksfreier Graph

Algorithmische Lerntheorie

Code

Computeranimation

Tabelle <Informatik>

Schnittstelle

01:24

Quelle <Physik>

Schätzwert

Wellenpaket

Punkt

Singularität <Mathematik>

Klasse <Mathematik>

Matching

Computeranimation

Virtuelle Maschine

Informationsmodellierung

Datensatz

Algorithmus

Rechter Winkel

ATM

Ordnung <Mathematik>

Softwareentwickler

Korrelationsfunktion

Tabelle <Informatik>

Schnittstelle

03:28

Softwaretest

Wellenpaket

Datenhaltung

Vektorraum

Ein-Ausgabe

Ausgleichsrechnung

Code

Computeranimation

Informationsmodellierung

Prognoseverfahren

Algorithmus

Softwaretest

ATM

Endogene Variable

Ext-Funktor

Tabelle <Informatik>

04:47

Bit

Zahlenbereich

Transformation <Mathematik>

Code

Computeranimation

Metropolitan area network

Virtuelle Maschine

Variable

Algorithmus

Standardabweichung

Eigenwert

Stichprobenumfang

Datenverarbeitung

Schätzwert

Lineares Funktional

Zentrische Streckung

Kategorie <Mathematik>

Fokalpunkt

Teilbarkeit

Arithmetisches Mittel

Rechenschieber

Bildschirmmaske

Strategisches Spiel

Faktor <Algebra>

Kantenfärbung

Standardabweichung

Aggregatzustand

Fehlermeldung

10:36

Schätzwert

Distributionstheorie

Shape <Informatik>

Systemaufruf

Betrag <Mathematik>

Computeranimation

Unendlichkeit

Arithmetisches Mittel

Softwaretest

Twitter <Softwareplattform>

Standardabweichung

ATM

Ext-Funktor

Korrelationsfunktion

Standardabweichung

Fitnessfunktion

11:31

Portscanner

Softwaretest

Schnelltaste

Matrizenrechnung

Datensichtgerät

Singularität <Mathematik>

Matrizenrechnung

Versionsverwaltung

Transformation <Mathematik>

Extrempunkt

Computeranimation

Fitnessfunktion

12:28

Softwaretest

Matrizenrechnung

Zentrische Streckung

Wellenpaket

Klasse <Mathematik>

Systemaufruf

Implementierung

Zahlenbereich

Transformation <Mathematik>

Objektklasse

Skalarfeld

Computeranimation

Arithmetisches Mittel

Objekt <Kategorie>

Datenmanagement

Standardabweichung

Wärmeübergang

Mustersprache

Korrelationsfunktion

Fitnessfunktion

Schnittstelle

Standardabweichung

Leistungsbewertung

Attributierte Grammatik

14:30

Distributionstheorie

Zentrische Streckung

Maschinencode

Zahlenbereich

Transformation <Mathematik>

Objektklasse

Ein-Ausgabe

Code

Division

Computeranimation

Arithmetisches Mittel

Standardabweichung

Stichprobenumfang

Normalvektor

Standardabweichung

16:05

Resultante

Matrizenrechnung

Wellenpaket

Zahlenbereich

Kartesische Koordinaten

Transformation <Mathematik>

Skalarfeld

Computeranimation

Tensor

Standardabweichung

Datentyp

Kontrast <Statistik>

Pell-Gleichung

Lineares Funktional

Transformation <Mathematik>

Kategorie <Mathematik>

Gebäude <Mathematik>

Transmissionskoeffizient

p-Block

Portscanner

Objekt <Kategorie>

Arithmetisches Mittel

Verkettung <Informatik>

Ablöseblase

Dateiformat

Normalspannung

Standardabweichung

Fitnessfunktion

17:59

Arithmetisches Mittel

Distributionstheorie

Zentrische Streckung

Gewicht <Mathematik>

Klasse <Mathematik>

Programmbibliothek

Zahlenbereich

Computer

Transformation <Mathematik>

Objektklasse

Skalarfeld

Computeranimation

19:04

Softwaretest

Maschinencode

Gebäude <Mathematik>

Mailing-Liste

Kardinalzahl

Transformation <Mathematik>

p-Block

Ein-Ausgabe

Code

Skalarfeld

Computeranimation

Arithmetisches Mittel

Softwaretest

Spieltheorie

Datenverarbeitung

p-Block

Instantiierung

Standardabweichung

20:24

Schnittstelle

Subtraktion

Klasse <Mathematik>

Mathematisches Modell

Zahlenbereich

Transformation <Mathematik>

Skalarfeld

Code

Computeranimation

Virtuelle Maschine

Informationsmodellierung

Softwaretest

Prognoseverfahren

Einheit <Mathematik>

Mini-Disc

Endogene Variable

Programmbibliothek

Meter

Schnittstelle

Metropolitan area network

Trennungsaxiom

Softwaretest

Schätzwert

Präprozessor

Datennetz

Stochastische Abhängigkeit

Gebäude <Mathematik>

Datenmodell

Statistische Analyse

p-Block

Kreuzvalidierung

Konfiguration <Informatik>

Objekt <Kategorie>

Bildschirmmaske

Generator <Informatik>

Menge

Koeffizient

Dreiecksfreier Graph

Ablöseblase

Gamecontroller

Serielle Schnittstelle

p-Block

Parametrische Erregung

Standardabweichung

24:36

Stabilitätstheorie <Logik>

Komponententest

Wellenpaket

Mathematisierung

Mathematisches Modell

Versionsverwaltung

Kartesische Koordinaten

Transformation <Mathematik>

Code

Richtung

Internetworking

Virtuelle Maschine

Bildschirmmaske

Informationsmodellierung

TUNIS <Programm>

Einheit <Mathematik>

Algorithmus

Notebook-Computer

Datentyp

Mustersprache

Test-First-Ansatz

Stichprobenumfang

Programmbibliothek

Skript <Programm>

Vorlesung/Konferenz

Datenstruktur

Softwaretest

Dokumentenserver

Likelihood-Funktion

Plot <Graphische Darstellung>

Quellcode

Elektronische Publikation

Ein-Ausgabe

Modul

Kreuzvalidierung

Integral

Rechenschieber

Rechter Winkel

Festspeicher

Mereologie

Heegaard-Zerlegung

Programmierumgebung

Instantiierung

Fitnessfunktion

Standardabweichung

### Metadaten

#### Formale Metadaten

Titel | Testable ML Data Science |

Untertitel | How to make numeric code testable using Scikit-Learn's interfaces. |

Alternativer Titel | Using Scikit-Learn's interface for turning Spaghetti Data Science into Maintainable Software |

Serientitel | EuroPython 2015 |

Teil | 37 |

Anzahl der Teile | 173 |

Autor | Peters, Holger |

Lizenz |
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben |

DOI | 10.5446/20215 |

Herausgeber | EuroPython |

Erscheinungsjahr | 2015 |

Sprache | Englisch |

Produktionsort | Bilbao, Euskadi, Spain |

#### Inhaltliche Metadaten

Fachgebiet | Informatik |

Abstract | Holger Peters - Using Scikit-Learn's interface for turning Spaghetti Data Science into Maintainable Software Finding a good structure for number-crunching code can be a problem, this especially applies to routines preceding the core algorithms: transformations such as data processing and cleanup, as well as feature construction. With such code, the programmer faces the problem, that their code easily turns into a sequence of highly interdependent operations, which are hard to separate. It can be challenging to test, maintain and reuse such "Data Science Spaghetti code". Scikit-Learn offers a simple yet powerful interface for data science algorithms: the estimator and composite classes (called meta- estimators). By example, I show how clever usage of meta-estimators can encapsulate elaborate machine learning models into a maintainable tree of objects that is both handy to use and simple to test. Looking at examples, I will show how this approach simplifies model development, testing and validation and how this brings together best practices from software engineering as well as data science. Knowledge of Scikit-Learn is handy but not necessary to follow this talk. |

Schlagwörter |
EuroPython Conference EP 2015 EuroPython 2015 |