We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Testable ML Data Science

00:00

Formale Metadaten

Titel
Testable ML Data Science
Untertitel
How to make numeric code testable using Scikit-Learn's interfaces.
Alternativer Titel
Using Scikit-Learn's interface for turning Spaghetti Data Science into Maintainable Software
Serientitel
Teil
37
Anzahl der Teile
173
Autor
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
ProduktionsortBilbao, Euskadi, Spain

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Holger Peters - Using Scikit-Learn's interface for turning Spaghetti Data Science into Maintainable Software Finding a good structure for number-crunching code can be a problem, this especially applies to routines preceding the core algorithms: transformations such as data processing and cleanup, as well as feature construction. With such code, the programmer faces the problem, that their code easily turns into a sequence of highly interdependent operations, which are hard to separate. It can be challenging to test, maintain and reuse such "Data Science Spaghetti code". Scikit-Learn offers a simple yet powerful interface for data science algorithms: the estimator and composite classes (called meta- estimators). By example, I show how clever usage of meta-estimators can encapsulate elaborate machine learning models into a maintainable tree of objects that is both handy to use and simple to test. Looking at examples, I will show how this approach simplifies model development, testing and validation and how this brings together best practices from software engineering as well as data science. _Knowledge of Scikit-Learn is handy but not necessary to follow this talk._
Schlagwörter
51
68
Vorschaubild
39:40
108
Vorschaubild
29:48
Güte der AnpassungSoftwareentwicklerInterface <Schaltung>Endliche ModelltheorieVirtuelle MaschineKurvenanpassungVorlesung/Konferenz
MaschinencodeInterface <Schaltung>Virtuelle MaschineDreiecksfreier GraphAlgorithmusInterface <Schaltung>Rechter WinkelCodeProdukt <Mathematik>
Singularität <Mathematik>ATMSoftwaretestAusgleichsrechnungExt-FunktorStandardabweichungMetropolitan area networkMatchingZentrische StreckungCodeZahlenbereichAlgorithmusVirtuelle MaschineSoftwaretestSoftwareentwicklerWellenpaketStandardabweichungEndliche ModelltheorieDatenverarbeitungTabelleAlgorithmische LerntheorieStichprobenumfangOrdnung <Mathematik>AggregatzustandGraphfärbungInterface <Schaltung>Exogene VariableVektorraumDatenbankFokalpunktSchätzfunktionBitArithmetisches MittelDatensatzVorhersagbarkeitKorrelationsfunktionQuelle <Physik>RechenschieberEin-AusgabeKlasse <Mathematik>CASE <Informatik>Faktor <Algebra>Rechter WinkelPunktSystemaufrufStrategisches SpielTeilbarkeitSchnittmengeFitnessfunktionComputeranimation
BildschirmmaskeSoftwaretestATMExt-FunktorBetrag <Mathematik>StandardabweichungPortscannerLokales MinimumMatrizenrechnungSingularität <Mathematik>WärmeübergangKlasse <Mathematik>SchätzfunktionArithmetisches MittelKorrelationsfunktionImplementierungSoftwaretestMusterspracheFunktionalAggregatzustandZahlenbereichShape <Informatik>BitTransformation <Mathematik>MatrizenrechnungTwitter <Softwareplattform>FitnessfunktionStandardabweichungStrategisches SpielDistributionenraumLeistungsbewertungCodeInterface <Schaltung>UnendlichkeitMultiplikationsoperatorSchnelltasteDatensichtgerätSystemaufrufEigenwertproblemKategorie <Mathematik>VariableVersionsverwaltungAlgorithmusObjekt <Kategorie>WellenpaketTeilbarkeitAttributierte GrammatikSkalarfeldKlasse <Mathematik>FehlermeldungDatenverwaltungZentrische StreckungData Matrix CodeRechenbuchDatenverarbeitungTermComputeranimation
StandardabweichungZentrische StreckungNormierter RaumZahlenbereichArithmetisches MittelDistributionenraumStichprobenumfangStandardabweichungEin-AusgabeMaschinencodeDivisionTransformation <Mathematik>CodeComputeranimation
Klasse <Mathematik>StandardabweichungPortscannerZentrische StreckungTransformation <Mathematik>Arithmetisches MittelZahlenbereichZweiCASE <Informatik>Grenzschichtablösungp-BlockGebäude <Mathematik>Computeranimation
Pell-GleichungTransformation <Mathematik>Kontrast <Statistik>Transformation <Mathematik>Kette <Mathematik>TensorArithmetisches MittelZahlenbereichTransmissionskoeffizientKategorie <Mathematik>MatrizenrechnungTypentheorieWellenpaketDruckspannungObjekt <Kategorie>p-BlockFitnessfunktionKartesische KoordinatenResultanteDateiformatFunktionalGebäude <Mathematik>StandardabweichungSkalarfeldSchnittmengeKlasse <Mathematik>VersionsverwaltungComputeranimation
Klasse <Mathematik>SoftwaretestZentrische StreckungGewicht <Ausgleichsrechnung>Klasse <Mathematik>ZahlenbereichTransformation <Mathematik>StandardabweichungDatenverarbeitungGebäude <Mathematik>DistributionenraumSoftwaretestInstantiierungEin-Ausgabep-BlockArithmetisches MittelSchnittmengeSkalarfeldComputerProgrammbibliothekMailing-ListeComputeranimation
p-BlockSpieltheorieMaschinencodeCodeKardinalzahlDifferenteGrenzschichtablösungSoftwaretestRechenwerkNumerisches VerfahrenStochastische AbhängigkeitEndliche ModelltheorieBesprechung/Interview
SoftwaretestDatenmodellInterface <Schaltung>BildschirmmaskeGenerator <Informatik>Serielle SchnittstelleKlasse <Mathematik>Dreiecksfreier GraphTransformation <Mathematik>Konfiguration <Informatik>SoftwareInterface <Schaltung>Gebäude <Mathematik>SchätzfunktionProgrammbibliothekMultiplikationsoperatorSoftwaretestSchnittmengeGamecontrollerPräprozessorKoeffizientKreuzvalidierungParametrische ErregungCodeObjekt <Kategorie>GrenzschichtablösungVorhersagbarkeitVirtuelle MaschineSkalarfeldMetropolitan area networkp-BlockStatistikStandardabweichungMathematisches ModellZahlenbereichExogene VariableMeterMini-DiscEndliche ModelltheorieFunktionalSchaltnetzSchreiben <Datenverarbeitung>Algorithmische LerntheorieFitnessfunktionAlgorithmusParametersystemDifferenteMeta-TagInformationsspeicherungComputeranimation
Transformation <Mathematik>Mathematisches ModellQuellcodeSoftwaretestLikelihood-FunktionStichprobenumfangProgrammierumgebungVirtuelle MaschineRechenwerkDatenstrukturBimodulStandardabweichungHeegaard-ZerlegungStabilitätstheorie <Logik>HalbleiterspeicherNotebook-ComputerFitnessfunktionSoftwareentwicklerTypentheorieKreuzvalidierungSkriptspracheMereologieMathematikEndliche ModelltheorieWellenpaketCodeAlgorithmusRechter WinkelCASE <Informatik>InstantiierungMusterspracheIntegralInternetworkingKomponententestKartesische KoordinatenBildschirmmaskeRichtungMultiplikationsoperatorVersionsverwaltungTUNIS <Programm>Elektronische PublikationDokumentenserverEin-AusgabeRechenschieberProgrammbibliothekPlotterTest-First-AnsatzArithmetisches MittelMaßerweiterungSchnittmengeSchreiben <Datenverarbeitung>KardinalzahlFunktion <Mathematik>Kontinuierliche IntegrationVorlesung/KonferenzBesprechung/Interview
Transkript: Englisch(automatisch erzeugt)
Thank you for the kind introduction. Hello, I'm Holger Peters from Leander and I'm presenting to you today how to use scikit-learn's good interfaces for writing maintainable and testable
machine learning code. So this talk will not really focus on the best model development or the best algorithm. It'll just show you a way how to structure your code in a way that you can test it and that you can use it in a reliable way in production.
For some of you who might not know scikit-learn, scikit-learn is probably the most well-known machine learning package for Python and it's really a great package. It has all batteries included and this is its interface.
All right, the problem in general that I'm talking about is that of supervised machine learning in this talk and just imagine a problem. We have on the left side here on the table, we have a table with data.
It's a season, that's spring, summer, fall and winter. We have a binary variable encoding whether we have a day that is a holiday or not. Each row is a data point and each column is what we call a feature. On the right hand side, we have some variable
that we'll call a target. It is closely associated to the features and the target is a variable that we would like to predict from our features. And so features are known data, targets is the data that we want to estimate from a given table on the left.
In order to do this, we actually have one data set where we have features and targets, matching features and target data and we can use this to train a model and then have a model that predicts.
So the interface is as follows. We have a class that represents a machine learning algorithm. It has a method fit that gets features named x and a target array called y. And that trains the model so the model learns
about the correlations between features and targets. And then we have a method predict that can be called upon the trained estimator and that gives us an estimate y for the given model and the given features x.
And this is the basic problem of machine learning. There are algorithms to solve this and I'm not gonna talk about these algorithms. I rather would like to focus on how to prepare data, the feature data x in this talk and how to make it in a way that is both testable,
reliable and readable to software developers and data scientists. I'm sure you want to see how this looks like in a short code snippet and this is actually quite succinct. So in this example here, we generate some data sets,
x train, x test and y train, y test. Then we create a support vector regressor that is some algorithm that I take off the shelf
and so I can learn. We fit the training data and we predict on the test data set and in the end we can obtain a metric, we can test how well is our prediction based on our input features x test.
And so this is a trained model and so I could learn. It's very simple, very easy and the big question now is how do we obtain or how can we best prepare input data for our estimator because that table that I showed you might come from an SQL database or from other inputs.
It sometimes has to be prepared for the model so we get a good prediction. And you can't think of this preparation in a way, it's a bit like some preparation as it's done in a factory. So there are certain steps that are executed
to prepare this data and you have to cut pieces into the right shape so that the algorithm can work with them. One typical preparation that we have for a lot of machine learning algorithms
is that of a normally distributed scaling. So what we imagine that your data has very high numbers and very low numbers.
But your algorithm really would like to have values that are nicely distributed around zero with a standard deviation of one. And such a scaling can be easily phrased in Python code. So X is an array and we just take the mean
over all columns and subtract it from our array X so we subtract the mean of each column from each column and then we calculate, based on this, we calculate the standard deviation and divide by the standard deviation so now each column should be distributed
around a mean of zero now with a standard deviation of approximately one. And I've prepared a small sample for this so you can see above an input array X and it has two columns.
Let's first focus on the rightmost column. That would be a feature variable with values 32, 80 and 31. Of course, in reality, we would have huge arrays but for the example, a very small one is sufficient. And then we apply our scaling and in the end, that column now has values
that are based around zero and are very close to zero. And now I put in another problem that we have in data processing. We have a missing value. So just imagine I showed you in the first slide, I showed you an example where we have weather data.
Just imagine that the thermometer that measured the temperature was broken on a day so you don't have a value here. But you would like that your estimation, you would still like an estimation for that day and in such cases, we have ways
how to fill this data with values and strategies. So one strategy is just to replace this not a number value with the mean of this feature variable.
So you could take the mean of temperatures of historic data to replace such a missing temperature slot. And because if you apply our algorithm with the mean and dividing by standard deviation, what you'll get is just a, in this example,
you'll get a data error from our code because another number values will just break the mean. So I've prepared a bit, code that does a bit more than our code before. So before, we just subtracted the mean
and divided by the standard deviation. And now we would like to replace not a number values by the mean. And the reason our code failed before was that taking the mean of a column that contains not a number values numerically
just raises not a number. So here I replaced our NumPy mean function by the function NumPy non mean, which will yield even with non values in our array x, it will yield a proper value for the mean. Then we can subtract again as we did before the mean
and divide by the standard deviation. And in the end, we'll execute a function NumPy non to Num, which will replace all not a number values by zero. And in our rescaled data, zero is the mean of the data. So we have replaced not a number values by the mean.
And so how does this new code transform our data? And it actually seems to work pretty well. The same data example with the new code has a resulting array where both columns
are distributed around zero with a small standard deviation. And so this is an example of some data processing that you would apply maybe to your data before you feed it into the estimator.
And yeah, this is a very simple example and this small example actually has a few properties that are very interesting. So I said that we, if we go back to our example, we actually transform our array x
and take the standard deviations of all columns and the mean of all columns before we call estimator.predict. But what about the next call when we call estimator.predict? There's also an array x that is fed into
and we have to process data that goes into this predict accordingly as we have transformed the data that goes into fit. Why is this? Because our estimator has learned about the shapes and correlations of the data that we gave it in fit.
So the data has to look, has to have the same distributions, the same shape as the data that it saw during fit. And how can we do this? How can we make sure that the data has been transformed in the same way? And scikit-learn has a concept for this
and that's the transformer concept. A transformer is an object that has this notion of a fit and a transform step. So we can fit data, fit in this transformer, we can train it with the method fit
and we can transform it with the method transform and there's a shortcut define a scikit-learn fit transform that does both at the same time. What's important about this is transform returns a modified version of our feature matrix x given a matrix x and during the fit it has to,
it can also see a y. And so now we can actually rephrase our code that did the scaling and not number replacement in terms of such a transformer. So I called this, I wrote a little class, it's called Not A Number Guessing Scaler.
So because it guesses not replacement values for not numbers and it scales the data and I implemented a method fit that has the mean calculation as you can see and it saves the means and the standard deviations of the columns as attributes of the object itself.
And then it has a method transform and transform does the actual transformation. It subtracts the mean and it divides by the standard deviation and it replaces not a number values by zeros, which are zero is the mean of our transformed data.
And using this pattern we can fit our not number guessing transformer with our training data and then transform the data that we actually would like to use for predict. We can transform it in the very same way. And another opportunity here is,
since we have a nicely defined interface and for this we can actually start testing it. And I wrote a little test for our class. I think you remember our example array and I create a not number guessing scaler.
I invoke fit transform to obtain a transformed data matrix and then I start testing assumptions that I have about the outcome of this transformation. And now the issue that this test actually,
this test finds an issue. Our implementation was wrong because if I calculate the standard deviation for each column and I expect the standard deviation for each column to be one, I realize that the standard deviation is not one
and that has a very simple reason. If we look back at the code, I calculate the standard deviation of the input sample before I replace not a number values with the mean.
So in this example, the standard deviation of the input sample is wider than the actual distribution of the data after replacing not number values with the mean.
Because the mean is in the center and we map not a number values also to the center of the data and that makes the distribution kind of smaller. So in a way, if we want to fix this code, we have to think about this transform method
and the solution is actually to make two transformation steps. At first, we want to have one transformation step that replaces not number values with the mean and then we want to have a second transformation step that does the actual scaling of the data.
So we want two transformations and Scikit-Learn has a nice way how to do this. It offers ways to compose several transformers, several transformations. In this case, we use a building block
and I apologize for the low contrast. We use building blocks that are called pipelines, a pipeline. And a pipeline is a sequential, it's like a chain of transformers. And so during FIT, when we are training
and learning from a feature matrix X, we use a first transformator, transformator one and invoke FIT transform to obtain a transformed version of the data and then we take our second transformer
also apply FIT transform with the result of the first transformation and finally we will obtain a transformed data set that was transformed by several steps. It can have an arbitrary number of transformers in the predict when we have already learned the properties of the data like in our example,
the mean and the standard deviation. We can just invoke transforms and get a transformed X in the end from building a pipeline. In Scikit-Learn, we can build them pretty easily. There's a make pipeline function and we pass it transformer objects and it returns a pipeline object
and a pipeline object itself is a transformer that means that it has the FIT and the transform method and we can just use it instead of our not a number guessing scaler that I just presented. So we could go back and rewrite this class
into two classes, one doing the scaling and one doing the not number replacement or the question is maybe there's actually someone has solved this for us already and indeed Python has batteries included and Scikit-Learn has batteries included
so we can actually also use two transformers from Scikit-Learn's library. One of these transformers is called the imputer because imputes missing values and so here not a number would be replaced by the mean
and then we have the standard scaler that scales the data that is distributed in this example represented by the red distribution to one to a data set that is distributed around zero and these two transformers can be joined by a pipeline
so here you can see this. We just put together the building blocks that we already have. We saw make pipeline, we use make pipeline here and pass it a imputer instance and a standard scaler instance
and then if we fit transform our example array, we can actually make sure that our assumption holds true that we would like to have a standard deviation of one. We could here also check for the means and perform other tests.
We have wrapped the data processing with those Scikit-Learn transformers and we've done this in a way where we can individually test each building block so assume that these were not present in Scikit-Learn,
we could just write them ourselves and the tests would be fairly easy and yeah, I think that this is the biggest gain that we can have from this so if you're leaving this talk and you want to take something away from it, if you want to write maintainable software,
if you want to avoid spaghetti code in your numeric code, try to find ways how to separate different concerns, different purposes in your code into independent composable units that you can then combine and you can test them individually, you can combine them and then you can make a test for the combined model
and that's really a good way to structure your numeric algorithms. So in the beginning, I showed you an example of a machine learning problem where we just used a machine learning algorithm
with a Scikit-Learn estimator that we fitted and predicted with. Now, I extended this example with a pipeline that does the pre-processing, make pipeline, we use the imputer, we use a standard scaler and we can also add our estimator to this pipeline and now our object S does contain
our whole algorithmic pipeline, and it does contain the pre-processing of the data and it does contain the machine learning code. And also it does contain all the fitted and estimated parameters coefficients that are present in our model.
So we could easily serialize this estimator object using pickle or another serialization library and store it to disk or send it across the world into a different network and then we could load it again, restore it
and make predictions from it. And so to summarize what Scikit-Learn and these interfaces can do for you and how you should use them, we found that it's really beneficial to use these interfaces that Scikit-Learn provides for you.
If you want to write pre-processing code and you can use the fit transform interface for the transformers, use them. Write your own transformers if you don't find those that you need in a library.
If you write your own transformers, try to separate concerns, separate responsibilities. Estimating or scaling your data has nothing to do with correcting of number values. So don't put them into the same transformer, just write two and compose a new transformer
out of the twos for your model in the end. If you keep your transformers and your classes small, they're a lot easier to test. And if tests fail, you will find the issue a lot faster
if they are simple. And use the features like serialization because you can actually quality control your estimators. You can store them, you can look at them again in the future, it's really handy. And in this short time, I was not able
to tell you everything about the compositional and the testing things that you can do with Scikit-Learn. So I just wanted to give you an outlook on what else you could look at if you want to get into this topic. There are tons of other transformers and other meta transformers that compose in Scikit-Learn
that you can take a look at. For example, a feature union where you can combine different transformers for feature generation. And also, estimators are composable in Scikit-Learn. So there's a cross-validation building block,
the grid search in Scikit-Learn that actually takes estimators and extends their functionality so their predictions are cross-validated according to statistical methods. So I'm at the end of my talk. I thank you for your attention. I'm happy to take questions if you like.
And if you also, if you want to chat with me, you talk with me, you can come up to me anytime. Hi.
Hi. Could you please describe your testing environment to use like a standard library, like unit tests, stuff like that, too?
Well, basically we use unit testing frameworks like unit tests or pytest. I personally prefer pytest as a test runner and we structure our tests or structure the tests like we would unit tests in other situations. So in the most basic form, testing numeric code
is not fundamentally different than testing other code. It's code, it has to be tested. You have to think of inputs and outputs and you have to structure your code in a way that you don't have to, that in most cases you don't have to do too much work to get a test running.
And so, yeah, we have some tools to generate data and to get more tests that are more going into the direction of integration tests. But in general, we just use the Python tools
that non-data scientists also use. Other questions? As data, you apply also transformations
once you have all made all the training? Yes, that is, so if I understood the question correctly, the question was if we also apply the transformations
to the test data. So you are talking about the data that I passed to predict, right, in the first example? Not the one that you used for the test. So, sorry, here, you're talking about...
Yeah, exactly. Yes, we do, this is the purpose of splitting in the transformer into those two methods. So I'll just pull up the slide again. The whole purpose of splitting fit and transform here
is that we can repeat this transformation in transform without having to change values for those estimated parameters, mean and standard. If we would execute the code in fit again, then we would not get the same kind of data
into our algorithm that the algorithm expects. Any other?
How do you track your model performance over time? So in some of our applications, we have data going for years,
and we have models that are built up. And then, for instance, that model, the assumptions underlying probabilities of the data, so we're using mostly Bayesian models, the underlying probabilities are changing, and we want to revalidate to see how, on previous data sets or versions of data sets, how the model is either overfitting or underfitting, depending on what we have.
So are you doing anything across versions of data sets to make sure that your assumptions aren't missing stuff or adding a new stuff that you didn't have before? Okay, so you're asking how we actually test the stability of our machine learning models?
Well, this is done with cross-validation methods, and we have, yeah, we have cross-validation methods for sample data sets. We have reference, so reference scores,
and if the reference scores are getting worse in the future, then tests fail, basically. And then, if that happens, one has to look into things, why things are getting worse. There's not really a better way
than using cross-validation methods. Yeah, it's more of a monitoring thing. So this talk was more about actually testing the code, whereas your question was rather about testing the quality of the model.
So I think these are two different concerns. Yes, they're complementary, yeah, definitely.
So I just got curious, when you do this, what do you work in? I mean, do you work in an IPython notebook, or do you do it as separate scripts, or what do you do for this?
Yeah, I'm personally not using IPython notebooks that much. I just use, I write tests in test files and execute my test runner on them, and then use continuous integration and all the tooling that is around unit testing.
Yeah, I personally, well, IPython notebook is not an environment that is really great at exploring things, but it's not an environment for test-driven development, and so there's no test run
in IPython notebook, and I personally think all the effort that I've put into thinking about some test assertion that I could type into an IPython notebook. If I put it into a unit test and check it into my repository, it's done continuously over and over again, so I really prefer this
over extensive use of IPython notebooks. I do use it if I want to quickly explore something. This is just an add-on, so no question. Your talk was about the testing stuff, and this is really great with these modules,
let's say, or small units, but of course, it's also important to have reusability, because then you can really change a model or apply it to different problems reusing parts of your pipeline.
Any other questions? Okay, thank you. Thank you very much.