We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Extending Scikit-Learn with your own Regressor

00:00

Formal Metadata

Title
Extending Scikit-Learn with your own Regressor
Title of Series
Part Number
64
Number of Parts
119
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceBerlin

Content Metadata

Subject Area
Genre
Abstract
Florian Wilhelm - Extending Scikit-Learn with your own Regressor We show how to write your own robust linear estimator within the Scikit-Learn framework using as an example the Theil-Sen estimator known as "the most popular nonparametric technique for estimating a linear trend". ----- Scikit-Learn is a well-known and popular framework for machine learning that is used by Data Scientists all over the world. We show in a practical way how you can add your own estimator following the interfaces of Scikit-Learn. First we give a small introduction to the design of Scikit-Learn and its inner workings. Then we show how easily Scikit-Learn can be extended by creating an own estimator. In order to demonstrate this, we extend Scikit-Learn by the popular and robust Theil-Sen Estimator that is currently not in Scikit-Learn. We also motivate this estimator by outlining some of its superior properties compared to the ordinary least squares method (LinearRegression in Scikit-Learn).
Keywords
CodeRegulärer Ausdruck <Textverarbeitung>Computer virusDependent and independent variablesComputer animationLecture/Conference
Dependent and independent variablesBerlin (carriage)Metropolitan area networkLibrary (computing)Virtual machineBitRobust statisticsEstimatorNeuroinformatikPresentation of a group
Total S.A.Data miningSample (statistics)Metropolitan area networkOpen setSquare numberLinear regressionLinear mapDependent and independent variablesVariable (mathematics)Data modelCoefficientBlock (periodic table)Cycle (graph theory)Dimensional analysisCASE <Informatik>MedianDot productInstance (computer science)Line (geometry)Combinational logicCartesian coordinate systemOutlierGoodness of fitElectronic mailing listLinear regressionReduction of orderResultantPoint (geometry)Dependent and independent variablesDecimalDistanceRight angleAttribute grammarPhysical systemPredictabilityOpen sourceArrow of timeMultiplication signWeb pagePattern languageTerm (mathematics)Data analysisDigitizingExtension (kinesiology)MeasurementSquare numberLibrary (computing)NumberVariable (mathematics)AlgorithmData miningOrder (biology)Arithmetic meanPlotterEndliche ModelltheorieMereologyPreprocessorSampling (statistics)AreaProjective planePerfect groupRankingDecision theoryWebsiteSpacetimeLeast squaresError messageReal numberLinearizationOpen setDiagram
Metropolitan area networkAsynchronous Transfer ModeSquare numberStorage area networkDependent and independent variablesRegulärer Ausdruck <Textverarbeitung>Set (mathematics)Curve fittingSocial classState of matterCurvatureState of matterSampling (statistics)Electronic mailing listGoodness of fitAdditionPoint (geometry)Software developerTransformation (genetics)EstimatorMultiplicationCodeOnline helpOutlierProjective planePredictabilityLine (geometry)TesselationIterationSquare numberMultiplication signWeightSubsetSpacetimeVector spaceSocial classCASE <Informatik>DistanceSummierbarkeitFunctional (mathematics)RandomizationMatrix (mathematics)Object (grammar)Instance (computer science)NumberUtility softwareFitness functionLinear regressionMedianParameter (computer programming)Array data structureImplementationHyperplaneDependent and independent variablesDifferent (Kate Ryan album)MereologyInheritance (object-oriented programming)Power (physics)Error messageOcean currentPairwise comparisonHypermediaGraph (mathematics)File formatSet (mathematics)Product (business)Video gameAreaCycle (graph theory)PermanentSource codeRight angleDimensional analysisParametrische ErregungIntercept theoremVariable (mathematics)Program flowchart
Curve fittingCodeStandard deviationSample (statistics)Metropolitan area networkSoftware developerWordPredictabilityPrototypeCross-validation (statistics)Parallel computingFeedbackStandard deviationGoodness of fitCodePauli exclusion principleMultiplication signText editorMaxima and minimaComputer programmingWebsiteDivisorNumberCodeSoftware maintenanceWeightComplex (psychology)SpacetimeIntercept theoremFigurate numberSlide ruleSubject indexingInstance (computer science)Functional (mathematics)Run time (program lifecycle phase)Software developerAlgorithmProduct (business)System callOpen sourceProjective planeOnline helpString (computer science)Point (geometry)CoefficientWorkstation <Musikinstrument>Sampling (statistics)CuboidMereologyBitSource codeMachine learningExecution unitGreen's functionCycle (graph theory)Array data structureBoss CorporationIntegerCombinational logicVapor barrierDifferent (Kate Ryan album)ArmStatisticsStorage area networkTesselationEndliche ModelltheorieFitness functionParallel portUtility softwareDependent and independent variablesParameter (computer programming)JSONXML
Parallel portSampling (statistics)Software developerInternet forumNumberRational numberNeuroinformatikMachine visionAlgorithmBit rateError messageHypermediaSet (mathematics)Social classLie groupPoint (geometry)Arrow of timeCombinational logicOutlierLinear regressionAdditionRobust statisticsHyperplaneSingle-precision floating-point formatReduction of orderOnline helpRight anglePlanningLeast squaresMereologyMedianEndliche ModelltheorieLipschitz-StetigkeitComputer animationLecture/Conference
GoogolCode
Transcript: English(auto-generated)
Okay. Hello. Now we're going to learn about extending scikit-learn with your own regressor from Florian Willem.
Thank you. Hello, everybody. So in my talk, I'll talk extending scikit-learn with your own regressor, I'll first give a short introduction to scikit-learn, as most of you know, and then I'll talk about
an estimator which is not included, which is not yet included in scikit-learn, a robust estimator called Tilesen. With this as an example, I'll show you how you can implement your own estimator in scikit-learn, how to extend scikit-learn, then I'll talk a little bit about what you need to consider
if you want to contribute an own estimator to scikit-learn, and I'll tell a little bit about my own experiences in contributing to scikit-learn. So first of all, what is scikit-learn? So scikit-learn is a machine learning library, so whenever you have some kind of data and you want to extract some insight from this data, you can scikit-learn.
You can use scikit-learn, it's simple, efficient tool for data mining and data analytics, so it's really simple to use, and so that makes it accessible for everyone, and you can really apply it to all kinds of problems. So I took this marketing sentences right from the webpage, but it's really true, so it's
extremely simple, so if you haven't used it, you should definitely look into scikit-learn. It's built on NumPy, SciPy, and Matplotlib, so two, three famous libraries which are used all over in the Python ecosystem, and what is really good, it's open source, but
still commercially usable, so it's BSD licensed, so if you want to maybe not contribute everything you do with it back to scikit-learn, you can still use it, which makes it really good in the commercial applications. Okay, so this picture, it can also be found on the scikit-learn website, I like it because
it gives a nice overview of the things you can do with scikit-learn, the basic areas of applications, so you can do classifications, so a typical example would be if you have handwritten digits, and you want to classify if a digit is a one or a seven, for instance.
Then you have everything related to clustering, if you're just looking for patterns in the data without having some kind of labels, real targets, so for unsupervised learning, you can use clustering. It also supports dimensional reduction techniques, so when you have too many features and you
want to avoid overfitting, for instance, you have a lot of tools to do PCA and so on, so dimensional reduction, and of course the whole regression part, if you want to find the relationship to a target variable, depending on some features, and this is what we're going to talk about.
But before we start, first a little refreshing from school, you've learned about this, the least-square method is called linear regression in scikit-learn, and I want to shortly explain how it works, because Tilesen is a kind of extension to this regressor, so we have
independent variables, X1 to XP, and in scikit-learn-speak, they're called features, and we have a dependent variable, so the so-called target Y, and now we want to build a model, we want to use the features to somehow predict the value of Y, and a typical really
simple approach is just a linear model, so you have a linear combination of X and the coefficient W, and you try to explain your target variable Y with the features X.
So in order to now find the Ws, you minimize a functional, which is given here, so the least-square, you're minimizing the square distances, and in a typical one-dimensional case, this is a picture here, so the blue dots is your data, and in one dimension,
so the X axis is a feature, and the red line now minimizes the square distances to all black dots, so this works really well if you have perfect data, because there's
an internal assumption that error value is normally distributed, so this works then quite well, but in practice, and in many projects that I worked on, all the data you get maybe from customers is less than perfect, so you have a lot of outliers, you have
corrupted data because of measurement errors, because of maybe someone put in a wrong value somewhere, and then quite often, your data looks like this in one dimension, so you directly see on the right-half side that there are some values that don't really
fit to the really dense line on the left side, so what you would do in this case, you would maybe just remove those dots just by looking at this plot and decide, okay, I don't want to take this into my fit, but what do you do if you are in a 10-dimensional
space or in an n-dimensional space, then you can't just see by looking at the plot like this, which are your outliers, and you need to somehow make some complicated pre-processing to eliminate those outliers, so what happens if you now just apply the
ordinary least square, so you would get, of course, a complete wrong result, so you would not expect the line to go like this, you would rather want to have the line to go through the black line to the dense line on the left side, so this is something
I think you really need to consider whenever you look at new data, that there are no outliers in this new data, and that you come up with something robust, so the Tilesen as a natural generalization of the least square method is an algorithm that now looks at all possible
pairs of your sample points and calculates a list of slopes, and then in this case, and if you have in the end the list of slopes, you take the median, and the median is what
makes the method really robust, because the median doesn't care about a single value, it only cares about the ranks, so the order of those values, so I think this is easily shown and understood with an example, so here again our plot with the outliers,
we now take two points, the two red dots here, we calculate the slope connecting those two blots and add it to the list just close to the x-axis, and the slope is 3.1 in this case, and now we just go on with all possible points, and so this time it's
3.1 again, and now we are not so lucky anymore, so we have one outlier connecting with one point we would consider not to be an outlier, and so the slope is 3.1, and you see the list is sorted, and we go on, another one, and bad luck, so even two outliers,
and we could go on and on and on, but already here we see that if we look at the center of the list of slopes, that the median, so the center is correct, so it's 3.0, and if we look at 3.0, so this is the slope of the line we would expect, yeah, how
the line should be, so inside this dense line of our sample points, so the whole principle is that you take the median and that you don't look at all points, so in this method
the outliers, they're not really considered anymore in this case, so this is the case for a two-dimensional problem, so just one feature and a target variable, of course this method can be extended to n-dimensional space, because in most cases, if you use scikit-learn, you will have a lot of features, and not only one feature, and in
an n-dimensional space, so I've given here the citation to this paper, in an n-dimensional space you don't have any slopes anymore, so the slopes become hyperplanes, and the list of slopes then becomes a list of vectors, but you basically do the same thing, you
sample in an n-dimensional space n plus one points, make an hyperplane, and put this vector of the hyperplane inside the list, and then it becomes a little bit tricky because then you need to decide what is the median of a list of slopes, and the median of a
list of slopes can then be for instance the spatial median, and the spatial median is just, if you see the list of vectors as just points in an n-dimensional space, you try to find the one point so that the sum over all distances to all other points
is minimized, so this is the so-called Fermat-Weber problem, but basically it works exactly like it does here. Okay, then again, the comparison, the ordinary square and the tile-sen, if you do this iteration really for all points, it finds
the perfect line. Okay, so this is about the motivation of tile-sen, and at one project I had to deal with corrupted data and outliers I could not really remove by hand, and then I tried, okay, how would I now implement this estimate
inside of scikit-learn. So, the good thing about scikit-learn is that you have a lot of good documentation, so I think scikit-learn is used so often because the documentation is just so well, so if you look for how to write an on-regressor you directly get a manual
and you need to, if you want to write an on-regressor you have to provide four functions, setParams and getParams, this is of course for setting and getting the parameters of your estimator, and those methods are, they are more or less used only internally,
so they are used, for instance, if you do cross-validation or if you use another kind of meta-estimator, those functions are used to set and get the parameters of your estimator, but you need to implement them for your own estimator, and of course you need a fit
and a predict method. So, the base estimator class which is inside of scikit-learn already gives you an implementation of setParams and getParams, so that you can just inherit from it, and since Tiles 10 is a linear model we can also directly inherit linear model,
and this also gives you the predict method, because in a linear case, as we have seen before with the formulas, predicting a feature or a design matrix X is just a matrix vector product, so you just take X times the weights W we have calculated before, so if
we inherit like shown on the right side, if we just take, let our Tiles 10 estimator inherit from linear model, we already get setParams, getParams and predict, and additionally we have so-called mixins in scikit-learn, so the principle of mixins is that you have
some reusable code that can only work together inside something larger, and you can combine different mixins inside a class, and in Python mixins are done with the help of multiple inheritance, and in our case, so there are a lot of mixins, classifiers, requesters, clusters,
transformer mixins, in our case, since we are writing a regressor, we of course inherit also the regressor mixin, which gives us additional functionality like for instance a score function. So, but that's already about it, so to see the source, code, so Tiles 10, as I said before, we just inherit from linear model and regressor mixin
to get setParams, getParams and predict. We override the init function, I made an abbreviation here, so of course you state all different kinds of parameters you have
in your init function, like if you want to fit the intercept, or if you don't want to fit, there are like 10 different parameters, also if you want to work maybe only on a subset of your sample points and so on, and if you want to make this sub-sampling with the
help of some random state and so on. So the more interesting part is then the fit function, access now the design matrix, the feature matrix, and why the target, as usual in scikit-learn, here I check with the help of check random state, the random
state, if we really do some sub-samplings, if we work on some sub-population of X, if you don't want to consider all combinations, and we also check the arrays X and Y, so check arrays and check random states are two functions which are in scikit-learn utils, and if you write your own function, if you write your own estimators, you should have
a look in scikit-learn utils for all the developer tools which help you a lot doing those repetitive things like checking array, is it a float, is it a dense format, and is the random state given as a number that you should use as seed, or is it a random
state object itself and should just be passed on. So this is about the developer tools inside scikit-learn, then the actual algorithm comes, I don't want to go in too much detail about this algorithm, so as I said before, it's basically quite simple, it's just technical because you need to create all
those different combinations of sample points in an n-dimensional space, and you also need to consider that you don't do too much, so depending on some maximum number of samples you might want to consider, and also I did the parallelization with the help of choplib
which is also included inside scikit-learn, so scikit-learn also comes with some external packages which are directly included like sikit-learn and choplib. Okay and then in this green tile san algorithm part I calculate then the coefficients, of
course the source code is online so you can check it out, and now the coefficients they need to be stored for the predict function to work and be stored in self-intercept
and self-cof so that the predict method that uses those arrays works, and in the end of course we return self which allows us to chain different methods together that we can call fit and directly dot predict for instance. So after having programmed this I was really happy that
it worked so well, so without being a scikit-learn developer or something I could really easily take my tile san prototype and put it inside this framework so that it can be used with
things like cross validation for instance and so on, and I thought okay why not just give this back to scikit-learn, so I got the okay from my boss and decided okay what do I now need to do to really contribute this, and again so contributing in scikit-learn is also
well documented so they have really good high quality standards and so what you need to do if you also want to contribute something you your code of course should be unit tested at least 90% but of course 100% to make really sure your method works, then of course documentation is really
important so I think looking back the documentation took me way longer than actually writing the code because you need to find good examples, you need to explain a little bit your method, you need to define all your parameters in things and so on and yeah so you should also consider what the
complexity of your algorithm is, the spatial and runtime complexity and yeah as I said before like you need to draw some figures maybe you want to compare your method to an already
implemented method in scikit-learn and if you got the idea of this method from some paper you should of course make a reference to this paper or papers, then there of course coding guidelines so as usual PEP8 and PyFlakes is used in scikit-learn and they really help a lot to find like
quite obvious problems but it's good that it can be automatically checked and as I said before you should make sure that you use the scikit-learn utils that you don't re-implement stuff that
is already there and another big barrier for me was that I had to yeah make sure that my code runs in Python 2.6, 2.7, 3.4 and so forth and this can be done with the help of 6 that you
surely heard of and this is also included in scikit-learn and the parallelization with the help of Choplip. Okay so this is about the requirements for contribution and then I thought okay why not just contribute this then so a little bit about my experiences so my first pull
request started on March 8th and yeah it was my first kind of pull request in the open source world and the community of scikit-learn is really great so there were a lot of improvements
due to really good remarks so I could improve with the help of the scikit-learn maintainers the performance was increased by a factor of 5 or 10 even so it was a really huge improvement and also I got some coding guidelines still I had still wrong at this time so this is really
good so of course showing your code to other people always gets you good feedback and then there was also discussion about Tilesen being more a statistical regressor and not really machine learning and so on if it should maybe be better included in stats model and if RANSEC
so this is the random sampling consensus method is maybe almost always better than than Tilesen and this is something this is included in 0.15 and at this time it was a scikit-learn 0.14 so it was not included at that time so I didn't even know about this
existed so during that time I learned about a new method so yeah it was really really cool and if you yeah want to follow up on this pull request so it's currently it's so Tilesen is still not included so I'm still working on this and if you want to learn
about the discussion was really interesting discussion I can only recommend to everyone if you want to if you want to contribute to an open source project it's always a good idea because during that yeah during that way you really learn a lot just about yeah how to
improve things and what common standards are and so on okay so that's about it with my talk yeah a little marketing slide Blue Yonder the company I work for is hiring maybe you've seen us just outside at our booth and we will be here until Sunday so even throughout the
PI data so if you want to come and talk to us well okay thanks a lot and any questions yeah so the question was a little more traditional if there are more traditional
techniques like what rich regression yeah rich regression is included but yeah it really depends
I mean rich what rich regression does is it removes features completely if you have too many features techniques like lasso is another one and rich and there the problem is more you want to avoid overfitting with those methods so you have let's say 100 features but only
1000 samples and this is really prone to overfitting and then you give it to lasso or rich or another one is ARD and then it kind of says okay I throw out feature number five and it reduces it's more like a model reduction thing so yeah so the the thing with the
outliers is more it's different because you can have this outliers inside one features and so I think it's a good idea to also include more robust estimators in scikit-learn and as of now
I mean ransack is now included and this is algorithm coming more from the computer vision so it's more heuristic it's not that complicated it tries to select the right points and checks if it adds other samples to this consensus set and so on so I think the scikit-learn
developers are really now looking for more robust things in addition to what they already have yeah some more questions yeah so the question was if Tilesan could be paralyzed and yeah it can
and I paralyzed it so the thing is that taking out those different combinations of all points of course this can be done perfectly in parallel and calculating then the hyper planes
can be done in parallel and writing back to some large arrow array is can be done in parallel so this is what I did with the help of chop lip which is included in scikit-learn and this works really good only the last step that you need to find this one single spatial median so this is then so the algorithm is based on a reweighted least square I think it's called
modified Weitzfeld method and this is then iterative and can't be paralyzed but the first part of course is easily paralyzed yeah