Merken
Extending ScikitLearn with your own Regressor
Automatisierte Medienanalyse
Diese automatischen Videoanalysen setzt das TIBAVPortal ein:
Szenenerkennung — Shot Boundary Detection segmentiert das Video anhand von Bildmerkmalen. Ein daraus erzeugtes visuelles Inhaltsverzeichnis gibt einen schnellen Überblick über den Inhalt des Videos und bietet einen zielgenauen Zugriff.
Texterkennung – Intelligent Character Recognition erfasst, indexiert und macht geschriebene Sprache (zum Beispiel Text auf Folien) durchsuchbar.
Spracherkennung – Speech to Text notiert die gesprochene Sprache im Video in Form eines Transkripts, das durchsuchbar ist.
Bilderkennung – Visual Concept Detection indexiert das Bewegtbild mit fachspezifischen und fächerübergreifenden visuellen Konzepten (zum Beispiel Landschaft, Fassadendetail, technische Zeichnung, Computeranimation oder Vorlesung).
Verschlagwortung – Named Entity Recognition beschreibt die einzelnen Videosegmente mit semantisch verknüpften Sachbegriffen. Synonyme oder Unterbegriffe von eingegebenen Suchbegriffen können dadurch automatisch mitgesucht werden, was die Treffermenge erweitert.
Erkannte Entitäten
Sprachtranskript
00:17
OK and now we're going to learn about extending cited learned with your own aggressive from you
00:26
few more so imagine a my talk extending cited when the only present the 1st give a short introduction so I could learn and maybe most of you know and then I'll talk about an estimator which is not included which not yet included in could London robust estimator cultivars and this this as an example I would show you how you can implement your own estimator and so I could learn how to extend so I could learn that a little bit about and what you need to consider if you want to call you wouldn't own estimated society of learning and a little bit about my own experiences in the computing secular so present what if I could learn so sigh can learn machine learning library so whenever you have some kind of data anyone extract some insight from this data can I could learn
01:22
that you can use I could learn it from its simple efficient tool for data made mining and data analytics so it's really it's simple to use and and so that makes accessible for everyone and you can really applied to all kinds of problems so I took this marketing sentences right from the web page but it's really choose it's really extremely simple so if you haven't used you should buy you should definitely look into sigh learn it's built on number high side and Matplotlib toward 3 famous libraries which are used all over in the technical system and what really good it's open source but still commercially usable so it's BSDlicensed so if you want some not maybe not contribute everything you do with it back to psych learning you can still use it which makes it really good In the commercial applications OK so this picture and it can also be found on the cycle and
02:23
website I like it because it gives a nice overview of all the things you can do with secular the basic areas of applications so you can do classification so and to be good example would be if you have like hand written digit and you want a classifier for digits of 1 over and then you have everything related to clustering if you just looking for patterns in the data without having some kind of labels that the real real targets unsupervised learning you can use clustering and it also supports dimension reduction techniques so when you have too many features and 1 avoid over fitting for instance you have a lot of tools to do PCA and so on so the dimension reduction and of course the full regression part if you wanna find relationship to target variable depending on some features of and this is what we're going to talk about so but before we start 1st I'm a little refreshing from the mean from school who have learned about this the least square least
03:31
square method the is called linear regression in so I could learn an emotionally explain how it works because titles and is a kind of extension to this request so we have independent variables X 1 to X T and B in psych it learns the they're called features and we have a dependent variable to the socalled target Y and now we want to build a model we want to use the features to somehow predictive value Y an attribute really simple approaches just linear model so you have a linear combination of x and the coefficient W and you try to explain your target variable Y with the features x so in order to now find the w and then you minimize the functional which is given here so this is then the least square your minimizing the squared distances and in a typical onedimensional case this term this picture here so the the blue dots your data and in 1 dimension for the x axis artists the future and directs and online now minimizes the squared distances to all black dots so this works really well if you have a if you have perfect data because that's an internal assumption that Arab value is normally distributed so dispersed then quite well but in practice and in many many many projects and that I worked on all the data you get maybe from customers is less than perfect so you have a lot of some outliers who have corrupted data because of measurement measurement arrows because of and maybe someone in the wrong values somewhere and then quite often in your data looks like this in 1 dimension directly see on the on the right hand side of the decimal values that don't really fits to the really dense line on the left side so what do we do in this case would maybe just remove those thoughts just by by by looking at this plot and decide OK I don't wanna take this into my into my face but what do you do if you were in a in a 10 dimensional space so in ndimensional space and you can't just see by looking at the plot like this which I outliers you need to somehow makes some complicated preprocessing to eliminate those outliers so what happens if you know just applied the ordinary least square to gets of course the complete wrong results so you would not expect the line to go like this you would rather want to have the line to go through the the the black line to the dance line on the left side so this is something I think what this really you really need consider whenever you look at new data that there no outliers in this new data and that you would come up with something robust so the highest and as a natural generalization of of the least square method is an algorithm that now looks at all
07:04
possible pairs of those few sample points and calculates a list of and then in this case and if you have in the end the list of floats you take the median and the median is what makes what what we what them makes the method really robust because the median doesn't care about a single value it only cares about the rank so the order of those values so I think this is easily shown and understood with an example so here again our plot with the with outliers we note take 2 points to 2 red dots here we calculate the slope connecting those 2 blocks and added to the list and just close to the x axis and the slope is 3 . 1 in this case and I would just go on with all possible points and of this time it's 3 . 1 again and so no we are not so lucky anymore so we have 1 outlier connecting to 1 point we consider not to be an outlier and so the is 3 . 1 and you see that the list is sorted and as we go along and another 1 and so that lots of even to outliers and we could go on and on and on but already here we see that if you look at the center of the list of slopes that the media and so this the center is that correct so it's a 3 . 0 and if you look at 3 . 0 so this is the the slope of the line would expect the that whole the line should be so inside the stands line of of our sample points so the whole principle is this that take median and that that don't look at all points on this method that outliers they're not really considered anymore in this case so this is the case for a twodimensional problem so just 1 feature and a target graph variable Of course this method can be extended to come to ndimensional space because most cases if you cited learning will have a lot of features are not only 1 feature and I'm in an ndimensional space so I've given here on this occasion to this paper and in ndimensional space you don't have any smokes anymore so the slopes become hyperplanes and the list of slopes and becomes a list of vectors and so but you basically do the same thing you sample in an ndimensional space and plus 1 points making the hyperplane and put this vector of the hyperplane inside the list and then and it becomes a little tricky because and they need to decide what is the median of a list of slopes and a median time of a list of slopes can then be for instance the spatial median and the spatial median is just if you see the list of vector just points in dimensional dimensional space to try to find the 1 point so that the sum of all distances to all other points is minimized so this is the so so called from our the above problems but basically but exactly like like it does here OK then again and the comparison the ordinary square and the and if you do this iteration fall horns it finds the perfect life
10:44
so case or this is about the motivation of pilots and and at 1 protect had to deal with corrupted data and outliers could not really by hand who removed by hand and then and then anchorites OK how would I know implemented this I estimating inside of a cited learn so the good thing about sigh learn is that you have a lot of good commentation so I think the
11:16
2nd 1 is used so often because the documentation is just so well so if you look for how to run on a regressive directly get manual and you need to do is you want to widen on the present you have to provide far for functions said currents and get permanent parents this is of course the setting and getting the parameters of your estimator and those methods are I they more or less used only internally so I they use for instance if you do crossvalidation right to use another kind of made estimator those um functions are used to set and get the parameters of your estimator but you need to implement them for yeah for for your own estimator and of course you need to fit and predict method so the base estimator class which is inside of cycad learn already gives you an implementation of the currents and get power and so so that you can just inherit from it and since we have since I've is a linear model We can also directly inherit linear model and this also gives you to predict method because the linear case assisting before the fall last and predicting and future of design matrix X is just the matrix vector products just take X times the weights w we have calculated before so if we inherited like shown on the right side of we just take in let's or ties and estimated inherit from linear model we already get set parents get parents in Britain and additionally we have socalled mixed in in so I could learn so the principle of makes senses that you have some reusable coat that can only work together inside something larger and and you can combine different mixands inside a class and pipe makes things are done with the help of multiple and multipole inheritance and in our case so that a lot of Nixon's classifiers regressors cluster transform extends our case since writing press we of course inherit also the regressor makes in which gives us additionally additional functionality like for instance a score function so but that's already about it so to see the source code so Tyson and as I said before we just inherited from linear model and
13:46
press omics to get sitcoms get parents and predict and all right the init function I made it improved appreciation here so of course state all different kinds of parameters you have been doing it functions like if you want if if the intercept the dominant in the sense them in my consider there like 10 different and parametres also if you wanna work maybe only in a subset of the sample points and so on and if you wanna make the subsampling with the help of some random state and so on so more interesting part is then the sum of the fit function and X is not designed matrix the feature matrix and wider target as usual in and so I could learn here I and check with the help of Czech random state the random status privileges some some some subsampling some of the broken some stopped publication of x if you don't wanna consider also combination and and we also check the various X and Y so check errors and check random states to functions which are psychic you chills and if you write your own function that if you if you write your own estimators and you should have a look in in cycling you chose the developer tools which help you a lot doing those repetitive and things like that checking areas floated the dense format and is the random state even as the number of that you should use a seed or into the random state object itself and should just be passed on so this about the the developer tools inside I learn than the actual and with incomes and 1 is going to much detail about these algorithms such as said before it's basically quite simple it's just a technical because you need to create all those different combinations of sample points in ndimensional space and you also need to consider that you don't do too much so depending on some some maximum number of samples you might wanna consider and also editor and parallelisation with the help of chocolate which is also included inside the site could learn so it's I could learn also comes with some external packages which are directly included like 6 and the chocolate OK and then In this green tire center within Part I calculate the coefficients of course the source code is online so you can check it out and now and the coefficients they need to be did you need to be stored and stored would predict function to work and be stored in self intercept and self and co and so that the predict method that uses those arrays box and in the end of course the returns self which allows us to change different methods together we can call and directly top products for instance sold after having some program this and I was really happy that that it works so well so without being 2nd learned developer something I could really easily take my title and prototype and put it in inside this framework so that it can be used fees with things like crossvalidation for instance and so on and so on I thought OK why not just and give this back to I could learn so that the OK from my boss and decided OK what do we now need to do to really I'm contribute this and again so contributing cycle learners also well documented so they have
17:47
really good a high quality standards and and so what we need to do a few of them also want to contribute something you your code of course should be unit tested at least 90 % but of course 100 % and to make sure you method works then of course documentation is really important so I think looking back and the documentation and that took me way longer than that actually writing the code because you need to find good examples you need to explain a little bit you method you need to define all your paramaters in in strings and so on and yeah so you should also consider what the complexity all your algorithm is the spatial and runtime complexity and yeah as as a for like you need to draw some figures maybe you want to compare your method to an already implemented method in psych you've learned and if you got them and the idea of this method from some paper edition of coke costs make a reference to this paper 4 papers then of course coding guidelines to its usual PEP 8 and pi flakes is used as an insight could learn and they really help a lot to find like yeah quite operas problems but it's good that it's of American be automatically checked and it before you should so you should make sure that it was used to cited learned you tales that you don't we implemented stuff that is already there and I another big the barrier for me was that I had to uh and yeah make sure that Michael grunts and Python 2 6 2 7 3 point 4 and so on and so forth and this can be done with the help of 6 that usually heard of and this is also included cycle learned and the police station with the help of chocolate OK so this is about the requirements for contribution and benefit of why not just uh contributors that so it would be about my experiences of so my 1st call requests started on March and yet was my 1st kind of pull requests in in a in the opensource world and that the community of a secular is really great so that there were a lot of improvements due to really good remarks so and I could improve the all with the help of and the 2nd 1 main maintainers the performance was increased by a factor of 5 or 10 even so it was a really huge improvement um and also I get some coding guidelines are still had still wrong at this time so this is really good so of course showing your code to other people always gets you good feedback and then then and there was also discussion about the TEI's and being more statistical arm requests that really machine learning so on integer may be better to start small and forensic so this is a random sampling consensus that method is maybe almost always better than 10 times and this is something that is included in 0 . 15 and at this time there was a secular learn 0 . 14 so it was not included at that time so I didn't even know about this existed so some some during that time I learned about new methods so and yet was really really cool and if you don't wanna follow up on this for request so it's currently it's so secular also ties and is still not included so I'm I'm still working on this and if you want to learn about the discussion was really interesting discussion I can only recommend to everyone if you want but if you want to contribute to an open source project it's always a good idea because during that um yet to index weight you really learned a lot just about you how to improve things and what comes standards are and so on OK so that's about it with my talk on yeah a little marketing slides to you on the company I worked for is hiring maybe you've seen us just of sites we adopt was and we will be here until Sunday so even throw 2 PPI data so if you wanna come talk to us OK
22:30
thanks a lot because of
22:32
thank you and any questions yeah this is out of this you have all of it was so the question was a little too much additional they're much efficient techniques like what which regression yet rich aggression is included as but yeah it really depends and enrich what richer class ration does is it to remove features completely if you have too many features techniques like last so it's another 1 and rich and that problem is more you wanna would always fitting with those methods so you have let's say 100 features but only 1 thousand samples and this is really prone to overfitting and then you give it to last so a rich or not 1 is idea and then it kind of says OK I throw out feature number 5 and it reduces it's more like a moderate rocker repair reduction thing so yes so that the thing with outliers more is different because you can have this old lies inside 1 features and so I think it's a good idea to also include some more robust estimators incited learn and as of now I mean ransac is known movement and this is algorithm coming more from the computer vision so it's the more prestigious it's not that complicated it's tries to select the right points and check if it acts other samples to to this consensus set and so on so I think and this I could learn developers are really not looking for more robust things in addition to what they already have some more questions what you so the question was if Tyson and could be paralyzed and yet it can and is paralyzed it's so the thing is that taking out those different combinations of all possible points of quotas can be done perfectly in parallel and calculating than the hyperplanes can be done in parallel and writing back to some large arrow error rates can be done in parallel so this is what I did with the help of chocolate which is included so I could learn and this works really good only last step that you need to find this 1 single spatial media media and so this is and so the algorithm is based on the reweighted least square thing is called modified bytes Feltz method and this and understand iterative and can't be paralyzed but the 1st part of the course is usually paralyzed feel
00:00
Endogene Variable
00:26
Schätzwert
Bit
Open Source
Datenanalyse
Güte der Anpassung
Zahlenbereich
Maschinelles Lernen
Kartesische Koordinaten
Computerunterstütztes Verfahren
Physikalisches System
WebSeite
Schätzung
Kombinatorische Gruppentheorie
Analysis
Computeranimation
Endogene Variable
Data Mining
Open Source
Virtuelle Maschine
Berline
Programmbibliothek
Robustheit
Programmbibliothek
02:22
Resultante
Lineare Abbildung
HausdorffDimension
Schaltnetz
Kartesische Koordinaten
Term
RaumZeit
Computeranimation
Informationsmodellierung
Prognoseverfahren
Algorithmus
Perfekte Gruppe
Ausreißer <Statistik>
Gewicht <Mathematik>
Lineare Regression
Mustersprache
Endogene Variable
Zeitrichtung
Abstand
Maßerweiterung
Gerade
Einflussgröße
Attributierte Grammatik
Algorithmus
Lineare Regression
Präprozessor
Datenmodell
Dimensionsanalyse
Plot <Graphische Darstellung>
Ordnungsreduktion
Endogene Variable
Entscheidungstheorie
Arithmetisches Mittel
Skalarprodukt
Ausreißer <Statistik>
Dezimalsystem
Quadratzahl
Flächeninhalt
Rechter Winkel
Digitalisierer
Koeffizient
Mereologie
Projektive Ebene
Quadratzahl
Ordnung <Mathematik>
Instantiierung
07:03
Schätzwert
Lineare Abbildung
Punkt
Gewichtete Summe
Gewichtete Summe
Iteration
Kartesische Koordinaten
Maßerweiterung
RaumZeit
Computeranimation
Medianwert
Ausreißer <Statistik>
Rangstatistik
Hyperebene
Stichprobenumfang
Endogene Variable
Abstand
Gerade
Algorithmus
Videospiel
Lineare Regression
Graph
Güte der Anpassung
Datenmodell
Stichprobe
Dimensionsanalyse
Plot <Graphische Darstellung>
MailingListe
pBlock
Vektorraum
Paarvergleich
Medianwert
Skalarprodukt
Ausreißer <Statistik>
Quadratzahl
Hypermedia
Quadratzahl
Ordnung <Mathematik>
Instantiierung
11:15
Matrizenrechnung
Gewichtete Summe
Punkt
Extrempunkt
RaumZeit
Computeranimation
Prognoseverfahren
Lineare Regression
Permanente
Randomisierung
Array <Informatik>
Prototyping
Addition
Lineares Funktional
Parametersystem
Strahlensätze
Stichprobe
Strömungsrichtung
Quellcode
Schätzung
Biprodukt
Teilmenge
Texteditor
Strahlensätze
Menge
Rechter Winkel
Benutzerschnittstellenverwaltungssystem
Koeffizient
Parallelrechner
Dateiformat
Parametrische Erregung
Fitnessfunktion
Instantiierung
Aggregatzustand
Fehlermeldung
Web Site
Subtraktion
Gewicht <Mathematik>
Quader
Schaltnetz
Klasse <Mathematik>
Zahlenbereich
Implementierung
Multiplikation
Endogene Variable
Stichprobenumfang
Vererbungshierarchie
Optimierung
Softwareentwickler
Hilfesystem
Leistung <Physik>
Algorithmus
GreenFunktion
Vektorraum
Ausgleichsrechnung
Kreuzvalidierung
Endogene Variable
Objekt <Kategorie>
Flächeninhalt
Mereologie
Dreiecksfreier Graph
Shape <Informatik>
17:46
Rückkopplung
Bit
Web Site
Gewicht <Mathematik>
Punkt
KolmogorovKomplexität
Komplex <Algebra>
Code
Computeranimation
Virtuelle Maschine
Algorithmus
Einheit <Mathematik>
Standardabweichung
Code
Stichprobenumfang
Arbeitsplatzcomputer
Vorlesung/Konferenz
Figurierte Zahl
Hilfesystem
Feuchteleitung
Softwareentwickler
Statistik
Codierungstheorie
Open Source
Güte der Anpassung
PauliPrinzip
Systemaufruf
Rechenzeit
Teilbarkeit
Softwarewartung
Rechenschieber
Automatische Indexierung
Ganze Zahl
Dreiecksfreier Graph
Codierung
Projektive Ebene
Standardabweichung
Zeichenkette
22:30
Addition
Punkt
Schaltnetz
Klasse <Mathematik>
Einfache Genauigkeit
Zahlenbereich
Computer
Bitrate
Ordnungsreduktion
Ausreißer <Statistik>
Algorithmus
Menge
Webforum
Lineare Regression
Rationale Zahl
Hypermedia
Mereologie
Hyperebene
Stichprobenumfang
Zeitrichtung
Vorlesung/Konferenz
Robustheit
Softwareentwickler
Maschinelles Sehen
Parallele Schnittstelle
Hilfesystem
LieGruppe
Fehlermeldung
Metadaten
Formale Metadaten
Titel  Extending ScikitLearn with your own Regressor 
Serientitel  EuroPython 2014 
Teil  64 
Anzahl der Teile  120 
Autor 
Wilhelm, Florian

Lizenz 
CCNamensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. 
DOI  10.5446/19972 
Herausgeber  EuroPython 
Erscheinungsjahr  2014 
Sprache  Englisch 
Produktionsort  Berlin 
Inhaltliche Metadaten
Fachgebiet  Informatik 
Abstract  Florian Wilhelm  Extending ScikitLearn with your own Regressor We show how to write your own robust linear estimator within the ScikitLearn framework using as an example the TheilSen estimator known as "the most popular nonparametric technique for estimating a linear trend".  ScikitLearn is a wellknown and popular framework for machine learning that is used by Data Scientists all over the world. We show in a practical way how you can add your own estimator following the interfaces of ScikitLearn. First we give a small introduction to the design of ScikitLearn and its inner workings. Then we show how easily ScikitLearn can be extended by creating an own estimator. In order to demonstrate this, we extend ScikitLearn by the popular and robust TheilSen Estimator that is currently not in ScikitLearn. We also motivate this estimator by outlining some of its superior properties compared to the ordinary least squares method (LinearRegression in ScikitLearn). 
Schlagwörter 
EuroPython Conference EP 2014 EuroPython 2014 