Scikit-learn to "learn them all"
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Alternativer Titel |
| |
Serientitel | ||
Teil | 49 | |
Anzahl der Teile | 119 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/20046 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache | ||
Produktionsort | Berlin |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
| |
Schlagwörter |
00:00
SCI <Informatik>Software Development KitCOMVirtuelle MaschineWort <Informatik>Maschinelles LernenWort <Informatik>GraphfärbungFramework <Informatik>Maschinelles LernenLesezeichen <Internet>Computeranimation
00:57
TaskMaschinelles LernenRoboterCOMInformationsmanagementMultigraphSoftwaretestVirtuelle MaschineObjekt <Kategorie>AchtSchätzungKlumpenstichprobeAlgorithmusInformationPhysikalische TheorieRechenwerkTaskRandverteilungDatenanalyseStatistikPunktwolkeWort <Informatik>AlgorithmusQuick-SortKlasse <Mathematik>Computeranimation
01:48
DatenanalyseVirtuelle AdresseData MiningBeobachtungsstudieEuler-DiagrammRohdatenData MiningWort <Informatik>Fundamentalsatz der AlgebraDiagrammDatenanalyseMereologieMaschinelles LernenAnalysisBeobachtungsstudieRelativitätstheorieComputeranimation
02:27
Maschinelles LernenVirtuelle MaschineDatenanalysePrognoseverfahrenMustersprachePersönliche IdentifikationsnummerCodecAnalysisEndliche ModelltheorieAggregatzustandStatistikEinfügungsdämpfungVorhersagbarkeitRelativitätstheoriePunktWort <Informatik>Persönliche IdentifikationsnummerData MiningMusterspracheSchnittmengeMultiplikationsoperatorInstantiierungXMLUMLComputeranimation
03:45
Metropolitan area networkPortscannerSpezielle unitäre GruppeCOMCASE <Informatik>Formale SpracheProgrammierungGleitendes MittelAuswahlaxiomBitfehlerhäufigkeitGeradeNatürliche ZahlMaschinelles LernenMAPVirtuelle MaschineNichtlineares ZuordnungsproblemAdressierungJensen-MaßZwölfRechnernetzZustandsdichteDean-ZahlAlgorithmusKlasse <Mathematik>Virtuelle MaschineMaschinelles LernenMaschinencodeSupport-Vektor-MaschineCASE <Informatik>GruppenoperationSchnittmengeProgrammbibliothekDifferenteAnalysisDatenverarbeitungssystemWort <Informatik>Prozess <Informatik>ExpertensystemUnüberwachtes LernenFormale SpracheKlassische PhysikNatürliche SpracheMinimumVektorraumResultanteInformatikRechter WinkelÜberwachtes LernenEndliche ModelltheorieAuswahlaxiomBitNeuroinformatikProjektive EbeneInterpretiererKartesische KoordinatenGraphMereologieFunktionalNeuronales NetzSchlussregelMAPBenutzerfreundlichkeitHinterlegungsverfahren <Kryptologie>Jensen-MaßZentrische StreckungInterface <Schaltung>Ordnung <Mathematik>Mailing-ListeProgrammierumgebungTopologieBitrateDreiecksfreier GraphMathematikPhysikalisches SystemSoftwareStichprobenumfangMannigfaltigkeitWeb SiteCluster <Rechnernetz>ProgrammierspracheSystemaufrufEindeutigkeitBell and HowellAggregatzustandWärmeleitfähigkeitTaskComputeranimation
12:25
MenütechnikFormation <Mathematik>Installation <Informatik>Software Development KitSCI <Informatik>Maschinelles LernenMaschinencodeModallogikAlgorithmusLineare RegressionMatrizenrechnungLeistungsbewertungFunktion <Mathematik>SpeicherabzugAlgorithmusSkalierbarkeitInterface <Schaltung>Zentrische StreckungDifferenteRechenschieberFokalpunktTrennschärfe <Statistik>Web SiteMetrisches SystemModallogikKreuzvalidierungOrdnung <Mathematik>SchnittmengeQuaderNatürliche ZahlFunktionalCodierung <Programmierung>ProgrammbibliothekDreiecksfreier GraphInstallation <Informatik>InterpretiererVersionsverwaltungLeistungsbewertungLineare RegressionSocial TaggingMaschinelles LernenCluster <Rechnernetz>MaschinencodeArithmetisches MittelVirtuelle MaschineMatrizenrechnungDomänenspezifische ProgrammierspracheComputeranimation
15:05
SCI <Informatik>Software Development KitAlgorithmusCheat <Computerspiel>OrdnungsreduktionLokales NetzInterface <Schaltung>DatenmodellSchätzfunktionPrognoseverfahrenEndliche ModelltheorieOrdnungsreduktionCluster <Rechnernetz>Lineare RegressionTransformation <Mathematik>Figurierte ZahlQuick-SortMAPVirtuelle MaschineSchnittmengeSchätzfunktionMaschinencodeObjekt <Kategorie>WiderspruchsfreiheitDiagrammFlussdiagramm
16:05
Computerunterstützte ÜbersetzungRohdatenDatenmodellCAMSoftware Development KitSchätzfunktionSCI <Informatik>Metropolitan area networkCAN-BusURNMehrwertnetzNeuronales NetzEmulationAdressierungModallogikRFIDATMSummierbarkeitKnoten <Statik>GraphiktablettHochdruckGebäude <Mathematik>Gewicht <Ausgleichsrechnung>ZahlenbereichSchnittmengeChi-Quadrat-VerteilungMatrizenrechnungVektorraumComputerunterstützte ÜbersetzungAlgorithmusVirtuelle MaschineDifferenteEndliche ModelltheorieKlasse <Mathematik>CASE <Informatik>FitnessfunktionTransformation <Mathematik>Prozess <Informatik>OrdnungsreduktionQuaderBildschirmmaskeTrennschärfe <Statistik>NebenbedingungGüte der AnpassungGruppenoperationSelbstrepräsentationObjekt <Kategorie>SchätzfunktionMaschinelles LernenRechter WinkelInterface <Schaltung>InstantiierungKategorie <Mathematik>WärmeübergangOrdnung <Mathematik>PunktLineare RegressionSoundverarbeitungProgrammbibliothekVorhersagbarkeitDickeComputeranimation
19:24
ZahlenbereichSchnittmengeChi-Quadrat-VerteilungMatrizenrechnungVektorraumDatensatzKartesische KoordinatenVektorraumMatrizenrechnungCoxeter-GruppeDifferenteGraphfärbungCASE <Informatik>ImplementierungInformationStichprobenumfangZahlenbereichDatenkompressionSchnittmengeSchwach besetzte MatrixComputeranimation
20:40
AlgorithmusART-NetzIRIS-TCliquenweiteDickePERM <Computer>HochdruckSpezielle unitäre GruppeVarianzInnerer PunktDualitätstheorieNeuronales NetzURNAnalogieschlussKlasse <Mathematik>DatenbankDreiLastPortscannerQuaderPersonal Area NetworkGruppenoperationKlumpenstichprobeCAN-BusSpeicherabzugKette <Mathematik>Software Development KitSCI <Informatik>PrognoseverfahrenDatenmodellEinfügungsdämpfungMetropolitan area networkDeskriptive StatistikRechter WinkelObjekt <Kategorie>SchlüsselverwaltungMusterspracheLeistungsbewertungComputervirusART-NetzDifferenteEndliche ModelltheorieCASE <Informatik>FunktionalDatenfeldDreiecksfreier GraphResamplingZustandsmaschineAlgorithmusResultanteStichprobenumfangMultiplikationsoperatorShape <Informatik>Metrisches SystemProgrammierumgebungRechenschieberCluster <Rechnernetz>Wort <Informatik>MaschinencodeRandwertVirtuelle MaschineVektorraumInterface <Schaltung>EntscheidungstheorieProgrammbibliothekZahlenbereichFitnessfunktionKlasse <Mathematik>Uniformer RaumGeradeVorhersagbarkeitAggregatzustandMatrizenrechnungInverseDickeRelativitätstheorieLastValiditätComputerunterstützte ÜbersetzungComputerspielSchnittmengeSprachsyntheseAbstandEinfacher RingTaskZweiIRIS-TCliquenweiteSkalarproduktPi <Zahl>DatensatzTermGraphfärbungComputeranimation
27:29
Neuronales NetzMakrobefehlDatenmodellAnalysisStatistikStochastische AbhängigkeitSchnittmengeAusgleichsrechnungMUDOrdnungsbegriffPrognoseverfahrenSummierbarkeitSCI <Informatik>Software Development KitProzess <Informatik>QuaderMarketinginformationssystemFormale SpracheNatürliche ZahlSpezialrechnerNormierter RaumSpezielle unitäre GruppeZoomURNMaschinelles LernenProgrammbibliothekVirtuelle MaschineDesintegration <Mathematik>ProgrammbibliothekWort <Informatik>SystemaufrufKlasse <Mathematik>Prozess <Informatik>Formale SpracheSchnittmengeMatchingDiagonale <Geometrie>Virtuelle MaschineMultifunktionDatensatzKonditionszahlMetrisches SystemKorrelationsfunktionParametersystemKreuzvalidierungBitrateRechenschieberBefehlsprozessorHeegaard-ZerlegungProgrammiergerätMultiplikationsoperatorCASE <Informatik>GeradeResultanteAdditionMaschinencodeImplementierungDigitaltechnikEinsBimodulToken-RingArithmetisches MittelSoftwaretestKontextbezogenes SystemLesen <Datenverarbeitung>DifferenteFunktionalNeuroinformatikSchnitt <Mathematik>DefaultCodierung <Programmierung>Zentrische StreckungFehlermeldungInterface <Schaltung>Endliche ModelltheorieStatistikZahlenbereichMatrizenrechnungVektorraumQuaderVorhersagbarkeitBildgebendes VerfahrenKartesische KoordinatenATMRechter WinkelWrapper <Programmierung>LinearisierungWeb SiteInstantiierungParallelrechnerSoftware Development KitMultiplikationSupport-Vektor-MaschineValiditätMaschinelles LernenRauschenNatürliche SpracheEinfache GenauigkeitComputeranimation
34:18
COMVerhandlungs-InformationssystemW3C-StandardZwölfFitnessfunktionCASE <Informatik>E-LearningWort <Informatik>Elektronisches ForumBildgebendes VerfahrenFourier-EntwicklungKreuzvalidierungParametersystemMereologiePartielle DifferentiationRechenschieberEndliche ModelltheorieMaschinencodePunktRechter WinkelProjektive EbeneVollständigkeitMatrizenrechnungProzess <Informatik>RichtungArithmetisches MittelVektorraumLineare RegressionZusammenhängender GraphEnergiedichteSignifikanztestSchaltnetzZweiInterface <Schaltung>HalbleiterspeicherRelationentheorieMultiplikationsoperatorQuaderSchnittmengeSpeicherabzugVisualisierungVorhersagbarkeitUmwandlungsenthalpieGreedy-AlgorithmusResultanteSelbst organisierendes SystemGruppenoperationDatenverwaltungGarbentheorieTypentheoriePräprozessorMathematikerinMathematikVirtuelle MaschineAlgorithmusInstantiierungHilfesystemEin-AusgabeOrdnung <Mathematik>DifferenteXMLComputeranimationVorlesung/Konferenz
40:11
Lie-Gruppe
Transkript: Englisch(automatisch erzeugt)
00:15
So the talk today is about scikit-learn or in other words why I think scikit-learn is so cool
00:22
First of all, I would like to ask you three questions not what's your favorite color actually, but If you already know what machine learning is how many of you? Oh Great. Okay perfect The second one is have you ever used scikit-learn?
00:44
Okay, and the third one is how many of you also attend this the great training on scikit yesterday Okay, okay. It's just two Brief questions, okay So what actually much learning means?
01:02
martial learning There are many definitions about much learning One of this is much learning teaches machines how to carry out tasks by themselves. Okay, this is very trivial Very simple definition and is that simple the complexity come with the details? Okay. There's a very general definition But just to give you the intuition
01:24
behind martial learning at a glance martial learning is about Algorithms that are able to analyze to crunch the data and in particular to learn the data. Okay from the data there basically exploits statistical approaches
01:41
So that's why statistical is a very huge word in this cloud Okay, martial learning is almost related to data analysis techniques. There are many buzzwords About martial learning you may have heard about data analysis data mining data big data and
02:01
Data science. Okay data science actually is the study of the generalizable extraction of knowledge From data and much learning is related to data science. I call it to do combo with this burn diagram Martial learning is in the middle Okay and data science is a part of much learning because it exploits martial learning machine learning is a
02:24
fundamental part in the data science steps Okay, but what what it is actually the relation of data mining and data analysis in general with much learning Martial learning is about to make predictions. Okay, so instead of only
02:43
Analyze the data we have much learning is also able to generalize from this data Okay, so we have the idea is we have a bunch of data Okay, we may want to crunch these data to make Statistics analysis on this data, but and that's it. Okay
03:00
This is also called data mining for instance. Martial learning is a bit different because martial learning Performs this analysis, but the the goal is slightly different The goal is analyze this data in generalize try to find a to learn from this data a general model For future data for data that are already
03:23
That are almost unseen at this time Okay, so the idea is a pattern exists in the data We cannot pin this pattern manually, but we have data on it Okay, so we may learn from this data in other words. This kind of learn is also known as learning by examples
03:42
Okay, martial learning comes in two different settings There is the supervised settings. This is the general pipeline of a martial learning algorithm Use you have all the data on the upper left corner you translate You translate the data in a feature vectors
04:01
There's almost a common step in pre-processing the data Then you feed those feature vectors to your martial learning algorithm and the supervisor learning setting supports also the the labels which is the set of Expected results on this data and then we combine we
04:21
generate this model From feature vectors and labels and we generalize that we we get the model to predict for future data in their bottom left corner of the fear, okay a classical example of supervised learning is the classification you have two different groups and of data in this case and you want to find a
04:43
general rule to separate These data's this data. Okay, so you you find in this case if a function that separates the data and for future data You will oh, you will be able to to know which is the class in this case. It's a binary classification
05:00
So you have two classes and in the future when you got new data, you will be able to predict which is the class associated to this data Another example is the clustering In this case, the setting is called unsupervised learning The pipeline processing is this one you have the same old processing
05:24
But what you miss is the label part Okay, because that's why this is called unsupervised because you have no supervision on the data. You have no label to predict okay, and For as for the clustering the problem is get a bunch of data and try to cluster eyes in other words to
05:42
Separate the data into different groups. Okay, so you have a bunch of data you want to Identify the grips inside this data. Okay Just a brief introduction So what about Python? Python and data science are very related nowadays. Actually a Python is
06:02
Getting more and more packages to for computational science according to this graph Python is A cat-engaged technology for this kind of computation. It's about almost it's in the upper right corner and
06:21
Actually, it's Replacing and Substituting other technologies one of the advantages such as R or Matlab persons one of the advantages of Python is that Python provides a unique Programming language across different application. It has a very huge set of libraries to exploit and this is the case
06:46
This is why the reason why Python is the language of choice Nowadays for data science almost the language of choice and this is displacing R or Matlab by the way There will be also a PI data
07:01
Conference at the end of the week. It will be started on Friday. So if you if you can please come and Data science in Python actually Matlab can be easily substituted by all these Technologies such as a Python, NumPy, SciPy and Matplotlib for plotting
07:21
But there are many other possibilities for especially for plotting nowadays R could be easily substituted with pandas. It's a great package and In the Python ecosystem we have also Efficient Python interpreters that have been
07:42
compiled for this kind of computation such as anaconda or And thought can apply and we have also siten or projects like siten siten It's a very great project to allow to boost the computation of your Python code. Okay The packages for machine learning in Python are manifold actually
08:04
I'm trying to Describe a bit of the set of well-known packages for for machine learning code and I would like to make some consideration on why scikit-learn is a very great one. Okay
08:23
We have Spark machine learning, PyML, natural language toolkit, NLTK sometimes called, the Shugun Machine learning toolbox, this morning there's been a talk about it scikit-learn of course, PyBrain, MLPy, okay, and There is a guy who
08:41
Set up a list of this on github where everybody can put his Her contribution to this list in order to distribute the the knowledge about available packages in different languages and Python is very full of Okay
09:02
So we have Spark MLib. Spark MLib actually is Implemented in Scala. It's not Python it's there is a wrapping in Python, which is called PySpark But actually the library for machine learning is at a very early stage
09:21
Shugun is written in C++ and it offers a lot of interfaces One of these interfaces is in Python the other Packages there are Python powered. So we are trying to talk about these packages Natural language toolkit is implemented in pure Python
09:43
Okay, so no NumPy or SciPy allowed but and the other packages are implemented in NumPy and SciPy So the code there is quite more efficient for large scale computations NLTK supports Python 2 and Python 3 is also in a alpha stage
10:01
PyML supports Python 2 actually the support Python 3 is not so clear PyBrain supports only Python 2 and these are the two guys there supports both Python 2 and Python 3 Okay What about the purpose of these packages? NLTK is for natural language processing
10:23
Okay, it embeds some algorithms for machine learning, but actually it is not supposed to be Used in complete machine learning Environment, it's almost related to text analysis natural language processing in general
10:40
PyML is almost Focused on supervised learning in particular to SVM Technique which is support vector machine. Okay, it doesn't many Algorithm especially related to Supervised learning PyBrain is for a natural neural network with which is another
11:02
set of techniques in the martial learning ecosystem the other two guys there are Somewhat general purpose. Okay, so scikit and ML martial learning py are Contains algorithms for supervised and unsupervised learning and some others different slightly different settings for machine learning
11:23
Okay, so we're We remove we will not consider anymore the PyML and PyBrain, okay from here on so We ended up with these three Libraries written in Python for our machine learning code. So why to choose scikit-learn?
11:45
up Ben Lorica, it's He's a big data guy recommend scikit-learn for six reasons. The first one is commitment to the documentation and usability scikit-learn as a brilliant documentation and
12:03
It's very very useful for newcomers and for people without any background about martial learning The second reason is models are chosen and implemented by a dedicated team of expert and then The the set of models supported by the library covers most martial learning tasks, okay
12:25
Python and PI data improves the support for data science Data science tools data science problems and actually I know if you know Kaggle Kaggle is side where you may
12:42
Apply for competition for data science and scikit is one of the most used package for this kind of competition The fire the the another reason should be the focus scikit-learn is a machine learning library and its goal is to provide a Set of common algorithm to Python users through a consistent interface
13:02
These two features are two of the features that I like the most. Okay, I will be More precise in few slides about this and and finally, but by no means So last but by no means least scikit-learn scales the most data problems Okay, so scalability is another feature that scikit-learn supports out of the box
13:26
If you want to install scikit-learn you have to pip very few comments You need to install numpy scipy Matplotlib IPython actually is not needed It's just for convenience and then you install scikit-learn all the other packages
13:43
Numpy and scipy in particular are required because scikit-learn is based on numpy and scipy Okay, but anyway if you want to install other Version of the Python interpreter such as anaconda. It's already provided out of the box
14:01
the design philosophy of scikit it's One of the greatest feature of this package, I guess In my opinion, it includes all the batteries necessary for general-purpose martial learning code it has as its it supports features for and
14:21
functionalities for data and data sets feature selection extraction feature extraction algorithms martial learning algorithms in general in different settings so Classification regression clustering and stuff like that and Finally evaluation functions for cross validation confusion matrix. We will see some examples in the next slides the
14:42
algorithm selection philosophy for this package is Try to keep the core as light as possible and try to include only the well-known and largely used martial learning algorithms. Okay. So the focus here is to be as much general purpose as possible Okay, so in order to include a broad audience of users
15:05
At a glance this is a great Sorry, this is a great picture depicting all the the features are provided by scikit-learn And this figure here is has been gathered by the documentation. This is a sort of map you may follow to
15:24
That allows you to choose the particular martial learning techniques you want to to use in your martial learning code There are some clusters in this picture. There is regression over there classification clustering and dimensionality reduction and you may follow this kind of
15:43
Path over there to to decide which kind of which is the setting most suited for your problem Okay The API of scikit is very intuitive and Mostly consistent to every machine learning technique
16:01
There are four different Objects, there is the estimator the predictor transformer and model. Okay, the these interfaces are Implemented by most almost all the Martial learning algorithms included in the library for instance. Let's make an example. The API for the estimator is
16:25
The method fit. Okay the an estimator is an object that fits the model based on some training data and is capable of Inferring some properties on new data For example, if we want to to create an algorithm, which is called KNN or K neighbors classifiers
16:45
We the KNN algorithm, which is a classifier So it's it's for classification problems and then supervised learning It has the feed method But for all also sorry for also unsupervised learning
17:01
algorithm such as k-means the k-means algorithm is an estimator as well and it implements the feet method to For feature selection is almost the same. Okay Then the predictor the predictor provides the predict and the predict probability method and
17:21
Finally, the transformer is the transform is about the transfer method that and sometimes there is also the feet transfer method that applies the fit and then the transformation of the data the transformer is used to to make transformation of the data node to to to to make the data
17:40
Able to end in a in a form that is able to be processed by the algorithms Finally the last one is the model the model is The the general model you may create in your machine learning algorithm. The model is for supervised and for unsupervised
18:02
algorithms and another great feature of machine learning of scikit is the Pipelines because scikit provides a great way to create Pipeline processing. So in this case you may create a pipeline of different processing steps
18:23
Okay, just out of the box You may apply these select k-best which is feature selection step Then after the feature selection, you may apply your PCA PCA feature is a an algorithm for Dimensionality reduction and then you may apply logistic regression, which is a
18:44
Classifier a classifier. Okay, so you may associate a pipeline processing very Very easily. Okay. See and then you call the fit method on the pipeline and the fit method will and then the predict the only constraint here is that the last step of the pipeline should be a
19:06
Class that implements the predict method sold a predictor. Okay, so far so good Okay, right So, let's see some example scikit in action. We have it's very
19:21
introductory example The first thing to to consider is the data representation Actually scikit is based on numpy and scipy as you know So all the data are usually represented as matrices and vectors in general in martial learning by definition We have the X matrix over there, which is usually
19:43
identified by the capital letter because it is a matrix as a matrix of n different rows and D different colors in this case I'm sorry in this case and is the number of samples we have in our data set and D is the number of features
20:00
So the number of Relevant information on the data we have. Okay, so the data comes the training data come In this flavor and it under the hood it is implemented by scipy dot sparse matrices. Okay Usually it is if I'm not mistaken should be CSR implementation
20:24
so comma sparse row a compressive sparse row, okay, and Finally we have the labels because we know the the values for each of this Data about the problem we have the problem
20:40
We are going to consider is about the iris data set and we want to design a algorithm that is able to automatically recognize Iris species. Okay, so we have three different species of iris We have iris versicolor in on the left iris virginica here and Iris setosa here, okay
21:02
The features we're going to consider are for and Are the length of the staple and the width of the staple the length of the petal and the width of the petal Okay, so every data in this data set comes as a vector and every sample Sorry comes as a vector of four different features. Okay this for here
21:24
psychic a rat has a great package to handle the data sets actually these particular that is very well known in many fields and Is already embedded in the psychic learn library. So you only need to
21:43
Import the data sets package and called load iris and then you you call the function load Iris and the iris object is a bunch object that contains different keys It has the target names the data the target a description of the data set and the feature names. Okay
22:01
Description is the description of verbose description of the data set feature names are the four different features I already mentioned in the previous slides The target names are the the targets we expected on this data set in particular setosa versicolor virginica the three different Iris species we want to predict Then we have the data. So we
22:24
Iris dot data comes as a new pie metrics do pie and the array the shape of this matrix is 150 hundred 150 rows times four four which is four different colors columns and
22:41
The targets are 150 because we have a value for the target value of target for each sample in the data set So N the number of samples in this case is 150 D the number of feature in this case is four and That's it
23:01
The targets here is the the result of the target Okay, so we have a value that ranges from zero to two corresponding to the three different classes we want to predict We may try to apply a classification problem on this Data, we want to exploit the KNN algorithm the idea of the KNN classifiers is pretty simple
23:24
for example, if we consider a K which is equal to six we're going to check the the the classes This is a new data we Train our model with the training data and we want to predict the class of this new data on the
23:44
The classes of the the six dearest neighbors of this data. Okay in this case Should be the virginica. Okay, the the dot The red dot. Okay, very simple in second few lines of code We import the data set we call the K neighbor
24:03
Classifier algorithm in this case. We select N neighbors equals to one then we call the fit method and we train our model Then if this is what we get actually if we want to plot the data These these are called the decision boundaries of the classifier and if you want to know for new data
24:23
Which is the kind which is a species of iris that has three centimeter times five centimeters Sepal and four times two centimeters Petal width. Okay, right. Let's check. I just dot target names of
24:40
KNN dot predict because KNN is a classifier. So it may fit the data and also predict After the training and it says, okay. It's virginica Okay So far so good, right? then we may also try to instead of
25:01
Facing this problem as a classification. You may also face this problem as a no-through in a non-supervised setting So as a clustering problem in this case, we are going to use the k-means algorithm The k-means algorithm is the idea is pretty simple. We want to recreate and a Cluster of object and each each object is equally distance to the center of this of this cluster. Okay, and
25:28
That's it and psychic. It's very simple. We have the k-means We we specify the number of clusters we want to have in the k-means in this case We want three clusters because we're going to predict three different
25:43
Species for the iris and then this is the ground through so this is the value we expected this is what we got after calling the K-means as you may already Notice the interface for the two algorithm is exactly the same Even if the machine learning settings are completely different in the formal case. It was supervised in this latter case is unsupervised
26:06
Okay, so classification versus clustering Finally Very few slides to conclude Another great battery included in scikit, and I'm I don't know how many other
26:21
machine learning libraries in Python Are so complete in term of batteries is about the model evaluation algorithm model evaluation Isn't necessary to know how do we know if our predictor or our prediction model is good? So we apply model validation techniques we may
26:44
Simply try to verify that every prediction correspond to the actual to the actual target. Okay, but this is Meaningless because we are trying to verify if we train all the data on the training okay, so this is this kind of evaluation is very poor because
27:04
Because it's based only on the training So we we are just checking if we are able to feed the data, but we are not able to To test if the model the final model is able to generalize Okay, because a key feature of this kind of technique is the generalization. So
27:24
No Go too much to the training data because it's it you will end up in a problem which is called overfitting but you need to Generalize to to to be able to noise and to be able to predict even new data that are not actually
27:44
Identical to the training data. Okay one Usually technique use a technique in machine learning is the so-called confusion matrix. Okay Say kit provides are in the the metrics package provides different kind of metrics to evaluate your
28:01
Performance in this case, we're we're going to use the confusion matrix. The confusion matrix is very simple is a matrix where It's the number it has is a square matrix where the rows and the columns corresponds to the number of classes you want to predict Okay, and then the diagonal you have all the classes that you expect with respect to the classes that you predict
28:23
okay, so you have all the possible matchings if you have all the data there on the On the diagonal it sells that you predicted perfectly all the classes, okay Is that clear? Okay, right. Thank you
28:40
But I grew a very well known for you guys that are ready aware of machine learning is the cross validation technique cross validation is a mode of validation techniques for Assessing how the results of the statistical knowledge of the data is able to generalize through independent Datasets not only to the data set we use for training
29:02
Okay, and psychic I ready provide all the features to handle this kind of stuff. So psychic Imposes us to write very few code just the few lines of code necessary to Import the functions are ready provided in the library
29:22
in other cases we Need we were required to implement this kind of function over and over for every time in In our Python code. Okay, so this is very Very useful even for lazy programmers like me, okay
29:42
In this case we have we exploit the train test plate. So we the idea of the cross validation here is the to splitting the data The training data in two different sets The the training set and the test set so we fit on the training set and we predict on the test set
30:03
Okay, so in this case, we will see we see that there are some errors Okay Coming from this prediction. Okay. This is a more obvious way to evaluate our prediction model Okay, so the last couple of things thank you the last couple of things is
30:22
Large-scale out of the box. Okay. Another great battery included in scikit is the support for large-scale computation and already out of the box you may Combine scikit-learn code with every library you want to use for
30:41
multiprocessing or parallel computation distributed computation, but if you Want to exploit the already provided features for this kind of stuff Some there are many techniques in the library that allows for a parameter which is called n underscore jobs If you set these parameters with a value different to one which is the default value it
31:07
Performed the performs the computation on the different CPU you have in your machine if you put the minus one value here This means that is gone it is going to exploit all the CPUs you have in your single machine, okay, and
31:25
This is for different settings or for different kind of application in machine learning. You may apply Multiple processing for clustering the k-means examples We made few slides ago for cross validation for instance or for a grid search grid search is another
31:43
Great features include that a feature included in scikit that is able to Identify the best parameter for a prediction that for a predictor that Maximizes the value for the cross validation. So we want to get the best parameters for our model
32:03
That maximizes the cross validation. So that is able to generalize the best Okay, just to do to do to give the intuition Okay, this is a Possible thanks to the job lib a library which is provided in the background
32:21
Okay, so under the hood the new number jobs here correspond to a call to the job lib Okay, there the job lib is well documented as well So you might read the documentation for any additional details and last but by no means least scikit meets any other
32:40
libraries, okay Sorry scikit could be integrated with nltk This is that is natural language toolkit and for scikit image just to make a couple of example In details scikit meets natural language toolkit by design nltk includes a Additional module which is nltk dot classify dot scikit learn which is actually a wrapper in the nltk library
33:07
that allows to Translate the API of scikit in the API used in nltk okay, so if you have code on nltk you want to apply a classifier exploiting the scikit library, okay, you may translate you may import the
33:26
classifier from scikit and then you may use the scikit-learn classifier class from the nltk package over there and Wrap the interface for this classifier to the one of scikit
33:40
That it is in this case linear SBC that stands for support vector classifier Okay, and then you may also include this kind of stuff in a pipeline processing of scikit so in conclusion scikit-learn is not the only learning library available in Python
34:01
But it is powerful and in my opinion easy to use very efficient implementation provided It's based on numpy sci-fi and site in under the hood and it is highly integrated for example in an Ltk or scikit-image just to make an example. So I really hope that you're looking forward to using it and
34:24
Thanks a lot for your kind attention Thank you, thank you very very real we have six minutes left for your questions Please raise your hand and I'll come by with a microphone
34:49
Well, thanks for the talk. I have two short questions. Does scikit-learn provide any online learning methods? Yes. Yes Yeah, actually the this is a point I I wasn't able to include in the slides the online learning is already provided and
35:04
There are many classifiers or techniques that allows for a method which is called partial fit, okay, so you have this method to Provide the the model a bunch of data one at a time Okay, so the interface has been extended by a partial fit method
35:23
so some techniques allow for online learning and another very Great usage of this partial fit is in case of the so-called out of core learning in that case The in the out of code out of core Sorry learning setting your your data are too too big to fit in the memory
35:45
Okay So you provide the data one bunch of the bunch of data one at a time? Because they're too big to fit in the memory. So you call the partial fit Method to train in case of a classifier to fit your model a bunch at a bunch of the time now
36:03
Okay. Thanks seconds Quick question Is there any support for missing values or missing labels apart from just deleting them? In case of online learning you're in no just in general for any machine learning for missing labels missing labels or missing data
36:20
What do you mean? So like if you have a feature vector that just misses like a value at the third component Actually, I don't know. Okay. Actually, I don't know Yeah, thank you I'll just but I'm but so we have a very simple imputer that's going to impute by
36:45
median or mean in the different directions So if you have very few missing data is gonna work. Well if you have a lot Then you you might want to look at matrix completion Methods, which we do not have we had a Google Summer of Code project on this last year
37:00
They didn't finish. We welcome contributions, of course Thank you Hello, hi, I have some experience actually with psychic before and I know actually a mathematician and I had no
37:21
Idea about all that the stuff under the hood and I didn't want to deep to inside to be too deep inside of the whole algorithm starts and mathematics and such and The biggest problem for me was to realize what do I do wrong?
37:40
So if you got some kind of big data set with features labeled supervised learning how What would you? Advise to someone who doesn't know. How does it work inside? What which steps or which? Small small easy solutions should I consider to improve the results of the
38:05
classification. Thanks. Yeah, actually Muscle learning is about finding the right model with the right parameters Okay, so there are many steps you may want to apply in your training the different algorithms
38:21
In general you apply data normalization steps So you you might first of all, the the first step I suggest is Pre-processing of the data. Okay, so you analyze the data you make some statistical tests on the data some pre-processing some visualization of your data in order to know
38:42
What kind of data you're dealing with? Okay, so this is the first step. The second one is Try the the simplest Model you you you want to apply and then improve it One step at a time. Okay, if you find the right model you want to use then you want to
39:04
Find you should you're required to find the best Settings for that model. Okay in that case you might end up using the greed search method for instance which is a method provided out of the box just to Find the best combination of parameters that maximizes the values of the cross-validation for instance and
39:29
of course, it's A training on the job, right? So you You may find the right model for your predictions or you may find the worst model and then you
39:43
Start over again and look for different models Okay Open that open this helps. Yes. Thanks again Valerio I think he is going he's going to be give a talk at pay data as well. I think on Saturday, isn't it? Yep on Saturday. So if you attend PI data, don't miss that talk as well. And yeah, thanks again
40:06
Thank you very much