Scikit-learn (2/2)
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 43 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/38207 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Erlangen, Germany |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
4
5
6
7
9
10
11
14
28
29
32
33
34
35
38
39
41
43
00:00
Disk read-and-write headComputer fileNetwork topologyAnalog-to-digital converterLetterpress printingArithmetic meanKernel (computing)Cellular automatonInternet forumState diagramFile formatDensity of statesAreaStandard deviationParameter (computer programming)Maxima and minimaInformationCore dumpWave packetForm (programming)Network topologySet (mathematics)Software testingDecision theory40 (number)BitTwin primeEndliche ModelltheorieStatisticsCausalityContext awarenessKnotMultiplication signFunctional (mathematics)Dilution (equation)Instance (computer science)Fitness functionCellular automatonMixed realityLatent heatConstraint (mathematics)CASE <Informatik>Strategy gameResultantRow (database)Digital photographyAuthorizationControl flowMetric systemSpeech synthesisSampling (statistics)Different (Kate Ryan album)DatabaseProtein foldingTwitterPredictabilityParameter (computer programming)AverageFunction (mathematics)Revision controlValuation (algebra)Data dictionarySubsetBit rateNeuroinformatikReal numberFlow separationCross-validation (statistics)AdditionValidity (statistics)Virtual machineHeegaard splittingPlug-in (computing)Analytic continuationComputer animation
07:47
Network topologyParameter (computer programming)Analog-to-digital converterInternet forumCellular automatonComputer fileKernel (computing)Arithmetic meanFile formatData typeCurveCartesian coordinate systemCode refactoringBeat (acoustics)Computer programmingEstimatorDecision theoryMachine visionBarrelled spaceWave packetCASE <Informatik>NumberInformationEndliche ModelltheorieBitInstance (computer science)Sinc functionAdditionStatisticsSoftware testingAverageNetwork topologyProduct (business)Point (geometry)Constraint (mathematics)Multiplication signSpeech synthesisRevision controlSet (mathematics)Virtual machineFitness functionMaxima and minimaMessage passingProcedural programmingAreaRow (database)ResultantSampling (statistics)Utility softwareDampingFunctional (mathematics)Different (Kate Ryan album)Table (information)Valuation (algebra)Presentation of a groupSpiralSmith chartDigital photographyForm (programming)Assembly languageKey (cryptography)Flow separationValidity (statistics)SubsetProgram slicingMusical ensembleChemical equationArithmetic mean19 (number)Green's functionCausalityVarianceRepresentation (politics)Parameter (computer programming)Social classRandomizationCurveCellular automatonCross-validation (statistics)Support vector machinePredictabilityOptimization problemHeegaard splittingNatural numberIndependence (probability theory)AlgorithmComputer animation
15:29
Cartesian coordinate systemCurveArithmetic meanPlot (narrative)EstimatorComputer fileCellular automatonKernel (computing)System callLetterpress printingFile formatMusical ensembleBeat (acoustics)Sample (statistics)GradientPerformance appraisalData modelForestAnalog-to-digital converterLaptopCurve fittingNetwork topologyWave packetCASE <Informatik>RandomizationPoint (geometry)Endliche ModelltheorieFitness functionFlow separationConstraint (mathematics)BitRight angleProcess (computing)NeuroinformatikAreaNumberPlotterSet (mathematics)Default (computer science)Multiplication signInstance (computer science)Product (business)Sampling (statistics)CausalityTwitterPrincipal idealSubsetSuite (music)Metric systemWeightSoftware testingSequelFamilyImplementationInformationBoundary value problemRule of inferenceSource codeNormal (geometry)ForestGradientParameter (computer programming)Selectivity (electronic)CurveValidity (statistics)Utility softwareMusical ensembleEqualiser (mathematics)Cross-validation (statistics)Error messageMaxima and minimaPredictabilityThumbnailBuildingComputer animation
23:11
Musical ensembleGradientTotal S.A.BefehlsprozessorForestLetterpress printingFile formatPerformance appraisalData modelCurve fittingCellular automatonComputer fileKernel (computing)Analog-to-digital converterArithmetic meanPlot (narrative)Matrix (mathematics)CurveLaptopRotationLinear regressionImplementationData miningFacebookFunctional (mathematics)Row (database)Error messageAttribute grammarGoodness of fitThresholding (image processing)Different (Kate Ryan album)Social classNumberProcedural programmingRepresentation (politics)Instance (computer science)GradientEndliche ModelltheorieCurveReal numberSet (mathematics)Software testingMultiplication signCartesian coordinate systemSampling (statistics)Function (mathematics)Confidence intervalCASE <Informatik>Descriptive statisticsLevel (video gaming)Insertion lossType theorySubsetFrame problemCross-validation (statistics)PredictabilityEstimatorLine (geometry)Traffic reportingRandomizationForest2 (number)System callPoint (geometry)Physical lawAdditionAbsolute valueElectric generatorBoss CorporationInterpreter (computing)State of matterProtein foldingPosition operatorMathematicsSelectivity (electronic)PlanningChemical equationWave packetDefault (computer science)NeuroinformatikBetti numberComputer animation
30:53
LaptopCellular automatonKernel (computing)Computer fileRotationRadio-frequency identificationLinear regressionData modelImplementationData miningFacebookMusical ensembleRandom numberSystem callDefault (computer science)ScatteringMaxima and minimaFile formatHypercubeAbsolute valueSummierbarkeitCombinational logicLinearizationLinear regressionSquare numberFunction (mathematics)Logistic distributionoutputNetwork topologyDecision theoryLine (geometry)Goodness of fitLaptopEstimatorDifferent (Kate Ryan album)ForestEndliche ModelltheorieInformationFacebookCoefficientCodeScaling (geometry)GradientDefault (computer science)CASE <Informatik>Attribute grammarVariable (mathematics)Arithmetic meanRevision controlPosition operatorHypercubeReverse engineeringMathematical optimizationMetreNumberSound effectMenu (computing)Computer animationDiagram
33:59
HypercubeSystem callCellular automatonDefault (computer science)Computer fileKernel (computing)LaptopParameter (computer programming)Data typeAbelian categoryStructural loadGoodness of fitParameter (computer programming)HypercubeDecision theoryNetwork topologyWave packetSingle-precision floating-point formatProcess (computing)Workstation <Musikinstrument>Cellular automatonMaxima and minimaForestGradientMultiplication signComputer animation
35:22
Abelian categoryHypercubeLaptopParameter (computer programming)Structural loadCodeGame theoryPatch (Unix)Cellular automatonKernel (computing)Letterpress printingComputer fileDatabaseExistenceFloating pointCartesian coordinate systemDefault (computer science)EstimatorDependent and independent variablesNetwork topologyConstructor (object-oriented programming)Analog-to-digital converterFormal grammarLinear regressionTask (computing)Normed vector spaceState diagramGradientStrategy gameAverageDiffuser (automotive)Beat (acoustics)Endliche ModelltheorieAxiom of choiceParameter (computer programming)Constructor (object-oriented programming)EstimatorDirection (geometry)Cycle (graph theory)Hypercube1 (number)LaptopInformationMultiplication signWordComplete metric spaceAlgorithmDifferent (Kate Ryan album)IntegerMereologyLoop (music)Statement (computer science)Attribute grammarProtein foldingCellular automatonObject (grammar)Cross-validation (statistics)ResultantWave packetPerformance appraisalSoftware testingArithmetic meanKernel (computing)QuicksortSoftware developerSet (mathematics)Codierung <Programmierung>Film editingCASE <Informatik>Fitness functionGoodness of fitMaxima and minimaMathematical optimizationValidity (statistics)Right angleProcess (computing)Prime idealGradientCategory of beingComputer programmingSystem callLecture/ConferenceComputer animation
43:10
Default (computer science)Cellular automatonKernel (computing)CodeParameter (computer programming)HypercubeLetterpress printingScatteringFile formatBeat (acoustics)GradientEstimatorDiffuser (automotive)RankingCurveDifferent (Kate Ryan album)TouchscreenPlotterParameter (computer programming)Type theoryWave packetCodeDimensional analysisValidity (statistics)LaptopBitScatteringDot productProcess (computing)Endliche ModelltheorieResultantObject (grammar)Data dictionaryBoolean algebraFunctional (mathematics)LoginCombinational logicCartesian coordinate systemPoint (geometry)Software testingTable (information)Scaling (geometry)Goodness of fitElectronic mailing listGreen's functionLetterpress printingMaxima and minimaLine (geometry)Physical lawSet (mathematics)Term (mathematics)GradientContent (media)GodCross-validation (statistics)Computer animation
50:10
BlogDiffuser (automotive)Beat (acoustics)RankingCodeDefault (computer science)Kernel (computing)Cellular automatonHypercubeEstimatorData modelParameter (computer programming)ExplosionFile formatMenu (computing)Sheaf (mathematics)Function (mathematics)GradientSample (statistics)Independence (probability theory)Hausdorff dimensionPerformance appraisalSampling (statistics)Object (grammar)Multiplication signDistribution (mathematics)Content (media)Electronic mailing listUniformer RaumRandom number generationSet (mathematics)Parameter (computer programming)Attribute grammarSocial classRandomizationResultantNumberPoint (geometry)Surjective functionMaxima and minimaPlotterDimensional analysisFlow separationEndliche ModelltheorieCombinational logicShape (magazine)Different (Kate Ryan album)CurveActive contour modelStatement (computer science)Software testingTouchscreenFunctional (mathematics)SpacetimeDefault (computer science)Wave packetDisk read-and-write head1 (number)RankingInformationLengthSource codeElectronic program guideRight angleSimilarity (geometry)PRINCE2CausalityCrash (computing)Link (knot theory)Insertion lossType theoryMereologyStatisticsComputer animation
57:10
GoogolElectronic program guideSoftware developerAsynchronous Transfer ModeSoftware maintenanceFunction (mathematics)Linear regressionDependent and independent variablesInclusion mapMathematicsTunisMusical ensembleRandom numberForestMeta elementControl flowNetwork topologyGeneric programmingSource codeParameter (computer programming)String (computer science)Diffuser (automotive)Maxima and minimaInformationSample (statistics)Revision controloutputAdditionCurvatureGroup actionTotal S.A.Internet service providerNetwork topologySubsetVariable (mathematics)LaptopDecision theoryDistribution (mathematics)Heegaard splittingOnline helpType theoryMaxima and minimaGoogolComputer animation
58:21
Diffuser (automotive)Default (computer science)Computer fileKernel (computing)Cellular automatonHypercubeParameter (computer programming)Code refactoringCodeFunction (mathematics)Performance appraisalData modelHausdorff dimensionBeat (acoustics)Mathematical optimizationSocial classGradientRandom numberPoint (geometry)Bayesian networkSequenceAlgorithmSet (mathematics)Linear regressionParameter (computer programming)Dimensional analysisSequenceDistribution (mathematics)PlotterEndliche ModelltheorieRange (statistics)Point (geometry)Mathematical optimizationCurveHeegaard splittingRandomizationMultiple RegressionRevision controlFunctional (mathematics)NumberSoftware testingView (database)Object (grammar)Multiplication signSampling (statistics)Bounded variationLibrary (computing)Set (mathematics)Maxima and minimaDifferent (Kate Ryan album)EstimatorShape (magazine)SpacetimeDot productCausalityFitness functionSocial classGradientComputer animation
01:05:24
Computer fileCellular automatonKernel (computing)HypercubeAlgorithmRandom numberBayesian networkData modelGradientParameter (computer programming)Point (geometry)Performance appraisalMathematical optimizationDefault (computer science)CodeSequenceEstimatorStatisticsAttribute grammarMatching (graph theory)Different (Kate Ryan album)GradientSet (mathematics)Maxima and minimaDerivation (linguistics)Mathematical optimizationWeightGradient descentComputer animation
01:07:48
Default (computer science)Cellular automatonComputer fileKernel (computing)CodeHypercubeRandom numberMathematical optimizationData modelBayesian networkGradientParameter (computer programming)Point (geometry)Performance appraisalSequenceDrop (liquid)Linear regressionRandomizationMathematical optimizationEndliche ModelltheorieGradientAlgorithmTime zoneFunctional (mathematics)Digital electronicsNatural languagePoint (geometry)Dimensional analysisProcess (computing)Intermediate value theoremTerm (mathematics)Bayesian networkParallel portBitNumberSet (mathematics)Parameter (computer programming)Goodness of fitLimit (category theory)NeuroinformatikSoftware developerSpacetimeMaxima and minimaFlow separationHypercubeDivisorMoment (mathematics)Multiplication signFitness functionPredictabilityPositional notationMorley's categoricity theoremResultantAlpha (investment)Strategy gameAxiom of choiceBeat (acoustics)Object (grammar)Performance appraisalInstallation artBefehlsprozessorSocial classDefault (computer science)Interface (computing)InternetworkingMachine visionPower (physics)2 (number)Game theoryInitial value problemProjective planeComputer animation
01:15:24
Kernel (computing)Default (computer science)Cellular automatonComputer fileHypercubeNumberType theoryCombinational logicFitness functionSoftware testingDifferent (Kate Ryan album)HypercubePredictabilityRoutingValidity (statistics)Parameter (computer programming)Thermal fluctuationsTraffic reportingEndliche ModelltheorieWordWave packetSoftware developerWebsiteSet (mathematics)Moment (mathematics)Performance appraisalPlotterNetbookMetropolitan area networkMathematical optimizationRight angleFunctional (mathematics)2 (number)Goodness of fitRow (database)Programmable read-only memoryCycle (graph theory)RandomizationLaptopPartial derivativeEstimatorBayesian networkSoftware bugSpacetimeCross-validation (statistics)Computer animation
01:22:21
Wave packetSoftware testingVideo gameEndliche ModelltheorieSet (mathematics)Instance (computer science)MathematicsMultiplication signDecision theoryWordDivision (mathematics)LaptopFraction (mathematics)MereologyResultantPhysical systemSoftware developerProcedural programmingProjective planeLeakBoss CorporationPoint (geometry)Performance appraisalProduct (business)Network topologyInformationValidity (statistics)Level (video gaming)Cross-validation (statistics)Selectivity (electronic)DemosceneBitGame theoryMorley's categoricity theoremLogicValuation (algebra)Mobile WebComputer animationLecture/Conference
01:28:04
State of matterPasswordRelational databaseGradientComputer fileSystem callKernel (computing)CodeCodierung <Programmierung>Flow separationService (economics)Price indexHausdorff spaceSinguläres IntegralShape (magazine)Function (mathematics)ImplementationInformationTransformation (genetics)Data modelCurve fittingVariable (mathematics)IntegerHeat transferLevel (video gaming)Matrix (mathematics)CountingRankingSample (statistics)Sparse matrixSigma-algebraServer (computing)Address spaceStructural loadLoginComputerVariable (mathematics)Social classSlide ruleNetwork topologySpeech synthesisPresentation of a groupCodierung <Programmierung>Transformation (genetics)IntegerAuthorizationEstimatorDecision theoryComputer animation
01:29:15
Server (computing)Address spaceStructural loadComputerMatrix (mathematics)Graph coloringProjective planeGoodness of fitVariable (mathematics)Morley's categoricity theoremAdditionVirtual machineTransformation (genetics)Multiplication signPreprocessorCodierung <Programmierung>Computer animation
Transcript: English(auto-generated)
00:05
OK, so before the break, we introduced cross-validation. And so we used the cross-validation score function to score a specific model on a specific data set
00:22
using five-fold cross-validation. And you see that you can change, plug in additional strategies from Scikit-Learn here, if you need. And you get the average. As the output, you get the scores, the validation scores, on each of the folds. And you can compute the average score
00:41
to compare two models, for instance. Sometimes, you might also want to compute the score of the model on the training set itself. It's a bit weird, because it's not going to reflect the ability of the model to generalize. If you really want to evaluate
01:01
the intrinsic quality of a machine learning model, you should always compare on the validation set. But sometimes, for debugging, or to understand the model better, it's also useful to do the same on the training set as well. So instead of using cross-validate score, you can use cross-validate, which is a more flexible version of cross-validate score
01:20
that will output a dictionary as a result. You give the same kind of argument, but you can also return train score equals true. And the output will be a dictionary with additional information, like the training time, the scoring time, how long it took to train the model,
01:42
and so on. And you can also get training scores. And you can compute several metrics at once, like the accuracy, the rock score, the precision recall, and all of this at the same time. So here, I'm just using it to compute the Rocky UC
02:01
score on the validation folds, as we did previously. And we see that we get the same results. But I also do it on the training set. And here, you see that the score on the training set is larger than the score on the test set. But it's still not 100% accurate.
02:23
So can someone explain why the score on the training set is larger than on the validation set? You have an idea? Vias towards think, and someone else said overfitting.
02:43
Actually, the difference between the two is the ability of the model to overfit the training data. If the model has a lot of flexibility, it will be able to memorize perfectly the training set, like a database will do, basically, to remember individual records.
03:02
And when it sees the same record at prediction times, it says, oh, I know this one, and I know the answer, because I remember it. And so if you remember exactly the training set, then you overfit, and you can have a larger accuracy on samples that look like the training set than on real new samples.
03:22
But it's still not perfect. And the fact that it's not 100% is because the model has been constrained to not be able to remember everything. And here on this decision tree, what is this constraint? Can you explain what prevents the model to perfectly remember
03:44
the training set? The max depth, yeah, the maximum depth. When we build the decision tree, I will explain a bit more what it means. So where did I put my chalk? So assume that the model starts with the full data set,
04:03
and it computes some statistics, and it discovers that if you consider the categorical variable, the continuous variable, house per week, if you have a house per week that is more than or less than 50, then you
04:26
have all the samples in that leaf that are low income. They are all low income. And for all the other samples that have people who work more than 40 hours per week,
04:42
you can make another split. For instance, on education years, for instance, education new, for instance, let's say it's above 12 or something, I don't know. Then you split the data set again
05:02
into two different subsets. And then you can iterate and select subset of the data set until you reach leaves that are either low income or high income. And they are pure.
05:20
So if you don't constrain the decision tree, it will split, recursively split the training set until there is only one sample in each of the leaves at the end. And this can be very deep. But if you constrain the model to not split beyond a depth of 8, for instance, then the leaf will not be pure.
05:43
There will be a mix of high income and low income records in the leaves. And in that case, the model cannot perfectly remember individual records because they are not pure here. So this is a constraint to prevent the model from overfitting, constraining the depth of a decision tree.
06:01
It's a strategy to prevent the model, basically, to force it to summarize the training set in a way that is useful for generalization. So as an exercise, you can insert either, no, actually, you don't need to insert a new cell.
06:22
You can just play with the existing cell and re-evaluate it by changing the maximum depth and observe what it does change here and try to explain that in the context of this. Actually, I already explained it later.
06:43
Just check what it does.
07:31
So when we do a cross-validation, we do the same operation several times. And we consider each of the folds as a test fold subsequently.
07:44
And so we remove that from the training set. A fold is basically you cut the training set into subsets. And it's a slice, basically.
08:03
It could be a sub-sample. It's not necessarily contiguous elements, but it's a sub-sample. Yes? Yes.
08:28
So the question is, for each of those classifiers, we start from scratch. And yes, those classifiers, when you train classifier one, then you discard it.
08:41
Before training, classifier two. So classifier two will never see, because classifier one has seen the records in fold number two here. And if we want to evaluate classifier two, its ability to generalize on new data, we need to make sure that it has never seen in the training set.
09:05
How can it be? So the constraint depends on the nature of the statistical models. It's independent of the cross-validation strategy. The constraint here depends on the fact that we told the machine learning algorithm to not grow
09:24
trees that are beyond a depth of 8. So let me do it. So if I say max depth equal 1, for instance, there will be only one split in the trees.
09:41
And you can see that the performance is degraded. It's 60% accuracy, and both on the training set and on the test set. So you see that the model has been too constrained. Now, it's not flexible enough to make good prediction. If we increase, if we give it more flexibility by allowing
10:02
slightly deeper trees, you can see that the performance is improving a lot. But those two numbers on train and validation, they are still very close. And if I go around, for instance, 7 or 8, I get the maximum performance on the validation set.
10:24
So it's probably the best depth. If I train 9, for instance, it's the same. Actually, it's slightly bigger. But there is some randomness here because we are in the area of non-meaningful results.
10:44
But here, if it's too big, you can see that now the validation accuracy is starting to degrade a bit. And the training score, on the other hand, is still increasing. And if I don't put the constraint at all,
11:00
I can use none instead. And you can see that that's weird. It should be 100% accuracy. I'm not sure why. Maybe there are duplicated samples with different labels. I don't know. I don't know exactly why it's not another percent. It's very close to 100% because we
11:22
didn't put the constraints so the lifts could be pure. And basically, the model has the flexibility to completely remember the training set. But then in that case, it's making decisions that are too noisy, basically. So we say that the model is overfitting in this situation. We have a variance problem.
11:42
And whereas when we put a strong constraint, like max depth equal 1, it's basically biasing the estimator, the decision tree estimator. It's a bias on the decision tree estimator. And in this case, it's too strong of a bias. And the model is underfitting, we say.
12:02
Underfitting is when both the validation score and the training score are very close, and they are bad at the same time. Overfitting is when the two scores are, the training score can be very good, but the validation score is not as good.
12:20
And we need to find a balance between training, between variance and trade-off, between underfitting and overfitting. In some cases, it's possible that we are underfitting and overfitting at the same time, which means that the training score is not good, but the validation score is even worse, much worse. That happens on very difficult, very noisy data.
12:46
All right, so if we go on, we can see that we can do a visual representation of this behavior of the decision tree model by using what we call a validation curve. So there is a utility function in Scikit-Learn that will do this procedure several times.
13:04
And we use my plotlib to display the results. So it's basically doing what we did manually by executing the same cell several times with different value of the max depth parameter and recording the cross-validation score in green and the training score in red.
13:23
And you can see that there is some optimal value of max depth, so around 10 or 9, 8, 9, 10. And when you go to 11 or 12, it's starting to degrade. And here, we see the overfitting behavior. And here, we see the underfitting behavior.
13:42
And here, we have a good trade-off between the two. So now that we have found the best parameter for max depth, how could we improve the model further? Did you have a suggestion?
14:06
Yeah, we could change the decision tree class. We can use a different kind of model, like a support vector machine. And what we are going to do, actually, is instead of using one single decision tree, we will assemble several decision trees that are trained on slightly randomized version
14:21
of the training set and average their prediction. And this is actually a very good way to be able to train deeper trees without overfitting, because the averaging at the end is removing this. So this is what we are going to do later.
14:42
Another thing that we could do is also try to preprocess the data in a better way, like get better features, maybe enrich the features with additional data. For instance, if we have the city of origin of the people, we could collect statistics of the cities and merge them as statistics of the people,
15:04
and use that as additional information to describe the records and get better classification this way. So there are always two ways to improve stuff. Either you can improve the data or you can improve the modeling itself. And most of the time, the data is most important. But because we are doing a tutorial on Scikit-Learn today,
15:22
we will see first how to build better models. Another thing to do to analyze the behavior of the model is to plot the impact of the training set size. So it's the same kind of utility.
15:41
It's called learning curve instead of validation curve. But this time, we will keep the max depth parameter fixed. So for instance, if it's none, and we will increase the number of training samples. So we will train five different models on a subset of the data.
16:03
Here on the first point, we just have less than 2,000 samples selected randomly. And here, the last point is the full data set. So you see that if I don't constrain the model, there is a large overfitting behavior. And this overfitting is actually larger when the data set is smaller.
16:22
And you see that the validation score is increasing slowly with additional data. So when you have an overfitting issue, a very good thing to do is to try to collect more labeled samples. It's actually the best way to combat overfitting, even though in this case, the model is too bad.
16:40
But we can do the same with a model of depth 15. And you can see that even the model at the beginning on a small training set with a depth of 15, it has enough flexibility to perfectly memorize a small training set.
17:00
But when the training set is increasing, the training accuracy is decreasing. And so it's kind of a boundary on the validation accuracy. The validation accuracy is never going to be better than the training accuracy. And so you can see the trade-off and the impact of the training size this way. And ideally, so if we take max depth equal 8,
17:25
you can see that for the full training set here, we see that the cross-validation score and the training score are very close. But if we decrease the training set size, the gap is increasing very quickly.
17:41
And if we constrain the model too much like here, or maybe I can try even lower, you can see that it's constrained from the beginning. And the more data you add, you don't benefit from adding more data because the model is underfitting at the beginning.
18:00
And so you cannot benefit from more samples. All right, do you have questions on this? Or is it clear? Okay, so as we said previously, a good way to go beyond this
18:22
is to build ensembles of trees. So there are two kinds of ensembles of trees, two families. There is the random forest classifier and the gradient boosting classifier. So instead of training one tree, we'll train 30 trees, for instance. But we can do that in two different ways.
18:43
Either we train many trees independently of one another and we average the score. Or the other way, which is a random forest classifier. And the other way is to train a first tree and then look at the errors, the prediction errors that it makes.
19:01
And you train a second tree that tries to predict the errors of the first tree and fix it basically. And then train another one that tries to fix the error of the past two trees together and so on. And so you train new trees sequentially that are trying to improve the errors of one another. And so this is what a gradient boosting classifier does, basically.
19:28
So in one way, random forest classifier, it's fine to use very deep trees. Whereas for a gradient boosting classifier, we tend to use shallower trees like depth 3 to 8, for instance.
19:41
Whereas with a random forest classifier, depending on the size of the training set, you can either set max depth equal none or use a large depth, more than 10, for instance. So we will try the two of them with some parameters. Usually when you increase the number of estimators, the number of trees in the forest,
20:01
it's getting better and better until some point. But then it's also getting slower and slower because you have to do a lot more computation to train all the trees. So the rule of thumb is to increase the number of trees until you see no improvement, as long as you're patient enough to wait for the training.
20:22
In this case, for 30 trees on this small dataset, it's quite fast. And you see that we have improved the ROC score by 1.4%, 1.5%. And for gradient boosting classifier, here I use 100 trees and a maximum number of leaves
20:47
is constrained to be 5 at max. It's another way to constrain the depth. It's a different way to constrain the model. And you can see that the ROC score is even better. So it's actually very often the case that those two models,
21:03
they have approximately the same score, but gradient boosting tends to be slightly better. It's not always the case, but it's very often the case. So this is why people doing Kaggle competitions, they use XGBoost or gradient boosting or stuff like that.
21:21
So now that we have a good model, we could tweak the parameters even further and we will show later with Tim how to do that in a more principled fashion. But assume that this is the best model that we could select. It's good to evaluate the model on the final test set and to compute some performance metric.
21:42
Yes? OK. So the question is, does the random forest implementation in scikit-learn do feature selection? Feature selection is just dropping features that are useless.
22:01
We will see later how we can ask the model which features are important to it. And then we could use that information to trim explicitly those features. But by default, it will just compute stuff and don't use features that are not too useful.
22:22
But it will try them anyway. So it's wasting computation to use too many features that are useless, and it's also a source of a bit of overfitting, even though for random forest it's not such an issue because by default it will not use them too much.
22:41
But it's good to trim the features that are useless in general to reduce overfitting and to increase computational speed. The gradient boosting, they have the same behavior, the random forest and gradient boosting in that respect. They can compute feature importancies that you can use to trim the features manually if you want.
23:03
I will see that later. So now that I have found this model, basically, when I do cross-value score here, I will train many copies, five copies, of the same model on different subset of the data. But I will not modify the original model.
23:22
I will modify copies. And basically, I select the hyperparameters of the models, those values. So now that I'm happy with those values, I can fit this classifier on the full training set. So I'm no longer using the full training set.
23:42
And you see that it's taking 1.8 seconds in this case because I'm using the person-person time magic here. And then I can compute the score on the real test set. And you can see that my final test
24:02
rock is actually matching the confidence interval of the cross-validation. So it's a good sanity check. My cross-validation procedure is actually representative of the true generalization ability of the model. And I can further analyze how my model is making errors
24:24
by using the classification report function of scikit-learn, which is basically computing precision, recall, and F1 score for all the classes. And in this case, the positive class is this one. So the precision of the positive class is 78%.
24:46
It means that out of all the records that have been predicted as positive by my model, 78% of those records are actual true positive high-income people.
25:01
And the recall is the ability of the model to find them. So out of all the high-income people that I should have found, how many were retrieved classified positive by the model? So you see that when you have highly imbalanced classes,
25:21
you have a trade-off between the two. And F1 score is an arbitrary way to combine them into a single score. I think it's more important to look at precision and recall individually. And you can also see the number of records in each class. So it's informative as well.
25:40
And there is another way to represent this, is to do a precision-recall curve. And basically, we are trying to set different threshold. By default, a classification model that has a predict-proba function that outputs probabilities of the target class will make a prediction of positive or negative
26:02
based on whether or not the output is more than 0.5 probability. But we can change that threshold. Instead of 0.5, we could use 0.2 or 0.8. And by setting to 0.8, our model will be more precise, but it will miss some positive samples. So the precision will increase, but the recall will decrease.
26:23
And we can, by changing this threshold, get a line of the behavior of the model. So it's the same model that has been trained once, but we just changed the threshold on the outcome of the predictions. And we see this curve. So a perfect model would have a flat curve that is always 1 for all the precision level.
26:42
We'll have a recall of 1 everywhere. But you never get this. So you get something like this in practice, and you have this trade-off. And you could select a threshold, for instance, of... Here you have the threshold value, so you could use that to find a dot here where you have a precision that is higher than 0.8,
27:01
for instance, and you want to find the best recall at that point, and 0.6, for instance. It depends on your application. Sometimes you want to have models that are very precise and you want the best recall, or sometimes you want a large recall and you have a procedure later to filter out stuff that is not interesting. So in medical applications, for instance,
27:20
you want high precision if you want... I know it actually depends. You can have a high recall to diagnose a disease and then use additional scanning procedures to refine the diagnosis, for instance.
27:42
And you don't want to miss possibly a deadly disease. So the threshold array, you can have a look at 100 points. Actually, I'm skipping many points, 100 points. This is the step size, basically. And you can see that the thresholds,
28:00
they are moving between 0 and 1, and you can have a look. We don't have a good way to select the threshold inside KeepLearn for a given precision. This is a tool that we should add in the future, but it's not there yet. But you can do it manually using those arrays. And to come back to the question about feature selection,
28:21
once you have trained a gradient boosting classifier or a random forest inside KeepLearn, you have these attributes, feature importances with a final underscore here. This underscore is a marker that says that this attribute has been extracted from the data.
28:40
It didn't exist before the call to fit. It has been added after the fit. So it's good for introspecting the model after it has been trained. And basically, it's a hat because it's an estimate. It's just that we don't put the hat on top of the variable name, so we put it at the end this way. This is the original justification.
29:01
So instead of having a look at the row numbers, what we can do is use a matplotlib and use the column names from the original pandas data frame and to visualize the heights of the bars. Those are the feature importances. And you see that the most two important features according to the gradient boosting classification model
29:20
is the capital gain and capital loss, then the age, then the education. And the marital status and stuff like that is not that important. And hours per week apparently is not that important once you know the others, basically. And what is good is that the race apparently
29:41
does not impact too much the income, given the others. And the education, the type of education is not informative when you know the number of education and so on. So actually, those two features, capital gain and capital loss, it's probably, I'm not sure,
30:01
but we need to check the description of the data set. But I think it's the amount of capital that was gained over the last year or lost over the last year. And basically, if you have a large gain or a large loss in capital, it means that you are quite rich in the beginning because you cannot lose money that you don't have. So at least you must have some high income in the past.
30:24
So they are quite informative. It would be interesting to discard those two values and then to retrain a model without them and see how it performs and see what are the important features.
30:41
Yes? Because if you do a capital change, it's basically the absolute value of the sum of the two or the difference of the two,
31:01
you lose information. You have more information this way. If you train a linear model, it might help it to pre-compute the sum because it's not gonna, even for a linear model, maybe it's useful to have more information. Yes?
31:22
So the question is, does the feature importance attribute is unique to decision trees or decision tree-based model? And yes, it's the case. Most of scikit-learn models, they have no good default way to give you what is important as a feature, except for the linear models,
31:42
like logistic regression, for instance, or least squares or stuff like that. You have the coefficients and you can directly interpret the coefficients of the linear model. If they have a high magnitude, it means that they are important, assuming that you have your data scaled. So it's not necessarily easy to interpret.
32:01
And here you see the importance, but you don't see whether or not it's positively or negatively correlated to the outcome because it's a non-linear model. So actually you could have a variable. You could have a continuous variable as the input
32:26
and suppose that this is xi, some input variable, and this is the target variable, and assume this is a regression problem and not a classification problem, so it could be a continuous value. And the output could be related to an input like this,
32:43
which means that they are very correlated, but a big value of xi does not necessarily mean a big value of y. And a random forest is able to deal with these non-linear dependencies. So there is no good way to say that xi is positively correlated with xy
33:05
if xi is positive, if it's there, and it's doing the opposite in that region. Whereas a linear model on this kind of feature would not be able to benefit from it.
33:25
Okay, I think we stop here for this notebook. And you can go at the end later if you want, and it's a way to build your own estimators to make it even better by using a Facebook paper. So a couple of lines of Python code
33:41
to combine logistic regression and gradient boosting together, basically. So the next notebook is about hyperparameter optimization, and Tim is gonna introduce it.
34:03
So I clear the cells. I don't even know at what time she will stop.
34:30
Yeah, so living in Switzerland, I was thinking, okay, my train is going to be late because it's a German train, it's not a Swiss train and thing. And so I thought about that,
34:42
and then I did not think at all about what happens if it's my fault to be late. So I have no good excuse or joke or anything like this prepared. Besides being incredibly embarrassed. Okay, so now we will talk about hyperparameter optimization,
35:01
and the idea is, so far we already investigated what happens if you change the max depth of your decision tree. You also have tried single decision tree, gradient boosted trees, random forests. All these things are what I would call hyperparameters.
35:22
So the parameters that you adjust, basically all the parameters that you pass in the constructor to a estimator in scikit-learn are hyperparameters because you have to choose them. You don't learn them from the data directly. And I would include in that the choice of which model you use, because that's something you decide
35:41
when you wake up in the morning, and then you need to try different ones. So you can either do that by hand, by changing the value and retraining the model and doing all that, or because that's something you do very frequently in scikit-learn, there's tools that automate a lot of this for you.
36:05
So the first few, so should I restart the kernel or you think it will work? Oh, we'll see. We'll see what happens. So the first few cells in the notebook
36:21
just repeat what we had in the previous one. So they load the data. And in this case, actually, you can see I just use get-dummies because I was too lazy to do the categorical encoding in some smarter way. Then we split the data.
36:42
And so because I find that when you discuss with people and you call it train-test-splits, and then you reuse the word train-and-test-split for your individual folds, people get super confused.
37:02
So I decided some time ago that at the beginning I will always split my data in something which I call the development set, which I use to develop my model. And then there's a second part called the evaluation set, which essentially you forget that it exists until you're really sure that you're done with your work.
37:23
And this is what Olivia called the test set. Okay, and then one more thing. We train classifier and we see that we get back the same AUC score as before.
37:44
And now the question is, how would you go through trying these values in a more principled fashion than we've done so far? So in Scikit-Learn, there's something called grid-search-cv. And this does exactly what the name suggests.
38:02
You build a grid of hyperparameter values that you want to try. And this grid could be a 1D grid. So in this case, I just defined for the parameter called max-depth that I want to try values one, two, four, eight, 16, and 32. And then you pass them to grid-search-cv as a parameter.
38:24
You define what kind of scoring metric you want to use. And then grid-search-cv behaves just like any other estimator in Scikit-Learn, which is a trivial statement to make, but it's actually super useful that there's nothing special
38:44
you need to know about it. You instantiate it and you can fit on it. And then it goes off and now it tries all six of these values and does cross-validation for each of the values and all this stuff that normally you would see as a for loop
39:01
is wrapped up inside it. So then the question for you is, can you print out the values of max-depth that we used without looking in the, so the trivial solution to printing out these values
39:22
is to look at, to print out param grid, that variable, right? So you can do that. But there is a different way of doing it. So you can look at the attributes of the grid-search object and it contains all sorts of information about what went on on the inside.
39:45
And that will tell you what values it's tried and you can also find the train and the test scores and all sorts of other information in there. So I'll give you some time to start poking around.
40:04
Basically, do grid-search dot and then hit tab for tab complete. I think you will find what you're looking for.
40:42
Okay, so the question is, for other things which are not Scikit-learn estimators, you would like to search the parameters? No. So the short answer is yes and at the end
41:03
of the notebook I will talk about one more. So it's not part of Scikit-learn, it's called Scikit-optimize. And that is something which you can use to find parameters on any kind of algorithm. But that's a completely different story in principle.
41:32
Yes. Yeah, because inside it does cross-validation. So you can see, I don't have a, I can't find it,
41:58
but there is a parameter called CV equals
42:01
which you can either give an integer to or a more sophisticated CV object. And they will use that to do the cross-validation on the inside. And the evaluation data you essentially leave to the side until the very end.
42:20
Has anyone managed to find the parameters? Yes? Or you want to ask a question? Okay, so yeah. So this is, the key attribute to look at
42:41
is this CV results underscore. And it looks like a dictionary and it contains lots and lots of interesting stuff. So all the parameters that you searched over, there will be an entry that starts with param underscore and then the name of the parameter. And then you also have the mean training
43:04
and testing score for the different folds. So you can print them out and potentially make a nicer table than this. And then one thing that will help also towards later
43:23
is you can make a plot similar to the validation curve that we had in the previous notebook and plot what the testing and training score is. And I would just go and copy and paste the code for that because it will, printing out as a table
43:41
is a little bit hard to visualize what's going on. So I would go and copy and paste the code from the validation curve example previously and modify it to use the values from the grid search. That's the next exercise.
44:08
I will wait.
45:44
Okay, so if you look at what I did, this is essentially a copy and paste, but you now pass the name, well, you continue to pass the name of the parameter you want to make a plot for
46:02
and you pass the CV result object. So then you can look up the scores for this parameter because we know that if that is something that you pass through your grid search, there will be a entry called param underscore the name of the parameter.
46:21
And then there used to also be some stuff that deals with averaging the training scores from the cross-validation, but grid search CV already does that for us. So you pull them out and then you make a scatter plot instead of a line. So these are the only differences between what we had before.
46:41
And if you do that, luckily enough, you get exactly the same curve as we had before. So the only thing that's missing is that the x-axis is not log scale anymore. And you can see at the beginning, the red dots and the green dots are on top of each other. And then as we increase the max step,
47:02
they start spreading apart. And the training score keeps increasing until it gets to one, as you would expect for a super complicated model. But then you can see that it doesn't do very well on the training data anymore.
47:20
So then the next thing to do is to extend the grid search to now also search over a parameter called max features. And I would just try 5, 3, 6, 12, 24, 48, and 96. And it should be a one-line job for you
47:43
because all you have to do now is, in your parameter grid, add a new entry and list the values that you want to try. And then you can re-run the grid. You have to re-instantiate the grid search object and then you can run it again and everything should work as before.
48:01
It will take a little bit longer to run because there's more values now to try, but otherwise it's as before.
48:28
Yeah, so the question is, does grid search exhaustively search the search space? And the answer is yes. So I forgot to mention that. That is why it's called grid search because it will make a grid of all possible combinations
48:42
and will try all of them. So then the next thing to do is, in, I don't know, half a screen.
49:01
And I'll give you an answer in a second but I'll let people type and then talk at them.
49:28
I think all of you should be there now and you could already see this took quite, took a little bit longer to run than the previous one because it tries a lot more values.
49:42
So yeah, all you need to do to add an extra dimension to your grid is add a new entry to your param grid dictionary and list the values you want to try. And then everything else stays the same to the point that we can reuse,
50:07
we'll get to that. We could reuse our plotting functions and so on to plot the results for individual parameters because the contents of the CVResult object stays the same.
50:23
Another thing to do is to try and pick the best set of parameters out of your CVResult object. And then for you to think about, should you pick the ones which have the best test or the best train result?
50:43
And then we wait one more screen to get to your question.
51:08
Yeah, so for 2D, you can, so the question was, or the statement was is more difficult to visualize and that's true. So for 2D, you now need to make
51:20
some kind of contour plot. That's just about what we can still do and understand and yes, in 3D, 4D, 5D, it becomes super difficult to try and visualize. Okay, so how many parameter combinations did we just evaluate?
51:45
Somebody managed to find inside the grid search CV object the answer to that.
52:01
So there should be several attributes which contain results or an exhaustive listing of all the parameter combinations tried so you can take the length of that. Or for 2D, you can also do it in your head probably. Yeah, and then you can,
52:21
if you want to pick the top rank model, you can ask the results object again to tell you how they're ranked by the test score.
52:42
And then this, all of this picks out some of the information about the model that was trained and prints it out. So you can see then your best model has a score of 89.7 or 0.897
53:00
and these are the parameters that led to that kind of score. And then there's potentially something that you can learn from looking at what slightly lower ranked models, do they have completely different hyperparameters? Do they have similar hyperparameters? There's potentially something you can learn
53:22
about what the search space looks like. And yeah, for 2D, it's maybe still doable and in 3D, then it gets more complicated and yeah. But you can see, for example, the difference between the top model
53:44
and already the third ranked model is significant. So it makes sense to search for these hyperparameters and not just use the default values.
54:02
So here now we can reuse our plot grid scores function and what you will see is that there's now lots, lots more points for each value of max depth. Does anybody have an idea of why that happens?
54:33
Yeah, so what happens is that for each value of max depth, you also try several different values of max features
54:42
but you're always evaluating it at the same value of max depth. Or if you made the plot the other way around, you would also see six values of max features and then the other dimension of this problem collapse onto this.
55:01
So this, together with your question of what if I have a very big grid that I don't want to evaluate every possible value, the answer to both these problems, because now what you did here is you spent an awful lot of time evaluating the model but we still don't know anything really
55:22
about what the shape of this curve looks like in more detail than if we just did six points for max depth. But one way around this is then to do a random grid search. So now you specify the distribution or even the distribution of the hyper parameter
55:43
which can be a random or uniform random number generator. And then you say how many values, or how often do I want to evaluate the model? And you pick a number that is consistent with how much time you have to wait.
56:01
And if you do that, so the answer to or the name of the class to use is randomized search because it will just pick points at random in your grid. And you have to also specify your grid slightly differently, you cannot,
56:20
or you can also just list explicitly values if you want to. But usually people give a distribution from which you can draw a random number. So in SciPy stats, there's lots of these distributions. And if you have, for example, if you have an idea that smaller values you want to sample more finely
56:41
or larger values you want to sample more finely than small ones, then you choose a different distribution to sample from for your hyper parameters. No, so I'll wait for it to run
57:04
and then I'll show you how we can find out. Because I'm not, okay, so we don't get to use
57:30
the built-in help in the notebook because it's running now. But luckily you can just type your, essentially you type your question into Google and you. So what did we want to find the answer to?
57:41
Max features, so there you go. So there's, in the way the decision tree is built, you could say that I only want to consider a subset of my variables when I'm trying to make a split in the tree. And then the question is how big should this subset be?
58:04
And this is what max features controls. So it's another way to influence how the tree is grown. So now I have to go back, yeah, so this is run by now.
58:26
And the important thing is so we gave a distribution for our hyper parameters. And we also had to now choose how many times we want to evaluate or how many values we want to sample.
58:43
And this essentially you decide upfront how long do you have time and then you set it. And if you now make the plot again, you can see you get a much better view of how things change as a function of max step.
59:02
Because chances are that every point you pick at random from your parameter grid will have a different value for max step. So now we get a very nice curve showing what happens as you increase this.
59:21
And this is particularly useful if one of the dimensions of your search space doesn't actually influence the performance of the classifier. So sometimes it happens that
59:40
one of these hyper parameters you can essentially set to any kind of any value that's legal and the performance of your classifier doesn't really change. So if the problem with this is you don't know upfront usually which hyper parameter that's going to be. So if you
01:00:00
build a grid then you spend an awful lot of time evaluating different values of this useless hyper parameter and you learn nothing else about the or you don't increase the amount of knowledge you have about the shape of the distribution for the other hyper parameters whereas if you pick them at random then it doesn't matter so much that one of them is completely
01:00:22
useless because you do get a new point on the dimensions where there is some variation. So you can also increase this now to three and if you carefully pick the range for min samples per leaf so this is the minimum number of samples
01:00:42
you want to have in a leaf it's again stops the the splitting early if you carefully pick the range for that then you have a kind of an example of a parameter you can see that the performance for this one at least in within that
01:01:04
range doesn't really vary very much so that's an example of a parameter that maybe in this problem doesn't really affect the solution. The other thing you do see though is now that we've increased the number of dimensions it's
01:01:22
not so clear anymore that the max depth of 8 is really or 10 is really the the cutoff point and the gap has become much smaller between the red and the green points and that is kind of because for each point here there's a
01:01:41
potentially a completely different you know two neighboring points here have completely different settings for max features and min samples leaf so it starts becoming more difficult to see what the what the behavior is just for this parameter so then the but you can just ask your random search object
01:02:10
what are the best parameters and it will tell you these are these they have the highest test score and then you can even keep using the grid
01:02:20
search or the random search CV object as if it was a fitted version of the best classifier which is why it's so nice that it behaves exactly like it like the classifier okay so if random search is still not good enough for you and you want to do something even smarter then there's Bayesian optimization and
01:02:43
that uses a library called psychic optimize which you can also use for things which are nothing to do with the psychic learn if you want to and the idea is so neither grid search nor random search somehow learns from the
01:03:01
points is already evaluated so in grid search we decide all the values of the hyper parameters that we're going to try up front and then we just go through them and we try all of them and you know we wait until we're done in random search also we decide the distribution we decide how many values
01:03:22
we want to try or how many values we can afford to try but we learn nothing from the first few values that we try now we just pick we keep picking them at random so you can do something smarter than that and this is this comes under the name of Bayesian optimization or sequential model based
01:03:40
optimization these are the two names for this and the idea is very simple you evaluate a few points and then you fit a regression model to whereas as features you use the value of the hyper parameters and you try and predict what the score of your classifier is going to be and then you say well now I have
01:04:03
a regression model I will go to where the maximum is of my predicted score and I will evaluate those hyper parameters next and that's a super simplified version of what Bayesian optimization does and does that make some amount of sense to you guys yeah so yeah okay so you mean what if I
01:04:53
what is the difference between random search predict and random yeah so if you
01:05:07
after fitting random search if you call predict it will use the estimator which is in random search dot best estimator so that's a way to extract or look at it if you potentially because so max features 91 max features
01:05:45
91 max step 18 max step 18 so for me it does seem to work now I mean this is what it should do okay it should it should not happen yeah basically the
01:06:11
this is a the best estimator attribute as a way to go and look at poke around at it so for example if you wanted to extract the feature
01:06:20
importances I don't think random search has a feature importance or feature weights attribute and then you have to go to the actual estimator sorry yes so the question was what's the difference between the best estimator underscore attribute and using random search directly as a as an
01:06:46
estimator and the answer is there should not be a difference and we were discussing that sometimes it happens yeah yes ah yes okay so the question is
01:07:15
why use Bayesian optimization instead of gradient descent and the answer is
01:07:21
for example for the value of max features what is the derivative of rock AUC with respect to the setting of max features and I think nobody knows how to compute the gradient for this which is why you cannot use an optimizer which
01:07:41
relies on the gradient which would it would be great because you probably get there much more quickly so instead you have to use something which is at least is an optimizer that doesn't need gradients and then usually also
01:08:02
it's very expensive to train your model and then the third kind of or 2.5 reason is that the answer you get is also noisy and because of the randomness and how the splits are done you will get and so this then leads people to
01:08:23
doing Bayesian optimization instead of for example you could use genetic algorithms for example as well but they tend to be very costly or they require a lot of function evaluations so that becomes very very costly for example if it takes you know a day to train this model then yeah or if you
01:08:45
don't know alpha go or something you know weeks of GPU time so the question is how much risk is there to get stuck in a local optimum I don't know I
01:09:06
don't know a good general answer to that yeah I mean Bayesian optimization I will I will talk about on Thursday in more detail but I think as much as I
01:09:24
love it and I think it's very clever it seems the at least a received wisdom at the moment is if you can afford to rent twice as many computers to do your problem keep doing random search at twice the speed or three times the
01:09:41
speed or four times the speed and you will probably beat something like Bayesian optimization in not in terms of how many CPU hours you use but in how many human hours you have to wait so the question is can you improve on
01:10:03
on random search with Bayesian optimization the answer is yes or probably because what Bayesian optimization does to get started is pick some values at random to to initialize everything and then it fits a model and tries to pick the next best point to go to because you can't
01:10:21
do that until you have a few values that you've evaluated okay so if you manage to install psyche optimize without internet this morning then you can import a class called base search CV from Scott and it has an interface
01:10:48
which is almost exactly like grid search CV or random search CV so again you provide your classifier and in a slightly different notation the
01:11:01
dimensions for your search space and also you tell it how many points it should evaluate so for fairness maybe we should do 36 as well actually let's see if with 15 we also get there and then you call fit on it with your
01:11:21
development set and now actually we have to wait a little bit because it's not just picking points at random it actually fits a model and optimizes some functions to try and pick the next point and so on so it takes a little bit longer than just picking them at random so you see you know you
01:11:46
recover again there's no magic here you recover the same behavior of max steps as you did before but you can see there's for example a big hole here so there's some potentially some intelligence in Bayesian optimization
01:12:02
that it you know doesn't try intermediate values it goes to a bigger one fairly quickly and then you get a very similar score for your best setting and again if you want to you can print out the best parameters and the best score and it should really look like one of the grid search or
01:12:26
random search CV objects in scikit-learn and this is the end of the hyper parameter optimization thing I think if we do some more categorical feature and feature engineering then you could come back to this because
01:12:43
you can build then a pipeline which contains you know as a choice how to do your feature or how to encode your categorical features and optimize over that as well as the settings for your classifier and then now you have I don't
01:13:02
know a 5d space or something like that and it starts really getting more complicated and I would say if you have a small space make something use grid search CV and just try all the values because it's cheap then as it starts getting bigger maybe start using random search and then if you have a
01:13:24
large number of dimensions then maybe and start investigating Bayesian optimization but that also comes with some caveats so it gets difficult the more dimensions you have so the question is can you parallelize
01:13:46
Bayesian optimization and the answer is a little bit so the problem you're trying to solve is you're trying to look at what you've evaluated so far to make a prediction of where to try next and if you want to do that massively in parallel you have no results to look at when you need to
01:14:03
make predictions of what to try next so the benefit of doing this complicated Bayesian optimization stuff becomes smaller the more in parallel you want to do things so in psychic optimizer some things which I would say can do and I maybe a factor two or four in parallel but much beyond that I would not I
01:14:25
would not use it no by default it stand off because you need to think about what strategy you want to use to deal with the fact that you're trying to make several predictions in parallel so yes for random search and grid search
01:14:55
you can just specify and jobs parameter and it will just run in parallel which
01:15:01
is why I was saying if if your limit is human hours and you can afford to buy more computers then random search is a very good strategy because it's trivial to parallelize ask your question again okay so can you learn more about which
01:15:48
hyper parameters are important using Bayesian optimization yep yes and no so in psychic optimizer we have a few functions which or methods that allow
01:16:01
you to make plots also for for higher dimensional things so for example you can make partial dependence plots and that's maybe a way to learn something about your your search space but I have to I have to say that often if
01:16:20
you know nothing about the search space interpreting the plots is hard work it is very nice if you already know what it looks like then they always make sense but if you in a real-world setting don't know anything about it it's still very hard work I find it very hard work to try and do something for example to decide did I get stuck in a local optimum or not
01:16:42
which is the biggest question and is it's extremely hard work so yeah I will
01:17:23
answer a slightly different question which is how do you pick what the best parameters are and the answer is you pick the parameters which you which give you the best test score so now the trap that a lot of people fall into is use
01:17:43
this number as the their prediction of how well the model is going to perform on unseen data and that is clearly not the right or not clearly it is not the right thing to do because what you what you could do is you
01:18:01
could try a bazillion hyper parameter combinations and you will find a model that performs best and it is probably one which where you are catching a lucky fluctuation upwards so that's why you need to have your evaluation data set which you can then use to evaluate the performance of your model
01:18:25
with the hyper parameters that you've chosen and this number then you can use to make a prediction about how well it will will generate generalize it was yeah yes if not it's a bug any more questions yes so yes you use the test
01:19:07
score to find the best hyper parameters yeah so this is what I was
01:19:20
saying before do we over fit if you try enough values of hyper parameter combinations then yes you will get a good test score which is purely due to the fact that you've tried so many of them which is why it's so important to have your evaluation data set you know you remember a very at the very beginning of the notebook I split the data into our development data set and
01:19:45
then evaluation data set no exactly so we only use the development so we still at the end of the notebook now have some data left that we've never used and this is the one you need to use to make a prediction how well it will generalize and if you try a huge number of hyper parameter
01:20:04
combinations you will see a very big gap between the test score that the random search reports and the score on your held out data and then the
01:20:44
make sure that I'm not evaluating too many different hyper parameter combinations and essentially become over fitting those I think that I mean
01:21:01
then you could one thing you could look at is nested cross validation there's a nice example in the scikit-learn documentation but essentially you will always have the problem that you need to keep some data secret and you can only look at it once because otherwise now you're influenced
01:21:25
by what you saw the performances on that data and use it to refit your model and then you you know now you've lost your chance to to make an unbiased estimate of your performance so what was the difference between the
01:22:06
training score the testing score and the validation score so the the testing score is the score of your model on the data that you use to train the model which like all of you explained okay is like all of you explained the
01:22:29
the test data you can achieve perfect performance on it by just memorizing the answer to the examples that you're seeing during test during training then
01:22:43
the test data is data that you kept aside and didn't use okay so no but so this is actually I think an important point everybody you want to discuss this with at any level of detail you will mix up the words that you use
01:23:04
because yeah you have to at least use some of them twice yeah so why why always do is I have a development set which you use to do all of your development and then there's a part of the data which is maybe you call
01:23:22
about evaluation set which you split off at the very beginning of your project and you never ever look at it you look at you you know you give it to your boss or somebody else who will not give you access to it in kaggle for example the private leaderboard is is this yeah you never get to see it
01:23:42
until the competition is over and then within your development data you can now split it again into a training and a testing set if you want to or you do this cross-validation where you make a you know you split it in five chunks
01:24:02
and then each part of the data set plays the role of being the training data and the test data at some point and then frequently then also people talk about well they have trained and test data and then they need a word for the other data and then often people call this validation data or some
01:24:24
people does that make some sense okay so the question was if you look at the
01:24:47
result of your model on your evaluation data are you can you go back and change it again and the answer is no now the really is the kaggle competition is over and then you look at the evaluation set and you know
01:25:02
either you win money or you don't win money but you don't get to make any changes ever again you have to collect new data so the question is so the
01:25:34
correct the question is would it made you it could have made your grid search better because if your data is larger or there's somebody invents a new
01:25:42
model that you didn't know of the problem is if you if you want to predict how well your model will work on data that it's never seen then you need to have data that it's never seen and so there is nothing really you can do that once you look at the result and use it to inform your decision
01:26:03
making and changing the model this is as if the model has seen the data so we will honor yeah so just another comment on this the another way to see
01:26:28
it is that typically the tree evaluation set is the future data that you're gonna going to collect in the future and you don't have access to it at all it's just that when you deploy your model in production you see it's
01:26:43
not as good as I hope and it's basically a way to check that you didn't make any mistake in your model selection procedure that you didn't leak information or didn't use information that you shouldn't have and it's the only way to check that if you deploy your model you're not gonna degrade the
01:27:06
performance of the production system and actually people what they do in practice is they use A-B testing is that they deploy the model only for a fraction of the users on the future data of a fraction of the users to check that it's actually improving the performance of the system on those
01:27:24
respect one percent of the users for instance and if they notice that it's actually not improving but degrading the performance then they stop and they say we made a mistake just also something for the division of so just there is a
01:27:44
sub notebook that we won't have time to do today because it's the end but it's just a couple of words about it it's about categorical feature engineering I think so in in this alright so in in this tutorial we used
01:28:12
integer encoding which is fine for our decision trees we could have used the dummy variables and actually a team used dummy variables like one at encoding which is also fine but maybe less efficient for decision trees
01:28:25
and here it's actually diving deeper into what it means and how to do it manually with pandas and how to wrap that as a class a scikit-learn transformer that you can then use in a pipeline then there is an exercise to
01:28:41
do it using the pandas API and there is another example on how to do the integer encoding using the pandas API as a scikit-learn estimator and yeah there are many many more ways to do a smart categorical variable encodings and I have put a reference here it's a presentation it's a slide
01:29:06
deck that is very good on feature engineering I don't remember the author of this slide deck Van Veen apparently from Brazil and and there
01:29:22
are many ideas there that are useful and there is also in scikit-learn contrib a project which is called categorical encoding which has also additional transformers for categorical variables I think this is like good pre-processing of categorical variables and good pre-processing of variables in general is very very important and it's not often treated in the machine
01:29:42
learning tutorial because it's not really machine learning in itself but in practice this is what makes a data science project work so I think you should invest some time in this and I think we should stop here because it's over so thank you very much