Walking the Random Forest and boosting the trees
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44932 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201873 / 132
2
3
7
8
10
14
15
19
22
27
29
30
31
34
35
41
44
54
55
56
58
59
61
66
74
77
78
80
81
85
87
91
93
96
98
103
104
105
109
110
111
113
115
116
118
120
121
122
123
125
127
128
129
130
131
132
00:00
ForestRandom numberSoftwareMusical ensembleEndliche ModelltheorieModel theoryTheoryImplementationDecision tree learningNetwork topologyGradientDecision theoryPredictionIndependence (probability theory)Structural loadFacebookWeb pageAbelian categoryTotal S.A.NumberLinear regressionHill differential equationShape (magazine)Software testingCurve fittingComputer-generated imageryComplex (psychology)Motion captureSample (statistics)Fraction (mathematics)WeightBinary fileRule of inferenceConditional-access moduleVertex (graph theory)Plot (narrative)Point (geometry)Maxima and minimaInstant MessagingDecimalFreewareCore dumpModule (mathematics)Wave packetSampling (statistics)BootingEstimatorLevel (video gaming)Civil engineeringResidual (numerical analysis)Coma BerenicesBit rateCodeError messageSingle-precision floating-point formatParallel portRevision controlProcess (computing)Philips CD-iACIDDifferenz <Mathematik>Network topologyWeb pageLibrary (computing)Set (mathematics)Perturbation theoryBlock (periodic table)Parameter (computer programming)Model theoryLevel (video gaming)BuildingBitRight angleSampling (statistics)Bootstrap aggregating1 (number)NumberLinear regressionCategory of beingMusical ensembleSubsetOrder (biology)Decision theoryGradientTouchscreenError messageForestControl flowRandomizationPredictabilityDifferent (Kate Ryan album)Endliche ModelltheorieAnalogyDefault (computer science)Wave packetRow (database)Multiplication signOrientation (vector space)Computer fileGreatest elementFraction (mathematics)Parallel portPlotterImplementationLaptopTheoryGoodness of fitSuspension (chemistry)Heegaard splittingSoftware testingCondition numberState observerPoint (geometry)Residual (numerical analysis)Single-precision floating-point formatFunctional (mathematics)Morley's categoricity theoremElectronic mailing listNumeral (linguistics)Maxima and minimaLink (knot theory)WebsiteCore dumpAssembly languageSquare numberRootNoise (electronics)MathematicsExterior algebraBit rateIndependence (probability theory)MereologySlide ruleZoom lensMixed realityContext awarenessBridging (networking)Flow separationWeightEstimatorInstance (computer science)Computer-assisted translationArithmetic meanConstraint (mathematics)Product (business)Virtual machineFitness functionDot productSymbol tableConnectivity (graph theory)Rule of inferenceComputer-aided designComputer programmingAreaMoment (mathematics)Ideal (ethics)Ring (mathematics)DiagramSpring (hydrology)Film editingString (computer science)FacebookComputer animation
Transcript: English(auto-generated)
00:01
All right. Cool. We're good. So, that's the only slide I have. So, right. So, thank you for coming. We'll do it really quickly. Just a quick introduction. I'm Kevin. I'm working at a company called Cambridge Spark. We do trainings in data science for individuals and for companies on site. That was it. Now the notebook. The link that you have
00:26
here is available as well on the can you all read? Yeah. It's available as well on the Europe item page. So, if you can't read it here, you can just copy paste it from there. Today we're going to talk about Ensembl models. Why do we care about Ensembl models? They're
00:43
really popular. They've been used a lot and still used a lot in the industry. They provide pretty good performance. And they're quite convenient in the sense that they don't require as much data as deep learning in general. So, they can be a good alternative. And in fact, they used a lot in machine learning competitions as well on Kaggle, for example.
01:04
The goal of this talk is to try to build a bridge between the theory and the implementation of those models in Python. So, it's going to be really much applied. Just quickly, the agenda. I'm going to start with an intuition of why Ensembl models work. Then discuss
01:25
the main building block of the Ensembl, which usually we use a decision tree. Then talk about two techniques, bagging and boosting. And finally, talk about other libraries that you can use that are a bit more advanced. Right. So, quickly, first a definition of
01:45
what do we mean by Ensembl model. So, it's simply combining multiple simple models that we will call weak learners into a larger one that will be Ensembl. There are two popular techniques that you can use, bagging and boosting. And we'll discuss both of them
02:02
here. And we use as a core building block a weak learner. Here it's going to be decision tree and it's usually a decision tree. Right. So, quickly, just some intuition. So, let's say you want to know if you have a given disease that here called A. You're going
02:23
to see three doctors. All of them tell you that you have that disease. And let's say you have access to the file of past diagnosis. And you know that you can calculate the accuracy on previous patients. And they had like all of them had 75% accuracy. So,
02:42
the question here is do you think that the probability of you having the disease given that all the doctors told you that you have it, is it higher than the 75% or is it equal to 75%? In the notebook that you have on GitHub, they have actually
03:03
simulating that scenario. So, you can play with it and run it if you're interested. Here I really don't have time to do it. So, I'm just going to tell you the answer. There is no much suspense here. It's actually higher. So, by combining all those predictions together, we get a higher probability. The reason behind that, really simply,
03:24
is because the probability of all those doctors to be wrong at the same time is since they are more likely to be right than wrong is going to be really low. So, you get better accuracy by doing that. There are two main assumptions behind that for it to work.
03:41
The first one, the models, here our doctors have to make predictions independently. If all the doctors will make the same prediction the same way, then you would just like have the same the exact same thing by consulting three doctors than if you would just have one. Also, another assumption is that you need your model or your doctor here to have an
04:05
accuracy higher than 50%. You need them to be more right than wrong. The reason behind that is if you combine several diagnosis from people that are more wrong than right, you will end up just having something really wrong. Right. So, really quickly,
04:24
introducing the dataset that we're going to use here. So, it's data from Facebook, different posts that were posted on pages. We have features such as the category of the page, the number of comments on the page, and so on and so forth. And the last column in the
04:44
dataset, so you can download the dataset here at this link. The last column on the dataset is the target, the thing we are trying to predict is the number of comments that we will receive in the next hour. That's a regression. Here we want to stick to classification in order to
05:00
stay with our analogy of the doctors that we've seen above. So, we'll redefine the problem as will this post be commented on in the next hour or not. So, basically, will we have zero comments or more than zero. So, let's just load the data really quickly because we don't have that much time. I have ten minutes left. All right. So, we defined the features
05:23
really quickly. Here that's just what I said. If we have zero comments, if we have more than zero comments, we have true, otherwise false. So, here the data is quite balanced.
05:40
So, pretty much half of the dataset has comments. The other half doesn't have any comments. And let's create a training set and a test dataset. Right. So, the first building block here is the decision tree. So, quickly, why are we using decision tree to build
06:00
our ensembles? Well, it matches with the two different conditions that we stated above. The first one is that we need to have 50% accuracy. Well, those decision trees are actually pretty good at capturing complex relationships including nonlinear relationships in the data. So, we have that checked. The second reason why we use decision tree
06:21
is because they overfit easily. Usually that's seen as a bad thing. But here it's good because that means we will by perturbating a little bit the data will be able to build decision trees that are quite different one from another. So, let's start here by building a decision tree that has a depth of two. So, a really simple one. I'm going
06:43
to fit it on my data. And then plot it. So, by the way, the function that I'm using to plot it, I've defined it myself. It's at the bottom of the notebook. So, if you want to run it yourself, you need to run that first. But here you can see that really simple decision tree. I'm starting looking at the feature 30. I'm looking at whether it's
07:06
smaller than 4.5. Then I continue that way looking at another feature and so on until I reach the bottom where I'm saying that if a sample is here, then we predict it will have comments. Otherwise no comments has comments here, has comments here.
07:24
Something important to note here is this samples value. It's actually the fraction of the data that is represented in that node. So, at the first node at the top, we are looking at 100% of the data. Then it's split in two. 63% of the data will be on
07:41
the left hand side. And the other one on the 37 remaining on the other side. It will become important in a minute. Just let's look at the accuracy. 81%. Right. So, now we've built a really simple tree. We want to we only have three rules. So, surely
08:03
our data is more complicated than that. So, we'll try to build a more complex tree by increasing the depth. Here we're going to use depth of 10. So, let's plot it here. You see a really complex tree. Let's zoom in a little bit. And you see that here we don't even see the whole tree because we can't plot the whole thing. But the
08:26
important part here is the samples. You see that we have .4% for example in this node that is of the data that is represented. So, surely that looks like we are over fitting a little bit. If we look at other ones, we see that, for instance, here we have
08:47
0.0% of the data in this node. So, we are really over fitting to some particular example in our data. Surely that's not what we want. So, with Scikit-learn there is a
09:01
main sample split. And what we're doing is passing exactly the minimum percentage that we need represented to create a new node. So, here I've used a really large number, 20%. And we'll see how that looks. So, it's creating another tree with a max depth of 10 still. So, we can still model complex relationships.
09:23
But we are not over fitting to really simple single examples. For example, here we have 30% of the data represented here. If we go down, we only have 30%. So, we don't create a new rule here. We stop and we say that all those samples will have
09:40
what we predict has comments here. And we say that if we have more data to back up our nodes, we actually keep dividing in two and go down and go down. And here we don't plot it because that would just be too much to visualize. But we're still going deeper.
10:01
Right. So, with this first building block, we're able to build our first ensemble. Random forests. So, as you can see in this diagram, it's pretty similar to what we were talking about with the doctors. We will train here we are training two different trees. The main difference is that in order to make those trees not correlated, we are not going to
10:24
train them directly on the data. But we will introduce perturbation in the data. So, the first stage is the bootstrapping. We will sample from our data with replacement in order to create a new dataset that has only a subset of the observation. And it will have some observations
10:43
duplicated as well. Meaning that here our sample one and our sample two are both statistically similar to the data, but they are different one from another. The second step that we're doing is the feature subsampling. So, instead of providing all the features to our trees when we're training them, we will only take a sample of
11:04
the features. So, we will have some trees that will focus on some features and other trees hopefully that will focus on others. The last important thing to do when you're training a random forest is to make sure that your trees overfit a little bit.
11:21
Because if you keep your trees really simple and had a really big constraint on it, for example, you will only create one node that will be the exact same one for all your sample datasets. So, you're not going to be able to benefit from disadvantage of ensemble models that we've seen before with different models.
11:45
Right. So, the three points are summarized here. Let's try to apply it with scikit-learn now. We'll start with five trees. So, here those are the main parameters to pass to the scikit-learn implementation of the random forest.
12:02
The first row that you see here are the parameters that constrain our trees. So, those are the same ones as before. We choose a depth of 10 and we keep 20% for the main sample split. So, as we've mentioned, it's not a good idea to do that. We'll see
12:20
how it works in a minute. And estimator. So, we are going to use five trees. Max features is the feature subsampling. Here we're not doing it. We're not using it and we're not doing the bootstrap either. So, we're expecting that to be quite bad. Let's see if that happens.
12:41
We'll calculate the accuracy. Right. 81%. With scikit-learn we can look at every single decision tree in our ensemble. So, let's do that. Here you see the list of all your decision trees within your ensemble. The nice thing is since we can access every single tree, we can also
13:00
access the accuracy of every single tree. So, by iterating over all my estimators, I can predict on the test that I set and then calculate the accuracy. If I do that, I see here the accuracy of my five trees. As you can see, they all have the exact same accuracy.
13:20
In fact, we've built five trees that are exactly the same ones. And when we end up assembling them, as we've seen before, we get the exact same accuracy as one single tree. So, basically, by not using properly the parameters of my random forest, I have not gained anything by assembling. Right. So, now we'll try to get a bit better by using the feature subsampling
13:47
here. So, by defining auto, it's just gonna pick square roots of the number of features every time it's building a sample. And we're setting the bootstrap to be true as well.
14:01
Here for the moment, we are keeping the okay, I'm skipping this one. I'm just gonna go to the right one. Okay. So, this is the good one. So, here I'm doing everything right. I'm letting my trees under overfitting by having a loser constraint on the
14:21
minimum sample split. Only 1% here. I'm also using more trees. 15. I'm using the feature subsampling and I'm using the bootstrapping. So, let's see how it does. 83%. That's a bit better. Let's look at every individual tree. And here we see that we've managed to build I don't want to do that. Here we see that we managed to build trees that are
14:45
different one from another. We see that some of them will have lower accuracy, some of them higher one and they compensate each other, which is what we wanted. Right. So, I have five more minutes to talk about boosting. So, boosting is different than random forest in a
15:03
way that instead of building all the trees in parallel, here we're going to build them sequentially. So, we start from the data. We build our first tree. So, that's the stage one here. And then we compute the residuals from it. So, we check what is the difference between what we were supposed to predict and what we actually predicted. Then we build a second tree
15:22
that is not trained on the original data, but on the residuals. Meaning that this tree will learn how to predict what the other tree got wrong. Meaning that if we subtract the second tree from the previous one, we will compensate the error. Obviously, this tree itself is not perfect and it will make an error. It will get some error itself. So, we compute again
15:45
the residuals, train another tree on it, et cetera. So, we can end up like training a lot of trees. As you can anticipate, it will overfit really easily because at some point the error is going to be smaller and smaller and we'll start building trees that instead of compensating the error,
16:01
they will overfit to what is essentially noise. So, it's quite important to make sure you tune properly the number of trees you're going to use here. And also, you don't want your first tree to overfit in this case. Because if it already overfits, your second tree is only going to compensate to try to overfit to the error, to the noise.
16:25
All right. Also, here we have access still to the bootstrapping. So, it's called subsample for gradient boosting. And the feature sampling in order to make sure that every subsequent tree that you're going to build will focus on something different in the data
16:41
and try to correct in a different way. So, we can benefit from the advantage of ensembles. Right. So, let's do it again. Again, I've tried to put in the first row the parameters for the trees. Same ones that we used before. I'm using five trees here. Using 80% of the dataset for every new tree. I'm using feature subsampling. And the learning
17:07
rate, which is the way we're correcting the error, is just the weight we will be using for every new tree correcting the error of the previous one. So, if I have a large learning rate,
17:20
only building one tree after the first one will correct a big amount of the error. So, it will overfit really quickly with only a few trees. Whereas if I keep the learning rates smaller, I will need to build more trees to be able to converge to the ideal solution.
17:40
But yeah. Right. So, wait. Where was I? Cool. So, I think I'm going to skip. Okay. So, I will just show this one. And then I will skip that because I just have two minutes left. All right. So, accuracy by default here, I've got 82%. Yeah. Just one thing.
18:03
Here it doesn't make any sense to look at the individual accuracy of the trees. It's just because if you look at the tree at stage two, for example, it was never meant to be accurate on the original data. In fact, it has never seen the original data. It was only trained on
18:20
the residual. So, it's only it only makes sense in the context of the previous tree and the error that was that the previous tree had. So, the only thing we can do here is cut at one stage and look at how all the trees before that behave together. So, for example, if I cut here, I can look at how the two first trees do without the third one. So, let's do that
18:50
with scikit-learn. We have access to the stage function that generates a prediction at every stage. We are able to iterate over that and calculate the accuracy score
19:03
for all of those predictions. So, if we look at it, we see that here we start with an accuracy of 54%. And then every new stage, every new tree that we're adding is improving the accuracy a little bit. So, we go from 54 to 77 and so on and so forth. Here you can see it's always it's always increasing. So, you might wonder like whether
19:26
we should build more trees and how at what time we would stop seeing an improvement. So, we're gonna do just that and try to do the same thing, but with more trees this time. So, we're using 15 trees. I've also changed the learning rate. So, it takes
19:41
less trees to converge. And let's run that. Get the accuracy score. So, I've got pretty much the same accuracy score here. But let's look at every stage to see how the error on the testing set changes. So, here we see that it's improving with the first with the first few
20:04
trees. But after a while, it just starts converging to 83% and then it starts going down again, 82. And it would probably go even lower after that as we are overfitting to some noise in the training set. So, what that means is that we really need to make
20:24
sure that we stop adding trees at one stage and get that number of trees right. And also, the learning rate. Right. So, just really quickly to finish, here I'm using the
20:42
scikit-learn implementation of grid boosting. It's pretty good for demonstration purposes, but it has that issue that it can't run in parallel. So, usually people in production will use other libraries that are more optimized for speed. So, we have XGBoost, which is probably
21:01
the most popular one. The good thing with those libraries is that you can use them exactly the same way you would use the gradient boosting in scikit-learn. So, you can import the classifier the exact same way. So, I'm doing it here. And then you instantiate your classifier with the same parameters. You use dot fit and dot predict
21:23
the exact same way. So, you can keep the exact same pipeline, but just change the instance that you're using. So, here I'm training XGBoost classifier, light GBM and finally a cat boost one. Right. So, that's it. You've got access to the notebook.
21:41
If you want to play a bit with it and try to replicate that. Yeah. That's it. Thank you very much. So, officially the time is over even for the questions. So, this is a bit of overtime we're taking given the problems we had. There's a 10 minute break. Well, now it's like
22:03
seven minutes until the next session which is the screen orientation. I mean, we have time for a couple of questions. So, if anyone has a question. Thank you for the talk. You mentioned
22:21
about the boosting tree that the trains are trained to correct the error of the first train, right? Yeah. But how did it work when you predict? Because you don't know the error when you predict of the first train, right? You don't know what? So, when you predict the test data, you don't know the? Right. Yeah. So, you the residuals are calculated on the training
22:44
set, right? So, you don't see, yeah, as you said, you don't see the test that I said at this stage. You're calculating the difference. So, the error on the training set that you have and then you train your second tree on this data, right? Thanks. So, could you
23:09
recommend like any publicly available data set to play and start with random forests and decision trees? Well, you can use this one. That's a good one. I think a lot of people start with the
23:23
titanic one that is on Kaggle, for example. That's just, it's pretty good because it has a mix of categorical data and numerical data and it's a pretty simple problem to understand. Might be a good one. But if you want, the advantage of the data set I'm using here is
23:44
that it has more data. It has more rows. So, you might be able to get more benefit from libraries like SGBoost with this data. Which are more complex models. We have time for one last question. If anyone, no? So, thank you, Kevin.