00:01
welcome everyone my name is olga standard I work as a post-doc in
00:05
Ireland so I will go to show you how machine learning can be applied in sciences and after the previous talk if you've been here that's a nice introduction about all kinds of ensemble methods so here I'm going to show you 1 specific case on the gradient boosting OK so is a background of the problem in the past 60 years observed decline in size of fish by about 4
00:28
centimeters on a rich so thing about having which is about 20 long 4 centimeters a lot of a lot of reduction so we would like to find out what's the problem why is it happening and
00:38
we're going to use machine learning to answer this question so why is it the problem is because having is very important species for consumption and we know that if it does decrease it has a consequences for 1st of production it means there'll be less fish in the future so we can consume and we don't know what's
00:56
called declined but we are presuming there is interactive effect of various uh factors such as the surface
01:03
temperature may change and much
01:06
of was happening over there has a like bombers may change
01:11
efficient bombers may change or fishing pressure you OK so to answer
01:19
this question i'm going to use data from uh for the
01:23
past 60 years from 1959 uh 9 to
01:26
2012 and the data is spread throughout the year to such that should have a cake
01:45
and so I'm going to use this data and there's the way data has been
01:49
collected is uh it has
01:52
neglected from commercial vessels from uh taken at time them 50 to 100 samples at the time and um total sample size about 15 solves into individual features so imagine a dataset of 50 thousand euros um of them OK so study it is this is where the data comes
02:12
from its cold Celtic Seas just on the sources are learned and is bounded by this and your channel in the channel and so it's just a you can imagine where we are now since about the study
02:22
area size and there some objective is identified wouldn't factors which underlies this problem and to answer this
02:30
question i'm going to use a gradient boosting integration these which is 1 of the ensemble algorithms which is available as is the case uh white because we don't have a collection of don't have
02:43
1 thing but they have a collection of trees so I and the final and model is in the interests of the final models improved because we have a collection of interlinked trees so in this case as opposed to other methods such as bagging or integration uh over the random forest with the independent analysis methods all trees are dependent on the ways that
03:06
a is it also want to be so unexplained part of the model is and this as the input in the next 3 so we have a sequence of interconnected trees which is a nice feature it allows to reduce variance it allows to introduce bias the only
03:19
problem is this is because of their internally sequentially come to realize that our algorithm because they all depend on each other OK so and so
03:29
advantages of gradient boosting regression trees are basically more or less the same as all of those of us in symbol methods which means uh just to mention a few we can detect a nonlinear feature interaction is just because of the underlying feature selection which is going on in the algorithm i it is
03:49
resistant inclusive irrelevant features which means we can include as many variables as like and islands there won't be selected so we don't care OK so which is nice it is it is good to the deal
04:01
is data with different scales and you don't have to standardize data we may have you may wish to standardize but you don't have to because they are abused and if you for instance the
04:11
user normal uh like linear regression model will explode so in this case is this idyllic what advantage but
04:18
also robust to outliers so that any data points which are not fitting data it may be because it's a mistake or maybe some special and we don't care at all it's more accurate and we can use different loss functions like for instance the least square or others which
04:32
is an implementation gradient what integration theories which is nice OK disadvantages it requires careful tuning it takes a lot of time to get there with models it's all detained but at a faster predict and also you after I finish my the top part of my talk I'll so you implementation that by the noble curve OK so a little bit of a creations so as a formal specification of the model and we have it is in additive model so we have a sequence
05:00
all of these and they're each these the weighted error so that it's a it's a is it as we get to compare a sample of trees they all combined through this grammar weighted can see here and each
05:12
individual is shown as the UN's part of the equation and then we build an additive model in afforestation frustrations of size is said to be at each the sequentially reserves parameter epsilon which sitting patrols enormous learning data we know we'll talk about learning a distance learning rate so in learning later allows to control and uh a speed call fast we descend along the gradient and finally
05:37
at each stage the weak learner is chosen to minimize some
05:40
loss function in my case I took a least square because it's a natural choice but it can be any other function which you can do differentiate and is that this is part of the model and the aided by the negative gradient descent of the I won't go into that that but it's simple about promoting the OK so parameters which I finally selected in my case I needed to about 500 iterations and
06:06
learning rate of about 0 . 0 5 and this this the parameters are referred to to as the regularization parameters of k and affect degree your feet and there is therefore effect well of each other which is a bit complicated because if I increase the number of iteration let's say by a factor of 10 it doesn't mean that learning degrees but fucked up there it's not proportional so which is difficulty we you may increase the iterations but the learning rate might be a decrease by different proportion and that's why it's getting thinking OK so next
06:37
parameters Moxon 3 there reaches in my case a
06:41
6 and uh for this particular all that it is known from theory and from different simulation models that uh 3 prongs so it means that the response print only perform best OK which is nice so you don't need any deep trees but in some cases you may need to go from 4 to 6 Moxon 8 uh is uh a uh the data the rates
07:05
of in my case at 6 it means that my model can accommodate up to 5 years interactions of is what means OK next round
07:14
subsample in my case 75 per cent it's in all optional if you specify in English the monarch means that you get a stochastic model so we introduced some randomness it can be nice because it allows to reduce
07:28
variance and then reduce bias and that is practically I found out that this was a better result therefore I introduced so and basically my model is stochastic gradient boosted regression the with size OK and
07:40
loss function is least squares as I mentioned it's a natural choice nice to start it's easy to interpret but that can be any
07:46
other loss functions and their nicely implemented this I could learn and it's very easy to to change OK so if is to make our model in this case I it's pretty my data in 3 parts uh you know if I have enough time I'll show you how I did it splits into parts sales have results and they're very similar which is nice shows that was the so my model but in this case I um I data 50 per cent for training 25 tested 25 an addition there is no particular reason why because I have 57 throws I can
08:17
I just I just can't if you have less data they you don't you may choose for maybe of only about 2 0 consolidation of some of the methods which are more specific of smaller datasets but I have a big datasets and you can see I have uh indices mean squared error which is the beauty of a few
08:34
say well it's so I'm I'm I'm happy enough is my model and I can see that after some interactions that my model
08:42
and a flat knowledge so there is it is no being as it is no change in there and see which means that I have enough iterations and R square it tells me I proportion of variance which is explained by model and therefore training set is slightly higher which may indicate a bit of overfitting but it's not a big gap between them so I I'm satisfied and that's but this all those who follow each other very closely so it means it'll marriage my model is doing a good job of it and there is some of if I Fred induces variability in data I see that R square goes up so this is
09:18
basically an effect of so a little bit of results so if I plot here a lens of
09:25
the fish on its axis and you can see that it's maybe around uh from 20 to 30 centimeters so imagine and my model predicts fees from 22 to 28 so basically it is what it says on every street here a correct value if you have experienced still smaller due to be the 1 that predicted correctly OK so it's 50 per cent of the R
09:49
square each what what's the reflectance graph OK and if you want to find out which variables play
09:55
a role in the and in my model this is what I wanted to find out and the this is the way it's performance and each variable is used 1 the most important 1 is used to speed to the more often is used to speedup the if account times it's used we can say OK so that means it's more important in this case I have a color
10:15
coding here so it is 1st as parent is basically moms
10:19
OK so we know that is something that in attributed I could see it's a 100 % of
10:25
the cases it has been used after that we have sea-surface temperature uh which is
10:31
uh I'll show you next graph how it's affected but is basically some relationship and other things I have
10:38
food availability so that is a doubtful to see and then abundance of fish so how the how many topical large population etc. so most important message here is to remember is that tent is important 1 and after that we have the sea surface temperature and the and food OK so
10:56
if you thorough visualize the variables in them partial dependence plots so um the 1st throw he is the 1 they partial dependence plots basically where Paul each feature against air our might explain the data dependent variable said lands of the fish we can see that uh becomes really see a need particular relationship here it does this is a relationship but it shows a high degree of interaction is the way
11:23
how it's uh it's dependent child so we
11:27
don't really the company dependence here but we do become here so I highlighted here circles this 2 areas um uh it means that maybe if you if you can see about 14 degrees so if it's a surface temperature is below 40 degrees there is a positive relationship Sophie's gets larger so likes the temperature up of wood and it
11:50
is in this case if it gets too warm feature is is a
11:54
negative relationship so it does it is it that it's definitely
11:57
shows some kind of dependence between length of the fish and the temperature well I don't want to talk about climate change here because it's very debatable issue but you can imagine if temperature you know a global warming if temperature without that may have an effect on the future and their on us eventually because we can't consume future like OK so
12:17
this is an interesting message and the final
12:20
layer here is this is 1 of the food sources say in this particular case phytoplankton is what the sheets if you focus on this area uh while worldwide focus here not focus here because uh my most of my data is concentrated over here is you see because this little Dixon deciles so it's where is
12:39
concentrated on making the goal up to here just because I
12:44
have some lie but I don't care because I know my model is reversed so just idle upper part so if I look at this part I don't see any dependence I think it's just
12:54
because in this case uh it's not a limiting factor obviously they have less of food it effect but in case of Celtic see there is a lot of phytoplankton so if he doesn't is not dependent on the OK and then the 2nd here we have a 2 way interaction plots is plot each feature against each
13:12
other justified to see if I can pick up any interaction between those of
13:17
OK so we can see here is basically the same story pieces of temperature about putting degrees here you see that something is happening so uh is uh what it
13:25
says is this analysis tells me well I know that is that I broadened features but I can't really say why is it so buys of base effects at Trent is important it tells me that I might need to going use maybe time modeling to find out is the way it's the pencil icon answers questions Machine learning collect can do is to be copters features out of the badge of other features on the big datasets and it's as far as it goes so there are limitations to how you can apply it and so conclude the
13:56
season there are 3 important features which I just stand time tends to
14:01
surface temperature and food availability something is going on this temperature which is clearly about 14 degrees and uh is there is a high degree of interaction between these features and the members that this this method we can't and find the cause-effect relationship but we have a relative importance of the variables so from a bunch of variables I picked up the ones which are more important and they can take away I think it is me for the next type of analysis
14:26
and OK so this is the 1st part of my talk and not
14:30
on the show how much time I have I would like to show you um a little bit of how it has been implemented and some of them have 5
14:39
minutes so it's basically the 1st part of this is so what I've shown you in the my presentation it's a 3 way splits what my data set
14:49
so I'll go a bit confused large enough OK so
14:54
you know and I'm sure it's all familiar to you it's a virtual libraries and the BD producible because work in science they need to succeed because I want to run it again against him results appeared in the data and ICT about about 50 cells and throws and about 15 features in my case um I haven't discussed this but I do check multi-collinearlity which means it has 2 features which are related their dependent um full of normal
15:21
integration these like when I have 1 the only may area to it not made real for sure blow your model and you can't allow that to a new model for uh um assemble measures for this particular algorithm it doesn't matter but if you can detect analytically and is that they call variables which are you know
15:39
which and yeah so is basically how I did here I construct the metrics of this and more for the moment relation that is just what a cold and idea is that it here and I can find out which variables so the higher uh in multi-collinearlity is a more intensive color is basically uh there is no rule but having a buffer at 2 per cent of 0 . 8 may indicate multi-collinearlity so I see here is this is the eyes are red or not Our so is basically those variables I just a part of my model of and this 1 as well OK so I do more terms and I do the create so 50 per cent 25 25 for each part of the model and and
16:24
I think my model OK so as this is the final parameters but it took me if you uh if you nations for sure to be satisfied I have and then how I found out how many is to me this I need here because it is the usual rule
16:40
is to set the learning rate as low as possible and to get a number for estimators of number theory as high as possible and if you do
16:47
that you model around forever but you should end up to something feasible and and you can start playing around by reducing the fate of how found others
16:58
apply all that means which is called early stopping in that set available it comes in little bit
17:06
later on OK so so much
17:11
more OK so yes and the trade and budget OK I just touch a button that just pushed on so OK and the same graph again you've seen it before and again I think
17:38
what's and what's interesting the show quickly and other part of so this is a stopping which I mentioned earlier the was part of many do it
17:46
the way split because to pretty something which is down to my opinion morphogens here
17:53
uh into is pretty only have trained and tested all have validation set of the air and to identify
18:00
parameters for this part I used a grid search so specified the range of parameters you can specify for all but you like but only to regularization parameters because those almost a year difficult ones so I specify here not that's which I know I had 6 so I do want to derive want to the
18:17
left and I know from duties it should be higher than 8 so I don't go there OK and that learning in a I have 0 . 5 now 0 . 0 5 and I want to be inclusive details and see how
18:28
it works so what happens is that we get where it begins you to get a um confusion methods and look at the but they have a metrics of different combination of parameters and each time fit the model around on the run and eventually the 1 which gives the highest accuracies chosen and it tells me which parameters such as the state so you
18:46
can see the output is best hyper-parameters it says that learning should be about 10 % of 0 . 1 instead of 0 . 0 5 and that's that can be a bit more shallow but it's very
18:57
close and close if I feel those parameters and all other parameters keep uh you
19:04
to say mn I eat what I
19:07
get is very similar results again here you can see
19:10
it's about 50 per cent or 50 per cent for the training and test data which is good OK so again we have to see the same graph which is good it means
19:19
I have the same algorithm but I applied to different types of data partitioning 1 time I did 3 ways create Rasera stopping to find a number of iterations of and otherwise splits into into parts and they used the research to find the best parameters I change parameters and steal my model does uh I think I have to finish my model does give similar
19:43
results which is good at this moment all that miserable OK I think I'll stop here because it's not doing well so think much for attention if if you have questions
20:01
yes because of the did you compare
20:09
the results with our new random forest
20:13
III around some results is just a normal and for a standard mean there is slightly higher but I didn't tell you because I I was a bit stressed this all this thing but I didn't normally square regression and again it does shows that the model does it is less and so yes it is OK and do you
20:35
have the data on the local global yes it's some ideas that you can see it you
20:40
put my ears on them right so
20:42
it's a this thing to person number of
20:51
so when you said there is a link between the temperature and the fish just does that help you then you can overcome the more research is that it's not kind of the able this yes and basically this was a
21:03
kind of this idea very interested to find out of that I have it in terms of this is a time series data for 60 years this is very unique and science is similar to the term collection so he wanted to find out which variables are more important so that we can use our dataset to more filament once and then I could take into some kind of automatic time-series analysis it there's no 1
21:27
object and you have some sort of 4 quality of research efforts and those you that was blamed machines