We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Forest species distribution modelling and spatial planning in R

00:00

Formale Metadaten

Titel
Forest species distribution modelling and spatial planning in R
Serientitel
Anzahl der Teile
24
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Computeranimation
Transkript: Englisch(automatisch erzeugt)
So what the talk will be is just giving you a little bit of context, of theoretical context at the beginning, so to have an idea of what we're talking about. And then we are going more into the data, the predictor variables that we're going to use.
We're going to analyze a little bit more the modeling framework that Tom presented this morning. And what he showed this morning, it was something like variable importance accuracy assessments. Those kinds of things are products of the models that can help us understand what can
be useful to increase, for example, the accuracy of the maps that we're going to prepare. And what kind of tests do we use to understand if the map is accurate or not, if we can use it, and how to calculate this error. So what's the confidence interval when you're going to use these maps? So what's the species distribution modeling? In this case, we are using correlative species distribution
models, which basically the whole concept is try to map the area of where a specific species or toxin can be located in space and also in this case in time.
So if the distribution of the geographical distribution of the niche of a specific species varies in both space and time. So to do that, we need some training data. So basically the location of where this species is found in both space and time.
So it's like in a spatial temporal framework. And these, of course, in these cases, we are using just the presence of the species. So if the species is present there or not, for species distribution models, more things can be used. For example, the species richness, if you're going to model multiple species or the number of individuals that we have in a specific location.
And after you have all these points, you associate these points with the value of specific predictor variables in that space, in that specific geographic location. And then you fit a model that can be just a linear regression. So you just do an interpolation of the points.
And then after you have the models ready, you can predict in the region of interest. So the idea is to have either the presence or the probability of presence of a species in area where you don't have the location. So the way you don't have survey location of the specific species. So as I said, in this case, we are talking about the probability of occurrence.
So when we're going to produce this map, each pixel will have a value that goes from zero to 100. And this is, this measures the probability of the species to be in that specific pixel. In this case, we are going to use 30 meters pixel. So the spatial resolution of the map would be at 50 meters.
And as you can see, the probability of occurrence of one species can be measured as a function that use predictor variables for the climate, the range. So things like, as we said here, the digital elevation model or the slope aspects and other kind of derivatives that we can get from the digital elevation model.
Spectral reflectance. So basically satellite images. In this case, we are going to use just Landsat because Landsat is a spatial mission that covers the whole world since the 80s until basically now. While for example, higher resolution maps that come from other satellite missions like Sentinel that go until 10 meters resolution are more recent.
They just started from 2015. So we cannot really predict back in time. And in this case, we are going to use other distribution maps. So species that can be found in the same place where the species that we are trying to predict are found.
So basically, if you go in a field and you see 10 different species in the same square meter, that's the same thing that we are going to use. So distribution maps of other species. But this kind of information, biological information about either competition or positive interactions are usually very difficult to model.
So we don't have really much data that cover a continental scale in this case. So yeah, as Tom showed in this little comic in the previous lecture, modeling is easy. So you just take the data, you throw it to the model, and then you get an output.
And then you just continue to do the same thing until you get the result that you want. But it's not actually like that. So in this case, for species distribution models, you have to understand what kind of data you have. So if the training data, the points, are actually only surveying the presence of the species,
or if you have a complete summary of a specific area where you have, for example, a grid. And in this grid, you have a systematic sampling, and each point will tell you, okay, here we have the species, here we don't. So you have a complete idea of what's happening in the specific area. Or you either use some specific modeling techniques to derive other points from the background area and say, okay,
these are climatic conditions where the species cannot be present because we know it from literature. And so we create an artificial data set. Then we have to understand if the data is clean. So we have to consider all these kind of things that are all the data science and biological data.
So see if the distribution is skewed or if you have rescaling of some covariates, if we have missing values because some models cannot work if you don't have data for all the entries of this RV. Then you have also to understand the problem. As I said, we can model different things like from ecology, we know
that each species in the geographical or in the environmental space covers different areas. So you have something like the fundamental niche or the potential niche that basically are all the locations in the future space where a species can be present. And then you have the realized niche, which instead is a smaller subset of this big space in the environmental space.
Because, for example, you have cities in a specific area. So the original conditions, climatic conditions could be suitable for the species, but you have something else that altered the original conditions and consequently, you cannot have the presence of a specific taxon.
And then based also on the problem that you're going to face, you have to choose the proper tool. So which algorithms did you decide to take into account for the problems? Also, which predictor variables because you cannot take all the information that you have in the whole world and then put it into a computer and then just you have a computer exploding instead of a model.
And of course, some models need to be fine tuned. So there are some specific values that you have to give to the model to get a response. So here is like an overview of what I was talking about. So the data set and the models are also strictly interdependent.
So depending on which model are you trying to use, you also have to consider which kind of data set do you have available for a specific species. Because you may not have all the data available for very rare species or species that are, for example, endemic of a certain region. So you have to consider if you have a presence only data set.
And in this case, you have more simple models, let's say, or a presence absence data set, as I said, locations where you have the known presence of a species or the known absence of the species. And in this case, you have true absence where you really know of pseudo absence, where instead you have to assume that the species is not there. And in this case, of course, you have multiple things. So absence can be environmental absence, which is the more obvious concept.
So you have environmental conditions where a species cannot live, like the polar bear cannot live in the Sahara Desert and species that are more common in the desert cannot live in the Arctic or the Antarctic, these kind of things. Methodological absences are actually the more common data in these kind of problems, because not all the region of interest that you are
maybe trying to study can be a country, can be a continent, present data or information about the absence in the whole geographical space. So you have kind of skewed or biased data, you don't have a complete representation of like
from the statistical point of view of what's happening in the environment that you're trying to study. And contingent absences instead are, as I was mentioning before, the real estate needs to show basically are areas where the species could live, but the species is not present due to numerous kind of factors like, for example, the competition between other species.
So you have a predator, maybe that is very active in the area and the species basically can go to the brink of extinction or basically historical factors. So if you just change, for example, the land use from a forest to a grassland, of course,
it's going to be difficult to find a tree that you're interested in or the size of suitable others. So, for example, the area is too small and then these conditions, even if they are okay, they are not enough for the species to survive in that specific area. And there are different, I put some references here if anyone is interested, on how you can
derive, for example, these two absence and you can artificially generate these points that I was mentioning. Data and covariance, yes. So we're going to analyze these data sets. As Tom mentioned this morning, it's a
very huge data set. We're not going to use all of it because it's more than 6 million of points. And you may not have a laptop that's powerful enough in terms of RAM to use all these data sets. And also the computation time can go even with a high performance computer to even two days. So we're not going to do that in this tutorial.
And here we're just going to use a small subset. And then we are going to cite this, of course, like every description on our access, download, the entire data set is available on the OpenGL ARP website. So here now we can go to the markdown. I hope that everyone downloaded the book already that Tom mentioned this morning.
So this is how this document that you saw on the website looks like. Let me think you're not seeing the screen. Let's see. There we go.
So this is how it looks like on the markdown document. So each chunk of code that you see on the book is actually a chunk of code that you're going to execute in our studio up here.
So in the first chunk of code, of course, we load this small data set and we see which libraries are needed. For example, MLR or dplyr and data table that are two libraries that you're going to need to do operations on the data set.
And this data set has a very high amount of code. So in this case, we are just showing the first eight, which is the metadata that's related to these points. So the idea of the points, which here, the tile ID, basically the problem is that when you have a continental scale of high resolution,
you have to use tiling grid, tiling system that allows you to, for example, predict only a specific area of interest. And that's what we're going to do today. We're not going to produce a map that covers the whole view, but just a small area. The coordinates and this Atlas class basically is a column that just shows you if the point that we have in the data set is a presence,
so a one or an absence point, so a zero. And in this case, of course, we didn't have, we use two kinds of absence and presence points.
So in this case, the true presence comes from the National Forest Inventory Data set. It's like a big ensemble of presence points available for Europe. And the true absence comes from land cover points. So points where we have cities, where we have croplands, while the pseudo absence comes instead from other tree species that may be present in a specific area.
And in this case, go back to the slides. So this is how the data set looks like if you open it in QGIS.
So for each point, you have a certain amount of metadata and you see which species we have available. And this, for example, we are interested in this species, the Fabus sylvatica. And then you have a lot of other different points that can be other species or absence data because they are located in cities or areas that are not forested.
And as I mentioned, the true absence data comes from this data set, the Lucas data set, which basically cover different kinds of classes. While here you can see what I mean when I talk about true and pseudo absence. So here in the first square, you can see the distribution known of this species in Europe.
But this is just like a static distribution. So we know from literature that roughly this is the area where the species can occur. And this is a very low resolution map. So we have more than 10 kilometers of uncertainty of finding this specific tree in the environment.
This is the distribution of true absence coming from the Lucas points. And these are the other trees that we are available that are basically used for this specific model only when they are outside of the known distribution of the species.
Something that you have to consider when you have these kinds of data sets, especially if they come from a citizen science or observations like iNaturalist, PlantNet, where basically people just go around, take the picture of a specific species, upload it, and it's then validated, is that these points may be located not in an even geographical space.
You may have areas that are highly clustered, and usually these areas are next to roads or cities, while the more remote areas, of course, have less observations. And to basically apply statistical techniques that can give you an unbiased idea of what's happening in the environment,
so to avoid it, the overrepresented areas have a high probability, and the areas where you don't have any observation are just a zero, but just because you don't have points, you can basically harmonize the spatial configuration of these points.
So in this case, we used this reference. You can check it out later. It's a package from R4R that we call SP thin. And it's a way of measuring in a specific area. In this case, for our model, we use a two-kilometer grid.
But in this example of the paper, it's like 10 kilometers or more. And when you actually check the regular distribution of these points and you remove some points, like here are shown in purple, if they are too close and clustered. So basically, just consider the red points for the model,
which is something that's more regular, more similar, what kind of observations you have here. Speaking instead of the covariates, we are using 300 covariates. So as we said, climate, terrain, reflectance and competition, so other species distribution maps. We can check some of those here.
So for example, bioclim is like a dataset that includes bioclimatic variables. So they are long term measures of what's happening in terms of temperature, precipitation in the environment. So you just do an average of the last 30 years. In this case, we covered the period from the 80s until 2010.
So in this way, we have an idea of the long-term influence of the climate on these species. And here we have the ERA4, like a product of the ERA5 climatic dataset. So it's a reanalysis. We are going to use this as time series instead of long-term average.
So for the previous dataset, you will have for every point one value. In this case, instead, you have a different value every year. And they are upgraded, of course, by month. So we have monthly values repeated 20 years. So it's 240 values for each one of these variables.
And these are also available on the data viewer that I showed you this morning. So if you want to check them out, you can download them from there. And from Sentinel, we use the Landsat Glove ARD. Maybe you're not familiar with satellite images, but if this data is available, it is for free.
And there is a difference between what's available on the official dataset or for Google Earth Engine, a version of this dataset, or the ARD data. Basically, you have different satellites, different sensors across these 40 years of satellite mission.
And not all of these values can be used for long time series of studies because they are not harmonized. This instead is a harmonized version of the dataset. And now we can start with the modeling and we can go back to our computational notebook.
There we go. So we looked at the metadata already, and here we are starting with the modeling. So the modeling can be split in three different phases.
We can do feature selection. So if we have a lot of predictor variables that can be used for this model, we can just select the ones that are most important. And usually with classical models, what you would do is just to check linear correlation between these variables.
And then after you check this correlation, you check, you use a threshold, normally is to use either Spearman or Pearson correlation. It's like a value that's above 0.85 or 0.9. You can compute the threshold that's more interesting for you.
But in this case, we are going to use as models, mostly decision trees and nonparametric models. So using linear correlation with a non-linear model, it's not the right way to go. So what we're doing is actually using a specific Python library.
That's what we used originally. Now this thing is available also in MLR3, as Tom mentioned this morning, this new library. So what we do is actually using a random forest model just to select which amount, so
not only the number, but also which ones of variables using a specific performance metric minimize the error. So in this case, you're using log loss. So the lower the value, the better. So in this case, we are running different repetitions of this random forest classifier.
It's not a very complex model. In this case, we're just using 50 trees. And in this case, we're just taking the first, the average of the minimum of this function. So in this case, the minimum you can see here is at 330. So for this iteration, it will be 330. Then you run it another five times.
You take the average, and that will be the amount of covariates that you're going to use for the model. In this case, the dataset that's available in the book, the feature selection is already conducted. So the amount of features that you have in there is 238, I think. It is already the minimum value of features that you have to use to get the highest accuracy in these maps.
Another function that I put in the computational notebook that we are not going to run, but you can check the code later, it's for the hyperparameter optimization. So in this case, we are just using three learners.
So random forest, XGBoost, and GLM is a linear model, so we don't really need to fine-tune it. And what we do is actually training a different model for each combination of these parameters. So for example, if you just use the amount of covariates that we have, which is 238 for random forest,
you actually run random forest with a certain amount of trees and a certain amount of covariates different times, like 10 times. So you run 10 different models, and then the one that gives you the highest accuracy will be selected as
hyperparameter value of the M try for the final model. And for XGBoost, for example, with this combination of parameters, so you see the minimum value and the maximum value, and what it's going to be like every iteration just trains 10, 11, 12, blah, blah, blah,
with these fixed parameters, and then another iteration will try another combination. So in total here, it gives you 72 different XGBoost models. And then among these 72, the one that's going to be selected is the one with the highest accuracy, and the combination of parameters of that model will be used in the final model.
And finally, then we have this other function that instead trains the final model, which is this one train spatial temporal ensemble model.
So in this case, what you do is train the model and you create an object, which is this tmd.ml, which is a list that contains all the models that we trained.
So for example, the first one is the random forest model. The second one is the XGBoost model, and the third one is the GLM. And the fourth one, if you remember the first slide, where we say that the probability of occurrence is a function of different covariates, is the formula with the response variable.
So in this case, the column where we have stored the presence or absence of the species, and then all the covariates that are used for the model. And here you can see in the learner model, all the parameters that have been used, which ones are actually the final combination.
So for example, here for XGBoost, we can see the final values for the parameter selected. So what we see is that the final model actually shows 10 for the max depth parameter, 0.75 for the eta parameter, and of course, all these parameters mean a specific thing for the model.
So you have to check and you have to study each model that you're going to use. So after these models are set, and you use the formula, the tuned model that I showed you before,
and then the dataset, of course, this is a subset of 5,000 points that they're using to make things faster. And the blocking factor, then you have the final model, which is an ensemble model based on stack generalization, as Tom mentioned this morning.
So you have different ways of doing ensemble, as we said, average is the most simple one, while stack generalization, it just use the output of every learner that you trained at the beginning. And every learner, as we said, will give you probability value from 0 to 100. So at the end, you will have just a dataset with three columns.
Each row is one observation, and each column is a specific probability value for that point. And the probabilities for these three models are then used as training data for a second model. And in this case, for example, we see that among the coefficients,
random forest has the highest value of coefficient, but if we then check the p-value, GLM is actually the most important. And this is in the final model, but not in the accuracy assessment. That's something we are going to see later. So from the other parameter optimization part,
to give you an idea of what's the influence of doing this thing, is that the maps can be very different if you use models that are not fine-tuned. So in this case, you have just the basic parameter values, and you see that, for example,
XGBoost is actually the one that suffers the most from not being fine-tuned, because in this case, all the values, as you can see from the scale, are around from 40 to 60. So they are really not big extremes. So every pixel is kind of predicted as having the same value. GLM is very accurate instead, and random forest also, more or less.
One is that if you do hyperparameter optimization, most of the models will, let's say, conclude that this area is actually predicted differently. So you have a more specific prediction, something that matches more your dataset.
First, you have to be careful with not overfitting, as we discussed this morning. So basically, in this case, if you want to have the final map for this specific area, what you do is that you take all the pixel values for these specific areas, and all these pixel values will be used to train the final model.
Then we talked about the variable importance. So the nonlinear model especially has an internal algorithm that will basically tell you
which among these features that you use to train the model are the most important to give you a specific value of accuracy. So in this case, the highest value possible. The thing is that, of course, this thing will just tell you which of these variables
will give you the most accurate predictions for this model. But that doesn't mean that you have, for example, a causation. So let's say that if you say that, for example, these two species, the occurrence of these two species indicates the presence of the species you're trying to model.
So the phagocyst body, the European beach, doesn't really mean that this is like a biological meaning. This doesn't have a biological meaning. This is like a correlation, but it's not a causation. The same thing could have happened if you have, for example, as the first covariate the precipitation or the temperature.
If you have 10 degrees in a specific area, that doesn't mean that if you increase or decrease the value of temperature, you will have biological meaning of what's happening there. You have to study it. So, for example, for the climate, we know that the species are basically influenced by the conditions of precipitation and temperature.
But for other covariates that have no specific meaning in terms of biological value, you may not know if the covariate has causation and not just correlation. And in this case, we're just looking at the variable importance from random forest. But things are actually more complicated than this.
So as we said, we have three models, and all of these models predict and use a different part of the feature space. That's why in this ensemble modeling, when you're going to choose which models are part of the final ensemble model, you have to choose models that are very different from each other.
So in this case, we are using two decision tree algorithms, which Random Forest and XGBoost, which one Random Forest is based on also an internal ensemble, while XGBoost is based on another thing that's called tabbing. And on the other hand, you have a linear model, which is this lasso regression.
So in this case, each model will figure out what's happening in the feature space in different ways. So you hope that these three models will capture different parts of the feature space and that the final model will take these conditions and will put them all together. That's why you always assume that the ensemble model would be better,
or at least as good as the best component model that you're going to use. So here, for example, in the computational notebook, we just have the Random Forest variable importance. But in the slides, you can see that I put the component,
the variable importance of all the component models. So this is, for example, the variable importance for Random Forest, as we saw also in the computational notebook. And this instead is for XGBoost, and this is for the GLM. As we can see, each algorithm takes into account different parts of the feature space.
So, for example, GLM uses more these spectral indices to correctly classify a specific area, while algorithms that are more similar, like Random Forest gradient boosting, are both decision trees.
As you can see, as the first and most important covariates, mostly choose the same. So they kind of capture the same part of the feature space, while these capture a part of the feature space, and it's usually ignored by these other two. So in this way, you have a better distribution of what's happening in the environment.
And for the last part, so the accuracy assessment. As I mentioned, is a way of assessing how good is your map.
So usually, you don't really have in this, as I said, we already have lack of absence data for the training of the models. And you also don't have an independent data set that you can use for the validation of your maps. So a more common thing in this case is to use cross-validation. So you split your data set. Usually, in this case, we use a five-fold cross-validation.
So you split the data set in five parts, and you use four parts of this data set to train the model, and the last part to validate the model. And you iterate this operation five times, as long as you basically change each part
and use each part of this splitting of the data set for more training and testing. And in this case, we also use spatial cross-validation. So in this case, you take all the observations that are closer in the geographical space. When you do the splitting of the data set, they're usually put in the same fold, which is already demonstrated in these kind of situations.
It's more accurate than using a normal cross-validation. It will give you over-optimistic accuracy values. And what we do is basically create three new algorithms that are the same that we use for the final models.
We assign the hyperparameters from the final model we trained before. Then we create classification tasks. So this is how the training works in real time. So these things that you see, so the classification task,
or how do you create the models, are already wrapped up in the previous function that I showed before. So for example, if you look at this trainSPML that I mentioned before, it's actually a wrapping function of these three algorithms.
And it does all that I'm showing you now under the hood. So if you want to really study how this library works to train different models, get different data sets, you can look at the functions. So what we do, as I said, is doing this spatial cross-validation. So we do a sampling of the data set with these three models.
And in this case, we use the log loss as an accuracy metric. We can use different performance metrics like the overall accuracy or the F1 score, the kappa. In this case, since we are in a probabilistic space, we prefer to use log loss because it usually gives you more realistic accuracy values.
So in this case, we did the resampling operation for both the base learners and for the ensemble model. And then we can check the final log loss values.
As I mentioned, the log loss is actually a loss function. So you have a value that's zero point something. You really have to see what you're trying to calculate. So let's use here the results from the computational notebook that's already online.
For example, you get values like 0.16 or 0.081. So if you don't really know these accuracy metrics, these numbers may mean nothing to you. So what we did is actually create a pseudo R-squared that uses the log loss of the model that you get. So these four values that you see in here, and you divide this value by the log loss R,
which is in this case, the log loss value that you use for a model as a random assignment. So if you use a model that basically is just a 50-50 decision without using any covariates, you just use this as a benchmark for your model.
If your model is worse than a guy just covering his eyes saying this point is a one and this point is a zero, then your model is basically useless. So don't even try to use it. So in this way, we standardize these R-squared on a value that's more intuitive and easier to understand.
So a value that goes from zero to one. And of course, the closer you are towards the zero, the worse your model is, the closer you are to one, the better your model. And in this case, we can see that these are the final pseudo R-squared values for the three component learners and for the final model.
So we saw from before from the model that actually GLM was the most important, but in terms of performance is actually the worst. And XGBoost is actually the best of the three component learners. And the ensemble has a slightly higher accuracy values compared to all the three components learners.
So in this case, we have a proof that the ensemble model is better than the three component models. We know from the variable importance analysis that the ensemble model scholars a larger part of the feature space. So you have less extrapolation problems. So in this case, it's a trade off on your side.
You have to decide if the extrapolation part and the increase in accuracy are enough. This increase is enough compared to maybe one of the base learners to decide if you have to spend the computational time and the physical resources,
if you have to buy them to actually train this final model, because as we mentioned, it's very computational. So to give you an idea, if you're not going to use the files that are already in the book and you try to process everything, this will take you maybe 15 minutes in total.
But if you use all the points that we saw before and all the covariates and get to the feature selection as well, then the computational time increases to even two days and not on a laptop. But on a server with 96 cores that you can see here, and it has 500 gigs of RAM.
So you may not be able to do the whole processing. So if you don't have the physical resources, that's something you have to consider. Then going back to the slides, we can see what I was talking about about the variable importance.
So in this case, most of the papers that you're going to read about species distribution models use threshold independent metrics, like the area under the curve, or threshold dependent metrics like the true scale statistics. Threshold and non-threshold, what does it mean?
Basically, as we said, we have observations that are either zero or one. So the model classifies the final pixel as a zero or a one based on the value of the predicted model. So in this case, non-threshold dependent means that between these zero and one, you use a threshold.
So usually the default is 0.5. So if the model classifies a specific observation with a value that's above 0.5, that's considered as a one. Otherwise it's considered as a zero. And then you calculate the accuracy based on how many points the models classified
as one and the zeros compared to actually what they are really zeros and one. So most of the time, if you have 90% accuracy, that means that among the zeros and the one, 90% of the points were accurately classified by the model and 10% instead were misclassified.
So they can be misclassified as ones or zeros. While a non-threshold independent means that you calculate this metric across all the threshold. So you calculate for 0.1 of being either zero, one or one, 0.11, 0.12 until you calculate all the possible intervals.
And you get a final number, which is the area calculated at the interval of this curve between correctly classified and correctly misclassified points. And as I mentioned in most of the papers, this metric is the one that is used. And you will see that most of the papers have very high values.
So as I mentioned before, like you have these random assignments. To give you a reference, a random assignment is 0.5. So our model in this case for these pieces has almost one. So it's almost a perfect model according to this metric. If we instead calculate the R squared log loss, which is calculated in a probabilistic space,
we actually see that it's not as perfect as this matrix would like you to think. So you also need to be really careful about which accuracy matrix you're using for your problem. You have to understand what's happening and why, for example, you have very over-optimistic values or not.
So in this case, also you have to be careful when you read the papers on what's happening in there. It's like, okay, they are reporting this problem, but is these metrics the correct one for the problem that they are trying to, and the modeling framework that you use to actually solve the problem?
Or it's just a way of, I wouldn't say lie with statistics, but showing that they have good results. Let's see. Are there any questions, I think? So this is the... Can you show me where that spot is?
Sure. Next. There we go. So here you can see, as I said, the different accuracy metrics.
So for example, for these species, according to these accuracy metrics, the model is perfect, while it actually is the worst species among all of them. So what would you like to look at in this problem? But that's very tricky. Yeah, it's very tricky to understand which metrics you have to use.
Yeah, it's amazing. Completely opposite. Yeah, you may have, as I said... The same model and completely opposite conclusion. So which one do you use? Since in this case, we are modeling in the probabilistic space, and we have, as I said, pixel values that go from 0 to 100,
the log loss actually gives you an idea of how much a specific observation classified by the model differs from being a 0 or from being a 1. So they measure, even if you have, for example, an observation that's accurate, like the real observation is a 0 and the model classified is as 0.6 or 0.7,
this logarithmic loss measures the distance between the classified point and the true observation value. And then you have, statistically speaking, it's more complicated to explain, I can show you the formula and everything, but it's out of the scope of this lecture.
And in this way, you have a more representative metric of what is actually happening in your model. So I think that if you use probabilistic space instead of just 0-1 classification, it's better to use the log loss value.
So I think we also have some conclusions in the notebook. So here we mentioned other kind of pinpoint things that you have to consider when you do ensembling.
So as we said, the overfitting problem, the computational problem, but something that you have to consider if you want to optimize your computing skills and computing performances as well is try to run operations in parallel,
so to use all the cores available that you have in your machine, try to do fine-tuning and parameter optimization. You saw in the prediction slides that I showed you before the difference between a non -fine-tuned model with a fine-tuned model, and in this case, using a tiling system. So it's very difficult, if not impossible, if you don't have good computational infrastructure, in some cases not even then.
For example, producing predictions at a continental scale at higher resolution, because you don't have enough memory to process everything at once. So it's better to split the whole difference. In this case, we use 30-kilometer styles over all Europe to run these predictions in parallel.
So you will predict only a specific portion of the whole continent multiple times in parallel. So in that way, then you will create a mosaic, so you take all the tiles together and you create the final map.
And then, of course, some general recommendations on the whole workflow, as I mentioned before. It's just try to define what's the target of interest of your model. So what's the problem that you're trying to solve? If you're trying to solve presences, if you're trying to solve species richness, species occurrence,
what could be the most important features that you have already to collect for your model? Then, of course, prepare a space-time overlay with a dataset that's like the one that I showed you here, where you have the metadata related to a specific occurrence point, and then all the features that are going to be used for modeling, and create a dataset.
So a matrix in this case is a classification matrix, and the problem that Tom showed you before is a regression matrix. And then, of course, try to reduce as much as possible model complexity without using
either accuracy values or performance values, depending on which matrix that you're trying to use. Of course, try to also include something that's really, now it's getting increasing attention, and also in the papers. Whenever you produce a map, it's actually recommended to produce an uncertainty map,
so that you can actually inform users that your map has a specific value. So in a specific area, you have either 90 or 100 percent of the probability of occurrence of a specific species. But also try to let them know that your model may be wrong, and how much this model can be wrong in that specific place.
So you just don't try to say, okay, I have an R-squared of 0.85 with plus minus 10 percent. Plus minus 10 percent on the whole European scale in every pixel may be different. You may have, for example, problems of extrapolation, so where you don't have any information of what is actually there, where the model can be more than 50 percent incorrect.
So in that case, if you want to use information from that map in that specific area, you have to be careful on how to use this information. And then these maps can be used for multiple post-processing analysis, like change detection or trend analysis,
only in this case where we have the probability of occurrence to see if the probability in a specific area is increasing or decreasing because of reasons. For example, in this case, we can actually detect if the probability of presence of a specific species is decreasing because, for example, there was a Wi-Fi.
So there are no more conditions suitable for that species in that specific moment in space-time for the species to be there. And that will change, of course, over time. Or, for example, you may have a family, so people are just harvesting that specific forest. So the specific forest is completely empty, the conditions are still available, but you
don't have a response from the species that's present there, so you don't observe it. So in that case, the model actually tells you, okay, there's a zero percent probability that the species you're trying to find is here. And this is more like more of a technical thing that I think was already examined this morning in the other track course.
Produced predictions as cloud-optimized geoteach, but they are also faster to visualize. In pyramids, it's not like you will have very slow rendering of the map, so you can zoom in and out of the map and analyze different areas very fast.
And that's about it. So this is the whole lecture, and thank you for your attention. And if you have any questions or you want to go back to a specific point of the lecture, please let me know. Or if you want to check some specific part of the computational notebook, or you want some more information
or references on some of the techniques that I mentioned, or the accuracy metrics references, feel free to ask. We still have, I think, three minutes, four minutes.
The training set is available on Zenodo on our website. The maps are also available.
Yeah, it's in the slides already. So in this case, in this notebook, we are trying to just predict the realized distribution of a species, so where the species actually is.
But what we actually, what we mentioned at the beginning is that we can have different environmental niche that we are trying to predict. So either all the areas where the species could occur or where the species actually is. So in this case, on the data viewer, you can find both.
Yeah, one second. I will just prepare the setup.
In using the compare tool, you can check the differences. So in this case, we are trying to model on the left the potential distribution. So what is actually all the locations in Europe, forcing a probability scale. So everything that's pinkish is below 10% and everything that is dark green is above 90%.
So it's basically sure that the species could be there in this case. So all the conditions in Europe where these species that we are interested in, so the European beach, could potentially be present if there were, for example, no cities, if there were no other species that would have a negative interaction with them.
So for example, competing for resources or all the climatic conditions that are, of course, like respected for the species to be there. While on the right, we have the realized distribution, which is the thing that is modeled in the computational notebook, which basically will tell you all the areas that the species is present.
And it's not present because of all the reasons. So as we said, the presence of cities, croplands, environmental conditions not suitable, pests, diseases, drought, all these kind of things.
So you can actually check. These can be used, for example, as a baseline to find the areas where you could actually conduct forest restoration plans. So if you want to do forest plantations, for example, you can check all these differences between potential and realized distribution of multiple species.
And you can check for specific areas, which one can be planted in the data. And then, of course, depending on what you're interested in, if you want to have more like a production forest or you want to increase the biodiversity on that specific area. Or you have to use, for example, close to the mountains, there are things that are called protection
forests that are used as offering barriers against avalanches and give specific species that are used for that. Or like there are three, four species that are used in consultation to create these kinds of natural structures that will avoid avalanches to follow villages and stuff.
So this is what actually you can use, how you can use these maps to plan ahead. In this case, we use, we produce only one map of potential distribution because potential distribution is mostly driven by climate. So we don't use information like the satellite. So we don't actually look at the place from above and see if the species is there or not.
We just consider the environmental conditions. While in the realized distribution, we do, we look actually at the specific area. And since it's just something that's related to climate, we are, we suppose that we assume that in these 20 years that we are going to analyze,
so from 2000 to 2020, this potential distribution would not change much. So it will change, for example, in 50 years from now, but not just in the period of observation that we are going to consider. For the realized distribution, this is not the case. So we, as Tom also mentioned, the
realized distribution doesn't change from one year to another or from one month to the other. So we actually produced six maps that cover a different period. The one that are in the tutorial are the last three of these timescale. So from 2000 to 2014, then you can basically go ahead in time and go forward in time and you
can see there this four year period from 2014 to 2018 and the last period is from 2018 to 2020.
So if there are no more questions, I can also stop the recording. Yeah, sure. It's using random forest, so it's like, as you said, yeah, it's based on random forest.
The covariance that random forest chooses. I mean, we use a subset of the data set and we use a less complex random forest model than the one that we're using at the end.
So we say they use just 50 trees. It's a very simple model. And we also do spatial cross validation. We repeat everything five times. It's actually 25 repetitions. So we try to increase the number of repetitions to have an unbiased feature space, let's say. But then the model that we're going to use afterwards,
we do hyperparameter optimization and all the other processes that I mentioned before. So what we do with the model, it's one of the simplest random forest model ever just to display selection. And it's a way of analyzing what's happening in the feature space. So in that way, it's
like you can use random forest, but you can also use other algorithms to do this feature selection. So it's based on the variable importance. So you check for each, for example, 200 covariates, you have a specific accuracy value. Then you decrease the amount of covariates and remove the one that is the variable importance of the lowest value.
And you increase the amount of repetitions until you reach the lowest amount of covariates possible, which is, in this case, we use 10 as a threshold.
Then we can also look at the computational notebooks. So here we have some of the functions that I was mentioning before. So this is the function that we use for the feature selection.
So what we do is, in this case, we first try to simplify the basis as much as possible, because as I said, some algorithms like random forest cannot work with NA. So basically you cannot have a specific role for a specific observation. You cannot have one cell having NA values.
So you have to consider all the observations that are completely clean. And if you don't have the observation for that specific covariate, you have to try to simplify it as much as possible. Here you create the feature space for random forest. So we select, for example, the amount of covariates and we select the minimum and the maximum.
In this case, of course, the maximum is 238 and the minimum is the square root divided by three of all the amount of covariates. And then you select, you create the same kind of feature space for XGBoost, and then you create the classification task.
And after that, you train the whole model. So you can see there's a specific function that does the tuning of the parameters. And you can specify which performance metrics you want to use to minimize the error.
So in this case, we use the log loss, but multiple measures can be used. You specify the amount of threads, so of course, of your specific machines that you want to use for this. And this is a way to generalize it as much as possible. Instead of putting a fixed value, we calculate it based on the machine that you are using. The number of trees is set to, in this case, it's set to the amount of cores that we have available.
So it's just like 96. In that sense, it means that you are training 96 different classification trees. And in this case, for example, if we have 96 cores in this...
that means that each core will run one specific random forest model. So this is a way of parallelizing as much as possible, increase the amount of resources, computational resources, as much as possible, without sacrificing, of course, like time. And then the same is done for XGBoost. XGBoost runs already in parallel,
so you don't really need to select the amount of trees and cores that you want to use. And the same then is done for the linear model that's linear models, the generalized linear models. In this case, it's a penalized, generalized linear model. So you have a specific parameter for the generalized linear models that can be either 0, 1, or a value that
goes from 0 to 1. And this is a way of penalizing all the covariates that will increase or decrease the computational accuracy of the model. So with LASSO, so in this case, we use a value of 1. In this case, that means that all the covariates that
will decrease the accuracy of the model of a certain value will basically be excluded by this model by default. While instead, we're fine-tuning this value from 0 to 1, so using 0.3, 0.4, you try to include as much as possible covariates
in your regression model. So the more you go towards 0, the more you are trying to be inclusive with the covariates that you are using for your model. Then, as we said, we have this final object that is used as input for the final function that trains PML. And you basically take all the three models
that you have available plus the formula in the first part. And then you create a computational classification task with the final learner, which is a logistic regression model. So this is like a normal GLM model with a binary family function, linking function.
And in this way, you basically constrain all the predictions in interval that goes from 0 to 1, to 100 in this case. Yeah, basically here, you recreate the models. You assign the parameters, which is something that I showed before in the actual assessment process.
And then if you check this object, which is the final model, you can see in this part,
learner model, supermodel, learner model.
So here, as I mentioned, you have a matrix that is used by the meta learner with the ID of the original data set. What kind of point was in the original data set? So if it was a present or an absence point.
What the three models predicted that point in the feature space, so a ranger predicts 0.96. So that means that you have a 96% and that this point is going to be 1. XGBoost almost goes to 0.99. And almost the same for GLM. So all these three values are then
used by the logistic regression to train the final model and have the final value. And this matrix is then used, as I mentioned, in the APUELAS assessment for the meta learner here at the end. You can see that this thing that I just opened here
and I called from the console, it's the same thing that I have in here because we use a different syntax here. We use the dollar instead of using the double brackets, but it's the same thing. On the targets, it's going to be the same column.
And in here, we can see all the performance metrics that we can use. So for example, I mentioned several. In this case, I mentioned log loss, but there are several.
And those can be used for different problems. So it can be used for either probabilistic models, like in our case. So for example, the area under the curve can be used for probabilistic models, but the overall accuracy cannot be used for probabilistic just for hard classification problems so that are not
in the future space. The F1 score is another one of those. And then you have more specific values like the true positive, true negatives, something that are related to the classification matrix concept. I don't know if you are familiar with that.
So these are all for classification problems. Then you have all for regression problems. Some of them are available for both. And as we skip taking one part to the tutorial just for the sake of clarity, for all these maps
that I showed you before and the tiles that are going to be produced, you can generate with the plotKML function something that can then be imported in Google Earth and can also be included in the historical version of Google Earth. So on Google Earth, you can add the historical version
of different locations on Earth. And based on the availability of pictures, you can see how that area is changing in space and time. In this case, we created different KML objects that are imported in Google Earth. And we can see, for example, how this distribution of these pieces is changing over time.
So basically, we can start from 2012 and see how it's going to be. And basically, the same thing happened in 2016, et cetera. So as I said, here, this data that is considered in the maps of 2020, 2016, and 2012 are wrapping numbers.
So 2012 means that this map goes from 2000 to 2014, as I showed to you in the data viewer before. 2016, the same goes from 2014 to 2018, et cetera. And then you can zoom in or even to a terrain.
Let's see. Let's go to a specific area in here. And then you can look in real time if this species is present or not.
So in this case, we were predicting that this species was almost at 90% probability of occurrence in this area. Here, you can see some of these species, plus some other weed and other trees behind. This is an area. It's located in Switzerland.
It's the same that we are showing in the map and in the tutorial. And it's basically just one of these little squares. As I said before, you have to use a tiling system. In this way, you can just predict one tile every single time. And in this case, this is all the points that we have for these species, which is, I think, in total,
more than 500,000. And then, of course, you have more points. But that tile is basically one area here in Switzerland. And it's 9,690.
So you can see that the conditions in the environment
using OpenStreetMap as a map are the same that we were looking at in Google Earth.
So this is the area in the mountains that we were looking at. And then, of course, there are several plugins that you can use. So in this case, you can actually
look even from QGIS at a specific point on the Google Street View. For example, I can click here, and I can then open it in Google Street View on the browser. There's another plugin that you have to specifically adapt to include and have it in QGIS here on the lower left
that you can use. It's not loading. Maybe take another area covered by the Street View.
We will see the light clouds in here. So this is, no, no, no, no, no.
Maybe here, we can see something.
Yeah, that's the thing. So if you check your map, and then you check the coverage of the Street View, you can do the same from QGIS instead of opening it in Google Earth. It depends on what you want to see, if you want to do analysis, or just basically a specific use
case scenario. Any other questions on the computational notebook? Did you find it something that's maybe not really clear that you want to explore more? I have a general question.
I'm still a little confused on how you produce the map of the potential distribution of different species. In this case? The same data set, and how did you find the objective maps? Yeah, in this case, you use the same data set and the same modeling technique. But what is different in the potential distribution
is that you do not include some specific covariates. So the feature space is much more reduced because of the condition that we are trying to analyze. So for example, the competition with other species, so these other 50 distribution maps that we were talking about before,
are not included in that model. The same thing, everything that has to do with human influence on the environment. So for example, reflectance, surface reflectance data sets, or everything that comes from the satellites. Or in terms of temperature, you also have, in the data sets that I showed to you before, the ERA5 Copernicus data set,
you have temperature observation that are either at surface level, so zero meters from sea level, or most of the temperature that are recorded are actually recorded by stations, weather stations that are two meters high. So in this case, having a building or not in a specific area can influence
the value of temperature you are recording there. So, and this is something that creates, for example, the heat island effects. So you normally have air temperature values that are higher in cities compared to what can be, instead if the area was covered with forests or grasslands, just because of the influence of any that's human related.
So in this case, for example, we just use the surface temperature, because it's recorded on the ground, for the potential distribution, while we use both the surface and the air temperature for the realized distribution to see which one correlated more with the distribution of the species.
Can you press the absence data for this project? Yeah, the absence data, it really depends on the techniques that I was talking about before. Let me go back to the slides. Because as we said, there are different assumption
that you can make in this case. Sure. So for example, you have different strategies when you use presence and absence datasets. Either, for example, you can prepare potential distribution models that are very simple and use just presence data and geographical distance
or melanobis distance. So all the kinds of distances in the feature space. Or you can create to have, of course, like an unbiased distribution of the data. You can just select all the points that you don't have as presence data as absence. So you can do just a random selection
in the specific area of interest. So for example, you have the whole Czech Republic in this case, you have the observation from specific species, in this case, for example, the European beach, and all the points that are not considered as presence data, they can be randomly sampled from the country and they can be considered as pseudo absence data.
But in this case, as we said, you can fall in these errors because you may not know that the species there is not present or is present because of survey problems. So it's not like you have every species of the country actually surveyed. So in this case, you can either,
as you mentioned, if you want to do potential distribution, it's better if you just do techniques that don't assume what's happening in the environment and you just do random selection. But there are other techniques where you just do environmental profiling. I mean, but if you have a certain location and there is a tree, then it's certain that the environmental conditions favor this.
Yeah. It's difficult for this tree, but it's absolutely, we don't know if it's the environmental conditions or whatever it is. Yeah, it's because of the environmental conditions or because there's something else. So that's also like a methodological problem. So what, we are talking about correlative models. So you associate values with a specific response. In this case, what you can model is potential distribution,
realized distribution, but you can never model what in ecology is defined as the fundamental niche. So all these ensemble of conditions where the species can be present, these can only be measured with process-based models. So whatever, like whenever you see a paper
that mentions the potential distribution of a species that uses correlative models, you know that that's never going to represent the fundamental niche. So this is one, for example, limitation of the approach. And another limitation is the inference. As I mentioned before, like you have,
you may use different covariates and you may predict a map that's very accurate. And if you show this map to, I don't know, in this case, a forester, the forester will tell you, okay, this map is precise because it's showing where the species is and where the species is not. And I know it from my expert knowledge, but that doesn't mean that the features
that you are going to use can predict, for example, in the future, what's going to happen. It will not give you a biological meaning of what's happening in there. So in this case, these models and these variable importance can be used to increase the predictive performances as much as possible.
But without further analysis, you can not say that these covariates are actually the ones that drive the phenomenon. So for example, if most of the studies that focus on, for example, climate change and how the distribution of a species, potential distribution will change over time, is just because we know that climate drives
the distribution of a species. So if the climate, the temperature, and the precipitation change according to a specific climatic scenario, and then the geographical distribution of the species, for example, moves forward higher in terms of altitude and higher also in terms of latitude,
that's already a causation because we know for previous studies that these covariates drive the phenomenon. But of course, on a more localized scale, if you have more specific problems, like you have a specific pest that is slowly changing the conditions of a specific area, you cannot model that because that falls
into chaos theory and stochastic phenomenons and it's difficult to model in times. Like it's basically almost impossible to predict these very random and extreme events. We do know, for example, for climate, because we observed that in the last 30 years,
climate change increased the intensity and the frequency of these extreme events like the wildfires or the pests. So then you have a qualitative assumption of what can happen if you have an increase of these kinds of phenomenon. But it's difficult to model, for example, that if the conditions of an area
and the realized distribution then change because of the influence of wildfires. And then of course, if you want to use more specific kind of techniques to generate these pseudo absence points, the more the technique is specific and complex,
the more assumptions you have to include in the techniques that of course will deviate from what you actually have in the data. So the best thing in every machine learning problem is to have better data. The modeling techniques can go so far, but the better data you have,
the best results you can have. That's why, for example, in this case, we are using a dataset that's, as I said, it comes from national forest inventories, but it's a harmonized datasets that has more than a one kilometer shift in terms of coordinates of the points.
So if you want to predict a high resolution, we use several post-processing techniques to try to get as accurate as possible locations. But the true location of the points is not shared because of privacy reasons from the countries and all this kind of thing. So doing maps that are very accurate and very precise without having precise conditions like coordinates,
it's very difficult. So in this case, that's why we have made claims also like more and more scientists. Recently, a paper came out of nature from the major forest groups, for example, in Europe, to just actually ask the countries,
please make this data open because without proper data, we cannot help you drive your decisions and make policies that actually reflect what's currently happening. And that's, of course, a very tricky problem. There are some studies that already used, for example, good forest inventory datasets,
but in this case, you have to sign a disclosure agreements. So it goes more into policy than actually science. There are groups that are very jealous of the data that they've in their possession and then don't want to release this data.
And it's difficult doing that science in this case, hence why what we were mentioning this morning, open science, code, data, everything should be available. So everything that it's produced here, we use the subset and whatever, and everything is on Zenodo and everything is available in this breakdown document in the Git book. And then when the paper is going to be published,
you can just check even the scientific reasoning behind it. And you can already download all the data that's available, both the maps that are on the data science, open data science viewer and the training data sets on Zenodo. So that's what we're trying to do. Whenever we do something, it's better to document it perfectly
with a code that everyone can download it and use. And we try to do it in two directions. So we try to produce the whole thing. So all the data sets, all the maps, all the code. So that if anyone with us, the same computing infrastructure can produce everything at once as we did.
And then we have like a smaller subsidence, like the one that we use for this tutorial so that everyone can practice and can get a gist of what is the feeling of working with these things. Maybe talk about the MLR tree and the MLR tree spatial package. Yeah, sure. And you used it and you tested it.
Yeah. I can show you the- Because we still just use MLR, right? Yeah, but what it's not, let's say for scientific reasoning, it's more like for, in our case, it's more like production problem. So if you want to work with the smaller data sets and you want to use specific techniques, et cetera,
you can certainly do it without a problem. Let me show you. So this is the MLR tree book, and it's actually better than, I mean, of course, like MLR tree in terms of philosophy, as I mentioned, is a complete change of part
in the terms of way how the libraries and all the algorithms are structured. So it's much more flexible. You can create pipelines. So let's see if they have some kind of visualization of how it works. So you can just plug in different components
like feature selection, every parameter optimization. You can run different operations for every algorithm. You can use control groups. And in this case, for example, we use a very simple model. As we said, we did the training of just three component models, and we train a final model and that's it. But you could potentially do it for three or four or five levels.
You can train ensemble of ensembles and we increase, of course, complexity. We increase difficulty of measuring the uncertainty of that model. So it really depends on what you're trying to do, how, when, where. And there are also a lot of vignettes available or with examples. And so you can try it out
and see what I'm talking about. All the pipelines really work. And for example, something that I showed you here, like accessing a specific object in R if you're familiar with R, like as a list, and then you can check every object.
This is not as simple in ML3 because you have to access it. It's like an ensemble of classes and functions. So you cannot really see what's in the object you have to study beforehand on the book. So if you have programming experience in R only, it would be a bit more difficult to get into it.
If you already come from other coding languages like Python, or if you are a computer scientist at C++, it would be like more user-friendly for you to access everything that's available in the package. But the structure, the concepts are mostly the same.
What is new is this concept of the pipelines because it's more flexible than MLR where you have some functions that just do a specific thing. For example, doing an ensemble of ensembles in MLR, it's difficult if not impossible. Here in MLR3, you can customize your pipeline as much as you want.
The spatial one is like one of those spatial tasks. So basically one of those things where you have to consider a spatial autocorrelation between the points. And this is something that this package, the MLR3 space-time is something
that Patrick Schratz wrote. And I think he will come to the summer school in September, right Tom? Yes. Yeah, and he will present these new packages and how the spatial temporal, spatial autocorrelation
works in machine learning problems. This one was a work in progress package for like two, three years. And now it's published since I think January or February. So everything it's available now is causing access and improvement for what's already was available in MLR.
And there are specific examples that you can try. And of course, references. And you can see the differences between, for example, spatial and non-spatial problems. And this is what I was mentioning before with the spatial cross-validation. So if you divide your dataset in this case in four parts,
you just don't take four run, like when you split it in four, you just don't take random observation from the old dataset to create these four parts, but you create a spatial folds. So you divide the dataset in four or in five parts, and you go four parts that are used for training,
in this case, the blue ones, and one part is doing four testing. And this thing is repeated until you finish all the parts of the dataset. And this is done with this tile ID column. Like in this case, like in the example that's on the MLR3 book,
it's done with the coordinates. So with X and Y. And what we did is instead using this tile ID, which is one of these little squares that I was showing you before. So all the points that are used for training and testing once you split the dataset are taken from the same little square.