We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

SDM modeling with R

00:00

Formal Metadata

Title
SDM modeling with R
Title of Series
Number of Parts
15
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production Year2023
Production PlaceWageningen

Content Metadata

Subject Area
Genre
Abstract
Daniele Da Re is a Postdoctoral Researcher, at the University of Trento, Italy. Cedric Marsboom is CTO of Avia-GIS, Belgium. During the 2023 MOOD Summer School, these two experts hosted a session to give an introduction to correlative SDMs in R using virtual species as example. A tutorial was provided and both the GLM and RF methods were explained. We first analyze each algorithm separately and the produced ensemble models. In the final part of the lesson we focused on discussing different predictive accuracy metrics.
Keywords
Context awarenessProgrammer (hardware)HorizonBoom (sailing)Process modelingHausdorff dimensionCartesian coordinate systemCombinational logicFrequencyBitGroup actionLink (knot theory)Statistical dispersionRandomizationSpeciesEndliche ModelltheorieForestVirtualizationXMLUMLComputer animation
Meta elementGame theoryGraphic designTouchscreenCombinational logicLibrary (computing)Type theoryLevel (video gaming)Angular resolutionAverageArithmetic meanBitFunctional (mathematics)Maxima and minimaExtension (kinesiology)Table (information)NumberLinearizationRaster graphicsEndliche ModelltheorieDifferent (Kate Ryan album)NeuroinformatikObject (grammar)Image resolutionDegree (graph theory)Multiplication signTesselationComputer animation
Variable (mathematics)SpeciesTouchscreenArithmetic meanMaxima and minimaRange (statistics)Standard deviationWeb browserData conversionLinear mapThresholding (image processing)Integrated development environmentPixelInternet forumRandom numberReal numberPressureEvent horizonTemporal logicStatistical dispersionLink (knot theory)VirtualizationDependent and independent variablesCharacteristic polynomialSpeciesCurveSource codeXMLComputer animationProgram flowchart
Zoom lensComputer fontTouchscreenGamma functionParameter (computer programming)Arithmetic meanLevel (video gaming)MereologyAreaPattern languageInformationNormal distributionIntegrated development environmentFunctional (mathematics)Dependent and independent variablesStatistical dispersionNormal (geometry)Distribution (mathematics)Quadratic functionRaster graphicsInteractive televisionElectronic mailing listSpeciesDegree (graph theory)CurveStandard deviationComputer animationSource code
Computer fontEmailSigma-algebraWage labourNetwork topologyVirtual realitySpeciesDependent and independent variablesWindowGamma functionParameter (computer programming)Optical disc driveCodeInformationFunction (mathematics)BitFrequency responseMereologyDependent and independent variablesField (computer science)Distribution (mathematics)Point (geometry)Quadratic functionRaster graphicsSpeciesEndliche ModelltheorieCurveComputer animationSource code
Data conversionModal logicVirtual realitySpeciesoutputParameter (computer programming)Linear regressionIntegrated development environmentBitParameter (computer programming)Data conversionAlpha (investment)SpeciesComputer animationSource codeXML
Computer fontGamma functionGame theoryRaster graphicsRandom numberFinite element methodVolumenvisualisierungSample (statistics)World Wide Web ConsortiumComputer virusMereologyPlanningRaster graphicsSpeciesDifferent (Kate Ryan album)CurveSpacetimeAlgorithmType theoryWave packetElectric generatorVector spaceDimensional analysisGeometryCoordinate systemPower (physics)Extension (kinesiology)Physical systemDot productSampling (statistics)Table (information)Link (knot theory)Error messagePresentation of a groupState observerPoint (geometry)Ocean currentSet (mathematics)Frame problemBuffer solutionObject (grammar)Auditory maskingComputer animationSource code
SpeciesVariable (mathematics)Computational physicsData conversionArithmetic meanBoom (sailing)Range (statistics)Standard deviationDistribution (mathematics)Point (geometry)World Wide Web ConsortiumLink (knot theory)Endliche ModelltheorieGraph (mathematics)Nonlinear systemFunction (mathematics)Wave packetBitFunctional (mathematics)Linear regressionGoodness of fitDependent and independent variablesStatistical dispersionParameter (computer programming)PredictabilityCross-correlationDistribution (mathematics)AutocovarianceEstimatorFrame problemSpeciesDifferent (Kate Ryan album)CurveStack (abstract data type)XMLSource codeComputer animation
TouchscreenIntegrated development environmentHypermediaGame theoryMoment of inertiaComputer fontGamma functionComputer-generated imageryWorld Wide Web ConsortiumVarianceWave packetExploratory data analysisAverageFunctional (mathematics)Correlation and dependencePearson product-moment correlation coefficientSubsetDivisorStatistical dispersionOperator (mathematics)CASE <Informatik>Presentation of a groupCross-correlationPoint (geometry)Population densityAutocovarianceGradientRaster graphicsMultikollinearitätFrame problemSelectivity (electronic)SpeciesComputer animation
Linear mapGamma functionData modelMarginal distributionPlot (narrative)Modal logicWeb browserComputer-generated imageryAsynchronous Transfer ModeModulo (jargon)Magneto-optical driveComputer fontGraph (mathematics)Wave packetAreaStatistical dispersionCASE <Informatik>1 (number)AlgorithmData structureVariancePlotterAverageWell-formed formulaFrequency responseFunctional (mathematics)Correlation and dependencePearson product-moment correlation coefficientTable (information)FamilyPredictabilityDirection (geometry)Selectivity (electronic)SpeciesEndliche ModelltheorieCoefficient of determinationDegree (graph theory)Multiplication signComputer animationDiagramSource code
Game theoryComputer clusterGamma functionPearson product-moment correlation coefficientResidue (complex analysis)Computer fontLinear mapQuadratic equationData modelPolygonMarginal distributionPlot (narrative)Computer-generated imageryModal logicGEDCOMIntercept theoremIcosahedronCase moddingMathematical analysisPlotterWave packetSoftware testingBitLine (geometry)Extension (kinesiology)PolygonTerm (mathematics)Quadratic equationStatistical dispersionIndependence (probability theory)CASE <Informatik>Residual (numerical analysis)Distribution (mathematics)Partition (number theory)Shape (magazine)Interactive televisionMultikollinearitätSound effectSelectivity (electronic)SpeciesEndliche ModelltheorieCoefficient of determinationDifferent (Kate Ryan album)Pattern languageTwitterComputer animationSource code
Plot (narrative)Data modelProcess modelingCommunications protocolEndliche ModelltheorieSample (statistics)Point (geometry)SpacetimeIntegrated development environmentSampling (music)GoogolSoftware bugUniform resource locatorTouchscreenExtension (kinesiology)Sound effectMixed realityProbability density functionGeneral linear modelMultikollinearitätMathematicsTheoryGamma functionStatisticsBuildingAsynchronous Transfer ModeLinear mapPeer-to-peerLink (knot theory)WritingComputer fontInterior (topology)Independence (probability theory)Sensitivity analysisPearson product-moment correlation coefficientFormal grammarGame theoryLinear regressionNumberPermutationInclusion mapElectronic meeting systemInformationCovering spaceStatisticsNetwork topologyPlotterType theoryWave packetVector spaceWell-formed formulaSoftware testingArithmetic meanFunctional (mathematics)Line (geometry)Interpreter (computing)Pearson product-moment correlation coefficientPerformance appraisalExtension (kinesiology)MereologyMetric systemPermutationReceiver operating characteristicSampling (statistics)Virtual machineAreaGoodness of fitDivisorSimilarity (geometry)Dependent and independent variablesStatistical dispersionIndependence (probability theory)CASE <Informatik>Error messagePredictabilityState observerResidual (numerical analysis)Distribution (mathematics)Square numberAbsolute valueAutocovariancePartition (number theory)CoefficientEstimatorView (database)Selectivity (electronic)SpeciesEndliche ModelltheorieCoefficient of determinationDifferent (Kate Ryan album)FreewareCase moddingForest1 (number)Computer animationProgram flowchartSource code
Computer fontVolumenvisualisierungTouchscreenInformationGame theoryKnotVirtual machineVirtual machineEndliche ModelltheorieTouchscreenMultiplication signForestComputer animationSource code
Point (geometry)Data modelCommunications protocolSpeciesEndliche ModelltheorieStandard deviationProcess modelingSample (statistics)PressureSpacetimeIntegrated development environmentSampling (music)GoogolUniform resource locatorVirtual machineMachine learningGradientRankingAlgorithmRandom numberForestLocal GroupTaylor seriesMultivariate AnalyseLinear regressionAdaptive behaviorModul <Datentyp>Linear mapKolmogorov complexityFundamental theorem of algebraWeb browserSeries (mathematics)Scale (map)Decision theoryLogistic distributionVector spaceFeedbackTouchscreenComputer networkProcess (computing)ModemMathematical singularityGamma functionInformationSoftwareMeta elementSurjective functionPolygonEndliche ModelltheorieCodeTunisParameter (computer programming)Social classInformationStatisticsPlotterType theoryWave packetLevel (video gaming)Sensitivity analysisGeneral linear modelSubject indexingProcedural programmingInterpreter (computing)Table (information)Goodness of fitDivisorCASE <Informatik>PredictabilityRandomizationMusical ensemblePoint (geometry)Partition (number theory)Shape (magazine)Latent heatSpeciesDifferent (Kate Ryan album)Sinc functionMultiplication signMessage passingForestComputer animationSource code
TouchscreenTrigonometryGamma functionPermutationLinear regressionInformationSet (mathematics)ForestBinary codeTerm (mathematics)Statistical dispersionCASE <Informatik>Electronic mailing listEndliche ModelltheorieCoefficient of determinationDifferent (Kate Ryan album)Thresholding (image processing)Position operatorComputer animationSource code
Hill differential equationInformationDependent and independent variablesVolumenvisualisierungPredictionState observerBitFunctional (mathematics)Extension (kinesiology)TunisPredictabilityPoint (geometry)PixelRaster graphicsEndliche ModelltheorieForestComputer animation
Plot (narrative)EmailGeneral linear modelDimensional analysisGamma functionTouchscreenGame theoryBitDot productPredictabilityState observerSpeciesEndliche ModelltheorieMappingForestData structurePlotterLink (knot theory)Social classArray data structureRaster graphicsObject (grammar)Computer animationSource code
Process modelingSample (statistics)Plot (narrative)Modal logicSpacetimeData modelEndliche ModelltheoriePressurePoint (geometry)Random numberIntegrated development environmentWeb browserMiniDiscObject (grammar)Attribute grammarDimensional analysisSatelliteComputer-generated imageryCoordinate systemMusical ensembleArithmetic meanHausdorff dimensionWeb pageCubeVector spaceSeries (mathematics)Raster graphicsPolygonBlock (periodic table)Image resolutionUtility softwareRevision controlMathematicsParameter (computer programming)Operator (mathematics)CircleAverageDensity of statesComputer fontOvalWorld Wide Web ConsortiumPlotterScaling (geometry)Field (computer science)Point (geometry)Observational studyObject (grammar)Computer animationSource code
TouchscreenGame theorySensitivity analysisSign (mathematics)Bit rateComputer fontReceiver operating characteristicArithmetic meanPolygon meshComputer-generated imageryView (database)Simultaneous localization and mappingGeneral linear modelPredictabilityEndliche ModelltheorieAlgorithmCodeInformationType theoryWave packetAverageSoftware testingGeneral linear modelArithmetic meanBitFunctional (mathematics)Pearson product-moment correlation coefficientMetric systemSampling (statistics)Table (information)SubsetNumberDependent and independent variablesStatistical dispersionError messageBest, worst and average casePresentation of a groupState observerRandomizationParallel portPartition (number theory)Raster graphicsLatent heatStudent's t-testElectronic mailing listEstimatorFrame problemLageparameterComputer fileSpeciesDifferent (Kate Ryan album)Thresholding (image processing)Object (grammar)Multiplication signReplication (computing)Standard deviation2 (number)Stack (abstract data type)ForestComputer animationSource code
General linear modelComputer fontGamma functionPlotterAverageBitFunctional (mathematics)Geometric quantizationPearson product-moment correlation coefficientCASE <Informatik>PredictabilitySpeciesComputer animation
CurvatureZoom lensComputer fontTouchscreenDampingGeneral linear modelGame theoryMusical ensemblePredictabilityForestEstimatorDifferent (Kate Ryan album)Computer animationSource code
Gamma functionDistribution (mathematics)General linear modelEmulationRow (database)Matrix (mathematics)Wave packetCategory of beingAverageBitComputer simulationLimit (category theory)Pearson product-moment correlation coefficientPairwise comparisonPhysicalismCASE <Informatik>Heegaard splittingError messagePredictabilityState observerMusical ensembleSocial classPixelRaster graphicsEstimatorSpeciesEndliche ModelltheorieMultiplication signForestSource codeComputer animation
Transcript: English(auto-generated)
So, as I was saying, there is this folder on Google Drive where you can have access to the bioclimatic variables and the script, the link is on Metromost. And this is the tutorial Cedric and I have prepared in the past days.
So as I said, we are going to have a practical session on SDMs in R and we will work for simplicity with virtual species. So we will create our species in Silico to avoid some of the biases we have discussed
yesterday. And we will try different packages. We will try to rely on Tera and SF packages as much as possible, which is what Carmelo
showed you before. But we are actually in a period of transition. So not all the packages work already with Tera and Raster. So we will have to do a bit of back and forth sometime. What we are going to check in details will be some classical, generalized neural model
and random forest. And then in the end, if we have time, there is a first combination of the two altogether and an exercise. So this is what we are going to try to cover in the next hour and a half. So to start, let's start by loading some of the packages.
I will rely on Geodata, this library, to download bioclimatic variables, so this linear combination of climatic variables mentioned by Carmelo before.
Geodata is a nice library that allows you to get access to several different predictors. You just have to specify what you want, basically. You can get working, global, a country, or a tile level.
You can specify here if you want to have minimum temperature, maximum temperature, average, computation, wind, and for bio, bioclimatic rasters, then you have different spatial resolutions.
We are going to work at roughly 0.1 degree of resolution, so roughly 10k of spatial resolution is going to be equator, which is just to gain a bit of computational time.
So let's start. I have my work claim, it's a spot raster file, as what's shown by Carmelo, resolution roughly 0.1, the desired dimension, the extent is global, you can plot it.
Bad idea. OK, it displays the average temperature in degrees Celsius. So we are going to crop it, so just again to gain a bit of computational time.
So I am going to crop it to the extent of roughly the continental Europe. The x function works with this, just this number represent the minimum x, the minimum
y, the maximum x, the minimum y, and the maximum y. So if I crop it, I have my m data, OK, crop it to the extent of Europe, roughly. There is a bit of delay, yeah, OK, so you see the names a bit of this data set, I mean,
I don't like them simply, and I am going to change it to bio 1 to 19, just because then
for modeling it will be easier to just type bio 1, bio 2, so nothing strange. And then, as I said, if you want to work with the virtual species, unfortunately we have to convert our data set from a spot raster, so an object of the terra package, to a raster
stack, an object to the old raster package. So simply using raster stack, and I will convert my m data to a raster stack, OK.
So here, just to show you a table about the meaning of the different ecological, bioclimatic variables we are going to use. So as you can see here in the R markdown, we have bio 1 is the annual mean temperature,
and then you have mean, junior range, isothermality, temporal seasonality, a couple of things. Then from one, from variable 12, we have annual precipitation, precipitation of wet as most, precipitation of dry as most, and so on. So you can use, we chose this variable because it's quite easy to get access to this variable
and play with them. At this point, we can create our virtual species, which is quite a straightforward approach, but it has some technicalities. And there is a great vignette about virtual species, the link is in the HTML.
And basically what we do, we create a species with some characteristics that we want. So we can decide their response curve to temperature, to precipitation, and other variables.
You can use the variables you want. So also remote sensing variable showed by Carmelo before, OK. So this is just an interesting method, approach, to benchmark your models, because you can
play knowing what is the reality of your species. Again, how to create this? So we decide to use just three bioclimatic variables, so Bio1, Bio12, and Bio15, Bio1 mean annual precipitation, temperature, Bio12 mean annual precipitation, Bio15 is precipitation
seasonality. We can have a look to these three variables, we can plot to have an idea how they are displayed. So there are different spatial patterns.
So we can see that the rainiest area in Europe according to this map are northeast Italy, northern part of Portugal, Scotland, Norway, and so on, OK.
So how do we create it? We decided simply to create our species with having a normal response curve with a mean
to the optimum at 12 degrees and a standard deviation of 4 degrees, a quadratic function for precipitation, and again a normal distribution for the precipitation seasonality. And what I do is I prepare my response curve, so if I expect this object, it's simply a
list with the information of my response curve, and then I use the function generated species from function. And I specify the raster stack, so our environmental data raster stack with the three environmental
variables we want to use, my parameters, and the formula, for example, so I decided that my species, the distribution of my species is created by interaction between Bio1 and Bio12 plus Bio15. I'm going to expect, I'm going to run it, sorry.
These are my response curves. So my Gaussian for temperature, my quadratic function for Bio12, you see it's truncated here because my bioclimatic raster spans from roughly 500 to 2500, so I'm missing that part
of the curve according to the data I'm using to generate the species. And now the Gaussian curve for Bio15, where on the y-axis I have the suitability, okay? Now I'm going to create my species, and I have my first species where I have, sorry,
I got a bit lost with the code. So I create my first species, again, which is this one, where I have Bio1, Bio12, the
response function they have defined above, and other information. I can have a look to different information about my function, my species, all the details. You can have access to all the information you want, and you can plot it.
And you can plot your suitability raster. So this is the suitability distribution of our species. Is it okay? Is everything clear?
I didn't get it, sorry. So the myparameters statement, I'm just wondering why you're setting distributions for each
of the biological... Just because I can create what I want, so I just decided to put this value. There is no reason for that. Okay. So, I mean, if you play a bit, then if you change the values, you will see that the output,
this output will be different. So that's the only reason. So the point is here, knowing in principle the biology of your species. That's it. So in the sense that instead of, for example, if you have your, I mean, as in the example
of Carmelo, you have your ticks, for example. You know the biology of the species by experiments, okay, where we're running experiment by going into the field and so on. By vehicle species, we skip all this section, okay, and we decide a priori how our species
will be. So in this way, when we do modeling, we can have something, we have our reality that we generated, and we can assess the predictive accuracy of our model comparing to the reality. And why you didn't generate points?
I will generate the points. You will generate them. Okay. So, response curves here, this is actually what we're going to do. So we need to reclassify our raster. So from a suitability layer, ranging from 0 to 1, to a probability of occurrence layer,
which will then we convert into a present absence map. Okay. To do that, this thing, we can use this function, which will use different method to rescale
this suitability to a probability of occurrence. What we will use is the probability function, which basically we will use a logistic regression to convert our environmental suitability into probability of occurrence.
And changing the parameters of the alpha and the beta parameters of the logistic regression, we can change a bit the conversion rules, basically. And in this case, for example, this is a nice, yeah, it's better, okay.
And so in this case, for example, you have that at 0.6 environmental suitability, you
will have 88% of chance of having a species presence. Okay. So this is exactly what we are going to do. So we're going to convert, we provide my first species, we define beta random, the probability to 0.3. And we also set the species prevalence. So how much it will be spread in the geographic space.
To give you an example, we convert it and we plot it. So you see, according to the rule, so this logistic curve that we have defined, this was our suitability raster, which have a probability of occurrence, which is then converted into a present absence raster.
So this plane is from 0 to 1. If we play with the species prevalence, so I just set it 0.3, but you can, this is the interesting part because you can play and decide and test different things. If I put it to 0.8, for example, what you get is a completely different probability
of occurrence. Okay. Let's put it back to 0.3. Okay. At this point, we can sample our occurrence. So we need our points. Okay. We need our presence and absence data set. So knowing our reality, so our probability of occurrence of the species and knowing our
present absence map, we're going to sample those layers. And here there are different things we can do. We can work with the present absences, sampling 600 points in this case, and with a fixed sample prevalence of 0.5, which means that we're going to have 300 one and 300 zeros.
So perfect. I need a very even and balanced sample size. Let's do it. You can play with it in a sense that if you check the function, you can have also work
with, you can also get presence only and then testing different methodology to generate background points and pseudo absences. Okay. That's the power of this package.
You can correct your sampling effort by suitability, by detection probability. So there are several things you can play with. So let's just generate it and voila. So I will generate my present point where I have the black dots are my presence and
the white dots are my absences. Okay. I just convert this object. I just create my data frame, which is a data frame of zero and one with the coordinates
of my observation in the geographic space. As I said before, we have a even sample because we have 300 zero and 300 one, and then I just
convert into a spatial object with the DSF package, similar to what Carmelo showed you before. So I have my table with present absence and the coordinates. I specify my coordinate arguments or use the X and Y column and I specify the coordinates
reference system. So in this case, is the WGSAT port. Voila. So my press now is an SF object with 600 feature one field, which is observed.
The geometry type is an object. So it's a vector. Okay. X and Y dimension and the extent of our all of our data.
Okay. So again, note that we're working with an idea situation because we have generated them. We know where they are. We created a balanced sample. This is not often the true. So we have seen yesterday that we have a lot of biases that we can have in our sample in our training data set.
So if you have presence only and you're going to use an algorithm that requires zero and ones, you need to create the zeros. To create the zeros, you can use pseudo absences or background points. And so what we encourage you as an exercise is to try again to run the vignette till here
and then here just change a bit and play. So you can sample the accuracy just with 300 points of current presence only. And then you can run them sampling pseudo absences in geographic space and not sampling the pseudo absences in a portion of the geographic space using, for example, a buffer out approach.
So you are going to mask out some of the errors in the geographic space where you don't want to sample pseudo absences. Or you can use this approach we have developed called use unifor sampling of the environmental
space. You have all the link here. This is the exercise. You have all the link to the vignette and tutorial of the used package. So it's a similar example.
Okay. Is everything clear? Can we enter into the modeling section? Are there any questions? Again, as we said yesterday, there are different main steps about the modeling workflow.
Let's say we have the conceptualization, the data preparation, model fitting, model assessment and prediction. We are going to go through all of these steps, but first remember that depends also on what we want to do.
If we want to estimate the correct parameters of the response curve of the species, if we want to interpolate and have an estimate of the species, of the distribution of the species in space, so called interpolation, or if you want to forecast. So that depends a bit on what we want to do. Today is an exercise we want to produce spatial outputs.
We want to get maps. So we're going to try to interpolate and predict the distribution of the species. And so what we are going to try to do, we have to decide from which side we want to take all these modeling approach. If we want to start from the biology of the species, or if you want to try to select the
best variable that explain our data set. If we have time, in the end, you will do the exercise starting from the biology of the species. Now we're going to have a look to check the different steps to find the best model that
fit our data. So we are going to extract all the environmental covariates that we have using again this terra extract function as shown by Carmel before. So we're going to specify the environmental data, so our process stack that we used before,
our presence, and we want to have our data frame. We combine then our data frame, we'll train the F, which is a data frame with 19 columns
plus one, which is the ID of each observation, we don't need it. So I combine my observations, so my zero and one, and I call it BA, present absence, with my training, so with the variable extracted from the extract function.
What we got is a data frame of 20 columns where we have the present substances and the associated value of different bioclimatic covariates. So we will now work with a classic GLM, generalized linear model with a binomial family, and we
will divide the modeling approach into different steps, so avoid multicollinearity, model selection, best model diagnostic, non-linear relationship, and prediction and good enough fit, and predictive
accuracy estimations. Let's start with the, to avoid the multicollinearity. So if variables are highly correlated between each other, we can lose some explanatory and predictive power, so we want to remove some of these variables.
There are different ways, so I just first run a correlation coefficient, just to get a graph here, and we start to see that there are some variables that are highly correlated
among each other, okay? So we can remove some of these just selecting a cutoff threshold, so remove those variables that on average has a correlation coefficient of like 0.6, for example, and in this case it's going to select me, this variable, okay?
Otherwise you can use other approaches, such as, for example, the variance inflection factor. I put here some of the different, how to use it, for example. The interesting thing is that if you use the correlation coefficient of the variance inflection factor, the type of variable they are going to select, so those variables you should remove,
they might turn out to be different. So that's also an important methodological step to take into account. So accordingly we can keep, we can remove those variables and keep these ones.
So at the end we will, those that are less collinear will be BIO2, BIO3, BIO6, BIO8, BIO13, and BIO15. Have you noticed something strange in this selection?
No? Yeah, exactly, and also BIO12. So in this way maybe, so we are, depending on the sampling that we perform in this geographic space, we might not get those variables, those variables that we use to create the species.
So this is a first potential issue that you have to take into account. But we will, in this way we don't know that BIO1 and BIO12 were the most important variables
and so we keep going with the exercise, selecting those six variables. Yes? Just one clarification, so you did the VIF or the correlation on the raster stack only or on the points with covariates?
On the points with covariates. What is our training dataset, let's say? Okay, so I get rid of the correlated variables just using a subset operation on the data frame
and our trainDF.sub dataset now has all the present absences and the six covariates that are not collinear. So as an exploratory analysis what I often do is just to try to get an idea how my present
and absences are distributed along our environmental gradients. So you can copy paste this function but basically we are going to plot the density of absences and presences along the different environmental gradients for the six variables we have selected.
So if we plot it, I was trying to see if I can make it bigger.
Okay, so it's kind of interesting we can see a bit, we are more, in case of BIO2 we have more zeros in this area and most of the presences are here. For other variables we have a kind of overlap as in this case so the signal is not so clear
and also for BIO13 is not so clear but for BIO15 we see that maybe most of the absences are on low values of precipitation seasonality and the other ones are skewed to the right side of the graph.
Okay, so this is just an explorative analysis of our training dataset. Okay, model selection. So we have our model, so we're going to model, I'm going to select my variables, my vars, just BIO2, 3, 6, 8, 13 and 15 and in a nerdy way because I'm lazy I just create the formula here.
Okay, just pasting the plus between the different variables so I create my formula which is PA depending on all the other variables. Okay, now I can create now my full model which all this, which with this structure
where you have my GLM, it's just my formula, the family is binomial because I have zero and one and then the data. So the summary, I check the summary of the model and what I can see is that some of the
variables I've selected are not significant so I can do something better. So again I have to go through the model selection and to go through the model selection we try to find the best model. So the model that has the lowest AIC, okay, and it's the best
to explain our algorithm to the next. I use this function, it's called the step, which is interesting because I can decide the direction of this stepwise selection, I can remove and add variable and it's going
to test different, automatically different models and provide me what is the best model accordingly to the AIC in this case. And so if I run stats, so what it does, so I start with this model, our full model and
then it's going to remove bio 15, bio 13, it removed bio 8 but you see the AIC when removing bio 8 and bio 3, bio 6, bio 2 increase so it's not better than our full model. So it's going to focus on removing bio 13 and bio 15 and bio 13 produce the lowest AIC
and he tried again so he removed bio 13, he ran again the model, he tried to remove bio 15 and got the lowest AIC, then he tried to add again bio 13 which was a variable that he
removed the previous step and does this thing iteratively a couple of times until he found he finds the best model which is this one having bio 2, bio 3, bio 6 and bio 8. Again, so this is our model.
I can put again into the binomial GLM, get a check. So we have most of the variables are significant except bio 8 so we can start with some diagnostic of the model. There's something weird also in this model. Just with the covariates, you looked at how well they were correlated and then when there
was a high degree of collinearity, did you remove both variables that were correlated or just one of them?
No, just one of them. And which one? Those that on average has the highest correlation coefficient. Okay, all right. So the function, this is also something that depends, no? This is another methodological decision, no? The function I use here, where is it?
Yeah, no, this fine correlation, yeah, exactly. Find the on average. So you compute the average correlation coefficient of its variable and then you set up the cutoff for 0.6, okay? So in SDM literature, you find the cutoff that is suggested, for example, is 0.7.
But there is a debate. You can put it 0.3, 0.5. So I put 0.6, but again, this is another methodological decision, okay? Okay, so we have our best model. We try to run something diagnostic.
So we can have a look to the marginal plots. So the prediction of the model for each variable keeping on average the other variables, okay?
So these are kind of how the response function of our species in our model looks like. So we see that in red, we have the model and in blue is the signal in our data set. So we see, for example, that for Bio6, we have a linear relationship in our model.
Maybe the data suggests that we have more a quadratic effect, okay? So this is something we might be worth exploring. Yes, I will see.
We can investigate for over dispersion and I put it here, okay? But it's mostly for if we have Poisson distributions, okay? So just remind that in GLM, you should check for over dispersion. This is, let's keep, I put it, but it's mostly for Poisson distribution, not binomial. And I also compute a absolute R-square using a mod-EBA function and the nagelkerker approach.
So at the end, we got a table showing that our pseudo R-squares of the variance explained by our model is around 0.18, okay? So not great, but in species distribution models, it's also difficult to get very high
pseudo R-square or R-square. Residual plots, so we check the independence of our, of the residuals according to the different predictors we use. What we see here is just mostly the mirror, you don't have strong patterns in for Bio3,
Bio8 or for Bio6 and Bio12, we have a bit of pattern here, which is not great. So we will try to see if we can clean this effect in the residuals. And so we might try to add the quadratic term on Bio6. And to do so, I simply specify, there are different way, I chose this one.
So a poly Bio6 and two, so a quadratic effect. I run the model and again, I see that, for example, in this case also Bio8 that before was not significant, turned significant, both the term of the quadratic effect are significant.
So we are getting, we're getting closer. So again, I check the marginal, the marginal plots. I put it maybe here. It's better. Okay, for Bio6 now, you see the shape is a bit better. Okay, so we're getting closer.
Maybe for Bio2 now, we should also include maybe a quadratic effect also for Bio2. The rest seems more or less okay. Again, there is the residual version, but just let's compare just the R square. So you see the best model, the best model one was 0.18, still the R square. Now we have 0.23, so we are getting a bit better.
Okay, residual plot again. So a bit better for Bio6 now that we have this quadratic effect. For Bio2, maybe we still have to work it, work on it a bit. So before going further, we need to compare if we can move further with best model two.
Yes, go ahead. I didn't understand what we are looking at in the residual plot. Can you briefly explain? If there are trends in the residuals. So what we should observe for this kind of residual is that we have
the residuals that are homogeneously distributed in the plot. So if they are homogeneously distributed, we don't have missing variates, we don't have missing quadratic effects and so on. So that's what we are trying to look at. It is a diagnostic we run on our selected best model.
So in this case, we see that maybe it's a bit better, it's okay for Bio3 and Bio8, but for Bio2, Bio6 is a bit better compared than previous plots. But for Bio2, we still have this effect here. Okay, so I've understood the theory, but looking at the plot, so what's the purple line and the...
So the purple line is the tendency of the residuals. Okay, so before moving forward, we have to test if we can select best model two
instead of best model one. So I compare the AC and I see that the AC of the second model is lower than the first one. And then I run an ANOVA with a key square test between the two models and what I notice is that
the residual deviance is significantly lower for best model two, is statistically significant, so I can keep best model two instead of best model one. According to the diagnostic, I can do a bit better. So for example, I add also quadratic term to Bio2. So I run the model.
Sorry, some are best model two, three, sorry. Again, everything is significant, also the polynomial effect on Bio2. Again, we can run some diagnostics.
It's getting better compared to the previous plot. Again, the over dispersion of C0 square, we see that this is mildly better. Okay, not that much. Independence, again, I still have this effect here, which is not great on Bio2.
But I want to compare the two models, so best model two and best model three. So AIC, again, for best model three is lower. And the ANOVA tells me again the residual deviance for the best model three is lower
than those for the best model two, and is significant. So I can keep going on model selection and testing other different things, such as outliers, interactions among variables. They have not included now and might explain also that shape that we have on the residuals
of Bio2. There are important missing variables that we haven't included, because maybe we have discarded Bio1 and Bio12 due to the collinearity analysis. Okay, so there are other things that we can check, but we need to move a bit further.
So we will keep model three as our best model for the GLM. And most importantly, what we did not do is that we didn't partition our data set into a training and testing data set.
So we have just used all the data we have. And so when we are going to assess the predictive accuracy of our model, we are going to predict also on the training data set, which is not great. But Yana, this afternoon, will show you how to do it. Okay, so if you want to have a look and dive a bit more into this topic, there are tons
of books about it. One is the book of species distribution model that you have seen yesterday, the one with the fox. And another one is, this is one of my favorites, is mixed-effect model and extension in ecology with R by Alain Zur.
It's just penguins on the cover. And it's a very nice book. I really like it. Okay, then let's have a look at the predictions.
So we can use our best model and fit our model to predict on our training data set, because we did not do the partition between training and testing. I use the predict function and explicit the type as response.
So again, I got a vector with the predicted value for my response variable. And then I can use different evaluation metrics.
We can use the AUC, which is one of the most famous one, but it has some issues, as we have seen yesterday. So it's influenced by the sample prevalence, is influenced by the extent of your area of interest, and so on.
We will use the true skill statistic, which is a similar but less dependent metrics for evaluating the accuracy of our prediction. In the case of AUC, the value spent from zero to one, zero, terrible model.
0.5, your model is not better than a random model to predict your observation. The closer you get to one, the better is your model. Similar interpretation for true skill statistic. Minus one, terrible model. Zero, your model is not better than a random model to predict your data.
Close to one, you are doing great. Then you have the voice index, which is another metric to evaluate the performance of your model. You can interpret it as a correlation coefficient between your observations, so only your occurrences, your ones, and the predicted values for your occurrences.
Okay, so let's have a look. I use ROC function from PROC package to compute our AUC. Here I specify our training data, so best mod free data at PA, and our fitted GLM prediction.
And I got an AUC between 0.74, and I think I can plot it.
Yeah, plot true, yes. Voila, so this is our AUC. So this is the line about 0.5 when the model is no better than a random model, so our GLM model is better than a random model, but it's not great.
TSS, I use the TSS function from Echospat package, another package used for species distribution models. Again, your fitted values and your observed data, and what I get is a TSS of 0.37.
It's okay, and then the Boyce index, again, fitted values, and then the observation only, so the fitted value only for the observations, and what I get also in this case, I got a correlation coefficient of 0.95.
So apparently it's quite good to be predicting the presence of our species. Good news, Peter, then there are the classical ones. So there are these, those D squared or R squared, or different methods for evaluating the pseudo R squared,
there are different approaches. I use the Nagar-Kerke often. And then there was the RMSE, or maybe I forgot to put the RMSE here, sorry. So this was, this at the end, we keep this model, this is some evaluation statistics on our model.
We've seen there it's okay, it's not great, but it's working, okay. And so now I will leave the floor to Cedric, who is going into details into our random forest. But there is a question there, yes. We have run GLM, but the data are especially distributed, right?
So we just then evaluate the independence, for example, the residuals in the space, or things like this. Now we have predicted only on the observed, of the only observation. We haven't, I haven't run a spatial prediction yet. Okay, but the independence in the space, we are not evaluating or...
No, absolutely, I didn't. Okay, we are assuming that the observation are independent. Absolutely. Okay. Which is a strong assumption. Okay, Cedric, I leave the floor to you.
All right, so sorry all, I couldn't be there in person with you. And I do apologize for the coughing because I'm a bit sick, but I will cover the random forest. So Daniel already explained the GLM, which is a statistical approach.
And so we thought it was also nice to include a machine learning approach. So each approach has some advantages and disadvantages. A statistical approach is very good in trying to find the relationship between
the driving factors in your data, while machine learning approaches are better in creating more accurate predictions. But that gives a cost that you have, you don't really have that good of a view
of the relationship between the covariates and your model. So every approach has its advantages and disadvantages. So it's not that you would only use one approach or the other approach. You use the approach where appropriate and for species distribution model,
where you really are looking for that accurate prediction. You, for example, would then go for a machine learning approach. So the models are the philosophy of the models are different, but the work approach or the workflow is basically the same.
So you see here on the yellow screen, you see it's a same variable selection. It's the same, you built the same type of formula. And you built your model based on that.
So something maybe to point out here while Daniela runs the model is we go quite quickly over a couple of things here for a machine learning model. That's because I will explain a couple things later on, but we will skip some steps here.
But I will tell you when that comes. So you built the same model. So you specify your formula, you specify your data, training data, and then you built your model.
And so you get your model output, which is part of the Ranger package where we built the model. And so you see we built a permutation model with 500 trees and 600 sample size based on six independent covariates.
Where we get a mean square error of 0.13 and an r square of 45. We don't get like this is a different approach, so we don't get the coefficient estimates like Daniela.
In the GLM, but we can look at which variables are important in the trees that built the forest. So it's also part of the Ranger package again. And so you see that certain variables are more important, like for example here.
Bio 15 is more important than Bio 6 in the model that we built. So that's, Daniela, can you maybe put it full screen so they see the x axis as well?
So you see that it just gives a relative importance to versus each other. So we go now straight, here we're going to skip a whole bunch of stuff in the interest of time.
Because you have to do a similar work here for random forest like you have to do for the GLM. So you have to tune your model. This is something Tom and Carmelo will cover later on.
But it's not like a machine learning model is going to be quicker than a statistical model. You have to do the same amount of work, you have to tune your model, you have to
go to all the steps again, you have to do all the selections, but we'll skip that here in the interest of time and to avoid duplication later on. But there is, Daniela mentions here, a book that is a great resource if you want to have a look in the meantime.
So they go into detail more in how to tune your model. Yes, it's a very nice book. I start from the basics and go a bit further. So about the tuning parameters this afternoon, we'd come down. So we went quite quickly in on
this session just to avoid overlap with different classes. So yeah, I go back to the code and here again, we run the same, we predict again on our dataset and we computed the same AUC,
RTSS, Boyce index that what we did for the GLM. And okay, and we have also, we can get also R squared and R squared and the RMSE for our random forest model. And we can compare it just here.
I have prepared a table to compare the prediction of the two models. And here we are. So we have just one partition, so we have all the data.
Sajik, would you like to comment or shall we ask here in the room? We can first ask in the room for some interpretations and then maybe we can respond to those and explain why we think these are not super great failures.
Okay, so some ideas. So there are differences, clearly differences in the performance of the two models. Any idea? Yeah, so it seems like random forest has overfit heavily because AUC of one is just unreal.
It obviously has a much lower error, but that's because of the overfitting. Yeah, exactly. That's the point. So again, in the way we have built the exercise, okay, so if we, what we are going to see soon is that also this overfitting is, we obtained the overfit
also because just we use just one partition. So it's trying, the model is trying to perfectly predict the training data set, okay. So at this point we know this model is not great,
okay, because it's overfitting. What we do now, Sajik, I don't know, you want to comment on this? Yes, I want to maybe point something out here. So here we also look, we have a parameter for sensitivity and specificity. So where sensitivity looks at the presences and the
specificity looks at the absences. Here we work with presence-absence data, but in a lot of cases you're gonna have only presence data and you're gonna have to go to procedure to create pseudo absences, which we mentioned in the text, but we skipped over in interest of time. And then
looking at these two might also be very important, especially looking if your pseudo absences are performing well, if there is a huge disparity between your sensitivity and specificity. So these will become quite important. So don't only look at one metric, always look at multiple,
like don't, for example, just don't look at R squared or AUC, but look at all of them together and evaluate, like for example, you can have a model with a nice looking AUC,
but then very bad R squared sensitivity, specificity. So you still know, yeah, it's going to be a crap model. So while if you just looked at one value, you think, oh, it's a nice model. So it's important to evaluate several different statistics. And of course, if you make a prediction map, also look at the map that it's like, for example,
for the random forest, you will see that it's going to be a very, what I call a black and white model. So either very strong presence or very strong absence, but very little in between. So you're going to, you know, that it's not, it's an overfitted model, for example.
Okay. So take home message. Yeah, please. Yeah. So maybe a question since we're using random forest, even though it has overfit, it does give us some information on the variable importance. And as far as I know, it's kind of agnostic to the shape of the data. So maybe we should use
random forest to do the feature selection, as opposed to the stepwise regression, and then use those in the GLM. This is another approach. So it's fine. You could do that, for example. You also be aware that this is a very quick,
we did not do any tuning on these models either. So you can get probably also a good working random forest on this data. But you can also create ensemble models of the two different approaches too, which is something we will maybe mention in the end if we have time. But yeah, that's a very
valid approach to use one model, one type of model to instruct information from defeating the other type. You can also do it the other way around as a GLM. This is good to identify the relationships and the driving factor of a certain species. You can then use that in a random forest to create
a better predictive model in the end. The microphone is working. So which covariates you would choose looking at the plot with the contribution?
Well, I think that we should, if you understood what you meant, we should run a random forest on the whole set of variables, right? And then we can give it a try if you want.
Does this position also not be affected by the variables that provide no signal at all hypothetically? So this is...
our model, so with all the variables, so now we have 19 variables, 0.38, 58, so I think it's slightly better than the other, but not that much in terms of r square. In terms of, let's do it here.
Though again, as far as I remember, it may be sensitive to binary variables, so the importance of binary variables may be a little bit misleading. Yes, it can be two categorical variables actually.
So yeah, in this case it's interesting because Bio1 is here, so we should keep Bio1 and also Bio15. We are losing, that's also difficult then to decide a threshold. So in this case,
we're just looking at this graph, maybe we'll say okay, let's choose the first five variables. Interesting thing in this case, you have two of the first initial variables in your model, so it might be interesting. Again, Bio12 is quite far down the list again, so it's hard to predict.
Okay, so again, different methodological choices, you can lead to different models.
Maybe it's good to mention here that you can always keep tuning your model and at some point you're gonna have to say enough, like this is gonna be the model that is the model.
You can always keep working on your model and keep fine tuning and then keep building a better model, but at some point it's marginal gains. You're putting a lot of extra effort into very little gains, so at some point you're also gonna have to say this is the model or this is the best we realistically can do.
So in the last 20 minutes of this session we're gonna have a look to special prediction first and then to combine a bit these two models together and have a look. So special prediction
in the raster or terra package when everything will be up to date with the terra package is quite straightforward. Again, you provide the raster stack, you provide your model and then you run the function. So in the end, if I run these two commands, what I get is a raster layer
for the GLM with the spanning through the extent of my area of interest. So the model has predict one value for each pixel basically and it is the same for the random forest.
Okay, we can try to plot it together. It's not working. Oh, what's happening here? Yeah, I think we...
Hola. So again, we have, again, the nice thing for the real plus pieces is that we have our observations. So we can compare our predictions to our observed
probability of occurrence of a species. So let's keep this one as a reference. Black dots are the real presence of the species, white dots are the absences. This is the prediction of the GLM and this is the prediction of the random forest which is
exactly what Cedric said a couple of minutes ago. It's more dichotomous than the GLM model. Okay, so this is really something that we might also consider as another diagnostic of the performance of our model. These maps are quite ugly, I would say, so we can try to do a bit
better. You can try to use, for example, the stars package which is another package to manage spatiotemporal arrays. It's not like Terra, it's not like raster, it's another thing. I use it
often to plot maps because it can be easily integrated within the ggplot structure or plot. So what I do here, I stack, first stack my rasters, so my
predicted probability, my observed probability of occurrence and the two predictions of the GLM and the random forest. I name them, so my stack now is an object of the class raster stack with my three layers and then I convert it to a star object using SDS star. So class,
okay, and this my star object is a different structure compared to the raster or Terra files, but you have a nice introduction to the Terra package. Here I put the link into the
vignette so we can have a look and it's explained very well how it works. Anyway, we can plot it. Again, the nice thing is that we can combine different special objects into a ggplot using the ggplot grammar, so gem stars for my star object for my
raster, gem sf for my to plot presence and absences. So again, I have to come back to an sf object from an sp object. Now my pres, if you remember a couple of, at a certain point
I had to convert to a sp object my presences because otherwise for some of the packages that are not updated to sf it doesn't work. So now I come back, I color it, I put scale field verities to have a nice palette for my habitat suitability values,
white and black for my observed data. So let's plot it. Voila. And here we have like, yeah, another way to show to our map, to predict our,
to show the prediction of our model. Okay, now in the last 15, there are any questions or, how can I move to the last part? Questions? No. Okay, so again, what we did now until now, we use independently an SDM
and independently a random forest. There are packages in R that allows you to combine everything together and run everything at once, okay, but you don't have all the model
tuning, okay, so there are pros and cons. There is a great example in the SDM package, there are others like bio mode, for example, is another one pretty famous. So SDM is like a wrapper for several algorithms. Again, here it requires some to provide a testing and a training data set,
so here we just subset our data set five times using 70% of the data to train and 30% to test. We can parallelize the whole thing, so it's great, we specify five cores and again,
I have to come back from my, to put as a SFSP file my presences. Again, so I can run it. What I do here, it requires to provide the data in this format, so SDM data is my observation,
my observed, so my response variable, my files, my observation of the present and absences of the species in a spatial format, and my environmental data, okay, so my environmental layers, my rust stack. So what I do, I create this object, I have my D object,
so we know we have one species observed, this species name is observed because I can also model multiple species with this at the same time in this package, number of features on my predictors, name of the features, type present absence, eyes independent test data, no, because
we are using an internal validation, not an external validation, number of records, and blah, blah, blah. Yeah, I use the 19 variables here all together, so without checking for multicollinearity, but again, I can do it and it would be better. The kind of a model, so I can specify this SDM
function, so observed is my response variable, here I use all the different variables, data is D, is this object that I've just created here, methods, so I combine a GLM and an Arnold Forest already, replication is this subsample that I defined here with the percentage of different,
of training and testing, five times number of replication, and then I set up the parallel, the parallelization of my code. Okay, so I can run it, I will take some seconds. So we have our model N1,
which already shows us some information, so it's going to tell you again how many species, the type of modeling methods, subsampling, well this is all the information that we have defined in the beginning, different methods, and then we have performances for the two approaches
on the testing data set. So this is the average AUC for the two different models over the five replicates, correlation coefficient and TSS. We can go a bit deeper, so we can generate already some rock and AUC curves, so you see again on the training data set, our random forest model
is overfitting, while on the testing data set, no, but it's again very prone to do this overfitting, less our GLM, and then we can inspect what we have inside this M1 object of this model,
which is quite complex, in the sense that you see it's a nested object. So we have M1, we can access to model, do it here, we have several information, replicate is going to give me the ID of the file for each replicate,
okay, we can have access again to the settings, to the info, to the models, I can have access to the observe it, and we have a lot of information, this was a terrible idea
to run this, okay, then I can have access to the two models, so for the GLM for example, and I have access to the different replicates, okay, so for just for the first replicates, I can have access then to again the evaluation, so all the metrics automatically run
to evaluate the model performance, and I can have access to the to the training, to the metrics evaluated on the training and on the testing data set, let's do for the testing, and I have a folder, sorry, I have a list with different values, okay, I can have
access to the correlation coefficient between the predicted and the observed values with the p-value, or I can have access to threshold based metrics,
which are sensitivity, specificity, TSS, kappa, and so on, so what we have seen before, so we can have access to all this information that here runs automatically, okay. Model performance, so here basically, since what I wanted to have a table with the average
performance for each fold on the testing data set, I created a data frame and I store the results, so since all this object, this SDM object has nested information, you have to play a bit with
the data, so not to go in detail through this chunk of code, but basically you see is R binding the mean of laplies through the nested object, okay, so if I run it, I got my final table, okay, we show the summary statistics of the other models,
so the previous model and the models we have run now, the same statistics, again we see that, for example, again sensitivity, specificity of this model with five partitions are better for
the random forest than those that we, the previous model which was very prone to overfitting, okay. Again, since we have 10 minutes, we can go to the spatial predictions.
With the SDM package, you can directly predict in a similar way, so again with the predict function as I showed you before.
It's going to take a couple seconds and voila, I have a raster stack object with 10 layers, one layer for each replicate, okay, so five for the random forest, five for the GLM, and I can calculate the average now prediction for my five replicates.
So again, I use the raster calq function over the first five rasters and I compute the mean, but I also compute the standard deviation if I want, if I want to produce an estimate of the
error of my maps, and I do the same for the random forest, so again I calculate this one and I have my mean pred is a raster stack with my predicted values for my two
or my two methods, so again our suitability, observer suitability and the prediction for the GLM in SDM package and into using the random forest in the SDM package. Okay, but for example we can predict directly the average prediction, so instead of doing
manually we can do predict and then mean equal true, so if you don't want to do it by hand, but in this way we have a bit more control, we can compute the standard deviation, quantize if you want and so on, or we can do an assemble, so we can combine the two prediction of the SDMs
and there are several ways to do it, we can either do a brutal average of the two predictions or we can weight the average by different statistics, in this case I chose the AUC, so then you can weight it by the TSS, by the correlation coefficient and other methods,
so it's gonna do a weighted mean basically. Okay, let's do it, use the function assemble, okay and I'm gonna plot it, so we have our observed virtual species suitability,
the prediction of the GLM with the SDM package, the prediction of the random forest with the SDM package and the assemble of these two. The next couple of chunks of code, let's do this, let's also add the prediction
of the previous models, okay, to our AST stack, so our AST stack now has six
layers and if we plot it hola, okay, so there are differences in the different approach that we use, so that the GLM you see quite different, there are disassemble, we can also have an estimate
to conclude of the account of the accuracy of the spatial predictions, so we can, since we have a virtual species, we can compare, so we can compute for example the RMSE and the correlation
coefficient between our observed pixel distribution, so our observed suitability, what we have generated at the beginning of the class to the predictions, okay, and I won't go in details to this chunk of code, basically I convert our rasters into a matrix where each column
is the predicted value or the observations and each row is a pixel, okay, and then I compute this RMSE comparing each model prediction against our
observed simulated suitability, so let's do it, and voila, in the end we can also have an estimate for each model, we have an estimate of the RMSE
and an estimate of the correlation coefficient for each model into the geographic space, so that's also the nice thing of having a virtual species because you can test it
several times changing your models and compare with something that you actually know because you have built it yourself. So the GLM, in this case we have our RMSE of 0.3,
so there is an error on average of 0.3 on the predicted values, okay, which is a bit high, so the lowest is the one of the random forest but we know that this one was the one with a very high
that was overfitting, okay, the SDM ensemble, so you see that the ensemble is a bit lower and also the correlation coefficient is one of the highest, so it's just another way to compare the predictive
performance of your model. So I won't choose none of this one because it's, I mean, they were overfitting, we didn't split in training and testing, as I said, there were other things we haven't checked, so this is not what we call the best
model was not our best model, so it was just to show you a kind of a workflow. Yeah, I was wondering why you would choose RMSE because isn't it the probability of occurrence, so it's a value between zero and one and I've seen RSE more with regression, is there any incentive?
Well, I've seen, I've used RMSE because I've seen it in other papers and I use it, so there, I mean, but of course we can use other metrics, it's just your, the fantasy is the limit,
so your imagination is your limit in this case. The physical properties of distribution, but for probabilities, the log loss. Okay, well, I mean, as I said, It's the same as RMSE, I mean, you basically, the smaller the number, the better, so. Okay, well, I mean, okay, but we can change it to a log loss, it's not a problem, but I mean,
in general, this is something that I observed in other publications and I used RMSE, so that's why. And it's something that people understand, usually. Sorry, Cedric. RMSE is also something most people understand, so
for a training course like this, it's actually quite useful to use. Okay, so this is the end of the session. There are any questions, otherwise, I'm happy to try to reply to any question, otherwise there is the lunch.
Okay, thank you then, and thank you, Cedric. No, thank you, Daniele, this was mostly all your work.