Statistical methods in global air pollution modelling (part 1 - ensemble trees) - TIB AV-Portal

Statistical methods in global air pollution modelling (part 1 - ensemble trees)

00:00

25

OpenGeoHub Foundation

Formal Metadata

Title

Statistical methods in global air pollution modelling (part 1 - ensemble trees)

Title of Series

OpenGeoHub Summer School 2020

Number of Parts

27

Author

License

CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/49530 (DOI)

Publisher

OpenGeoHub Foundation

Release Date

Language

Producer

OpenGeoHub foundation

Production Year

2020

Production Place

Wicc, Wageningen International Congress Centre B.V.

Content Metadata

Subject Area

Earth Sciences Environmental Sciences / Ecology

Genre

Conference/Talk

Abstract

High-resolution global air pollution mapping has significant social and academic impacts, but is a tremendously challenging task especially in terms of data assimilation and analytics. In this workshop, I will introduce most recent status in global air pollution modelling and evolvement in data (from social science, Earth observations, numerical models), with a focus on explaining various machine learning algorithms (e.g. ensemble trees, deep convolutional neural networks) and overfitting-controlling strategies (e.g. regularization, post-processing), and how they could contribute to global air pollution mapping.

Speech

Text

Image

00:00

Computer animation

01:57

Computer animation

04:10

Computer animationDiagram

06:15

Computer animation

08:33

Computer animation

10:24

Computer animation

12:02

Computer animation

13:27

Computer animation

17:24

Computer animation

20:00

Computer animation

24:31

Computer animation

26:21

Computer animation

28:12

Computer animation

35:35

Computer animation

42:26

Computer animation

49:17

Computer animation

50:30

Computer animation

54:43

Computer animation

56:20

Computer animation

57:51

Computer animation

01:01:00

Computer animation

01:06:04

Computer animation

01:11:06

Computer animation

01:17:11

Computer animation

01:18:35

Computer animation

01:19:48

Computer animation

01:24:40

Computer animation

01:26:44

Computer animation

01:28:06

Computer animation

01:31:54

Computer animation

01:34:40

Computer animation

01:36:04

Computer animation

Transcript: English(auto-generated)

00:03

Okay, good morning, thank you for waiting, and welcome to the workshop on statistical modelling, statistical methods in global national doubts at mapping. So I'm going to have three sessions. I'm going to give an introduction of the mapping problem and the data we use, and

00:23

then you're going to get started, you're going to set up, download everything and then explore the data. And then we'll focus on the machine learning methods. And I'll separately introduce the ensemble tree-based method, the convolutional neural network-based method, and after each of them you get an exercise.

00:45

And for the ensemble tree-based method, it's going to be ER. And for deep convolutional neural network, I'm going to share a truth-turner book through KIGO that I already introduced in the morning. It runs a Docker at its back, so you just need to run the scripts I share without installing

01:05

anything locally. And lastly, we'll look at the modeling process in practice, so how we implement everything and do the bootstrap cross validation and mapping. So let's get started.

01:21

So the national dioxide mapping is a spatial temporal regression problem that goes from point-based spatial measurements to a continuous spatial temporal field for every location to give an estimation. And because the ground stations, they are located in different areas of the country,

01:41

so here you can see some of them are in rural areas like these red dots, and some of them are near traffic or industry, like these green dots, they can go beyond doing interpolation purely based on distance. So that comes to the line-use regression model.

02:04

It got its name because in the beginning, the variables we used are mostly line-use variables, like traffic, road density, population, and so on. And now it's evolving, so we're now using more and more data, but the name still can.

02:22

And it originally is a linear regression model, and actually it's still the most widely used model for prediction and analyzing the sources of the air pollution. But there are also lots of developments, so trying to make better predictions.

02:42

So that includes integrating spatial temporal dependency, like kraging-based methods, and general data models, so that tries to relax from the variable distributions, and also machine learning-based methods and hierarchical-based methods, which are also my focus.

03:05

So geospatial, so as I mentioned in the morning that the data is a major driver to our global mapping, so that makes our global mapping so promising. So the predictors we use from road and industry areas extracted from open street

03:26

map, this is becoming more and more comprehensive every month or year, especially in European countries. Also, we use Earth's nightlife from satellite imagery and population

03:41

to represent social demographic data, and we also use wind speed, temperature, and solar radiation from numerical models. So wind speed is quite easy to understand that contributes to the dissipating of the air pollutant, and temperature and solar radiation

04:03

mainly contribute to the chemical reaction process of the species. So here we found it quite interesting. So this is a time series of the city of Australia, Liverpool, which is quite close to – much closer to equator than Netherlands or

04:24

many European countries. You actually found that the lowest air pollutant happens at the noon time or afternoon, but not at night time, as we usually expect. This is because the solar radiation greatly shorten the lifespan of the national dioxide.

04:42

And also the elevation, so for example if you're in a village and the air are very hard to disperse. This is from radar data already with quite high spatial resolution, and we recently also have more elevation product. I think you shared, so taking the first and second

05:06

derivatives of elevation, so we really know if we are at the village or downhill uphill, so that also can contribute a lot to the prediction. And also I mentioned in the morning

05:21

the Tokomi management, which is a very promising product that was quite new, launching in October of 2017. In 2019 it's going to have even higher spatial resolution of 5.5 kilometer. So it measures the broad wavelength from ultraviolet to short wave near infrared,

05:46

so not only national dioxide but also other gaseous air pollutant like nitrogen, nitrogen oxide, carbon oxide, sulfur dioxide, methane, all of them and so on. So that can

06:02

also help with the prediction because of the chemical reactions. So our interest is to find how to best evolve this product. And we also use the surface concentration product from numerical

06:20

model called GUS chemical transportation model, which gives global simulation by 10 kilometer resolution, so a bit low. And for the ground station measurements, so the collection of it is actually a big challenge in the beginning. You have to reach out a lot and go through lots of

06:46

grations, but a major part is from this OpenAQ. So OpenAQ is an open air quality community, and the data is becoming more and more comprehensive there, especially since the year 2016. And now probably you can even get more data.

07:06

So since 2017, it also has an R API, so you can carry real-time data and also data in the past three months. So I have an R script for it. You can already start from downloading the data

07:24

the first hands-on, so to get set up. I think it would be great if you first look at the data and then explain the statistical algorithms. It will be easier for you to understand. So have you downloaded the material from this link? You can download the study material from

07:44

this link. You all got it? And I have a description here. So if you download it,

08:09

you get the folder. In this R script folder, you see the introduction, and there's GeoHub 2020 R Markdown file, and you can look at data using this file.

08:28

So you all know R Markdown, right?

08:58

Yes, for this workshop, it's mostly annual phase. We are currently also doing temporal

09:07

prediction. Yes, they become an hourly phase. So for this workshop, we mainly do spatial

09:30

analysis. Yes, and actually our study is separated day and night time.

09:46

But for this study, I'm just going to use the iNew model. The shell goes through the script, and then you can do some exploration yourself.

10:05

So you can firstly install the packages. Here it is APM2. So that's the package I made to facilitate the air quality modeling process. So if you go to this GitHub page,

10:30

you can also see the R scripts and descriptions. So I also have data in this package. You know if you want to download R package from GitHub, you can just

10:42

use Delft tools and install GitHub. So I have data in there. But this dataset actually, because you know we might have a spatial temporal prediction for the Open GeoHub summer school. So because it might be slightly related, I added some noise in there. So it won't affect

11:05

your learning process, but just to make sure that the result you get may be a bit different from the real data. So here you can see the description of the variables.

11:22

And these are the functions in this package. So most of them are trying to streamline the process. So for example, to calculate the variable importance, and also the error matrix,

11:40

so like R square root mean square error, so different indicators, and some of them for data pre-processing, and this for converting the units. So this package just made for facilitating the modeling process. And then after you install it, you can use many of the

12:04

functions in there. So then we can load the data from there. So this, the global annual data is the major dataset that we are going to use. And here you can see the predictor variables, and also we have longitude, latitude, and this value mean is the nitrogen, the outside

12:25

concentration. Here are some statistics. And we also aggregated the different road types. So we have five road types extracted from open street map, and we think we aggregated the last

12:45

three, so tertiary road and local roads, and from the new group, but you can also try not doing it. Here is to make some maps. So if you run it, it will generate a map of the

13:09

ground management. So I don't know if you are following or so here.

13:21

So if you go to the folder where you run the scripts, and then it gives you a map, and then you can see where the stations are and also their values. Okay. And then we can also look at, we can also look at the, the box plows. So, so here,

13:49

and also the paired correlation. Here also the spatial dependency in terms of the varagrain. So you can just go through these crates and then to the modeling part.

14:04

I think we're almost there. Yeah. And then you have about, I'll give you about five minutes to look at the data. And also there are also there's also in this data visualize folder you

14:27

see two are marked down for the open IQ for carrying open IQ data and the cloud RS for visualizing remote sensing images and also the output from the numerical model simulation.

15:25

Okay. Let's see if I can get a picture of the image so we can just fan it to make sure. Thank you.

16:44

Is there a specific R mark down you recommend we're looking at or is it just time for us to check out the entire folder, please? I know that these are technical and there are two sections. The first section is about

17:27

this. So here. So here this three files, I want you to look at them. Are you can of course also look at the, this, the description, right here to get an overview,

17:48

if you like. Have you managed to install the packages.

19:28

So you start in five minutes, we'll go on in five minutes.

19:48

Yeah.

20:41

Oh, the very grand. Yes, that's a good question. So that's a quick way of examining the spatial correlation right so if you. So, the, so the x axis is the distance,

21:01

and the y, the semi barogram. So if there is a strong spatial correlation, then the data that are closer together. You have a stronger correlation, right, so then you have. So at the shortened distance. So that is, then that means you have lower semi

21:22

values, the higher correlation at lowest. So the data at the lower distance, so that means the data appears that are very close together, they have no values in the semi values. And the data are very far away from each other. That means the long distance,

21:45

then the semi value. So this shows the spatial correlation.

22:00

Yes, but for our case actually. So I have them here for different countries. So, so here you can type in the country code. So D for Germany, US for United States,

22:21

you can also try other countries. And then you, you examine these barogram to see if there's strong spatial correlation. So for example like here, the correlation is not so strong. Yes, and you can also set, it's also important to set the cut off.

22:44

So by the way the distance is in the gray. So this one is about 100 kilometers at Nadi. So you can also change this to see the more local and more global correlation.

23:05

Here if you don't, if you, here this section is just by doing it globally. And here actually

24:11

in Germany, you see the spatial correlation is not so high, because you have this, not really like this shape. In China, it's a bit better, it's a bit better,

24:24

and you can also think about why this is a reason by looking at the station distributions, either this is my, by looking at this map, you can think about why.

24:41

In some areas there's stronger spatial correlation and sometimes not. Okay, shall we move on, or are you all good with this part. Yeah. Okay, you can let it run properly, then the author, the next session probably,

25:15

I'll give you a bit more time to do for the explorations. So, so we're going to talk

25:25

about machine learning, I don't know how much you know about machine learning. Like, you have some experience with that. Okay. So, okay so machine learning is a very

25:41

broad field and it's evolving really fast in the past several years. And so, here is this, I found this very useful link, the machine learning roadmap. So if you click on it, so it gives you know, it's a roadmap of connecting many of the most important concepts,

26:06

it's very, it's very comprehensive and gives you what is happening, and the resources to develop into machine learning problems, process tools, mathematics and resources. I can also see this through interactive map. You can really, it's really comprehensive.

26:29

So, so here, so here for example, you get the concepts and the process, what is AI, and then you get also lots of codes. So the example codes from the beginners level,

26:45

and all the way up to a once now what the CS student even won't learn. So, very practical, and also linked to the mathematics and machine learning algorithms. So if you want to have

27:02

a more in depth of machine learning, the book will fit the books and cloud service, because very soon, it's not going to work on your local computer, and the cloud computation, maybe is the most practical way to go. And there are lots of data sets so it's very

27:57

fun to know but there are also lots of very new things. So what is the newly developed,

28:04

the newest developments. And I wanted to show you this one about how stream late. So it's Python based, but it's like our shiny apps, you know, our shiny as that helps you to

28:22

show your scripts directly. So it's super good for data exploration and showing the concepts and tuning different things. It's still in beta version but I think batch is really helpful so

28:40

I encourage you also to have a look, maybe after the course. So, so in this lecture, hopefully I can give you a good understanding of some most classical and popular algorithms and strategies and the tips we use in the mapping problem to the machine learning.

29:04

The history I already found is quite fascinating. It goes back at least to 1940, and it's water teeth, and he's calling race idea of your network. And then in 1950, there's a very important paper from Alan Turing introducing the Turing case. So, essentially, to determine

29:26

how computer process human intelligence. That's basically you're talking to machine but you don't know what you're talking to, and if you can't tell the difference if you can't distinguish how can machine or human machine has a case. In 1970s there's AI winter,

29:48

actually a neural network winter where people's inches started to be shaped to support vector machines. So algorithms made more mathematical background and also at that time gives better

30:00

predictions. And then in 1990, there's lots of algorithms. The Deep Blue beat human champion chess player and the IBM Watson beat the game show beat human won the game show Jeopardy. So it's still quite fun to watch. And in 1997, the neural network come back with this long

30:24

short term memory. The LSTM, which is your network which works on one dimensional data. In 2006, you really really started to show its potential, even got her name deep learning.

30:42

And the and it also attracted lots of statisticians and divide it into a branch of applied statistics cause statistical learning, where they focus on statistical modeling and uncertainty assessment. So, today I'm going to walk you from the basics to the newest development. So we'll

31:07

start from the decision choice. So it's a graphic model that asks a series of questions. So for example, a person wants to decide what to do, then you think he may think if I'm angry, if yes, then do I want local food or not. And then based on these decisions, he's going to

31:27

based on these answers, he's going to make decisions. And that it doesn't work so well. Sorry. Yeah, it's I don't know what happened. I just wanted to say those maps very well to

31:46

the air pollution model. So, if the, if an area is low populations, then probably there's less air pollution and the pet if there's a place with high population, then probably we want

32:01

to look at the traffic. So if everyone's driving a car, and if yes, then maybe the air quality is very bad. But we are so this kind of trade is my third have many features. So it's parametric so we are not restricted by the verbal distribution. And also, and the most important

32:24

thing is redundant variables. You ignore it, so we are not going to, we don't need to worry about redundant variables. So remember, we really have problems with the coloniality in simple linear regression, but if you just drop those variables we also not lose information.

32:42

So, a tree based monster may better integrate more information. And it can also handle missing data, easily so the missing data will just pass abroad to a branch. So, for example, if you know population there's lots of missing data, the algorithm will probably say,

33:01

okay, then I'm going to look at the next variable, and then go on to make a prediction. But we also know that a single tree is going to fall, so it's not going to work because of the modal variants. So, a new data set may always follow a different pattern.

33:21

So, what we want to do is to leverage from this trace to grow lots of trees that aggregate them to make to reduce the modal variants and make better prediction. So first method is bagging, where we firstly randomly shuffle data, and then take a subset of the data

33:42

to grow trees. And then by aggregating trees, we may reduce the modal variants. If you want to do it this way, we want the trees to be independent from each other. So that's the idea of the random forest, because the more uncorrelated the trees, the more

34:02

variants we can bring down. So, random forest basically says, I'm going to pick the best variable to split the node, not from all the variables, but from a subset of the variables. So, in this way, the trees or girls can be more independent. And then the boosting

34:23

follows a different approach. So, the random forest grow trees independently, so it doesn't pay attention to other trees, but the boosting keeps on looking after the areas where the previous trees doesn't do well, and then try to fit a better model there. So here's

34:47

why the random forest is co-worked. So if you have identically independent distributed variables, then each of them has variants sigma squared, then aggregating them will bring the variants

35:01

down to one divided by the number of the trees. But if there are correlation between trees, then it's this formula. So this row indicates the correlation between trees, and you can actually see if the row is very big, then this right-hand term is going to disappear. And then we just go back to grow a single tree.

35:26

So the algorithm is quite simple. We firstly bootstrap a sample from all the training data and then draw variables from all the variables, and then pick the best variable to split the node.

35:42

And because the random forest has several limitations, because it's not based on hypothesis test, so it turns to select variables with lots of missing value, for example, and it also does not assess uncertainty. So there are many variations of it,

36:04

like recursive position trees, or variation-based sampling and variable selection, and quantile random forest, so trying to address these limitations. So the parameters that are not learned by the model, but I said before the model

36:23

are called hyperparameters. And I developed a shiny eye for you to look at the adjustment of the hyperparameter, how the adjustment of hyperparameters affect the model cross-validation

36:41

accuracy, and also spatial prediction pattern. Actually, it is very important, because many studies actually only look at the cross-validation accuracy. But especially later we talk about boosting, so you see that some boosting method,

37:02

even though it's the optimum set of hyperparameters, the spatial predictions can have many artifacts. So the gradient boosting is a bit more complex compared to random forest.

37:20

I don't know if you know the algorithm of the boosting, should I go through it? Yeah, so in the first step, I initiate with a constant value. F0 means the zero is true.

37:40

So this constant value is found by finding the minimum value, by finding the value that minimizes the sum of the loss function between the observation and this value. So if you take half of the squared error at the loss function, you see that this gamma,

38:01

the value we want to minimize, is actually the mean of all the observations. So that's our zeroth tree, and then from the first to the last tree, you're going to do a iteration. The first step is to find the direction that minimizes the cost function quickly.

38:22

So that is by finding the gradient effect. So the gradient is by taking the derivative of the loss function with respect to the previous tree. And again, if you take the half of the

38:40

squared error, then this y coda pi m is equal to the difference between the observation and the previous tree. So that is why this is called the y coda and m is called pseudo residual. So i for each observation and m for each tree. And once we got this pseudo residual,

39:03

we can fit an l terminal node tree to it. And that will divide the data into lm regions. Here l indicates each terminal node and m again for each tree. And then we will pass the data there and again try to find the gamma that

39:25

updated the previous tree to minimize the sum of the loss between the observation and this update. And then once we found this gamma, we multiply it with the learning rate

39:41

so that decides the size that we want the cost function to go down and then use it to update the previous tree. And it's clear? Great. So the learning rate is a very important tuning hyperparameter, which actually you see also in deep learning and most of the

40:01

machine learning based methods that use this gradient descent method. If the learning rate is very high, then you risk not finding the minimum, the optima for your loss function and risk overshooting. If the learning rate is very low, you usually

40:20

need lots of iterations for you to for your algorithm to be set for the cost function to be set to the minimum. And the stochastic gradient.

40:42

Yes, because the cost function, so we want to minimize the cost function. And the yellow one is when the learning rate is very high, and then you can't find the minimum of your cost.

41:02

So yeah, so that's it. And the right is not bad, but the right means you might need lots of iterations for it to reach the minimum. Is this clear? Okay, then the stochastic gradient

41:25

boosting takes the same way as random forest. So instead of feeding the tree to all the data, each time I take a subset of the data to feed the tree. So by doing it this way, the model becomes more general. And so ISG Boost takes a step further from the boosting.

41:52

It is the most popular, at least one of the most popular methods in the past two, three years since it won the TIGO machine learning competition. So the biggest ideas that makes it

42:10

better than other algorithm is that it's not only penalizing the loss, but also penalizing the model complexity. So this Omega term here. So I think the penalty to the cost function is

42:25

regularization. So now we are not only minimizing the loss, but minimizing the loss and this penalty term together. And another great feature of ISG Boost is that it is scalable directly. So you

42:41

can just directly apply it and it can make use of all your free threads, or you can specify number of threads you wanted to use without needing to use other packages or do the parallelization yourself. So the ISG Boost actually origins from the distributed machine

43:03

learning program. So it's started from this distributed learning. So it does regularization in two ways. So the first is to regularize the number of leaves. So it doesn't want the tree to grow very big. So this is done by passing a hyperparameter to it and tell the tree to stop

43:27

growing if it doesn't reach some improvements. And another way is by penalizing large values of

43:40

leaves. And that's the two ways. One is L1 norm and the other is L2 norm. So L1 norm takes the absolute value of what we want to penalize. L2 norm takes the square of what we want to penalize.

44:02

So here I'm showing the L1 norm in linear regression because the cost function is very to visualize this parabola of the object here. But for our tree-based method, this cost function

44:20

can be very complex. So the figure b is to look at figure a from the top. And these are the contour lines. So here we are trying to minimize two things. One, the loss function. So indicated by this blue object or contour lines. So that can show how well we can reconstruct

44:45

our data. And you get the minimum as this dot here. And the other is the regularization term. So this blue, this green thing, which has the minimum as our region. And so the cost function

45:05

is minimized when these two contour lines touch. And because this touch can happen at the axis, this is why the L1 norm can enable feature selection. Because in the linear regression,

45:23

for example, the coefficients can be shown to zero. Is this part clear? Or I don't know if you know LASSO or RAGE regression already. And so the L1 norm or LASSO, because it's in

45:43

the LASSO regression, enables feature selection and also more robust. So it's more resistant to outliers. And it's not hard to understand because the RAGE takes the squares. So it takes the squares to squares. So that means it signifies the outliers.

46:06

But the RAGE also has some advantage. So it's more stable. That means it's more resistant to small data adjustment. So if you move your data later, the RAGE will still give a similar model. And it's also computationally more efficient because it's easy to differentiate

46:26

the square. But now it's very easy to differentiate the square term but not the absolute term. The absolute term takes a few more steps and it's also not differentiable at zero.

46:48

So that is to control overfitting when we are growing trees. And the other way of control overfitting is when we are aggregating trees. So remember in random forest we just

47:01

take the average of all the trees. So that basically gives all the trees the equal weight. But the trees can be redundant. So having already known about the LASSO regression that results in feature selection, can we also use LASSO to aggregate trees? So here this p indicates

47:24

the tree and alpha the coefficients for aggregating the trees. So can we here penalize the alpha to shrink it to zero so that there are less trees and we may also better control overfitting. So how does it work? You can see your hands-on still in this GeoHub 2020.

47:48

The second part, monoling. You see this LASSO plus random forest that's to do the post-processing. And I also have two shiny apps developed for you to see how the adjustment

48:02

of hyper parameters affect the cross-validation results and the prediction pattern. So one is for random forest and for SGBoost. For SGBoost I have published here this link so you don't have to run it locally if you don't want to. So if you want to run it locally it's just these

48:27

apps in this folder. I think our internet is the best. But basically you can you see

48:41

a set of hyper parameters to adjust and then this is for example with one tree. The spatial prediction, the spatial prediction is one tree and the other is 100 trees. So you clearly see that this tree doesn't really work.

49:08

Artifacts you also may need to adjust the learning rate for example. Oh yeah here it comes. So you can adjust these different hyper parameters. Here it gives

49:25

the cross-validation accuracy and the prediction pattern which is very important to look at. And if you adjust the number of trees this will all change. Okay so you can go through these crates

49:44

now. I'll give you about 20 minutes and you can ask questions. So basically there's three files

50:10

one mouthgown and two shiny eyes.

51:14

There's a quick question. Yeah it's a gradient between gradient and SGBoost but it's also

51:23

a bit difficult to set up when you really need to tune your hyperparameter as well for 1C. Yes even in this case I also have a description with all these hyperparameters.

51:41

So you don't tune them very well even though even though your cross-validation accuracy will show an output for learning for the prediction patterns and that sort of thing. Yes yes that's about it.

52:51

Oh no it's the same. You mean one that's more robust.

53:00

Oh that's a very good question. I think the boosting is more robust because it's more based on the previous prediction and it also has this regular regularization path. So three ways of doing the regularization.

53:23

So the key point is really if you don't tune the SGBoost well then you'd rather use learning for this. And there are some particular methods that you suggest. I think maybe I'll develop my method actually. Because for now we can comment on this.

53:50

I have the R Markdown file. It shows you how to tune it. So we do a great search we use the R para package and do a great search of the hyperparameter.

54:03

Actually you will find that the optimum segment when you reach to the lowest cross-validation accuracy your prediction patterns are not so good maybe. You may still have artifacts with your optimum hyperparameter setting according to your cross-validation accuracy.

54:22

So that's why I developed this to also look at the prediction pattern. It's okay I think I've got it. So here the modeling part you see this

54:53

hyperparameter optimization but maybe I suggest you to run it after the workshop because it takes

55:02

very long because it does many iterations. And then you have this running the models I think you can start from here. So from regression tree to random forest and this post-processing strategy which is actually really new. We never saw any studies but it

55:24

actually gives better improved results. So you can also see the number of trees that are dumped after running this last so random forest. You can see we can reduce the number of trees from a thousand to like 40 if you run. Do you prefer to run it yourself or like me to run it

56:02

here? I think it's better you run it yourself. Yeah and ask questions.

56:20

So if you can't run it there's also these results here. So you can also go through it if you have problems in running it. This is the result of the R amount downfall. So if you click on commit then you generate that

56:44

HTML file. So here actually you can see the coefficients of different trees. We can reduce the number of trees to only 100 if we do this last so random forest.

59:15

You mentioned

59:30

well it's actually yes it yeah it is sensitive but it's also because of your hyperparameter

59:41

setting. I meant that it got I still I still meant that it got artifacts when I got my optimum tuning using the great search optimization. So I was really trying to motivate people to look at spatial pipelines as well.

01:00:02

So it's not like the model is not robust to outliers. It's more because of how we tune the hyperparameters. And of course, yeah, and another thing is that we need to have probably more data compared to random forest.

01:00:24

From my experiences, it's really a big improvement from random forest, if you do it well.

01:01:16

You mean the, we use the set of matrix, so root mean squared error, mean absolute error, and

01:01:28

also relative values, so relative root mean squared error, so you see that if you run this error matrix function, so that's in the script, so I put all these indexes in this package, which I write.

01:01:43

Yeah, also this ongoing research that we are trying to fund better accuracy assessment methods to more really taking care of the spatial prediction patterns. So not only by visualizing it, but also trying to fund some

01:02:06

ways of quantifying it. From a spatial perspective. Yeah, but that's where, that's the only place where you have observations.

01:02:21

Yeah, but we're also trying to like to get some indicator that can show you there's lots of artifacts, so for better accuracy assessment. So here it is, error matrix, you can see different

01:03:32

indicators, and if you go to the function itself, I think I probably have a discussion.

01:03:46

Yeah, those are, but yeah, what you raised is a very good point. I think, I think there's no, I mean this spatial

01:04:05

related cross validation method, but they are trying to fund that. Thank you.

01:05:14

Yes, yeah, that's a good strategy. And we also, we're also trying to do something from the facial predictions to the predictions around

01:05:58

those quants. We are trying to see if it's more consistent.

01:06:04

Do you also, if you make, if you predict a certain value of a certain cell, does it count as a big error if the cell is further away from it? Like, do you take distance of, into account?

01:06:26

Can you repeat? Yeah, I'm not sure how to phrase the question, but for instance, if you take a person trying to make length, and you have a forest, and it's like 10 cells,

01:06:41

and you predict deforestation in one cell, but it actually, it actually happens, that will be a pretty accurate prediction, but it would be less accurate, but it's still,

01:07:04

right? Yeah, I think yes. Would you call that as like a less clear error than if it would be predicted 10 cells, Wesley? Yeah, okay, yeah, I see your point. I think for us, we have less

01:07:22

this spatial pattern because our stations are far away from each other. Yeah, but I think maybe I can still consider a point. Yeah, I can. Yeah, I think I can. I don't, I don't know any software for doing it,

01:07:47

but I think the things we can program and see. Yes, yes, yes.

01:08:37

Okay.

01:09:27

Oh. From a Greek life, yeah, from a Greek life, and he also developed a library, but he's still using everything. So, in that line, also when you start to see, you should think of a way

01:09:59

that has to go, because it is, yeah, one thing is so close, it's so easy to

01:10:07

press. So maybe even, I mean the information of somebody first,

01:10:43

So there is a paper, there is this special number, so I think you can do it, but I don't Then I see that the outcome is something we don't want to know.

01:11:09

Yes, without the data. Thank you.

01:11:55

Yeah, thank you. Okay, I got more for validation or prediction.

01:12:09

Validation. Okay, yeah, thank you. We're actually trying to do also meets the forest. So that's also to us

01:12:24

for cluster data, so regions within cities within country countries within continents and so on, and to capture similarities between other than countries, and also the patterns between countries. Um, yeah, but, yeah, but that's not for validation.

01:12:45

They cut this paper and see. I do not turn in I think maybe so the next one, the solution and your network based method,

01:13:05

I think it better takes account into spatial pattern, because it's purely like spatial base, you know convolution of fingers. So instead of what we are now doing is to calculate the road density, so the road length in different buffers.

01:13:28

And instead of doing that, you just go directly from the transportation networks and automatically and apply the deep convolution on your network. So automatically extract features from there. So not to aggregate data into a point anymore, but to

01:13:48

just look at the entire spatial pattern, and then scan all over the data to do the prediction. But still, in terms of validation with just always with the point based. I think that's probably all this value is known as a point based validation.

01:14:13

So I really want to have been thinking a lot better accuracy assessment methods, also thought about using this top homie, because it gives the global coverage data.

01:14:27

The top homie measurement, so the satellite. The Sentinel 5p measurement. It has the course resolution, but it has a very

01:14:41

seven, seven kilometer. So we also trying to use that to contribute to the cost validation, because you have a wider everywhere. But the thing is, that measures the column density. So which is not directly related to the surface concentration. So we are interested in the surface concentration.

01:15:04

No. Yeah. Yeah. So we want to collaborate with some other teams so they can, yeah, use the numerical model to do surface prediction, and they can use it also for validation.

01:15:22

So to for example have the predictions, matching the pixel of the top homie. So, I think we can discuss further validation and very interesting.

01:15:42

Okay. Is everything running well with this. So, anyone else. I didn't take it.

01:16:11

Yeah, probably your, the GEO, the virtual machine has the lower version. Okay.

01:16:24

And then you try to upgrade. Oh, yeah. Yeah.

01:16:41

Yeah, I should have let you download it. But yeah, you can also look at this HTML file to get an overview.

01:17:16

Okay. Do you need any help.

01:17:22

You just waiting for it to use. I don't know, the last half hour. The poems say we are going to have a discussion or show, go on.

01:17:40

Will you all stay for the afternoon session. Okay, then I'll move the deep learning to the afternoon session. Okay. And then also modeling in practice. So now you can just run your scripts and then ask questions.

01:18:02

You want to. Oh, I have been, I have the example in the script. So, after you finish that.

01:18:30

Pardon. Yes, I wrote the function.

01:18:41

Yes. Yes. Yes. You'll see a new script. So, let's see what it is.

01:19:07

So this last so pass on the forest, this chunk. This, this function. And here you can see the cross validation. If you run this function. And there are also two additional thing. So you miss other folder.

01:20:03

This. I made two scripts, just for you to have a better understanding with the machine learning techniques. So one is an example so to aggregate random forest SG boost all the different models to city, not just remember trees but to example models.

01:20:21

This can always improve prediction. The problem is, is the low of course because you have to run all the deeper models, but in this machine learning. Competitions you see the winner they usually example deeper models as well. So, I have this small ask Chris here for you to see how to example models.

01:20:44

And this is a one for difference between the statistical models like linear regression, and machine learning models. Because, yeah. Okay, I'll say later now we have to ask.

01:21:02

So, someone. I'm our thing. Yeah. I, I think it was, I think one year ago it was not meant for something or two years ago.

01:21:28

Yeah. Yeah, they started. Yeah, I'm aware of that. I think they started a few months ago or a year ago, something.

01:21:48

Yeah. Yeah.

01:22:14

I just started taking it out.

01:22:37

Super learner Yeah, I think I remember it. Also, a few years ago something.

01:22:46

If I don't remember the results. No.

01:23:16

Yeah.

01:23:25

What for. Oh, yeah. And also the Google had this auto native machine learning. You know, automated machine learning auto ml or something from Google. I think they do things probably a bit better.

01:23:48

Yeah, probably you can try. I'm actually speaking about my machine learning words to Python, because I sometimes feel they have better support or more rigorous community.

01:24:03

A few more people are working in machine learning Python. Are you all are users. Good choice. Oh,

01:24:20

I'm still learning saying like what packages to use. I don't know yet. Yeah, just to try both and compare.

01:24:41

So this number. This is Chris I also try to example. So, myself, to try to optimize the model prediction from different models. You can you can try to feel like, but I think I give you about so homework.

01:25:23

And anyone wants to run based on locally. Because I also have instructions for installing Python if you don't want to run it online, it's for the afternoon for the deep learning part, right, do it in Python.

01:25:43

So I'm going to probably already see it.

01:26:03

Maybe you can already try this link. I would recommend you to create an account. Then you can also use the GPU and TPUs there. And otherwise, otherwise you can only stay there for 15 minutes, so you can't use GPU or TPU.

01:26:23

But it's really the biggest community for machine learning practitioners, definitely. And it's very easy to have an account for maybe. Yes. Yeah, share the scripts.

01:26:43

So, so if you have an account, or if you click the ID, you can already run my scripts, either run everything, or you can just run as our notebook. Here you can have the accelerator. So you can choose GPU, now means CPU, GPU and TPU.

01:27:05

The TPU requires the configuration of this setup, but it's quite easy to do. So it's really very good start up for people who don't have a GPU.

01:27:47

No, also our notebook. Let me show you. You can choose to use Jupyter notebook, or just Python shell, like thing, or R, or R markdown.

01:28:15

Yeah, so there's four things you can think about in your notebook, and then you can choose who wants to be here the Python or R.

01:28:35

You can install it, but it runs the Docker at its back, so it already has lots of machine learning packages pre-installed.

01:28:45

So, yeah. Yeah, exactly. So here for example, you know, it's the same Keras and TensorFlow, you can just import already the same as library R.

01:29:00

And for the things that is not there, you can easily install it using, for Python it is PIP. Yeah. So I think for teaching and so on, that's really cool.

01:29:32

Maybe time. No.

01:29:43

Oh, yeah. Yeah, yeah. Yeah, they are a bit similar, but this has GPU and TPU, I don't know.

01:30:08

It also has good connection with Google, just can use Google Cloud.

01:30:26

Oh, so probably they are quite similar to you. So I think maybe you can have a Kaggle account now, and maybe in the afternoon,

01:31:57

so that you can, if you want to, otherwise you can stay for 30 minutes or so.

01:32:04

On my scripts. And also, you can still see my scripts. So you can still see my scripts and results and so on, but if you want to run it, so you can have an account.

01:32:52

Did you mention Keras example? The Keras example. You didn't mention it.

01:33:01

The Keras example. Oh, you did. Yeah, it's quite old. And it's actually quite slow. Yeah. Yeah, I have the scripts here, but I commented all of the code, because it's too slow.

01:33:23

So maybe you can try, if you want to. And also I have mixed effect model here. Yeah, I think it'll be too much if I talk about it. That's also what we are currently doing.

01:33:46

So the mixed effect model, as I talked about is for hierarchical data. Right, so to better integrate more spatial information for prediction, because between countries there's disparities, and within countries there are similarities.

01:34:04

So mixed effect model. There's one development, so to use random forest and mixed effect models. I think that's super interesting. And I have also a Kaggle scripts. So you can also try this out, if you are interested.

01:34:52

And actually it gives better results than random forest also. Yeah, the random forest, I forgot if it also gives better prediction than XGBoost.

01:35:07

But it's a very, I think it's very interesting.

01:35:43

Yeah, I think it's not here, or the R1 is no good.