High performance computing in Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 24 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/59772 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
7
8
9
12
24
00:00
Computer animation
01:56
Computer animation
08:38
Computer animation
09:58
Program flowchart
19:13
Computer animation
20:30
Computer animation
21:36
Computer animation
29:25
Computer animation
30:44
Computer animation
33:00
Computer animation
35:01
Computer animation
36:38
Computer animation
43:40
Computer animation
50:41
Computer animation
56:27
Computer animation
01:01:57
Computer animation
01:04:33
Computer animation
01:10:34
Computer animation
Transcript: English(auto-generated)
00:10
My name is Leandro Parent and I'm from OpenGeoHub. Today, I will talk about high-performance computing in Python and focus mostly in the context of
00:25
the open data science Europe and much array and raster processing. I'm online in the Slido link,
00:41
so you can send questions at any time because the focus here, it's a hands-on training session, so you can just ask the question and I will constantly check the Slido to reply and answer the questions.
01:03
I also put the link in the chat. Here is the short URL so you can access the slides. Yeah, I think we are ready to go.
01:20
Before I start, I would like to say that I'm sorry to not be present in the conference. I had some personal issues that didn't allow me to go, so that's why I'm doing my session remotely. But I hope you all enjoy the conference.
01:43
There are many multiples, interesting topics and training sessions and talks. I've enjoyed the whole week. Let's start. In this tutorial, basically, I prepared a few slides to explain
02:05
about some conceptual background for processing raster data, but mostly multi-array structures. It will be about 20 minutes
02:22
of this conceptual explanation. In the last slide of this presentation, you will find a link for a Google Colab. We can do the hands-on part using this link.
02:42
I will do everything with you, execute the code and explain all the steps. In this conceptual part, I will be explaining about embarrassing parallel problems, possibilities to optimize raster and processing workflow.
03:03
Just a bit about the BLAS and LAPAC implementation. If you work with any array processing, for sure you are using under the hood these libraries. The last topic is how we can optimize
03:21
temporary reductions and numeric operations. When we talk about embarrassing parallel problems, it's basically a problem that can be divided into completely independent parts.
03:41
If you think about it, you can have each part of this problem running in parallel, because of course you don't have any kind of dependence. In an ideal case, for example, if you think on a raster data and each pixel,
04:02
if you want to do some processing just for that pixel, you could have on the limit one pixel being processed by each core, for example. In this case, hypothetically, it's possible to scale up all the processing and really
04:23
split all the parts of the problems in multiple cores and CPUs. You have some other types of problems, not all the processing problems are really completely independent.
04:40
You also have this nearly embarrassingly parallel, so you have some dependence, so you need to do some reprocessing, and later you can scale up the processing for multiple slaves, as we can see here in this figure.
05:04
The most part of the raster processing, it's a nearly embarrassing problem because you need to organize some pre-processing and post-processing steps to really send, for example,
05:25
a comment and do a processing in each part of the pixels. I will show some examples here and how you can understand it with a concrete example. Here, this is a time series
05:44
that we extracted from AVHRR data, so since 82. For this time series, for example, we applied smoothing filter. Of course, you have all these data,
06:01
organized in mosaics, so for each point of this plot, you have one image, and here I'm illustrating, for example, in the Amazon region, the time series.
06:21
To really scale up it, at some point you need to coordinate the process, so you need to have a master process that will split parts of these images, like chunks of the data in different processing parts and send it for multiple cores,
06:43
could be like, for example, if you have just one CPU, you can use all your cores, eight cores, for example, and each core will process just part of the data, so Brazil will be processed by one core, and Europe will be processed by other core.
07:02
So, but this is just like a minor example, if you think in a notebook. If you are working with a really high processing and computing infrastructure, you could have thousands of cores, and as I explained it to you, if we have 96 millions of core, each core can manage one single pixel of this image,
07:23
and this will be the most optimized way to do the processing. Of course, this is a hypothetical situation, but in the end, you can still have like thousands of cores and split parts of the chunks of the data for each of core,
07:43
and here we have example of this processing running in our servers here, and for one server, we have 96 cores, and what we did, basically, we split the word in different chunks of data,
08:00
as I explained it, and each core, it's processing part of the data, and you can do it in several ways. You can split, for example, by country, by continent, but considering the processing needs here, it's not really necessary, have like any kind of administrative unit,
08:21
so basically, we divided the word in different grids, and like one grid and different cells, like regular cells, and we just sent this region to each process, to each core, sorry. So, considering it,
08:43
of course, if you want to really optimize it, you could, and when I'm saying like optimize it, imagine this processing, like this is muting filter in the time series. If we want to really speed up it,
09:05
we could, of course, use more CPU cores. We are limited for the number of cores in one single machine, right? So, for example, here, you are seeing 96 cores, and actually, in this machine,
09:22
we have two Axion Gold chips with 24 like real cores and hyper-thread, so basically, it's 48 threads, and actually, two separate physical CPUs.
09:42
So, in this case, it's really optimized, but of course, I cannot put like, it's not possible to inserting one single machine, 1000 cores, like considering these regular processing servers. So, if you have like multiple servers,
10:01
and I don't know, 10 servers like that, of course, you need to transfer the data through the network, and you need to balance it, right? So, you could have multiple cores, but you need to considering also the speed to transfer the data and all these aspects.
10:21
And the last possible way to optimize it, it's actually improve the processing code. So, you can keep the hardware as it is, and just change your code, like improving some algorithm and changing, for example, function. And there are several ways to do it.
10:40
And of course, we will focus here in this 30 part, how we in this party, in this part of how we can improve the processing code. And considering that, of course, it's expensive, change the code, and at some point, in some situations, maybe it's not possible,
11:01
you need to re-implement like from scratch some function. So, and maybe that's not the ideal situation, but you also have a kind of drop in replacement alternative. And what it means, it means that you can keep your code as it is, like the original code and just replace some low level library.
11:23
So, for example, if you use Python, and for sure you already use it or heard about the NumPy library, you can replace some C function that optimize, for example,
11:42
the way that the NumPy operations are performed in a low level library. And if you do it, you can keep your code in Python as it is, and just replace like a piece of software that it's responsible to do the hard work and to process kind of low level mathematical operations.
12:02
And this kind of optimization, it's really cheap and can produce good results. So, we call it as drop in replacement, but of course you can also re-implement all your code from scratch. And with Python, you can implement parts of the code in C,
12:22
integrating it with Cytome, for example. And this is all a new code implementation. So, considering this drop in replacement, we have basically, we have actually several options, but it's important to understand these two libraries.
12:43
So, we have the BLAST, it's basically linear algebra sub programs. It's a C library to provide a set of proteins for basic vector and matrix operations. And also we have the LAPAC, it's a linear algebra package and it's a Fortran 90 library.
13:01
So, to solve the linear equations. And these two libraries, they are actually pretty standard when you think in a numeric operation. So, and these, as I said, these libraries are kind of low level libraries. So, all the like higher level programming languages
13:24
like R or Python, they use it under the hood. So, you can really replace the libraries the implementation of these libraries and you can optimize part of your code.
13:40
I will show some examples of how to do it. But the important part here is as these two libraries are pretty standard, they define it some really well known interface. So, you have the same method signatures independent of the implementation. That's why it works if you change one implementation to another
14:02
and you can still execute the same code because everything it's tidy by the same method signature. And when I say method signature, it's something like this, like X dot, it's the name of the method. Here are the parameters and you have like the same signature
14:21
but with the different implementations on a C and in Fortran 90. And considering it here, we have some options that you can evaluate. We did it internally and I will show the result of our benchmark. So, you have R and Python
14:41
and these are really like high level libraries programming language frameworks and inside of this R and Python, there are several libraries that use this lap pack and blast. As I said, considering the same in the common method interface
15:01
and you can just replace, for example, the open blast implementation for Intel and Kyle and you have the same signature, method signature and doing it, you can have a better performance. You can also test, for example, the any video implementation.
15:21
So, for example, we didn't test it yet but you have the same method signatures, as I explained it, implemented by any video and compatible with CUDA. So, for example, any kind of operations that you can perform in NumPy,
15:42
if you use this any video implementation, they will be automatically, automatically parallelized and executed in a GPU. So, of course, you need to have a GPU to do it but it's a way to also optimize the processing but you need to have a hardware here.
16:03
For the Intel MKIL, as it's an Intel library, MKIL, it's mapped kernel library, you need to have a Intel CPU. The most part of our internal computing, it's based in Intel.
16:21
So, that's our current, that blast implementation, blast and apply implementation. But you also have other open source implementation, open blast and AMD, it's trying to develop these two libraries. That also, it's also an option
16:42
but the last time that I tested it one year ago, it broke in the same in the middle of the process. So, I will show the results. Okay, so considering that use case that I presented, so this smoothing filter
17:02
and also kind of gap filling in that time series, considering the AVHRR data. And we had a code at the time running in R and as I said, we just replace these three
17:23
low level libraries using the same code. So, it's a really a dropping replacement. So, without any modification in code, we discovered that in our R code, we discovered that MKIL, it's three times faster than open blast for this specific application.
17:43
Of course, you could have a different application and maybe the result, maybe no, for sure the result will be different. So, the best, the suggestion here, it's really benchmark yourself, your processing,
18:02
considering what you want to optimize. In our case, we did it because we were deciding about Intel CPU processors, AMD CPU processors. In the end, we just, we have everything running in Intel.
18:23
You will also find some like blog posts explaining that it's doable and possible run MKIL in AMD, but this is just a rack. So, you can try to do it, but for a production work, it's not, I don't advise it because MKIL,
18:44
it's really optimized and use some specific functionalities and resource that are available just in the hardware of the Intel. So, it's maintained by the Intel developers. So, you could try do some hack like that
19:02
to run MKIL in AMD, but it's not ideal and advisable for a production environment. Okay, so considering it, we will test and now how we can optimize some of the array operations
19:30
that we have in Python. It's for some of them, it's a kind of drop in replacement, not as I presented here. And this is just an introduction
19:42
and like some conceptual background to understand how you can optimize it. But for some of the libraries that we will present here, we will focus in bottleneck, numba, xpre, and numba. For example, for numba, you need to really rewrite
20:04
part of your function, but it's not something that it will really require a lot of effort because it's really close the way that these three libraries works with the numpy library.
20:23
So, you can, I will, I stopped the presentation here. So, you can click in this link.
20:42
So, you will see the same script as me, and it's a public script. So, before you start it, it's important to create your own copy to your drive.
21:01
So, please remember to click in this button and now I'm working in a copy of the script. And yeah, I hope everyone could access it.
21:27
So, just send a question or let me know if you have any problem. So, again, in the last slide of the presentation, you have this link. With this link, you will open a Google Colab
21:46
with all the code that we will execute during this tutorial. And you can just copy to drive to have your own copy and execute it in a proper way.
22:03
Okay, so now in this, in your copy of the script, if you scroll down, you see that there are, all the cells are like executed, right? So, you need to click here, clear all outputs.
22:26
So, now we will have a clean version of all the code that we will test. And again, clear all outputs.
22:40
And you can click here to start the runtime environment and really execute the code. So, again, clear all outputs
23:02
and just click here to start the runtime. I will increase my font, okay. So, I'm not sure if everyone here is familiar
23:25
with the Google Colab, but here the Google Colab, basically you have a kind of small virtual machine. You can see the source of here. So, basically every machine has a kind of space,
23:47
a specific space in disk and memory. And it's important to explain that we will download some data to here.
24:00
So, this is a completely empty environment and every data that you will execute and do the test needs to be downloaded and organized in this environment. So, and if you want to keep or save some of the outputs that we will produce considering data outputs,
24:22
you need to download it because when you close the Google Colab, everything will be wiped up and deleted and basically you need to rerun the whole script again to produce the same output. So, considering it, this script,
24:44
it's divided in three main steps. So, the first step is we need to install and prepare the environment. So, we need to add some libraries that we will use and I will show to you how to do it. And after the libraries in the environment,
25:02
we need to prepare the data. So, we will access the data that it's available in our spatial temporal asset catalog. So, it's basically a public URL where you can access, for example, different raster layers
25:21
and just clip it using the code that I will provide here and download it for the Google Colab environment. And the last part, it's actually the optimization. So, we will execute some multi-array functions considering these three libraries
25:40
and implement a benchmark. So, that's the main, these are the main steps. So, let's start. So, basically I was explaining about the structure of the notebook and the tutorial.
26:03
So, we will have the first part, we will set up the environment and the libraries and we will set up the data and to set up the data, we will use the stack, our stack catalog and basically accessing several cloud-optimized geotaves
26:21
and we will do the optimization considering these three libraries and comparing the time. So, the execution time. So, to set up the libraries in the context of the Open Data Science Europe,
26:42
we developed a UMAP package if you click here, you can see the package. And basically we have several functions to do gap filling to access data and implement different parallelization strategies.
27:02
And in the context of this tutorial, we will access the raster data and save the local rasters using these functions. It's just a helper to access like multiple raster layers and it's a wrapper for the raster basically,
27:22
but with one single line, you can access multiple rasters and organize the data. So, to do it, you need to install the library like that and we will also use the IPY leaflet to select some region of interesting to download the data.
27:45
So, just execute it and yeah, it's better. If I create a copy as I did, you can start, you can execute this line here.
28:01
I will just create a new copy to have the same environment as you.
28:21
Okay, here I'm cleaning all outputs and I'm creating this, great. And so here basically we are installing the IPY leaflet and we are installing also the UMAP library.
28:43
And for the UMAP library, we created a specific branch to match the Python dependence with the Colab. Because in Colab we have some limitations. So basically we modified some of the dependence. It's important also to mention to you that we also prepared a Python environment
29:08
to develop like for GIS and geospatial applications. And here you will have like the GDAL tree. We are changing the libraries here,
29:23
but this is just a Docker image that you can use to develop your own applications. And it's really simple. And here we have a JupyterLab ready to go with GDAL, Raster.io, all the signkit-learn library
29:45
and all the machine-learn libraries that you possibly need. So we use it internally, but considering that it's important to have the same environment for everyone.
30:00
In the training session, we choose to use Colab, but you can use local if you are familiar with Docker. So here we have the libraries are installed. So you need to restart the runtime. You will see the next parts of the tutorial.
30:24
You need to prepare the, we will use this magic cell, time it just to benchmark and compute the time and the execution time. So you need to update also the IPython version because Colab has a pretty old IPython version.
30:44
And yeah, you just need to execute it. You also need to execute the installed this PyStack. So I will open our open data science Europe stack
31:01
and implementation in a second. But the PyStack, it's just a library to access this stack, the catalog with all the raster layers and select some specific layers that we will process here. I will restart the environment again.
31:20
And the last step is just import all the libraries. So this is just a warning because of some dependence between Colab and UMAP. But when you execute these functions here,
31:42
it's basically, yeah, you shouldn't see any error. So it means that everything is installed and ready to go. So yeah, you install several libraries now and we can proceed with the dataset preparation.
32:03
So this can take a while. So I will execute now. In a meanwhile, I will explain. So to the dataset preparation, as I mentioned to you,
32:22
yep, it's finished. As I mentioned to you, we will have, we need to prepare that data and we need to access some remote data, some raster layer in a public URL. And we need to prepare the data and save it here in the Google Colab because we don't have any data in this environment.
32:43
So the second part, it's just to prepare the data. So, and to do it, don't worry too much about the stack. We will have a session about it tomorrow. So, but you can enter here
33:01
and I need to update it. This stack, open data, this one. But if you click here also, you can access directly the stack.
33:23
So basically in our stack, it's a spatial temporal asset catalog. We have almost all the layers that we produced in the context of the open data science Europe.
33:40
And you, for all these layers, you have a metadata and you can see thumbnail and here's the metadata. And this is just a browser to access the stack. But most important, when you enter here,
34:00
you have multiple assets and you have this URL here. So basically with this URL, you can access directly from Python or QuantGIS, it doesn't matter. This is a really open and public URL
34:20
where you can access the data through a cloud-optimized geotiff. So this is total prepared. So you have one mosaic for the whole Europe and over all the Europe. And you can access and visualize it in a fast way
34:44
because it was prepared by us to provide like different overviews and this pyramid layer and that it's inside of the cloud-optimized geotiff format. So what we did here, it's basically considering this URL.
35:06
So this is our stack URL. As I mentioned to you, you have a session about stack and all the concepts tomorrow. So this is just like a really a fast introduction.
35:23
Let me see just if it, okay, there is a question here. Let's see, okay. I'm not sure why we need to restart the runtime.
35:42
So some of the libraries change. And for example, for the UMAP, I'm not sure actually because some, yeah, for UMAP, I'm not sure because maybe some of these libraries use some operation system library,
36:03
but for the IPython, it's necessary because when you update the IPython, you need to create a new session because the runtime was started with the old version of IPython. I think it's the version five. If you click here, you can see some information
36:23
about the previous version used by IPython. So in this case, it's necessary. But in the case of UMAP, I'm not sure. Specifically why. Okay, so let's continue.
36:42
And so considering this URL, this is actually the URL to access the stack. As I mentioned to you, this is just a browser. And if you copy this URL, you can see all the layers in a JSON structure following the step standard.
37:02
And here I basically created access to the catalog, providing this URL and I'm retrieving all the layers, all the assets that are inside of this collection. So we have the red and in our band of the Landsat
37:21
and we have the percentile 50 and the percentile 50 also for the near. And here we created all the URLs. And so to create this URLs, basically I accessed all the children's here
37:41
of this catalog considering all the assets considering the red collection. So basically it's the same thing that I entered here and copy all the, of course, it's just a programmatic way to do it. Go here and copy this URL for all the layers,
38:04
all the periods that I have here. And here you can see the first URL and the last. And if you check this URL, we have a well-defined file name convention with the name of the product,
38:26
the surface reflectance band, and the reflectance band, spectral reflectance band and here the date. So it's basically starting in the, we are calling the winter,
38:40
but it's actually the first quarterly of 2000 and finishing in 2020. And you have it for the red and the inner. So considering that this is a cloud-optimized geotiff covering the whole Europe, you can use this IPY leaflet to really zoom in
39:03
and select some specific area. So, and this area that you will select, it will be used to clip all these remote files and save them locally. And we will use these files in the next steps.
39:23
So basically I will zoom in here. And as I mentioned here, please don't insert a really large area because it will take a lot of time.
39:41
And the idea here, it's really go to a specific area. This is a web map you can navigate in any place of the world and just select like a city or I don't know, some forest area close to the city. And here I will select Fahninge.
40:00
And basically you just need to click here, draw a rectangle and draw the region of interest. And so this line it's here, I will not explain this in detail. So it's just a way to create a IPY leaflet instance here
40:25
and access some and provide some controls, but it's basically a way to interact with the map. And here I'm accessing it through the draw control. And I'm just getting the geometry that I draw.
40:44
And here I can see the bounce. So, and as it explained it here, I need to get convert this bounce to a projection system considering the cloud optimized geotiff. So, and you can see here
41:02
that I'm doing using this PiProj transformer. And basically I know that the IPY leaflet it's giving the coordinates in WGS 84. And here I'm converting it to the projection system
41:20
of the first URL. So here I'm just accessing directly with Raster.io. So if you use Raster.io at some point, you can provide the local file, but you can also provide like a URL. And the URL that I'm providing it's basically the first one of the red layer, this one.
41:41
And here I'm converting it. And after this conversion, I'm creating a window object. So if you want to know more about it, you can access this link, but it's basically a abstraction in Raster.io to access, to provide like a window
42:03
for a specific raster considering the dimensions of the raster and the size that you want to access. And here with this window object, basically I'm using the read raster function.
42:24
I will just execute it. It could take a while. So with this read raster function, this is EU map library. It's a U map function that we developed.
42:42
And basically it's accessing all the red URLs, the first one, not all. Here I'm just limited because of the Apollo memory limitation and I'm passing the window. So if you click here, you can access the documentation,
43:02
but this is a kind of, as I mentioned, it's a helper function to read multiple rasters. So here I'm reading 20 red bands. And so I'm just clicking. I'm accessing 20 red URLs.
43:25
So different dates, sending this window and I'm doing it in parallel. And if you do it, you can see that this is the size of my data. So in the space and here it's the time.
43:46
So this function it's responsible to read each individual files, put it together in a three-dimensional array. So it's a helper function. And now with this helper function,
44:01
basically I have a NumPy array, a three-dimensional array, and I can save it directly in my local environment. In this case, it's the Colab. Here I'm just passing the work there and to save the raster it's important pass
44:21
like a base rest, as I'm saying here. You can check the documentation of the function here. And again, it's a U map function. We use it a lot in the project. So basically if you do it,
44:42
it will consider all the projection system, pixel size, everything from this base raster. And it will save all the red raster, all these NumPy three-dimensional structure, it will save in the same structure.
45:01
In the local folder. So if you do this, you can enter here, you can access here and you can see all the files that were generated. Let me see if we have question.
45:25
No questions. Okay. So now we are ready to go. And it's good because we still have half an hour. So, and the data it's here in the Colab and you can download if you want.
45:41
It's just the geotiff for the red band clipped from that big mosaic that I show in the stack considering this region. So now it's time to start like the proper optimization and to do it basically we will have two types of operation.
46:07
One it's the array reduction. So because here we have a three-dimensional array with the third dimension being the time. So array reduction, it's just operation where we have like a three-dimension structure, for example,
46:24
and we reduce it for a two dimension. So we apply some sort of mathematical operation in one of the array of the dimensions, reducing it. So, and if you think about the data input and output,
46:40
the data input, it's a three-dimension structure and the data output, it's a two-dimension structure, a two-dimension array. Here it's just a boy example. So considering this two-dimension structure, I'm reducing it for just one dimension. But of course we want to do it with proper data.
47:05
And now we have the data local. So I'm using this find files function to access all the red files here. And I'm sending now to the read raster,
47:22
I'm sending a list of local files, instead a list of where else. And for this, I have now here, it's of course, I already had the data before because I use it to save, but considering that I want you,
47:41
you can execute this step from here if you have the data. So it's just a different way to read the data, but now that considering our local folder. And I also created this info function here just to present the shape of the array that we read,
48:02
the data type and the size. So, and now we have the red data here and to do it, we will evaluate now the, yes, we will use this time ID magic cell,
48:27
this thing here. If you click, you'll see more documentation, more about the documentation, but it's just, it's a tool, it's a comment of the Jupyter notebook just to benchmark some specific Python function,
48:45
basically it's pretty generic. So the first thing is considering this red data, we have it here, like it's a really a small chunk of data, but we have 20 dates.
49:00
So, and for each year we have four image, so it's about five years of data. And considering it, we have this three dimension structure and we are calculating a median operation
49:21
in the time dimension. So considering all per pixel, all the pixels considering a median value, we will reduce and calculate the median considering these 20 observations available.
49:40
And the output will be a kind of median image for these five years of red band that I have. It's a common operation to do in this data cube structure in this multi-dimensional array structure. And here I'm just using a NumPy implementation,
50:02
and I'm using this function just to ignore the node data. So, and if you check the time here, this is the time considering five repetitions. So the same operation was performed five times
50:22
and you have the average time of this median operation. And here is the standard deviation. Okay, this is our baseline. So it's how we perform this operation in NumPy. So, but we have a alternative library.
50:43
This is a fast NumPy implementation of some functions. You don't have all the functions available here. It's important to know about it, but the most part of the functions you can find here in the documentation.
51:01
So this is a kind of drop-in replacement because you can see it's the same signature. So the same name of the method and the same parameters. And for this operation, we have 21 milliseconds here.
51:27
And with a regular NumPy operation, we have 80. So yeah, it is about four times fast with the bottleneck here. And again, it's the same data, the same operation,
51:42
and this is a nice improvement. And if you select a bigger area, like a large area, not too large, because of course you need to download and save the data locally, but if you select a larger area, probably it will be faster. So I did it with several chunks of data in different size,
52:03
and you can see this improved. And the point here is considering that, if you considering that explanation to process in parallel and send different chunks of the data for each CPU, if you have like an optimization like that,
52:22
you will, it's a nice way to reduce the processing time and perform like some operation for the whole Europe in a really fast way. So, and without change too much the code, it's just a matter to change the package here.
52:42
So, but again, we have now two data. So, and to really understand if it makes, I don't know, maybe the bottleneck implemented some operation to reduce the precision
53:02
or something like that. So in this case, we are also testing if the arrays are really equal. So here you have the result of the bottleneck and the result of the NumPy, and I'm using a NumPy function to really check
53:23
if the array is exactly the same considering all the values and yes, it's the same. So these, the guys that developed the bottleneck, they did a pretty nice job here and implemented in a fast way some of the common operations of NumPy.
53:43
Let me check the chat because I'm not without notebook. Okay, no more questions. All right, okay, so let's go ahead.
54:10
And so we also have a different library called Numba. It's a more complex and you have really much more functionalities than bottleneck
54:22
and you can check it later. It's a, there is a pretty extensive documentation, but for the context of this tutorial, I will basically implement the same operation
54:41
considering this GIT decorator. So basically it's the way to parallelize different NumPy operations inside of this chunk of the data. So for example, if you're considering this
55:01
as this number of pixels, internally Numba split it in parallel, but you need to modify the code to really allow it the execution of this method. So what I did here, I created a function called reduce
55:24
and I'm using this decorator with the option of parallel. But here it's a bit different and you need to understand that it's important part
55:41
that Numba doesn't, as far as I know, doesn't manage like multi-dimension structure. So in this case, we have a three dimension structure, like not multi-dimension, more than three dimensions. It's a bit tricky.
56:00
So what I did here, and that was the way that I found, it's basically I'm changing the dimensions. So I'm combining the both spatial dimensions, longitude and latitude X and I, I'm combining in one single dimension, like kind of a big table, right?
56:23
And the last dimension I'm keeping as the time. So, and basically I'm changing it like that, the shape, I'm changing the shape of the data input. I'm sending to this reduce function and inside of this reduce function,
56:43
Numba will parallelize. So you need to use this P range function. It will parallelize all the pixels. So it will use all the cores and they also do it in a really fast way. So they have some internal optimizations also.
57:01
And here I'm calculating the median. I'm putting this array and I'm returning the result. So this is a pretty simple function, but for example, if you want to change the operation to use like median instead to, I don't know, some average max, you can just change here
57:23
and keep the same function. So when you do it again, it will execute five times and you can see that it's a slight slower
57:41
than the bottleneck. And here it's just to explain that this is the input in a three dimension size. And here is the input for Numba in a two dimension size.
58:01
So the first dimension, it's the longitude and latitude combined. And now again, you need to check the result and you can plot it and compare. There, we already did a comparison here,
58:21
but it's the same image. And here you have the, like the time. So basically the bottleneck, it's the winner here, but for some reason, Numba scale is better.
58:41
So what I'm trying to say, if you had like a bigger portion of data, a larger area here, probably Numba would be better. I did this test locally and using, for example, 1,000 by 1,000, a region of interest
59:01
with 1,000 by 1,000 pixels. And Numba was faster, but maybe it's something in the collab, but what really matters here, it's actually that the both library are faster than the regular implementation of Numba.
59:21
And of course, Numba, it's a bit more tricky. And so for example, mostly I prefer use bottleneck because it's much more faster to implement and integrate in my workflow. Okay, so this is an array reduction example.
59:44
Let me see if I have questions. Okay, no. So I will proceed. And now it's the numeric operation. So the numeric operation example, it's,
01:00:00
Just, it's a different operation as I define here. You will provide a two dimensional array structure and it is a book example, but in our case will be three dimension. And you will calculate some mathematical operation,
01:00:24
numerical operation, and you have the same output. So the data input, it's the same of the data output. So for example, we will use in this tutorial the near and the rat band to derive a spectral index, vegetation index, a new one, and we will save the result.
01:00:45
So, and here, it's just like a toy example of how you can have two datasets. Just perform some operation on it and have the output.
01:01:03
So yeah, I will change here to 20 IR, I already executed with 40, but I tested several times this and sometimes the Colab crashed. So I will change it and put in the official link
01:01:26
after the presentation. So you can have the proper value later. So here I'm reading the rat data and the IR data
01:01:40
and we will use these two spectral values, spectral bands we will use to calculate vegetation index. So there is a pretty new vegetation index that it's correlates better with the biomass
01:02:01
and has less saturation in the higher values call it near V. So near infrared reflectance of the rest of vegetation. So if you click here, you can see the publication and basically I'm using this index. So, and here's the formula.
01:02:22
And if you think in a NumPy array structure, you can derive this index in a pretty straightforward way as we are doing here. And yes, here you have the time and it's just like some regular
01:02:42
and mathematical operations. And okay, we can do it with Numba also. And for Numba you need to use a different decorator. So in this decorator, it call it vectorize.
01:03:02
So because in this case, you will have the same out the input and the output as with the same input with the same size, sorry. So the input and the output will have the same size, the same dimensions. So it's important use a different decorator and it will inside here,
01:03:23
you can just do the same operation that you are performing in like outside with the regular NumPy, but Numba will take care of the optimization here. So let's execute it.
01:03:47
Okay, it's almost four times, so same speed up. And now again, it's really important check if it's exactly the same data,
01:04:01
otherwise, I don't know, maybe Numba it's doing something and change the precision of your data, but it's not the case here. So again, they did a pretty nice job. And now you can also test with the new expert. So if you click here,
01:04:23
you can see the documentation reference and yeah, it's a different library and it works in a different way. Basically it only has one single method.
01:04:40
And there are other methods, but the main method it's the evaluate. So you can send the expression for the evaluate, sending some parameters if you want, if you don't send any parameter, it will take directly from the Python scope,
01:05:01
but here for like to improve the readability of the code to understand better what we are doing here, I'm just sending the near and the read parameter and you have this optimization argument. So there's, if you click here,
01:05:22
you will see different arguments that you can send through the evaluate, but basically doing it again, it's the same thing. So basically you have a expression that will be executed.
01:05:42
And again, yeah, in this case, the result it's not impressive and I will check here the benchmark. So yeah, it's a bit slower. So, and I, again, I did it with different areas.
01:06:02
So if you increase the area, probably you could probably not for sure you will have difference, but, and I'm not maybe here in the Colab could be some limitation in the runtime. But when I executed it in my local machine,
01:06:22
I got the similar time of Numba. So, but yeah, it's important to test and if you run in your local notebook, probably you have a different result. So, and again, I did the evaluation here
01:06:43
and I check it if the values are exactly the same. And now, yeah, here and again,
01:07:01
everything that we did here, it was for the 20 images. So if you look the output, we are calculating these spectral indices for all the layers.
01:07:23
So basically it's not like the first image or the second image. So again, it's a three-dimension structure and we are calculating this vegetation index considering the inner and the red band and the right vegetation index for the whole structure.
01:07:44
So in this case, it's plenty of observations and five years. And so I created this helper function here just to plot the data. So in here you can see the vegetation index for the year 2000.
01:08:01
And I think we will have the 2002 because we are accessing just the first five years. And so basically this data structure has all the,
01:08:23
yes, has all the result. And here, so to finish, what we will do is considering the layers that we calculated this vegetation index, we can save it.
01:08:43
And I will use again the save raster function. So we will have a base raster, the output file. I will send the vegetation index structure here. You can see that it's a three-dimension structures
01:09:01
with 20 dates and I will send the window that we defined it. So it's important and send the window here because in this case, no, because it's the same dimension but if you use like a bigger file,
01:09:21
like a mosaic for the whole Europe as we did, it's important to send it. And here I'm writing the files and now I have the vegetation index and you can download this file
01:09:42
and check locally in the Quan Chi IS. So yeah, basically we tested all the libraries and in our case here, the winner for the numeric operation the winner was Numba.
01:10:02
Again, for a Colab environment, if you test in a different environment considering the hydrant could be different. If you change the size of the data input, it could be also different. And for the array reduction, the winner was the bottleneck.
01:10:24
Let me check the chat. I hope you still with me and yeah, that's my last slide and sorry about the problem with my workstation.