We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

GDAL/OGR and PKTOOLS for massive raster/vector operations (Part 2)

00:00

Formal Metadata

Title
GDAL/OGR and PKTOOLS for massive raster/vector operations (Part 2)
Title of Series
Number of Parts
27
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production Year2020
Production PlaceWicc, Wageningen International Congress Centre B.V.

Content Metadata

Subject Area
Genre
Abstract
GDAL and PKTOOLS are powerful commands line utilities mainly used for raster manipulation and analysis. In this workshop, we will explain the main principle and philosophy about these tools by showing simple geodata processing for raster cropping and re-projection, image masking, spatial and temporal/spectral filtering as well as image classification. We will explain how to maximize computational implementation and process raster data more efficiently by building up routines that allow to save temporary rasters outputs in the RAM and use VRT files for tiling operation in a multicore environment. We expect basic command line knowledge (any language is fine) and a general knowhow of geospatial data processing. Participants should bring laptops with 30G and we will assist them in installing the OSGeo-Live virtual machine.
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Transcript: English(auto-generated)
So, for this afternoon, we are going to addressing PK tools and in particular, so we are going to stay in the same material of this morning so we can keep working in the same work frame
and following the same exercise, so you can open again, I'm going to open again
my terminal and we take time to frame this one, so in the meantime that I fix my terminal
and so on, do we have any question about this afternoon, this morning, some function that
you are happy to see, some potential that you can do it, let me know.
Not really, so in the sense that it's a virtual machine, so you have to always two operating systems running, so as you can see your RAM is going to be splitted between the windows
and the operating and line operating system, so it's very good for testing, try out and teaching, but if you are going to do it in a production chain, then it's always good to switch completely to a Linux installation and with OSGO, you can do the full installation
of Linux, of OSGO. Some other question in this line, I'm going to continue briefly make it bigger also, with the PK tools, so we have been seeing GDAL, so PK tools,
you can briefly Google it and you arrive to the PK tools web page, okay, so you can
see this web page over here, they can be a installation procedure, installation procedure is already in the Ubuntu repository, so you can install directly like this, so very easy to install it, and when it's installed, you have all the different available tools
that are listed over here, so they all start with the word with the letter PK and then GDAL, they try to inform you with a common line what they are going to do.
Some, they are really similar to what GDAL is doing, but like for example, GDAL polygonize is going from raster to polygon, these are the main ones that were done when GDAL was already in development, so Peter developed already these tools and then also GDAL developed,
some are really overlapping, but some other are really unique, and whatever you cannot do it with GDAL, like for example, somebody was asking me, Mosaic, and when I call Mosaic, overlapping of different tiles and getting the mean of the pixel or the maximum
value of the pixel or other kind of like histogram of the image or some other extracting, getting average inside polygon, for example, with this one PK extract image, some field of data filtering, image filtering, and aggregation and disaggregation,
so we'll see all this operation in the example. Let's click in one of these, and let's see the syntax, so for example, we click on PK filter, so it's going to be a bash,
it's great in C++, but then has a, you can run, it's compiled, so you can run directly under bash, so for example, all of them, they start with PK and then you have the command, so then PK always require I for input and O for output,
and then again, you have everything that is between brackets, is compulsory, sorry, is optional, sorry, is optional, so in this case, for example, if we want to make a filtering of image, you need to identify input output, and then you
need to identify also the dimension of your filter, so we're here, down here, you have all the explanation and also some command, so for example, this one is going to filtering in the z dimension, so be careful about this,
you have filter action, you can do it not only in the x and y dimension, but also in the dimension, okay, so by moving, for example, in the temporal dimension, every one pixel, and here you can also identify which kind of filter you can do it, so let's see this one
that is a bit easier to read, for example, this one input and output, I'm going to apply three by three windows, moving windows, and going to use the filtering delay to make the filter action, so over here, you can have all the different
kind of filtering technique that you can apply, okay, so there are very, plenty of them, and the beauty of these that they are working in the d x and d y dimension,
but also in the z dimension, so also in the temporal line, so in few words, you can even do a filter in the cubic space, so where were GDAL cubes that was presented last, yeah, a few days ago by Marius, so in the same context, you can use GDAL,
GDAL filtering, and it's super fast, really efficient, okay, so it's amazing, you can get the online, you can get another online, the help also here, if you do pk filter minus minus help, okay, so you can, you can see all the different options inside here,
and you can see, for example, in the minus f dimension, in the minus f, that you can apply all this kind of filtering technique, there you go from disaggregation, percentile, mean,
median, minimum, maximum, density, count, majority, some, percentile proportion, all these one for different filtering image, a bit complex, and so on, so it's very complex, you just need to enter and read carefully everything, so, but we will see some exercise that are really
helping in understanding all this one, so you quite often you have a really nice example in the manual, you have all the descriptive manual work very well, and here sometimes you can even find more, so for example, so for example, here you can do a typical three by three
windows, but rather than be square, it's going to be a circular, so it's not the six pixel, but it's going to be the four pixel, okay, and of course if you increase the dimension,
it's not anymore 11 by 11, but it's 11 by 11 in a circular way, okay, and again, if you do the x, dy, you are filtering like this, but if you put the dz, it's the z dimension, okay, so in reality,
you can even calculate the mean of all the pixels in one sphere, so these are used quite a lot when I do the composite of image, and of course, something very cool that you can even, so for example, several satellite image like MODIS and Landsat, they have a QC, quality flag
layer, quality layer, so you can even say, okay, do the filtering, but if this band has a QC equal to three, you can do not count, so this is very super, and you will see how fast it is,
and how is a feature in the combination with GDAL and so on. He can also do it with wavelets transformation, I did it, it's very complex to do it, but it's possible, you can build up splines in the spatial domain, not only lines, but splines, sorry, not in the
spatial domain, but also in the temporal domain, so this one in combination PK filter and PK composite in combination with the CDO, climate data operator, you can do whatever you want in the multi-dimensional, in the multi-dimensional temporal framework, all of them are in C++, so super fast, and the script that I use it today during my plenary talk, one of the last
slide, that one is how you use GDAL, PK tools, and CDO for this kind of operation, so I really try to put emphasis in learning these tools because they are really promising.
Yeah, the syntax, let's see another syntax, let's see, sorry, like before we already used GDAL info or GDAL and then some command, right? Perfect, so in this case the GDAL was going to be, you know, GDAL info and then some flag, and the flag is
identified by the dash, okay, so if you do GDAL info and you enter, you have everything that is between brackets is not compulsory, you can skip it, and over here, so the same, for example, PK info, okay, so as also came as a flag, you can get the help with it, you can get the help
with typing help, and over here you can get the same kind of syntax, so everything that is going to be like this you specify as an option, so the same, look again like what I said before,
if you do minus mm and the input dot if you get perfect the same, now I don't have that image, like in the GDAL info, let me migrate to the right
directory that is over here on the top and we can enter and already go, okay, so before, for example, we were doing some GDAL info, okay, so for example this one,
so these are VRT file, let's see also PK info is going to read this VRT file, I think so, I don't remember if I use it, so in this case I have to put minus i because the input is always identified by minus i, yes, it's giving to me directly mean minus and minus, okay,
so that's why it's really the brother because with the same option you can get, so it's in the way, same way of thinking, of course some of them are pretty the same, actually, okay, now going back, start to learning how these numbers are, why these numbers are different,
so GDAL, the minimum maximum GDAL info as a default does not read the node data,
okay, rather PK info you have to specify which is the node data, okay, so if you, if you go into the, you can get the node data, so if you specify over here
this long number and now this long number that was here, so now, now the minus minimum maximum
should be perfectly the same, so you have to change and understanding how the developer work, so in this case GDAL info say, okay, whatever is node data don't count it, rather PK tools say, okay, you have to specify which number you don't want in the minimum, so that one is a way of shifting that you have and sometimes you just try,
it's not written, but you need to try and understand what he's doing, so pay always attention on this and again everything that has the minus is the information, so for example this one somebody was asking me how I can get the bounding box for getting and this is a good
example, how we can get the bounding box of my raster file and then use it to crop or plotting or yeah, cropping another, so the minus e give it to you minus e extend get from,
ah sorry, from vector file, so probably there is another one, but anyway if you do upper left bounding bounding box, so if you enter that one, it giving to you the upper left,
okay, here you get some error, let me put two or maybe you need to put all of them, because usually you you want all of them, so of course there are tools in development,
so sometimes you can find some little error, so I did investigate this one, but it's also good to to use it and to see, for example, number of row and line, let me see this one, for example, is still working, yes, so this one number of row and line number of sample
is working, so you can calculate and detect a lot of action, so be sure to really explore as much as possible the help file over here and the different, of course, the
PK crop, for example, is doing perfectly the same than GDAL warp and GDAL translate, so probably you will never need to use PK crop, I don't use it, but for example, you can do something a bit more complex like also classification of image in support vector machine and neural network, so that one is very appealing,
there was somebody of you that point to me that was using, you can do PK extract as a point file, but also as a shape file and so on and so on, so let's spend a bit of time
trying to learn or see how to search for the command, so someone of you, I need to do this, how I can do it, of course, you have to do simple tasks in order to find that command, you cannot find one command that does everything of what you want, like some operation that you
usually do it, okay, we see cropping, we see warping, we see extraction of point, somebody else, some GIS operation, last operation that you use and do it, can I do this, any idea,
Spanish corner over there, any GIS operation that you are doing usually, can I do,
okay, we composite, I already explained it, something else, Euclidean distance between raster package, between band layer, no, that one, no, but for example, if you have an Euclidean distance
between three files, you can do Pythagoras formula in GDAL Calc and you can get Euclidean distance for three files, so for four file, you have to think how to build up, so
something again is not so direct, so you have to think how to do it, but most of the time with the three or four command of this, you can get almost everything, almost, somebody else, stucco layers, in this case, you can do it, we saw today with the VRT,
and that one is the most efficient way with the VRT, you have a beautiful file, if the stack you have to maintain as a stack, keep it as a VRT and then this VRT, you can also
making the sum of all bands, for example, so in this case, you can use PK stack profile, you will see in the exercise, PK composite or some, some other stuff, any, anyone else?
Yeah, okay, you can do different classification and then, for example, PK diff, I didn't use so much, PK diff, you can identify difference between two raster image,
if you want to do some change detection in the profile, profile space, like was explained by with the Bfast, that one, no, so that one is a really specific algorithm that you need to code
in C++, but you can use it, PK tools as a framework to add them because doing a spline between points, between bands, and do a Bfast is the same concept, all the bands are there, you have to just write C++ code to adding this functionality, so some functionality are not there,
so they are not there, they are very specific, you know, some other, anyone else? Peru, I don't remember all your name, so easy at the country.
We project, we did the GDAL warp, so let's go directly in some, for example, nobody asked me, can I do an histogram of the image, do you use histogram of the image?
Hmm, okay, so for example, if you want to see if you're, especially for you guys, or you guys, who is using classical statistic, your image, a PK, is your image that the
predictor should be normal distributed, so you can check if your image is normal distributed, okay, so you can check, and then you do a transformation, then you can check back, now it's becoming normal, we will see all this kind of operation that you can do it, okay, let's go
indexer size, so we're scrolling down to the same page, very good question, that's why you can do it PK composite, I will point to you, remind when you see PK composite, very good question,
sometimes you want to know when a specific value in a time series is reaching, okay, calculate the maximum, but tell me also which day, okay, so I will go through the exercise, remind me as soon as PK composite, so let's work from here, and we see step by step,
so the same, I will do copy paste, so for example, today I was explaining how to masking, how to transfer a value of, you remember that I did this command, and I was creating this new file
where everything was changed, not everything, if there were some condition, there was change and then I compare back, so I say to you, this is a bad way to do it, but useful for some
case, okay, so now let's see how I would do it with the PK tools, okay, so in GIS, in remote sensing, there is a concept that is masking, okay, masking is nothing else than using another layer to mask out all the pixels under this masking, so you can, you have mainly two
commands to working with masking, one is PK get mask, that is the one that is creating the mask,
be careful, and PK set mask is the one that is applying the mask to another raster, okay, so for example, I have my Landsat image and I know that the, all the black area are water, so below a certain threshold is water, so I can do it, I can use it, this command in particular,
I compress, again this is the same compressor of GDAL, and they say all my, all my pixels that are between one and two, okay, put as a one, and the rest put as a zero,
okay, so everything that is between one and two, put as a one, and the rest is a zero, so now if I cancel, and this one is going to be binary, look, I already think in that direction,
this one is going to be zero or one, so which data type I'm going to use it, byte, we link it with today's explanation of data type, so now if we do it OpenAV, coming up, you can see, oops, get some corruption, there is the I, I forgot to cancel
so you can see that I, I create from this one, I have my, my image that I was using, and I create
a mask, look, sometimes you don't see over here nothing in OpenAV because you didn't stretch, if you stretch it, now you see, okay, this one was my first image, okay, now all these
pixels, they become, they become one or zero, okay, so now this is my mask, based on some value that I put, they are, they are not scientific value, I mean, I just explain randomly, okay, so now let's suppose, so I can build up different masks,
be careful with this exercise, I want to emphasize, it is not always that the one between, the one is between this range in this case, but I can also invert between this range, I can put, I can put between this range, I can even invert, I can put zero between the range
and no data, one outside, okay, so that's why you have a maximum flexibility to create a mask where your value in the mask is whatever you want, okay, so now you create three masks
based on different value, okay, and you have mask A, B, and C, so you have three maps that are binary, zero, one, zero, one, zero, one, and each one in different locations,
now I want to apply this mask to an image, how I do it with the set mask, okay, so I'm going to use the first mask, my value no mask, no data, so all the pixel that be with one, it will be converted to minus nine, what I do, what I'm doing now, and I try
to mask pixels that I don't want, giving a specific value, in this case, I give all negative number in this way that all my negative number, I know they will be negative, but one is coming from the water, the other one is coming from the cloud, and the other one is coming from the house,
for example, okay, so I can keep retrieve which mask has been applied based on the number, okay, so now I can run this full computation,
copy, paste, so now you can see, look, I'm working with four images, one input and three masks, oops, I hope it's not a problem, probably there is some error in the copy paste,
one second that I get the line, let me open one second this, I think it's just a problem of
of copy paste, sometimes copy paste from the web is not really the nice way to work,
so it's good to open an external script, script editor, and let me see if I can copy from here without problem, maybe because I didn't run all the full computation, probably,
I think because I didn't run, probably because I didn't run all this one, and now should work,
so now he's doing a three mask, and now we can use set mask, should be this one, I didn't run this before, so now as you can see, I have one image and I've applied three masks,
independent three masks, and now if I open all the image, we can see, so, and look, I can specify each one of these, like I mentioned, and I can read, I can see, okay,
you can see that, for example, here I have all my pixel has been changed, so these areas become minus nine, you can see the pixel value over here in the bottom, over here when I move,
so this one has become minus 10, this one has become minus nine, and somewhere I think here, or I have the minus 50 or whatever, this one minus five, okay, so each one of the area become another pixel value, okay, and this you can play as you want, that's why I give you
this a bit complex maybe to understand, but to give you the maximum flexibility in understanding that if I mask something out, I can put any number, it's not a default that, okay, if I mask,
it will become zero, and then if you have zero as a value that you, so it's really flexible for this because you can mask out and put whatever you want, okay, is it cool? Can you think in that direction? If you see as a, as a single command line, maybe it's not very cool,
but try to apply, you know, the system, for example, the modus, they have the QC band, quality control band, so you can use that band with this, over there,
you have different value for the four classes, I think four or five in according to the product, so each one is one is good, the two is semi-good, the three, and so this one you can use it and create something with a specific, this is how you change a pixel value in the image
with a masking operation, okay, and of course now I'm doing for one image, you're doing a for loop and you're doing, you know, in a multi-core environment even, okay,
there were several questions about composite and image, so can I do a composite? There are two main commands to make a composite, okay, PK composite and PK stack profile, we will see
the difference. Okay, when I do a composite of images when, for example, you have modi styles or Lancer that are in overlapping mode in a different, even different position, same pixel size but different area, so you want overlapping all of them, okay,
and you want to retrieve, for example, the maximum value or the mean between the temporal series. This is how you use it PK composite, but let's suppose that during this operation
you want to not consider all the pixels with one value, for in this case, for example, I don't want to consider, let's see, I can run, I can create a mask, specific mask, so for Lancer, set the image, so I have, this one is going to be my mask,
so I created just a mask, okay, so everything, I have 12 bands, everything that is going to be
under here is not going to be considered, it's going to be masked out, so in this case, I will do the operation all in this area between the 12 stack layers because I use it 12 months,
so I always do it, I do it over here with the second command, PK composite, so let's see what does this command step by step, so first of all, I need to list my input file that I do with
this, that I do it with this list, okay, the list, so now I have my 12 bands that I want to composite, so and my, but I want to use my mask and be sure that during the mean,
my mask node data that is labeled as zero is going to become zero, remain zero, okay,
but I want to also create another one that rather than create a mean, I want to create, I want to create a standard deviation between all these parts, and I want to also use the same mask, but why I put over here one minus one,
just for fun, compared to this, can I have a standard deviation less than zero? No, so if I put another value negative, I can really understand that, okay, all my value, it will be less than zero, will be not value for the standard deviation, okay, so now I can
without problem, I can run without probably this command, so it's taking 12 bands plus one additional layer mask, and let me look how fast it is, these are big files, this is like,
we can check 6,000 pixels by 6,000 to something like this, we can check our ability, okay, so it's really, usually the composites are really intense, but in this way, you can really, and is also dealing with gap, so if you have some tiles that are not overlapping
continuously, it takes into account, and they make transparent where is no data, okay, so now you can see immediately, open EV, that is going to be the value of the standard
deviation everywhere, okay, this is the standard deviation of my 12 months, but over here was retrieving the values, actually, I opened the mean, so monthly mean, so in the monthly mean, I put equal to zero, over there is where the mask has been applied, so I keep no data in that
one, so and then you can build up your composite as you want, super flexible, you can put any data, you can identify any value for the masking, you can identify any operation, you can put,
you can even insert the question for you, that can I, I want to retrieve
also, because I get the mean, but let's not get the mean, but we get the max, and I want also retrieve in which month I get the maximum, so the option should be minus F,
I just want to be sure, yeah, so file, right, number of observation or sequence, number of
selected file, okay, so here if I do file two, okay, so I'm going to have probably, we will check one T with two bands, one is going to be the maximum value, and the second band is going to
be in which month has been achieved the maximum value, let's see, so now I can, even without open, what I can do, what I can do if the file is correct, minus, minus, minus, the first one will
give me the value of the mean, and this one is going to give me the observation in which month has been retrieving the maximum value, okay, so now we have to think how the developer has been working, why I have zero eleven, yes, so most of the people work in C and C++, they start from zero,
in reality this is January and this one is going to be December, okay, so pay attention, so this one is going to be my map, okay, where is going to be my map, where my second band,
in particular, this one I didn't say before, over here, if you do right click on this file that is
a bit hidden, this option, you can also check the second band, okay, so this one, if you move around, you can see that are different number, okay, it's going to be 11 everywhere, but probably in few pixel is going to be another number, or maybe it can be everywhere 11, okay, just to give
you, I can check if there are older file numbers, so for example, with pkstat, now that we
minus i, I can get the histogram of this file, and then I can identify band number, also in this case start from zero, so I can do, okay, so look, I immediately see
that all my pixels, they come from the file from the December, because probably December is the host, okay, and this one zero is because some pixels they come from the next month, okay, January and December, okay, so as you can see in just few lines without even
opening from the command, even if I never use it, okay, so be careful to try to play in this direction, yeah, we will see below, we will see below, okay, so
we can have also another command pkstat profile that is the same way, but the pkcomposite,
works with all the band are in different locations, so it's a bit slowly, I mean slowly compared to pkstat profile, because he has to see where they are, rather pkstat profile works with tiles that are already aligned, so he knows and you should do a good job to align
all of them, and you can get again the mean and the sum deviation in the gap, you will have two bands, so this one is much faster, because in one reading you can get the mean and the deviation, rather before you were doing one line for the mean, another line for the standard deviation, and so on and so on, and of course most of them including maybe pk,
pkstat profile does not accept the mask, but it's not a problem, you can use it and then mask later on as a second step, okay, so a few questions, every of you struggling making mosaic
composite in some other software or you were always saying, okay, I have all my models image, I need to make a composite, how I'm going to do it, and now in with this composite operation,
uh, they're not really super needed, because you can do it of course in Google Earth Engine, but if you really want to know what you are doing, this is the way to go,
okay, if you have, you, you won't have full control of what you are doing is this one, the Google Earth Engine, yes, is creating mean, but you are not sure 100 percent, rather if you just download it from the mod is archive, then you do the mean with this operation, you can do
even a math operation for one pixel, you retrieve one pixel, you do your 12 value, and you get the mean, is it equal, yes or no, it's really easy to, to do this kind of benchmarking to see the development, especially in the beginning we start to use it that Peter was building, I found a lot of bugs, but now it's quite stable, so it's, um,
yes please, yeah, yes, yes, look, here is 12, yeah, now I never did it some, but this, this is a bit
marked for example, so it's, uh, so it's 1000 by 1000 pixels, not super big, but you know, it's already 12 bands, but is it, you see the pk composite, what is one less than a minute,
30 seconds, okay, and if we check the RAM, the RAM is going to be, actually we can even check
top, no, h top, so look, this is the, the memory, so now it's 700, okay, 700 less than 800, so when I run, look, it didn't get neither 20, 40 mega, 20 mega, yeah, it's going to be
less than, than 100 mega RAM, so what does it mean, that I can, even in this machine,
I can, to arrive to 3000, I can do it 30 operation, so if I have a 30 CPU, I can work with 30 CPU in this machine, because the use of the RAM is very small, they are super efficient for this, so the parallelization, you can do it without problem,
okay, if it's a small raster object, and you do it in whatever you want to do, there's three seconds, five seconds, but as soon as you get bigger, it will extrapolate,
and then you get two days instead of one hour, yeah, perfect, so if you have your, all your, the study case done from the lecturer, in the classes done in R, even with raster, I would do the same, because they are for study case, but as soon you go in the proper, in the processing,
in the processing line, for the mass geocomputation, then you need to have something efficient, okay, so for example, check this one, this is very common in the, I'm going to do mean
aggregation, I have a file, I want to make a mean between a moving window of 10 by 10, okay, if I put this flag over here, my resolution is going to be 10 times smaller,
so I aggregate the pixel by the mean, if I don't put this, it's going to be just a moving window, and of course, it's better to have a moving window, actually, with a even value, with 11,
11, like here, for example, but I don't want a moving window square, I want circular, okay, so if I put even dz, 11, I have the speed, the mean of a sphere that is moving in my multi-dimensional array, so this data, the geocube, the digital cube that we were
seeing last day, where I was in the Marius talk, you can do perfectly the same with these tools, in a super efficient way, and then of course, over here, you have the, you can define the
new data, all the pixels that have zero, I don't want to take, I don't want to use a filter, and of course, I put zero, but you can put any number, that's why the flexibility, and of course, input, output, okay, so you can see this one also, aggregation, quite fast, super fast,
look at less than maybe four seconds, this one a bit, this one a bit slower, why? Because I
don't aggregate 10 by 10, I do a moving window, maintain the same pixel resolution, okay, so that one was faster, but because my pixels become bigger, with one and I'm saying the same pixel, but my, the value of the single pixel is a mean of
a circle of 11, 11, 11 values, without changing pixel time, but if you do the, bravo, yes, if you do, if you check all these options, but you have two big lines, an average and another one, here you have, I put a mean,
but you can put a percentile, you can, there are plenty of them, okay, you can put really a lot of function that, so like, imagine you can put
the mean, the percentile, so it's calculating the percentile in all my pixels in the sphere, and then it's giving to you put percentile 95, so it's going to be the value at 99 percentile in that sphere, and you move on, so you can do a lot of operation in this way,
Is it influenced by the previous sphere or circle? No, should not, no, no, no, no, yes, yes, yes, yeah, good, good observation,
why is doing with two bands? Probably because I have two bands in that file, okay, so, and of course, again, I just write a few options, but there are plenty that you can play around, okay, so something very important that we see before that you were asked in me,
okay, see my histogram of an image, in the histogram I have the value of the pixel and how many pixel for that value, okay, so for example, this one, so what you can see,
can you see if it's normal distributed or not? It's more or less, okay, so be careful that this one is also giving to you a pixel where you have zero observation, okay, so, but due to that
you can insert the value, the source mean, you can put also source mean equal to 60, and you can start directly from there, you avoid, you avoid the six, actually you have to put 61,
look, because it is equal then to avoid all the one, so you can see that more or less than normal distributed, so you can use in the classical studies, this is a text file,
can I do the sum of the pixels, if I want to do the sum of all the pixels in the image, can I do it, can you tell me the math, sum of all the pixels in the image,
this is math, come on guys, no, what, dollar one, column number one, times what, oh,
times dollar two, column two, okay, so now I get what, if I build, oh I print, so if I do like this, I get how many pixels with that value, so now I need just to sum up this number,
okay, so I can do it, a second function equal to sum plus, so this is given to me
the sum of all the pixel values, I do it, the sum, the multiplication of column one and two, and then I calculate the sum of these one with this function and then returning in the
end of the file print sum, so you can really play as you want, and of course this is, I do it for one image, or if you do a for loop, you can do it in automatically for all of that, and as you can see it's nothing, but you have the sum of all the pixels, okay, so you have to
play with the different command and combine in order to have what you want, you have to be a bit thinking and it's not a straight line, it's okay, you can get also that one if you have
file that are binary, sorry, integer, if you have file that are floating, of course you cannot really get the floating, the histogram of a floating point, because you will be too big and you cannot, so you can specify the bin of your image, how many you want to, how many you
want binning the image, it's going to be for each bin, how many pixels you have, okay, and of course also here you can specify starting from a specific value, okay,
cool, no, sorry, if you use a mask, perfect, so for example, if you do a mask and you mask all
the pixels under the mask with zero, when you do the sum, the zero will not count,
and so you will get the value of the area without considering the pixels that you don't want, so if you play in this direction, you can find a lot of way to calculate what you need, cool,
something that I also use it quite intensively is PK reclass, is for reclassification of an image, when I do reclassification, I don't talk image classification, that is, that one is techniques, but reclass if when I do two become four, five become seven, eight become ten, and so on, so how I do it,
I can get my histogram, like we said before, I can exclude zero with this one, I get my histogram, I have my histogram of the image,
OK? So pixel value here. And number of pixel, no. So now I want a text file that is going to reclassify these in such a way that I have. For some file, based on some function, I have 0 and 1.
Less than 74, 75 becomes 0, and more than 75 becomes 1. OK, now I put 0, 1. But you can put even 0, 1, 2, and 3, and 4. You put more threshold, and so on. So how would you create? You create a lookup table that you store it.
This lookup table is nothing else. 86 become 1, 87 become 1, and so on. And then with pk-reclass, you can use this classification, and you apply to your image.
So this is very important, especially when you use a really classical example. You have land use classes, and you want to work with only forest. But you have five kinds of forest, but you don't need.
You want only all of them, the forest. So if you have 1, 2, and 3, you say, OK, 1, 2, and 3 become 1. And then the other four remain 4, 5 remain 4. This is how you do it. Now your image, in my case, is going to be just with two classes rather than all these pixels.
OK, so now it was just 0 and 1. This is how you reclassify an image. And of course, that one, you can even do it manually in the sense you can create a file,
and you can say echo all the 1 become 12. And this is going to be my text file. Let's call it in the same way, in this way. I can reuse it. Then I can do it manually.
4 become, I don't know, let's put a big number, and this way we can recognize. And 80 become a big number also.
Now if we rerun my reclass, and I reopen my file, I can immediately see that some pixel, the one that I specify,
they become this big number, 166. And the other one, they stay the same because I didn't put any change in. OK, this is how you can play. And this is how you change the pixel value in a raster, yes, in a raster, in a tiff, without the conversion
to txt that I mentioned before. But it's good to know all the different options. Please. No, actually no, this one, this one you have to check on the help.
You have to check on the help, I'm not 100% sure. If you don't put, it's going to, because look, these values are the same. So if you don't maintain the same, probably yes.
Let me see, because you have the minus c, I think. I remember this one, you can put each single class.
So you can do it manually. One becomes two, without creating a lookup table. One becomes two, three becomes six. So I'm not sure about, this is from two. I don't know if you can put range, I never did it.
But it will be possible. If not, you create your lookup table without range, just with a full lookup table. OK, anyone thinking or ready to use this pk-replus and pk
Instagram, these are very cool stuff. Something also extremely useful, we see g-dollar location info for the point.
But we have also a zonal statistic for shapefile. So you want a vector file, you want to overlay on raster, you want to get the mean of all the pixels inside that. So you can do it, for example, you
can directly output a shapefile, and you can also output other format. For example, you can also specify all the data that you don't want. So get a mean, but do not include, this is the typical we can know data.
OK, this one, yes. So now my, so I'm going to calculate, this is a bit slowly, just a bit, because always when you cross it last in a vector, you always have conflict. I mean, not conflict, but it's more complex.
One is a matrix, the other one is a vector, so he has to do a lot interpolation. Now, if I want to see this new file, I'm going to see, to see and check the attribute. You see this morning, shapefile OO, OGR info, OK,
name of the file, and then if you put minus A, minus A, minus O, sorry, minus O, minus L, and JOM equal, you know.
So take out this O. This way you can see the attribute. Now it's going to list all the attribute, and you can see that there's been adding two value, two attribute table, two attribute column, when I have mean, standard deviation, and also
the mean in this polygon. So you can open also these, it was, yes, so this one was the input shapefile, and this one was the raster file. In other words, I'm doing this, and getting open EV,
and doing this. I can calculate the mean, standard deviation, and minimum in these three polygons. So now the attribute of these three polygons will have that value. And this one, that is including no data, I exclude it by the flag no data.
So it's super flexible. For example, maybe you can even put, not sure, but we can check. You can put, rather than putting a value, you can put less than. Not sure that you can do it, but you can see.
Probably not. But what you can do? You can mask before, you mask everything, you put zero, all of them, and then source no data equals zero,
and you are sure that it's not going to consider all these values. So you can play a lot of these tools. You can also create a CSV rather than shapefile,
and this is really appealing because the CSV can be read immediately later on inside to the bash environment. So if I want to just, I don't need any more the spatial information, but I need to just,
I just need the ID of the polygon and my value, one, two, and three. This one, then I can manipulate directly with Oak for doing some operation. Or this one is already clean CSV that I can import in R to do my computation.
OK, so in this way, you enter in R with just a table, whatever you have. You don't, this way, the RAM for R is going to be everything, just the RAM for that table, and all the rest is ready for your computation.
Yeah, so you can do also zonal statistic in a point extraction. So like ogrinfo, sorry, gdala location info that you can start point location, but with this one,
you can extract point location, but even around the point, consider two, three pixels, for example. So not only where the pixels, where the point comes from, that pixel, but also the pixel nearby in a little buffer around it,
number two, and you can get the mean of that value. So it's really flexible in this context what you can do. And with a bunch of these command, you can build up your chain as you want, OK? So this is what I want to cover today
for GDAL and PKTools. You can see they're really working together. I was reading output from GDAL inside PKTools. From PKTools, you can read output. You can use VRT, like, for example, in the PKstat profile, maybe I didn't mention,
but in the PKstat profile over here, I was using the VRT. So you have 12 file. I create a VRT, and then I use the VRT to calculate the mean. So they are really proficient in this way for working.
Cool? In our view, they seem to be both the same thing, just one with the same. For which one? They can do the same operation. OK, in terms of speed, they are pretty, they say.
Because both of them, OK, first of all, PKTools are using GDAL for reading the image and for writing. So in terms of writing and reading are perfectly the same. OK, so sometimes you just have flavor from what you know already. For example, and sometimes you have a better,
so you don't need so much bash manipulation in that sense. If you do, for example, we saw before PK info minus minus M is giving to you already, only one problem,
is giving to you already the line with the minimum optional. So rather than the PK GDAL is giving to you everything. So then you have to make rep, and then you have to do a second. So this one, if you want just the maximum,
you do print number AWK print 1, 2, 3, 4, number 4. And this is, I forgot the print. And this is just the value for the maximum value. So probably, for example, look, I
don't need to do this, if I remember correctly. If I just do max, I think it's going to return directly the maximum. So I don't need any bash operation. So then it's a matter of flavor,
which one you know better. Sometimes you don't remember the PK option, but you remember GDAL, you just use the GDAL. But the cropping, for example, PK crop, I didn't use so much. So rather GDAL warp, when GDAL translate,
I can do something else in the meantime. So PK crop, I never use it. But all the way around, for example, all the PK filtering, I can do it in GDAL, unless only one, maybe the need. Now, actually, they insert really one year ago.
Sometimes it's also good if you have the same, if the two command are giving the same option, maybe it's also good to compare them. Potentially, you can fight back. Sometimes, especially in the beginning that I was helping in develop, I mean, I was using a lot
and he was developing, I was even using R2 benchmarking to say, OK, I get 9.7. Is this true or not? OK. I'm using quite intensively, so there are not any more bug. At least they are not so visible. And yeah, please.
Yes. So in PK composite, you have this flag. So you can do, in this case, you can do two. The easy one would be mask in all the image, the bad one.
NDVI is more than a set of threshold. This way, you capture only the pixel that have a good NDVI, OK? And then you do the composite. Or yeah, this one, I think, is possible.
This is also quite cool. This one, they have, yeah, the minimum value.
For example, you can use minimum and maximum. So if you say, OK, minimum then 0.6, so it's going to make a composite. We do all the NDVI bigger than 0.8. Yeah. You have to check a bit.
You can use an internal band as a full node data. So this is very useful for the MODIS QC that I use quite a lot, this option. This one that I was pointing to you, you can also do a composite. So for example, with the QC, you
can do a composite and then weight the composite in according to the quality of the pixel. And this one, you can do with the weight. What it is? I just did it before.
Anyway, somewhere over here should be class weight, this one. In some mode, it has an input that provide. So this one, they are weighting.
So it's not a weight for pixel, but it's a weight for each single TIFF in the same order that have been provided. So you can put 0.8, 0.2. So the first one is going to be weighted 0.2 and the second 0. So you can do even a weighted mean.
You can do scaling. OK, the scale, actually, you can do it also in GDAL translate. Some option, I use most of them, but not all of them.
So you just need to, of course, in the meantime, you can crop it also. You can change the output resolution. So it is really complex what it's doing.
Anyone else? Other question? Yes, please. So yes, the sample, yes, also here in the meantime
that you composite, you can define the pixel resolution. But actually, yes, you can resampling and you can also define how to resampling it. Main time, you make it a composite.
There is name will be linear. OK, so yes, you can do it, this kind of operation.
No, no, no, because everything that you print is a text, everything that you print in the terminal is a text file. OK, but you are able to apply the full object by using a non-stack, which is recognizing spaces. Perfect, ah, yes.
Is it organized space as a default? Indeed, the people that are working with really bash, they never create CSV most of the time, unless there is a need, like name of the region that sometimes can have a comma. But if not, it's always space separation.
OK, no, other way around. If there is a name of the city with a toothpaste, then that with a space, then that one creates a problem. Then you have to create a CSV. But if not, it's just tables, number tables. It's always space separation is the default. It is the same for sorting, the same for seed also,
and some other, cut, TF, paste. Most of the time, the default is, so you don't need to specify. But you can also use, you can also create CSV. You just do $2.
So if I do like this, it's creating the two number with the space. But if I put quotation between these, the comma is putting, the comma is creating a CSV.
Other question, can I do? Come on, you have to get profit of this situation. Can I do? Could you set up a parallel processing
to carry out operation on multiple titles at the same time? Very good. I was waiting for that one. Even if it's out of the, OK. You mentioned clustering it generally, but just this morning.
Yeah. So there is this, I think I cannot run because there are no data. But I can point to you the exercise, and we can go to the exercise without running because I think there are no data. But it's very short. Cluster processing.
So here you have, in cluster processing, you have one of the first, transform a simple for loop in multicore for loop. So if you click this one, so cluster processing, and then transform, OK. So let me show you this.
Yes. OK, let's cover this one. Again, I cannot run, but we have everything. So for example, if I have several TIF, this one is the typical for loop
where you do for file in all this list of file. So you have, for example, three files. Do this GDAL work, this operation, OK. When my file is going to be, each time change it from here.
So this is a typical for loop where you use only one processor, OK. So you can do it here. You can see. In my case, I have two processor, very simple. This one is going to be the first file. This one is going to be the second file, and then
the third file, OK. So only one process is working. Most of the time, you have to know that whenever you run something, unless it's something very specific, or you don't code in a such way, most of the time, you use only one processor, OK. Also in R, unless there is not a function that
is not parallelized, but most of the time, they are a single thread. So then you use xarg. Xarg, I use it a lot, continuously. So how you do it? You first list the file. Be careful that it is a bit complex, the syntax.
Before you list the file, so you have three file listing, OK. So then xarg, stay for x argument. These are called arguments. So I'm going to pass in the xarg for loop two argument.
Sorry, one argument using two processor. So this line is perfect the same than the one before. But the beauty of these, that xarg is going to send the first file to one processor, the second file to the second processor.
Then it's going to be all did. Assume one of the two is over, it's going to send the third one. So and then you have over here that you can see the two processor working together for one file and the other one. And then the third file is going to be covered by one of the processor.
So the syntax is like the for loop, but you need to change this part over here. Yes, in this case, this question mark identify 2020, 2050, 2060.
OK, so by this one, you list again. I don't have this file. If not, I will run it. It's going to list three file. OK, I can even try maybe the one with what.
So this one, for example, can be changed in xarg.
So if you just list, so again, this one identify how many arguments we want to pass inside the xarg. So in this case, only one. So it's the name of the file and then how many processor.
OK, so if you go down, you can see that you can pass even two arguments. So you can have file name and resolution, file name and resolution, file name and resolution. So this one is going to be the first argument and the resolution second argument.
And you can, no, and then file resolution, file resolution, file resolution. So you build up your table, and then you run again. But in this case, your arguments are going to be, actually these I can run.
Nobody don't have the file, so you're not going to. So build up this table. I think if I do like this, yes, if I do, no, I think for now it's not going to do it.
Yeah, it doesn't, I know. Ah, yes, it's building, yes, you're right. But it's building like with a question mark
because I don't have the file. But it's going to create the table, file one, resolution one, file two, resolution two, and so on. And then this one is building up, is giving the first argument is going to be the file, the second argument is
going to be the resolution. So in this cluster processing page, there are plenty of examples, even how to combine in a parallel multi-R job on the XR.
So if you have a multi-core system, rather than calling multi-core inside R, you call it outside, and you run a parallel job of R. These are really promising way of working. And in particular, I want to point also this last one.
That is for people that have access to high performance computing. Over here, you can have a really, really nice example how to work in high performance computing that even if it's another cluster, very much the same.
And they identify four independent way of working. So it's really, if you want to go in that direction, so this is the way where I explain very well how to work in a multi-core processing environment in a very simple way.
So I think I covered quite a lot today. I will be around for additional questions even now. Just ping me, drop me, stop me. And yeah, so it's very nice also to be here
and to be able to transfer these part of knowledge that is coming before all the data modeling we see yesterday and the day before.
In this way, you have the first part, the data preparation, data pre-processing. So if you do properly this one, then the second one with the data modeling over there, you can, because you have all the RAM for you for the computation to be ready.
OK, cool. So if you don't have any other section or question, you can. Yes, please.
Yes. So for the modeling part, for example, you need a table for all the machine learning, LASSO, whatever, SuperFAC, the first that you need is a table. Then you create your model.
Then you have to make the prediction. So in the case of the prediction, you need to import all the raster inside. But then you create small tiles, like Tom was mentioned today, and then you can import a very small tile that will be fast for the prediction.
The prediction, I mean, there are other way to make a prediction. But even myself, in the end, I always do it in r. I didn't find out the other way to do it outside r.
Probably, as I mentioned in my talk, we are going to investigate if Julia is ready for making prediction on TIFF file. I think so, even if I didn't check, because they're using for image processing to recognize a cat, a dog, or whatever. So even if you don't ingest a geotiff,
probably you are going to lose the geographic information, but then you can attach back with GDAL edit. So we are going to investigate, do the prediction inside Julia. But it's something under the process. I would be able to tell you maybe next year. Yeah.
And also, something that you have to also consider, when you enter inside today, you do the modeling. The prediction, if you think about it, is something that you can do with really the prediction to the full raster. It's something that you do it really in the end,
when your model is ready to go full prediction. All the testing procedure, you do it in the table, testing, and training. The testing, the training for building the model and testing. So as soon you keep playing with r, with only the table, the r is going to be super fast.
So when you arrive to the moment, OK, my root mean square arrow is very good. I did cross-validation checking, variable selection, blah, blah, blah, blah, all the different statistical techniques. Then you say, OK, now I'm ready. I do the full prediction for the globe. That one, even if it takes two weeks, it's not a big deal because you know more or less
it is going to work. In the meantime, there are little tiles. You can check each single tile that is coming up. OK, please.
Sorry, output? Yeah.
But in terms of size of the text file? Yeah. I mean, if you would have a file that has different formatted tiles to what it's used to,
I mean, do you have to treat and process it? Do you have to change it to file it? OK, I think, yeah, I think you want to say, if you were, which is the good file size that I'm
able to manage in terms of TIFF. OK, I was mentioning, try to never pass 1 giga file of TIFF. Remember that TIFF has a limitation that is 4 giga as a default. Then you have to apply the big geotiff flag passing the 4 giga, OK?
But as soon as it's becoming more than 1 giga, then it's even difficult to open with OpenEV. It's slow the rendering. So try to always set tiles that are 200, 500, 700 mega. It's not a big issue. OK, and then if you don't need to open, really,
because you know that it's going to be fine, then you can pass the 1 giga. In terms of the text file, it can be almost any size. Because text file, you read in chunk. You can read in chunk. Whenever you do a text file, yes, let's do it this one.
Maybe it's too small. But anyway, yeah, this one is fine. Whenever you read the text file,
if you do add only the first 10 line, that file can be 200 giga because you are only reading the first 10, OK? If you do tail, OK, it's going to go through the full file
before to read the tail. So it will take a while. But it's not going to stop. It's going to give the 10 file. If you try to load this one in Excel, a big file is going to crash. In R, as soon as it's becoming, I think, 1 million, it starts to become a bit tricky.
Over here is not a problem because it's not open the file. Until you don't enter, it's going to open just what you are asking for. Rather, when you are having in R, it's everything in the memory. So remember this. There is now this package table. I think it's called table, R table or table.
That is a bit better in the last two, three years, four years. It improved quite a lot. So you can work also with a big table. But as I mentioned, I'm trying to enter in R with a table that is ready to go for my computation. All the data cleaning, of course, some data cleaning,
identify outlayers or threshold percentile that you want to get rid of. Do it in R. It's quite complex. So do it in R. Calculate mean, extent deviation. Mean, it can be possible. Done, extent deviation, it starts
to become a big formula driven. So it's a trade-off sometimes, which one is faster. By even a hook that I was, when I started, as I say, 15 years ago, I was writing 200 lines of hook because R was not able to calculate the mean.
OK, now there is no need. As soon you do this, bring the number column to four and two, or you can put an if condition. If $2 bigger than 24, whatever, just bring to that one.
OK, now it's missing something. But anyway, $2. Yeah, now it's not checking bigger than. So this is the maximum of hook that you need to know. Print $1, $1, $2, if condition, that's it.
You don't need any more writing thousands of lines in hook. But it's good to be able to shrink your data as much you can. More questions, Spanish corner over there.
OK, good. I think we can resume. So as I mentioned, try to use it even if it's a bit hard. Do it, do it, do it, do it. And the docu wiki over here is full of example.
Try to do it. As soon you click, let me point also to you. On the main page, you have also four link. We have also four link. It's coming. Of web seminar that I did in May.
So the first and second one is pretty the same on what I'm running today. Today, there was some problem with the streaming. So we are going to replace these YouTube video with the Today Morning, but they're pretty the same. But something that you can do it that is a lot different is the GRASS GIS to use GRASS to start
in a really proper way, in a really simple way. And also, this one, the exercise in geocomputation for high performance computing. I explained very well. You can see there. Actually, for the people, Spanish people, you can send to all your colleague.
I did in English and also in Spanish. So you can send to the spoken English, spoken country, sorry, Spanish spoken country in South America and all over the world. Okay, so I think we covered several point. It will be here for additional question being me and so on and so on.
Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.