We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Working with harmonized LUCAS dataset

00:00

Formal Metadata

Title
Working with harmonized LUCAS dataset
Title of Series
Number of Parts
57
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
RadiusDedekind cutPoint (geometry)ForestSoftware frameworkGroup actionCovering spaceWave packetVirtual machineGoodness of fitMachine learningSet (mathematics)SatelliteRepresentation (politics)RandomizationMedical imagingMusical ensembleComputer animation
Content (media)Point (geometry)Product (business)Translation (relic)Social classSoftwareVirtual realityLibrary (computing)Set (mathematics)Sample (statistics)CodeHoaxForestInformationData managementPlanningHazard (2005 film)Texture mappingEndliche ModelltheorieAreaArchaeological field surveyUniqueness quantificationMappingGroup actionProjective planeBitAlgorithmInformationPoint (geometry)Set (mathematics)Wave packetArchaeological field surveyImage resolutionSampling (statistics)Directory serviceGeneric programmingArithmetic progressionOnline helpInternet service providerTask (computing)Harmonic analysisValidity (statistics)State observerContent (media)Uniqueness quantificationMoment (mathematics)Covering spaceMedical imagingFrame problemVapor barrierField (computer science)Product (business)Operator (mathematics)Degree (graph theory)Multiplication signSoftwareRepository (publishing)Endliche ModelltheorieRow (database)Virtual machineForestScripting languageDigital photographyStatisticsVisualization (computer graphics)VirtualizationPrice indexPlug-in (computing)Social classEstimatorLevel (video gaming)MathematicsLibrary (computing)Computer animation
Phase transitionSocial classCovering spaceArchaeological field surveyProcess (computing)Sample (statistics)Virtual realityPoint (geometry)DatabaseSocial classCovering spaceDigital photographyProjective planeMultiplication signSubset1 (number)OrthogonalityStability theoryField (computer science)MereologyMathematicsInterpreter (computing)Computer animation
Element (mathematics)Sample (statistics)Data managementPoint (geometry)Network topologyElement (mathematics)InformationField (computer science)Data managementPoint (geometry)Digital photographySampling (statistics)Web pageCovering spaceSocial classDirection (geometry)Task (computing)DiagramComputer animation
Network topologySparse matrixRootTable (information)PermanentSurfaceCovering spaceLevel (video gaming)Social classBetti numberFile formatVirtual realityBitFlow separationLevel (video gaming)Element (mathematics)PermanentCovering spaceContent (media)Social classSubsetProjective planeRootWater vaporComputer animation
Modal logicService (economics)MaizeData miningSocial classWater vaporData storage deviceFood energyMaxima and minimaDatabaseGEDCOMLink (knot theory)Integrated development environmentStatisticsControl flowComa BerenicesForm (programming)Client (computing)LaptopScripting languageSystem administratorPhysical systemAnwendungsschichtCodeLocal ringLibrary (computing)Sample (statistics)RadiusShared memorySocial classComputer fileForm (programming)Scripting languageInformationLaptopFile systemPhysical systemWeb pageCovering spaceClient (computing)Service (economics)BitConnectivity (graph theory)Computer architectureDifferent (Kate Ryan album)Interface (computing)Product (business)Process (computing)Point (geometry)DatabaseArtificial neural networkFood energyProjective planeMoment (mathematics)Multiplication signHarmonic analysisRadical (chemistry)Category of beingSet (mathematics)Cartesian coordinate systemField (computer science)Computer animation
Information managementRadiusSanitary sewerGrass (card game)UsabilityMaß <Mathematik>CodeRaw image formatRoute of administrationZoom lensComputer fileDirectory serviceRepository (publishing)Source codeComputer animation
Revision controlLibrary (computing)Lemma (mathematics)Asynchronous Transfer ModePointer (computer programming)Kernel (computing)Boom (sailing)Inclusion mapInterior (topology)Programmable read-only memoryInformationSample (statistics)Execution unitThumbnailAvatar (2009 film)Java appletMagneto-optical driveSocial classCategory of beingExistenceBuildingDigital filterState of matterOnline helpComputer fileView (database)CodeOrdinary differential equationFunction (mathematics)outputComputer wormNumberLetterpress printingPoint (geometry)Process (computing)File formatRadiusLattice (order)Device driverCodierung <Programmierung>Instance (computer science)Extension (kinesiology)Operator (mathematics)Computer fileServer (computing)CuboidCoordinate systemPoint (geometry)Equaliser (mathematics)GeometryBitCategory of beingLaptopDirectory serviceVirtual machineWave packetError messageService (economics)MereologyNeuroinformatikFile formatLibrary (computing)File systemSocial classSampling (statistics)Line (geometry)Revision controlExistenceMessage passingInstance (computer science)Task (computing)DatabaseLevel (video gaming)Computer animation
Digital filterOperator (mathematics)BuildingAsynchronous Transfer ModeCodeView (database)Computer fileCategory of beingOnline helpRadiusVirtual realityLogicLimit (category theory)PolygonVertex (graph theory)Proper mapLocal GroupGroup actionAttribute grammarSpacetimeArchaeological field surveyLaptopSample (statistics)Form (programming)VacuumCovering spacePoint (geometry)Attribute grammarLetterpress printingDatabaseInformationProjective planeCategory of beingMultiplication signPolygonField (computer science)Point (geometry)Sheaf (mathematics)Arithmetic meanMetadataCovering spaceNumberFilter <Stochastik>BitCuboidGroup actionOrder (biology)SpacetimeArchaeological field surveyImage resolutionContent (media)MathematicsCombinational logicSet (mathematics)Different (Kate Ryan album)Parameter (computer programming)Server (computing)Harmonic analysisOperator (mathematics)Library (computing)Limit (category theory)File formatAreaRule of inferenceRow (database)Functional (mathematics)DistanceElectronic mailing listInterpreter (computing)Computer animation
Attribute grammarContinuous functionVariable (mathematics)Morley's categoricity theoremIntegerType theoryRadiusSpacetimeDistribution (mathematics)Virtual realityPoint (geometry)LaptopProduct (business)Process (computing)MechatronicsCovering spaceAsymmetryConfiguration spaceAsynchronous Transfer ModeCodeKernel (computing)View (database)Computer fileLatin squareDigital filterSample (statistics)Open setData dictionaryOnline helpLetterpress printingReading (process)outputRaster graphicsVector spaceContent (media)Artificial neural networkForestWater vaporValidity (statistics)NumberCovering spaceComputer fileSampling (statistics)Revision controlAttribute grammarDegree (graph theory)Type theoryConfiguration spacePatch (Unix)Sheaf (mathematics)CASE <Informatik>LaptopSpacetimeSoftwareoutputMereologyLevel (video gaming)Arithmetic meanSubsetContent (media)Point (geometry)Shape (magazine)Vector spaceHarmonic analysisDirectory serviceSocial classProjective planeEndliche ModelltheorieData loggerWave packetProcedural programmingProcess (computing)Open setOrder (biology)UsabilityVapor barrierFunctional (mathematics)Error messageCoordinate systemBranch (computer science)Table (information)Wrapper (data mining)Virtual machineArtificial neural networkSet (mathematics)Different (Kate Ryan album)Observational study8 (number)Graph (mathematics)DatabaseCategory of beingMultiplication signString (computer science)Computer animation
Asynchronous Transfer ModeOpen setPoint (geometry)outputRaster graphicsVector spaceCovering spaceSample (statistics)View (database)Computer fileKernel (computing)Digital filterCodeOnline helpRadiusWide area networkShape (magazine)Data dictionaryLetterpress printingContent (media)Validity (statistics)HoaxDisk read-and-write headExplosionHecke operatorMaxima and minimaStructural loadBootingFile formatReading (process)Artificial neural networkForestAttribute grammarPlot (narrative)Distribution (mathematics)Social classAdditionTurbo-CodeOverlay-NetzControl flowWeightPressure volume diagramMachine learningClassical physicsMacro (computer science)Coma BerenicesWeight functionAttribute grammarGeometryPoint (geometry)Overlay-NetzLaptopType theoryNumberSocial classDigital photographyTraffic reportingDistribution (mathematics)Covering spaceDirectory serviceParameter (computer programming)Representation (politics)Level (video gaming)Validity (statistics)MereologySubsetUniqueness quantificationArchaeological field surveyLine (geometry)Limit (category theory)Set (mathematics)Object (grammar)Moment (mathematics)Vector spaceData typeContent (media)Standard deviationConfiguration spaceLink (knot theory)outputInformationFunctional (mathematics)Projective planeTable (information)MetreDistanceComputer fileState observerArithmetic meanFile formatCodeMathematicsBuffer solutionOperator (mathematics)Product (business)Raster graphicsDifferent (Kate Ryan album)Computer animation
Pressure volume diagramWeight functionMachine learningCohen's kappaMacro (computer science)Asynchronous Transfer ModeView (database)Kernel (computing)Computer fileCodeRadiusClassical physicsDigital filterOnline helpPlot (narrative)Matrix (mathematics)Social classProduct (business)outputDatenpfadArtificial neural networkForestWater vaporDatabase normalizationRootLetterpress printingDirectory serviceOverlay-NetzPoint (geometry)Vector spaceGamma functionRule of inferenceCovering spaceVideo projectorShape (magazine)Lie groupVirtual machineModemHoaxBitProduct (business)Grass (card game)Interpreter (computing)Point (geometry)ForestNumberoutputCovering spaceElectronic mailing listDiagramSound effectMetric systemObject (grammar)Green's functionSocial classContent (media)Configuration spaceAreaDigital photographyComputer filePermanentData dictionaryParameter (computer programming)Different (Kate Ryan album)Software testingMereologyValidity (statistics)Error messageMultiplication signLaptopRaster graphicsOverlay-NetzVector spaceAttribute grammarTraffic reportingCohen's kappaDirectory serviceKlassengruppeMixed realityDivision (mathematics)Computer animation
ModemLaptopPoint (geometry)Product (business)Covering spaceTranslation (relic)RadiusInformationCohen's kappaCodeMatrix (mathematics)Online helpComputer fileDigital filterView (database)Local ringSocial classLibrary (computing)Asynchronous Transfer ModeDisk read-and-write headRevision controlGamma functionPlot (narrative)StatisticsWebsiteProper mapProcess (computing)Default (computer science)Thread (computing)Attribute grammarRun time (program lifecycle phase)Digital photographyGeometryExplosionDecision tree learningBetti numberOracleValidity (statistics)Plug-in (computing)Table (information)Translation (relic)DatabaseCovering spaceVirtual machineComputer fileLibrary (computing)Image resolutionContent (media)Point (geometry)Functional (mathematics)Set (mathematics)Electronic mailing listRepository (publishing)Revision controlSocial classoutputCodeSubsetMathematicsEndliche ModelltheorieType theoryBitGraphical user interfaceLaptopInstance (computer science)Core dumpSemiconductor memoryDifferent (Kate Ryan album)Level (video gaming)Pie chartMultiple RegressionAttribute grammarProjective planeData managementMultiplication signPresentation of a groupCombinational logicNatural numberComputer iconComputer animation
Link (knot theory)Message passingOnline chatFinite element methodProteinMenu (computing)Game theorySkewnessGamma functionOvalMathematicsProcess (computing)Plug-in (computing)Covering spaceType theoryComputer iconComputer fileProjective planeVector spaceSocial classLaptopPresentation of a groupSoftware testingComputer animationProgram flowchart
Form (programming)Harmonic analysisCircleValidity (statistics)Digital photographyOpen sourceSoftwareDirectory serviceProcess (computing)Projective planeMetreMoment (mathematics)Direction (geometry)CuboidSlide rulePoint (geometry)Covering spaceSocial classEmailInterpreter (computing)Representation (politics)Attribute grammarField (computer science)Presentation of a groupPhysicalismPiGeometryComputer animation
Transcript: English(auto-generated)
Today, we have got a session dedicated to Lucas data set. Let me wrap up what we have done yesterday in this Python group.
You have learned about the data sets developed in GeoHarmonizer. You have learned how to run machine learning, land cover classification. And today we will speak about Lucas data set. You learn how to use Landsat and auxiliary data for machine learning. You have probably spoke about how to run random forest
or other ensemble methods. Whenever you want to do such exercise, you are supposed to use some training data set. And this is typically the weakest point of the whole exercise. As there are lots of satellite images like Landsat, Sentinel, PlanetScope, et cetera.
There are many frameworks for machine learning models, but typically there is a shortage of good representative training data set. And this session is dedicated to introduce the Lucas.
The session will have six parts. Initially, I tried to discuss and introduce you to Lucas data set. Then we will show you how to access such data sets using Python API, software we have developed in the GeoHarmonizer project and we will run Jupyter Notebooks.
Then I'll tell you about Lucas data set harmonization. Then we will have a use case, how to use such a Lucas data for validation, land cover validation or other land product validation. And then in the last two parts, we will discuss a little bit more
how to use the Lucas data set with other products like Corinna, how to translate the Lucas legend to Corinna and possibly how to perform some class aggregation to perform some research about the influences of the classification on the quality indicators.
And finally, possibly I will show you that there is also QGIS plugin to interactively access the Lucas data. To follow the workshop practically, you would need ODSE virtual box,
which I believe you have already set up from yesterday and it's all running. It's the first thing. Then you would need EU map library, which is the same libraries as you use yesterday. You would need just to update from the repository by running git pull, that's it.
And of course the Jupyter notebooks, which we have presented in the same directory as yesterday, there is a Python training directory and you would find there is 05 underscore Lucas, but there's a bunch of files, three or four Jupyter notebooks, some Python scripts
and sample data set for the Lucas validation. But when we get to this point and through the presentation, I would guide you, I will show you where it is and how to update this. So don't worry, please, at the moment. Okay, so this is what we should do.
Briefly, what is Lucas data set? It's a Eurostat or European Union activity. It stands for land use and coverage RF frame survey. Originally this activity started 2000 where commission decided that we need some
in real in situ data sets. And they started with mapping or monitoring in situ the agriculture field. They collected information about crops grown, but then they decided to generalize this task. And since 2006, the Lucas data set is about generic land cover and it evolves up to 2018.
And the main goal is to provide statistical values for estimation how European Union land cover and land use is changing. That was the primary goal to really monitor the changes. So the Lucas is not one of the activity,
but it's continuous, it started operation in 2006 and every year it is updated. And there are many of topics that it covers. Later on, I will show you what you can explore. I want to mention that it is to some degree
unique ground to because people go really into the field with GPS, collect the information recorded. If you look at many of the earth observation task or researches, it starts with some idea, let's monitor wetlands or let's do the forest monitoring
and you decide, okay, we can do it because there is a land side, there is a Sentinel too, et cetera, you have lots of images. You have got machine learning models or algorithms so you can write, but then when the activity starts, you discover that for such tasks,
you would need some training data. And typically you wouldn't have such. So what do people do? They decide, okay, we don't have training data so we can develop such using better resolution images, auto photos and manual editing and visual interpretation,
which is first of all, subjective because it depends on the interpreter, how decides to collect the points and how decide to label such. And it's time consuming. Imagine how many points can you do per day? 100, 200 and how many points can you do
if you go to the fields? I have been working on several such task and we have met the whole Czech Republic with such field trips and if you want to really distribute the points across the country, irregular or irregular, we have discovered that we can do something like 25, 30 points per day, no more.
And if you look what you need for the machine learning, it says it should be something like hundreds, better thousands. If you go to deep learning, it's about 10,000, 100,000 or millions. So every time you run something in machine learning,
you are short of training data or you are not sure about the quality of the training data. This is the weakest points and most of the people do not care about it, including me. This is the first project I work on only training, not only training data, but we dedicated one task to the training data.
And if you look at the Lucas, I would segment the European community into three groups. There are people who don't know about Lucas at all and there are people who know it exists. They try to go to the webpage and try to download some data. Then they got stuck and never use it.
And there are few people who pass these barriers and used Lucas data for the benefit of the mapping. There are few, if you look at the scientific papers published using Lucas data,
I believe there are only two and the third one is in progress. While if you learn what is the content of the Lucas data, it's really fruitful and probably helpful. One another aspect of Lucas data set I should mention, if you really start some project
and you want to go to the field, you would have the data of the year, but you cannot go to the previous years. And if you are dedicated, if your task is dedicated to change detection, you are completely lost. And since Lucas started 2006, and every year it is repeated, I would say there is a treasure in the data set,
which is not yet fully explored and used for the benefit of the project. So this is probably the device to use Lucas data set. And let me introduce how is it done. So at some point in time around 2000, commission decided let's run operational,
repeated monitoring of the European landscape. They created over the whole landscape, two kilometer grid. So every two kilometer, there is a point which was visited at least once or repeatedly. From these points, they put these points over the ortho photos, which are part of the INSPIRE every three years
and started with a simple photo interpretation. So they selected the land cover class to each point based on which they run stratification. There are 1.3 million of such points over the Europe, and commission decided to only visit
every three years subset of this, which is based on stratification. And stratification is dedicated to monitor the stable land cover and changed. So there are more than 50,000 points that are dedicated only to changes. So after the stratification, the white ones that you see on the screen,
these are those points selected for the field visits. People go to the field, they have got camera, GPS and instruction what to collect from the field. After they collect it, they put it into database and some quality assurance steps are performed.
So it's a really huge project. And if you look at the whole European landscape, you see that many teams have to work on this project to really collect in one year, the in situ data.
When they enter the field, they first take a photos, four photos to this four direction. These photos are archived and you can download it. So you can check these photos even later. So this is one from year 2006, nine, 12, 15, 18 now. They first collect the land cover information
that separately land use information, management of the field. Then they collect some structural elements in the landscape. Like if you are interested in forestry, you have got the height of the tree and other information about heterogeneous classes and 500 grams of topsoil sample is taken
out of every 10 points. So there are less, but even for the soil science, there is plenty of information and they call it photos as I said. So the task is not for five minutes, it takes while. The instructions what to do in the field
evolve from about two pages form up to really long instructions with a diagram what do you have to do. It's a tough work, but it's still running and you have got lots of data there. A little bit about the Lucas land cover nomenclature.
The land cover nomenclature does not follow the classical European Korean nomenclature. It does not follow the LCCS, it has its own nomenclature. But if you look at the content at level one,
it is pretty much covering all the landscape elements are starting from artificial land, describing the urban classes, then cropland, woodland, shrubland, grassland, bare water and wetlands. So it's everything what you would expect. At a second level, I wanted to point you to the agriculture.
Lucas started as agriculture project. So you see at level two already greater detail. You've got several root crops, non-permanent crops, and so on. So it is much more detailed than Korean or LCCS already. And I'm showing here the legend only for the second
level, otherwise you wouldn't see anything. There are 76 classes. So there's a great detail. And I don't know, it's meant that you would use every class in this detail, but maybe your project would be dedicated
to forestry or agriculture. Then you can subset these details and keep the other classes at level one, say. So these are purely land cover classes, but then you have got,
what happened? Okay, land use classes. Stop sharing.
Is it back? Okay, you continue. The land use classes are divided in four categories. There is a primary sector dedicated
to agriculture, forestry, et cetera. Secondary sector dedicated to energy production, industry, and third one is more about artificial land use. What is interesting, there is the fourth one, which is dedicated to abandoned areas,
which is interesting information that is not yet present in any land cover map or land cover monitoring activity across to Europe. If you search into the database, you find that there are more than 50,000 points where this land use is specified as abandoned,
but there's no product there which would really consistently monitor this topic across to Europe. Sorry. So it is quite a detail. If you combine 76 classes with 41 land use classes,
it's a great detail. It gives you more than 1000 combinations, which is lots of information you can use. Now, if you look at the web pages of the Eurostat, you find lots of information how to download
and use the Lucas data, what is the content? You find that the primary data for the all the 2006, nine, 12, 15 years are available. You can select country, which you are interested in. You can download the PDF files with the instructions,
with the field form to see how difficult it is to really fill it and collect the data. You can download the classification PDF, which presents the classification main culture and you can try to work with the data.
If you navigate to the database, you find that what you are only allowed is to download the CSV file where the data is prepared and you can populate it to the database, but the other information is not provided and there is no running online database
that would present such data sets. So let me introduce a little bit of what we have implemented in the GeoHarmonizer project. There is a simplified architecture of the system we have developed. It start with the persistent layer.
We simply downloaded the CSV files, put it in file system and upload it to Postgres, PostGIS database. But then we found that there are lots of differences between different files and different years. So we had to perform some of the harmonization process
to really make it smooth over the whole data set as it is developed. We have put on top of this resistance layer, the application layer where there is the EU data package with a component dedicated to Lucas and there's the Lucas Python API,
which connects to WFS service. You would learn about these services in the next following session. And there is the client layer, which is the interface you would use to get data from this system.
First of all, through the Python API running Jupyter notebook or a simple Python script or QGIS plugin. This is the system which is running online at the moment. Now it's probably time to start the first notebook.
So the first notebook is 01 Lucas Access EPI notebook. There's a snapshot from it. If you go to the terminal and you navigate to ODSE work there,
there is a directory called, which has several subdirectories
and you should follow the ODSE workshop 21, which is git repository and you should run git pull in this directory. Or it should download you the rest of the data you will need for this session.
Okay, okay, maybe it was updated for the other sessions as well. I see grass and other, then you go to Python and 05 underscore Lucas.
And there should be a bunch of files dedicated to Lucas.
Okay, let me check the Zoom. If there are any questions or Valentina, are you? Yes, there's a chat, but I don't see any questions.
So it should be fine. Let me go on. So you can start JupyterLab. If you start JupyterLab and navigate
to this Python training 05 Lucas directory. The first notebook is 00 update virtual machine, where you find two steps to update the UMAP library
and update the OWS library. So please run these two simple boxes.
Just to make sure that you have the latest version of the software.
Do you succeed to update the UMAP library?
Okay, can I proceed?
Just waiting for the lady. Are you there? No? Some difficulty. Because you have changed some of the training.
Then probably, or maybe Alejandro, can you give a hand?
Okay, then it's running. Great, so the first notebook is 0.01 Lucas access notebook. If you open this one, it starts with opening, importing the Lucas classes developed within the UMAP library.
So if you run this cell, should import it without any error message, it pass. The first one, just control enter, or just run this, okay?
Yes, if I know how.
Okay, let's go on. So as I said before, this notebook is dedicated
to demonstrate how to use this UMAP Lucas library. Package to access the database through the WFS service. So you have imported the important part of this. Then the first part shows how to define a request
to be sent to the server. You simply create a instance of the Lucas request class and you define bounding box with coordinates. You can change the coordinates, but the coordinates would have to be within the EPSG certified
which is the Lombard equal area, the European definition. And then you can run a method build which just prints you the request. So if you run it and there's no error message, then you know how to set the bounding box
and how to run the request. So the request is defined. The second important task is to download physical data from the server to your computer physically into the file system. You would use the request.
You first need to instantiate the Lucas IO class. And then run the download method with the predefined request. The third line shows you just to make sure that some points were downloaded.
So it counts how many features or how many points in the geo package are presented. So if you run this, it prints you some of the messages from the request. And it says that there are 5,000 something points
if you did not change the bounding box for the request. So now you can navigate into your computer and you would see that there is a downloaded file with this. And then if you want to work with this data set, you should set some, you want to transform into different formats.
You can download it to geo package but you might want to work with the geo pandas directly continue with the machine learning. You can present the geo pandas status as to machine learning methods. Here it converts to GML
and prints the first five lines of the GML. Next one is to store the OGC geo package file
and it prints if the file exists. So let me run this. It says true that the file exists, the geo package file, which is named sample GPG. If you look to the left in the directory 05 Lucas,
there is the sample GPG. So it's there physically. Now a little bit about the request that you can create. First, you can filter by some properties.
In the database, we will use first of all the attribute, not zero, which is the country level for the European Union. So we have cited here CZ like Czech Republic. So for this, you would need to use the OWS library functionality.
So first the request is created or instantiated. Then you would say operator property is equal to meaning that you are basically creating where clause of the SQL on the database side. But since you are connected to the WFS,
you have to compose the request or the filter is the request such way. So the property name would be not zero, which is the attribute in the database. And the parameter is CZ and you can run build to have some print on the other side.
So you see there is the request, the filter is presented. Of course, you want to combine the attributes in some logical way. So let's say you want to download data for Czech Republic and Slovak Republic
to have Czechoslovak Republic. And so you have to add a logical, which is R, and you can set the literal, the values of the attributes as a list of the values.
In this way, you can combine any number of attributes and the values of the attributes you want to get from the database. Additionally, you can of course filter by years. For that, you should know which years
the look as data set was collected. So here we go, example, 2006 and 2009. If you do not filter the years this way, you would get all the years at once. But you might be interested in changes between 2006 and nine only.
The bounding box is the same, so you can run it this way. Here's the combination of Czech Republic, Slovak Republic for the two years. So there is a parameter attribute in the database survey year. There is a survey date and we have added extra survey
year to simplify the filter. Here, you can again build this and the request is presented here. Then finally, you can define your area of interest.
In this case, through the WFS, you would have to present a GML format for the server. And I have to say that for the WFS service, there is a limit of 190 vertices.
So your RE should be defined as a bounding box or a simple polygon or simplified polygon. So there's example in GML. Next, if you look at the look as data set
at the current state, 2018, and if you really upload it from the Eurostat as it is, you would find that there is nearly 100 attributes. In the database. And it's because the project evolves and people wanted more and more information.
And you find out that you can group these attributes, information collected in the feeds, thematically. It's a land cover, land use. This is the primary information. Then there is one section dedicated thematically to Copernicus, especially for the high resolution layers.
Some forestry information, there is some inspired information and soil information. And in order to lower the traffic of the data and you don't need to download all 90 attributes at once,
you have created thematic sections. So you can select the groups, Copernicus forestry inspired land cover, land use. And we combine land cover, land use with soil data. So you don't have to download everything at once, or you can, but you can say,
I'm interested only in land cover, land use, and you get subset. It doesn't mean that you would get only the information on land cover and land use. There are some compulsory information like spatial coordinates and other metadata when the point was visited and stuff like that.
You can divide it. So there is attribute to group it. In this case, it will download only land cover, land use data set for the selected bounding box.
Now, the Lucas data sets are repeatedly collected since 2006. And these are separately in a database. If you want to really aggregate the data, so you have got one record for a given point
and attributes for repeated visits, you should apply what we call space time aggregation. So there's a method st aggregated, which you set it to true and run it. Then it will merge the data set and create one space time data set.
So you can, for instance, use it for the change detection monitoring. Now, there's another example which shows you how to combine different years, bounding box and spatial temporal aggregation.
In this first example, I have aggregated all the collected years. In the second, it's only dedicated to 2015-18. Okay? So this is all from the demonstration
on how to use the Python API to access the online Lucas data set. Of course, you can combine different requests and different filters to get it.
Now, let me go back to this presentation. So you know how to access the data, a little bit more of what is the content. I mentioned this already, the thematic grouping. But then if you look at the data set,
it covers the whole European Union. It is collected in 2.2 times two grid, 1.3 million points. And it evolved from 2006 to 2018. It covers not all countries since the very beginning.
At the beginning, there were only 11 countries. I think discovered only the EU 15, 11 at the moment, and 160,000 points were collected. As it goes in time, you see now we are the last two years, 28. And every year, there's more than 300 points
collected in field. I should mention that all these 300 something points are in the data set, but not every single point was visited in the field. There are some rules in the definitions. If you arrive to this point
and it's not accessible by car and walking distance in half hour, you skip this point and you do only photo interpretation. If you look at the database attributes, there is an attribute which tells you if it was really visited in the field or if it was only photo interpreted from the orthophoto.
Most of the points were visited in the field, but a few of them are only photo interpretation. And as you see, not only the countries were changing,
but there were other changes and the project evolved. So we discovered very early from the beginning that there is a need of harmonization. The harmonization, maybe I should run it. There are a few topics of harmonization.
First, if you spot this table, it says that there were only 20 attributes in 2006. Now we have got 97 attributes. So the attributes were evolving. There were some new attributes that were removed. So we have to decide which you keep. We decided that we take 2018 as a reference.
So everything is harmonized to 2018 or 1518 version. Then attribute names were changing. So there are small changes, but you have to somehow decide which one to use.
The most difficult part of this is that not only the names of the attributes change by their definitions. Typically, at the beginning, there were some continuous variable, like initially there were LC1, PCT, which were the percentage of the land cover in the point,
meaning that you step in some point and you see it's not homogenous one. So you should select to which degree it is land, this type of land cover. So it can be 90% grassland, but there is a small patch of arable land. So you should fill in the percentage
of the main class that you have put. It express to which degree the point is homogenous or heterogenous. This used to be continuous variable, zero to 100%, which evolved into LC1 pairs, which is categorical value one to eight.
And I think number seven is more than 90%. It is the land cover category you have selected. So from given these definitions, we had to somehow harmonize to one version. Data times were evolving.
There were some strings and integers differences and the difficulty in the data set, in the CSV files you download from your study is that there are many of no data with number eight. There's no definition for the no data. There are many eights or blanks
and you have to somehow harmonize it and decide what to do with these. So as it evolved, we develop number of procedures, how to harmonize it. This is showing this slide. The graph shows you the pipeline for the harmonization. It's lots of steps. The top is the data set, the CSV files.
And we start initially with importing the primary data. Then we found that there are some errors in the coordinates. Some of the coordinates were swapped or some were missing. So we first apply the correction of the coordinates.
Then we renamed attributes with reference 2018. Then we harmonize the values to the same branch or the same definitions. Then we had to retype some of the values because of the different data types. After this is done, we merge it into one table
and we apply the space type aggregation, which I presented, demonstrated in the previous notebook. So these are the steps you have to take in order to use all the Lucas data sets since 2006 to 18 to have it in one harmonized database
usable for your machine learning project. And we think this is a big barrier preventing people using their Lucas data. Now it is done and you can use it.
And yesterday you had a session with land cover classification model training. Today for today we have prepared simple validation with Lucas data set. We have prepared in the directory sample land cover T file with a simple land cover for checker public.
We have used the open street map as a base for this. And we simply selected the appropriate classes at level one for Lucas. So there are 10 classes. So you can find it in this directory.
At the same time, I present the Lucas points in a vector file. I think it's not gel package, but the shape file at the moment, but it is the subset of the Lucas points from the database I've shown you. And there is the 0.2 land cover
validation notebook, which you can run and see how to run the validation process with the Lucas data set. So let me change it to Jupiter. So it's 0.2 land cover validation.
Now the notebook should open. The first cell just imports what you would need in this process. So starting with OSCs, then Jida library, japan-das-numpai, matplotlib to present.
I think I've removed this because it takes file to plot it. One of the most important is to import the validator from validator class, which is the file in the same directory. There is a bunch of methods
to run the validation procedure. It's just a wrapper to other functionality based on GDAL and skylab. And these are convenience functions. So you start it, should import smoothly. If there was some error, then something is wrong,
but we double check that it should be running. So before you run the validation, you should configure. In the sample land cover directory, there is a config.yaml. Yaml is simple text file where you define the key and value.
We are using it for many different configurations. I've used it for this type of configuration. Let me show you what is the content is printed here so we can simply run it. If you run this, it opens the config file. The first section project gives you some names and important is to fill RAM ID,
which in my case is the date. This is simply added to the directory where the validation log files would be presented for the later use, but you don't need it. The most important is to define the input.
You should define the path to the files. In this case, it's sample and cover. You should define the input raster, which is the file here. You can additionally define the no data value for this land cover product,
and the software will simply skip this. Whenever it comes to the point where there is a zero in the land cover, it will skip it, not count for the validation. Then additionally, you can define legend in YAML file, which is also here. It's very simple legend.
You see 10 classes following the level one of the Lucas, artificial cropland, perennial forest, shrubland, grassland, barren wetlands, water, and glaciers, which are not present in this land cover, but it's defined as it is. And then you need to define the vector reference,
which is here downloaded a subset of the Lucas point for the Czech Republic 2018 year. And it says L1, so it's level one, which is consistent with the land product. As the Lucas data, even if you subset
the topic land cover land use, then you have to define which attribute contains the information about the land cover. In this case, it's named land label underscore level one or L1 to make sure that you are using for the validation the right attribute.
Additionally, there are some validation report settings. You can define in which directory you want to save the report, but you don't have to. And you can select the name of the directory and it will take this their name and add a run ID to this.
So you can run different exercises with this validator. One extra is the setting the validation point for GIS exploration. Whenever you run the overlay of the land cover product and the Lucas points, you compare if it's consistent
or not, that you might need it to, if there are disagreements between the two products, you might want to explore in GIS where disagreements happened and to understand what is wrong in the model or in the data or to explore it. So you can set the validation point, the file name.
It will store in this directory for the report. You can define the data type for the GDAL or Augura format and you should set the projection.
In this case, it's a EPSG code. So this is the settings which you can modify for different projects. You can simply use this config.ymls template and run it with this. So it is prepared. We have done it. And first of all, you initialize the validator.
Validator class takes the config file or JSON with the same content. So you can online edit it in JSON. Later on, we would do it. So if I run it, it initialize the validation. So there's object validation, which has some links and the configuration already.
It tells you that there are two inputs. You can check that everything is okay. Additionally, there are some convenience function to check the input data are readable and you can use it. You can open the raster, you can open the vector. This is very simple.
Nothing special. Or you can even go deeper and look into the attribute table of the vector. So here I read it in Geopandas and I present the part of the data set. So you see that in Lucas data set,
there's always point ID, which is unique ID of Lucas data set. You get a survey date, altitude, a lot long, knots, observation distance, meaning that it is the distance from the theoretical point created by the commission, which means that the people collecting the data
did not reach exactly the theoretical point because it was not accessible, but they move somewhere. So sometimes it's zero and sometimes it is in this case, 22 meters. And you have to be careful with this. If you are using this point for the change detection,
you should make sure that the points you use for the change detection should be within some buffer distance. Otherwise the changes recorded in the Lucas might be changes because of the person of different years collecting from the same data are collecting some other possibly other land cover classes
because they simply arrive to different point. This is a tricky part. So that's why you have got all these. And then there is a ops type, which is the observation type. If it's one, it was collected in situ.
If it's zero, it's the photo interpretation. Then there is a LC1 pass, which is the percentage. If it's five, it's more than 50% of the one, of the land cover, which you have labeled. Now you see label underscore L1,
which is the attribute of our interest. You see here numbers two, three, two is I guess cropland, three is a permanent crops, but there are some nuns. It means that it was not possible to simply translate it because there were some ambiguities. So you have to count on this as well.
And the last parameter is the geometry of the points. So you can see it. Yeah, for the convenience I printed here, legend, I think especially just to know, you can of course explore what is the content of the classes.
So here's a simplified chart showing the distribution of the class, the proportion of the classes. You see that most of the classes are number two, agriculture or cropland in this case. Second, it is number six, grassland. And then you have got 24% number four, it's forest.
And there's like nearly 5% dedicated to urban arrest. And the rest is small proportion below, some percentage, below 5% say.
So you should also check that you are validating, for your validation, you have got enough points. In this case, the number five, six, nine, number five, seven, nine is not well represented but it's representative to the distribution
of the land cover classes in the landscape where we are running the validation. I commented this part because I was presenting the points in the leaflet but since there are more than 5,000 points,
it takes while and in the virtual machine, you can see how it is rendering line by line. So you can run it on your notebook but I skip it here because of the limits. Now let's assume everything's prepared,
you explore your data, you should know what is in the land cover, you should know what is in your Lucas data set so you understand what is there. And the next step is to run the overlay of the Lucas points over the raster. So there is a functional method overlay. If you run it, it's pretty fast.
You find that there are 4,930 reference points. It is processed at the moment the object validation keeps the information and we need to report what is the accuracy. First, there's a simplified short report
which gives you only the overall accuracy information, a number of points which passed, the validation failed. This is simplified but if you run the method report, you would get full information.
Still you see the overall accuracy is about 84%. You could look where the F1 score deviates and what is the support. You would see that the first class has only 58% accuracy
while the crop plant and the agriculture forest is quite well classified. You can check the producer user accuracy kappa it's not the biggest accuracy, but to understand really where the confusions happened,
you should print a confusion metric. Before we move on, there is a convenience method safe report. So if you run this, you would find that in the directory there is a text file which summarizes what you have done,
what is the project, what is the input data, what are the classes and what are the indicators. So you keep it somewhere saved, not only in the notebook and you can use it later on. So you save the support and you go to plot confusion metrics.
So if you run this, you immediately see that the main confusion is between class two and six. If you look at the legend, class two is cropland, six is grassland. As I mentioned before, we use OpenStreetMap for this task.
And OpenStreetMap is done by volunteers and most of the work is done as a photo interpretation with orthophotos. To recognize what is arable land or cropland and the grassland from orthophoto is tricky.
And that's most of the errors coming from this. So if you see something green on the field, you don't know if it's really grassland or if it's permanent grassland and if it's cropland. So not surprisingly, there is a biggest error in this, but also there is a confusion between number one and the others.
Number one is the artificial land, but you should learn that in the urban area, there are some parks which appears as grassland or forest because there are trees and grasses. So there's also confusion with number six, number four,
typical problems of land cover. To get a bit, so this is done, you can of course show the normalized one, which normalizes to the number
of reference points you have. So this is all done. You can also run the safe confusion metrics to have it as a PNG file for some reports and papers. This is all done. Finally, you can run safe back.
So it will save the vector with the validation points. And if you open it, what is the time? Yeah, time is running. If you open it, then you discover that what you have done in the validation
is safe in the vector. So you've got point ID, you have got raster value, the vector value, and there's a attribute status, which says if the reference and the land cover has the same value of, or if there is deviation.
So if you select all the zeros where the division is you can navigate to the points where the disagreement between the reference and the land cover is presented. So you can better understand. So this is all saved and you can work with this. Additionally, we can play a little bit
with the class aggregation. If you look at this diagram or the confusion metrics that is the highest confusion is between two and six. And the true label says it is a grassland but the predicted label or the OSM saying,
oh no, this is agriculture cropland. So what you can do to understand is that you merge or aggregate together class number two and six and you would see what is the value of the other classes and say, okay, two and six together is agriculture. And I do not care if it's a label or grassland
it's just agriculture. Okay, let me show you how to do this step. At the beginning, I said that the configuration file can be YAML or JSON or dictionary. So if you copy paste and change a little bit so the input is the same, the input is the same.
You just change the name of the directory where the report would be saved. So I say, I've added underscore aggregation. So it's slightly different test run, let's say, that's it. The other things are all the same. Okay, so if you run this,
then you instantiate again the validator but now a little bit different objects so you don't mix it. And you present as a parameter the config aggregation which is the JSON. So the validator now knows where to start a report.
That's it. Now we have decided to aggregate number two to two and six. So there's a simple JSON. So the class number two would be composed two and six list of the classes. So you create this and you can of course
have a different, you can continue in this way but let me simplify it to two and six to see what is the effect. Okay, now you run again the overlay. Let's assume everything is all the same. So the inputs are working
and you present this object with the aggregation to this overlay as a parameter. So it runs the overlay with the same number of points and you can see the report, what is the effect. Now the overall accuracy is 96%.
Meaning the main problem in the original land cover is in this confusion and the other classes are quite representative. So with this code, you can a little bit play
and explore the content of the land cover but you also explore a little bit the value of the Lucas points and really you should do it before you publish something on your research in this way.
Yeah, that's the confusion. I think that we just confirmed what we know already. Now the next step should be, okay, what is the confusion of numbers three, four, seven with the artificial, but these are most likely the parks or the maybe at the border of the urban area
there are always some small parks which are not used for agriculture and some products included into urban. But if you go there as a Lucas survivor, you arrive to the grassland, which can be, I don't know, golf course, it is a grass
but then you should really find a Lucas land use that it is used for golf, not for agriculture. So this is the tricky part. Okay, we have got something like 13 minutes.
Let me continue. So this was the validation with the Lucas points but the same way you can use it for model creation for the calibration, either classification or regression model.
Next, what we have prepared a little bit is how to analyze Lucas data, we call it analyze. There are two steps. First, nomenclature translation. As I showed you, there are 76 land cover classes, 40 something land use classes. And then you have got land cover that has different nomenclature.
So you should translate it to this nomenclature. So far, we have prepared the translation table to translate Lucas to Corinne, which is not always straightforward. We have taken all 1400 combinations of Lucas LC1 and land use and find sensible class
in a Corinne level three, that would be appropriate for that. Not always the translation is possible. Then you would have a none in the translation. So there is a one notebook that shows
how to use this Lucas class translate. And second half of this notebook is dedicated to aggregation. I showed you how to do the aggregation in online validation, but there is another tool, class Lucas class agar, which takes as input the same JSON.
But again, you select some code for the label. In this case, it is A00. And you present list of the classes that should be aggregated to this class. So there are two functionalities. Let me show you in the Jupyter.
So first you import again from the EU map, the respective classes.
Then you define the request to OWS library. In this case, checker public, not zero 2018 year. You download the data. You can show what is the content. Some warning, but shortly there is some ID,
point ID, and you have got LC1 attribute. You can check the statistics of this, but this is pretty much the same as I use in the validation. So the pie chart is the same.
You develop your translation table to explore the aggregation. And you run apply, which basically takes the Jupyter package and adds another attribute with underscore A as aggregated.
And you can work with it. Here it is, LC1 and LC1 underscore A. Let's show the head of this. So you would see E20 change to E00. This is the aggregation. C23 is just C00.
So the aggregation is performed. It is stored in the file in the Jupyter package that you can move to your project and run the validation or calibration. Now the nomenclature translation. There's a class, Lucas class translate.
You instantiate it with your Lucas data. You set the translation. At the moment, there is one CSV file that translates this to core inline cover. But if you extend this CSV file in the UMAP package to other nomenclatures like Urban Atlas,
that can be high resolution LIRAS, LCCS, which we plan to do. You could use different translation and use the Lucas data for the validation of the companion Lucas products, for instance. So set translation, you say which type of, which version of the land cover nomenclature
and apply. And again, this stored in this object, you can import it to Jupyter and present either of this.
You see here, for instance, the original class is not visible, but E something was translated to three to one, which is natural grassland, if I'm not wrong. B something is 211, which is Rabeland. Okay.
Yeah, and you can, of course, yeah, download it or save it to different file for months to apply to file to Jupyter data frame. Okay, this was the translation. We finished the presentation. And additionally, there is a graphical user interface.
All these functionality to request filter the data set from the Lucas database to subset it, aggregate it and translate it to different nomenclature
is available through the QGIS plugin that you find on this repository. Please try to use it and possibly report if you find any issue. We have tested this plugin in the virtual machine.
There's some issue with the memory. So you better not using in this virtual machine, but in your computer, if you have a QGIS installed, you can simply, in QGIS, okay. Check version, sorry. In a management of the plugins,
you say that you want to install from zip file, you navigate to the zip file where do I put it? Somewhere here, Lucas download manager in a zip file.
You install it. I didn't try it. This is the first time online I'm trying. And there's a new icon with the plugin picture, download Lucas points. And of course, you can select the country you want.
Croatia for Tom. Yes, yes, I know why. Yes, for about now.
Yes, okay, thanks for. So what I did is installation of the plugin from zip file.
Select, open, install it. Then there is a icon plugin, which I opened. And in the first step of the download manager, you select the country for instance, or in a cannabis, if you have opened a project
and you want just the data for the selected cannabis or you present vector file. Yeah, I select the country, I did Croatia. You say you want to detect changes between 2006 and 12 for instance, you're interested in land cover. It may be other topics, maybe all of them,
maybe only land cover. You want to aggregate it because it's change detection. So special type aggregation, and you want to save it to some file. Test Lucas, job package and you run download. And it's downloading from the server,
which is at our university. And just thinking of there is no anything.
Okay, I've got three minutes to go. It's downloading now. You can also do the class aggregation, which I have shown in a Jupyter notebook. You only have to save the, save the JSON into file and you open the JSON from the file
and you apply the aggregation, which is basically all of the presentation we have prepared so far. This is it. I just finally mentioned that all the work we have done
in this geo harmonizer project is in a software library, which is at the moment published as open source data client, but we will also publish the backend as well. So you can run it and you can even change the harmonization process
as you need and you can write on your own. Okay, thanks for your attention. And I hope it was useful for you and you will use the Lucas data, whatever project you would run. Okay, thanks again. If there are questions, we are here till Friday and you are open to discuss
or at the beginning you'll see the email, our emails to the university boxes. So you can post us with emails or metamost. So it's running. Thank you. Thank you. Yes.
And do you like validate or evaluate the opinions of the university? Or do you collect data in the field like what GPS would be for the professional activities? Yeah, the GPS accuracy is there in the attribute. It's there.
It is attribute of that. You can also measure how far the point is from the theoretical point. That's also there as an attribute. Yeah, like how far. They also add like direction, maybe for example, Azimuth, you can calculate it as well
because we keep both the theoretical points and the GPS physical points. So this is also there. Additionally, since 2015, 18, there are kind of bounding boxes to which extend the point is representative.
So there's North, South, East, West. There's a bounding box which is, this point is representative for, I don't know, 100 meters to all directions. So it is a plot already. Or you can use this. In the methodology, if you go in deeper in the Lucas methodology, you see that some points were collected.
It's a circle radius one and a half meter. But if you enter to grassland, shrubland, or these heterogeneous classes, these are defined to bigger around 22 meters radius. So there are plenty of, I would say metadata,
which gives you together representativeness of the point and you can filter based on it. But how to validate Lucas? Well, yeah, if you see the validation process, we start disagreements and you can explore it with the orthophoto.
It can happen that the points where you find a disagreement are located in some heterogeneous landscape. You have got, I don't know, fruit pies. Then there is a small strip of grassland and there is a label. And in between there is a tree,
which is typical heterogeneous. And for this, there is the percentage value of the land cover class you have classified. So if you filter out all these, then you have only the points that are well representative.
We did not try to validate the other way. Lucas, so you can take the point and go to the field and say, where we are, what is it here? We plan to do some ad hoc validation of the Lucas points with the national orthophotos.
That's the only way I can think of validation Lucas. Yeah. Just try to see how to collect 300 points across the Europe by photo interpretation and manual work and compare it. I think the value is clear.
Yes. The slides. The slides, yes. We will put the slides into the directory.
Should be there. Should be there, yes. Python training, Lucas. The slides are already date wise already. No, no, no.
Yeah, yeah, I did indeed. If you need to pull this repository, you should have this presentation in the directory in PDF.