We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Modern challenges of geospatial data science: the Open-Earth-Monitor project

00:00

Formale Metadaten

Titel
Modern challenges of geospatial data science: the Open-Earth-Monitor project
Serientitel
Anzahl der Teile
17
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
In this talk were addressed some of the main modern challenges of geospatial data science grouped around five main aspects: (i) training data issues — these include mismatches in standards when recording measurements, legal and technical issues that limit data use, high costs of producing new observational data; (ii) modeling issues — these include point clustering and extrapolation problems, model overfitting, artifacts in input data, lack of historical data, lack of consistent global monitoring stations (Kulmala, 2018), limited availability and/or quality of global covariate layers; (iii) data distribution issues — these include unusable file formats, high data volumes, incomplete or inaccurate metadata; (iv) usability issues — these include incompleteness of data, unawareness of user communities and/or data limitations, data being irrelevant for decision making; and (v) governance issues — these include closed data policies from (non-)governmental organizations, datasets not used by organizations, lack of baseline estimates, absence of strategies for updates of data products. OpenGeoHub, together with 21 partners have launched an ambitious new European Commission-funded Horizon Europe project called “Open-Earth-Monitor” which aims at tackling some of the bottlenecks in the uptake and effective usage of environmental data, both ground observations and measurements and EO data. They plan to build upon existing Open Source and Open Data projects and build a number of tools, good practice guidelines and datasets that can help cope with some of the modern challenges.
Schlagwörter
ProgrammOffene MengeProgrammierumgebungImplementierungProjektive EbeneHorizontaleHomepageSystemaufrufOffene MengePhysikalisches SystemBenutzerfreundlichkeitOpen SourceProgrammierumgebungGebäude <Mathematik>GruppenoperationComputeranimation
ProgrammierumgebungSchlüsselverwaltungKeilförmige AnordnungOffene MengeDigitalsignalElektronisches ForumVollständiger VerbandProjektive EbeneCoxeter-GruppeBildgebendes VerfahrenProgrammierungSatellitensystemWellenpaketSelbst organisierendes SystemDifferenteMapping <Computergraphik>Minkowski-MetrikMathematische ModellierungFestplattenrekorderBenutzerfreundlichkeitBesprechung/Interview
ProgrammEinflussgrößeSchätzungPrognoseverfahrenDistributivgesetzRechnernetzOverlay-NetzFunktion <Mathematik>Element <Gruppentheorie>SoftwareWürfelFormation <Mathematik>Wald <Graphentheorie>Support-Vektor-MaschineMathematisches ModellMatrizenrechnungLineare RegressionImplementierungFlächeninhaltSampling <Musik>VariableComputerProzess <Informatik>Globale OptimierungVorhersagbarkeitProgrammbibliothekFlächeninhaltPerspektiveMinkowski-MetrikZusammenhängender GraphProjektive EbeneMathematische ModellierungMultiplikationsoperatorCASE <Informatik>Mathematisches ModellOverlay-NetzHypercubeDifferenteLuenberger-BeobachterSatellitensystemDatenflussMerkmalsraumComputeranimation
Globale OptimierungOrdnungsreduktionOverlay-NetzTransformation <Mathematik>Mathematisches ModellProgrammAnalysisPrognoseverfahrenVisualisierungTotal <Mathematik>PixelSampling <Musik>Mathematische ModellierungMapping <Computergraphik>MatrizenrechnungPunktVirtuelle MaschineProdukt <Mathematik>Algorithmische LerntheorieLineare RegressionGlobale OptimierungUnrundheitVorhersagbarkeitMultiplikationsoperatorKeller <Informatik>Nichtlinearer OperatorKreuzvalidierungSensitivitätsanalyseProzess <Informatik>Mathematisches ModellOverlay-NetzComputeranimation
PlastikkarteSystemaufrufFormation <Mathematik>InformationAbstraktionsebenePrognoseverfahrenTermÜberlagerung <Mathematik>AnalysisEin-AusgabeAnwendungsspezifischer ProzessorRelation <Informatik>Peer-to-Peer-NetzToken-RingE-MailWald <Graphentheorie>TopologieDistributionenraumART-NetzComputerphysikData MiningTotal <Mathematik>RaumauflösungPunktBenutzerfreundlichkeitMathematische ModellierungKategorie <Mathematik>Textur-MappingProgrammWorkstation <Musikinstrument>Inverser LimesEinflussgrößeMathematisches ModellStandardabweichungExtrapolationMetadatenProdukt <Mathematik>DateiformatSelbst organisierendes SystemAdressraumStichprobeSoftwaretestMeta-TagQuantisierung <Physik>BeschreibungskomplexitätGradientLineare RegressionGrößenordnungDatenstrukturLineare AbbildungPolynomAlgorithmische LerntheorieWellenpaketART-NetzOverlay-NetzPunktTopologieBitPixelVorhersagbarkeitAuflösung <Mathematik>Wald <Graphentheorie>WürfelÜberlagerung <Mathematik>Minkowski-MetrikMeterExtrapolationMathematisches ModellDifferenteKategorie <Mathematik>MengeMultiplikationsoperatorEigentliche AbbildungWeb logVirtuelle MaschineSchlüsselverwaltungKontextbezogenes SystemInverser LimesDistributionenraumMerkmalsraumEin-AusgabeWahrscheinlichkeitsverteilungBenutzerfreundlichkeitSelbst organisierendes SystemGenerator <Informatik>Mathematische ModellierungSoftwareentwicklerBereichsschätzungTaskOffene MengeURLMooresches GesetzWorkstation <Musikinstrument>Harmonische AnalyseKartesische KoordinatenLuenberger-BeobachterEntscheidungstheorieProgrammfehlerMultiplikationProgrammierungStrategisches SpielTermCluster <Rechnernetz>Twitter <Softwareplattform>OnlinecommunityWiderspruchsfreiheitFokalpunktKovarianzfunktionCodierungGruppenoperationBasis <Mathematik>SchätzfunktionMatchingDateiformatComputeranimation
AnalysisOpen SourceLineare AbbildungMathematisches ModellWald <Graphentheorie>PrognoseverfahrenTopologieFormation <Mathematik>FreewareProgrammExtrapolationTermInverser LimesTemporale LogikExtrapolationLinearisierungMerkmalsraumCodeVerdeckungsrechnungVorhersagbarkeitVirtuelle MaschineGeradeBereichsschätzungMathematikLineare RegressionKlassische PhysikFormation <Mathematik>KovarianzfunktionRandomisierungWald <Graphentheorie>DatenstrukturPlotterMathematisches ModellPunktEntscheidungstheorieFehlermeldungFlächeninhaltWellenpaketPixelTwitter <Softwareplattform>DistributionenraumWahrscheinlichkeitsverteilungFramework <Informatik>MultiplikationsoperatorRauschenSchaltnetzKeller <Informatik>Nichtlineares SystemBitAlgorithmusKartesische Koordinaten
GarbentheorieInverser LimesSichtenkonzeptVorzeichen <Mathematik>Inhalt <Mathematik>Attributierte GrammatikVariableBenchmarkMathematisches ModellAbstraktionsebeneProdukt <Mathematik>MittelwertAuflösung <Mathematik>GradientEinflussgrößeDistributionenraumGammafunktionProgrammOvalLastWorkstation <Musikinstrument>CompilerComputerphysikMultifunktionSupercomputerMapping <Computergraphik>Framework <Informatik>Algorithmische LerntheorieWellenpaketMengePunktGraphfärbungVorhersagbarkeitProzess <Informatik>Ein-AusgabeComputeranimation
TabelleDichte <Stochastik>HypermediaAbstraktionsebeneProdukt <Mathematik>StichprobeExtrapolationPunktSchätzfunktionCASE <Informatik>SchaltnetzMultiplikationsoperatorBitTypentheorieMathematische ModellierungPunktKontextbezogenes SystemWald <Graphentheorie>Cluster <Rechnernetz>Computeranimation
ParametersystemProgrammierumgebungDatenverwaltungSystemprogrammierungInformationDienst <Informatik>VersionsverwaltungMenütechnikVerschlingungDivisionMathematische ModellierungProgrammMengeCharakteristisches PolynomDichte <Physik>Lateinisches QuadratPunktMultiplikationsoperatorCluster <Rechnernetz>SchätzfunktionSoundverarbeitungComputeranimation
ProgrammMathematisches ModellFrequenzExtrapolationMathematische ModellierungPrognoseverfahrenAdditionMedianwertOrganic ComputingWärmeleitfähigkeitInhalt <Mathematik>VariableLineare RegressionOpen SourceMathematisches ModellMinkowski-MetrikBereichsschätzungVarianzRechter WinkelMapping <Computergraphik>PixelVorhersagbarkeitVisualisierungStreuungsmaßGraphfärbungResultanteDimensionsanalyseBitTopologieComputeranimationDiagramm
BenutzerfreundlichkeitAutomatische IndexierungHybridrechnerDatenverarbeitungssystemEINKAUF <Programm>Dichte <Stochastik>RechnernetzMereologieFormale GrammatikFlächentheorieFatou-MengeDatenanalysePunktwolkeKanalkapazitätVolumenSystemprogrammierungSystemplattformBefehl <Informatik>AbstraktionsebeneComputerCodePortal <Internet>Prozess <Informatik>DateiformatPolarisationParametersystemSpezialrechnerATMMedianwertArithmetisches MittelGarbentheorieDoS-AttackeBenutzerfreundlichkeitMultiplikationsoperatorInverser LimesÜberlagerung <Mathematik>Prozess <Informatik>DatenfeldMosaicing <Bildverarbeitung>KanalkapazitätDateiformatTesselationXMLUMLComputeranimation
TabelleDoS-AttackeMengeMenütechnikDemoszene <Programmierung>FreewareVollständigkeitProgrammPixelMengeMultiplikation
DatenanalyseProgrammGoogolTemporale LogikAuflösung <Mathematik>DateiformatAnalysisFlächeninhaltRuhmassePixelWasserdampftafelÜberlagerung <Mathematik>RohdatenSatellitensystemGrenzschichtablösungZeitreihenanalyseCloud ComputingRuhmasseAbschattungDifferentePixelWasserdampftafelProdukt <Mathematik>CodeAuflösung <Mathematik>VorhersagbarkeitHybridrechnerAutomatische IndexierungHarmonische AnalyseGrundraumKartesische KoordinatenFormation <Mathematik>SystemplattformPackprogrammMapping <Computergraphik>Prozess <Informatik>VerdeckungsrechnungMathematische ModellierungNeuroinformatikEin-AusgabeSoftwaretestProjektive EbeneBeobachtungsstudieGanze FunktionPunktwolkeMultiplikationsoperatorZentrische StreckungGanze ZahlSpannweite <Stochastik>FreewareFunktion <Mathematik>Temporale LogikTabelleUmsetzung <Informatik>Luenberger-BeobachterFitnessfunktionPerspektiveSpezifisches VolumenComputeranimation
ThreadFunktion <Mathematik>MetadatenTemporale LogikInformationLuenberger-BeobachterPixelThreadEntscheidungstheorieFahne <Mathematik>DifferenteNachbarschaft <Mathematik>Computeranimation
Magnetooptischer SpeicherProzess <Informatik>AbschattungPunktwolkeMetrisches SystemPixelTemporale LogikBlu-Ray-DiscDifferenteOpen SourceAnalysisMultiplikationProdukt <Mathematik>KraftProgrammMAPFunktion <Mathematik>Mathematisches ModellDichte <Physik>Demo <Programm>CodeGreen-FunktionFormation <Mathematik>Offene MengeFunktion <Mathematik>ÄhnlichkeitsgeometrieZeitreihenanalyseSchwebungFahne <Mathematik>ComputerspielPunktwolkePixelMeterMAPSatellitensystemProdukt <Mathematik>ForcingLuenberger-BeobachterTemporale LogikMengeDifferenteHarmonische AnalyseSoftwareProzess <Informatik>Twitter <Softwareplattform>MustersprachePackprogrammTopologieART-NetzAdditionBildgebendes VerfahrenMultiplikationsoperatorMathematisches ModellWiderspruchsfreiheitBildschirmfensterVerschiebungsoperatorMatchingComputeranimation
PixelWald <Graphentheorie>Temporale LogikGrenzschichtablösungFlächeninhaltSatellitensystemPunktwolkeBildgebendes VerfahrenLuenberger-BeobachterDreizehnMultiplikationsoperatorFahne <Mathematik>Kartesische Koordinaten
ServerTabusucheRechnernetzService providerPhysikalisches SystemModul <Datentyp>BefehlsprozessorOpen SourceHardwareThreadVersionsverwaltungParallelverarbeitungCodeEin-AusgabeProgrammierumgebungProgrammInformationsspeicherungProzess <Informatik>MIDI <Musikelektronik>BitSpannweite <Stochastik>Physikalisches SystemPixelDifferenteThreadBefehlsprozessorVersionsverwaltungProgrammbibliothekCASE <Informatik>Prozess <Informatik>ZahlenbereichServerBitmap-GraphikSoftwaretestSelbstrepräsentationSkriptspracheVirtuelle MaschineDatenstrukturDatei-ServerDateiformatDichte <Physik>ProgrammierumgebungBildschirmfensterZentrische StreckungDateiverwaltungKanalkapazitätSoftwareSoftwareentwicklerZeitreihenanalyseZellularer AutomatCodeMultiplikationsoperatorRechenbuchProtokoll <Datenverarbeitungssystem>VorhersagbarkeitObjekt <Kategorie>Elektronische PublikationMereologieInformationsspeicherungBildgebendes VerfahrenGrenzschichtablösungComputeranimation
Offene MengeGammafunktionBefehlsprozessorServerThreadProzess <Informatik>ProgrammProgrammbibliothekSuite <Programmpaket>SystemprogrammÜberlagerung <Mathematik>MeterPrädiktor-Korrektor-VerfahrenGleitendes MittelGruppenoperationOpen SourceSoftwareDistributionenraumMathematische ModellierungExtrapolationSoftwareentwicklerBenutzerfreundlichkeitKonstanteCodeSoftwaretestAutomatische HandlungsplanungKanalkapazitätThreadAnalysisTwitter <Softwareplattform>PixelBenutzerfreundlichkeitServerMultiplikationsoperatorOrtsoperatorWrapper <Programmierung>SoftwareentwicklerGruppenoperationSchreiben <Datenverarbeitung>Kartesische KoordinatenSensitivitätsanalyseResultanteDistributionenraumFlächeninhaltTesselationMathematisches ModellKomplex <Algebra>PunktwolkeSatellitensystemSoftwaretestDatenfeldFunktionalWarpingElektronische PublikationSoftwareMinkowski-MetrikMathematische ModellierungKlasse <Mathematik>WellenpaketProgrammbibliothekOverlay-NetzOpen SourceKontextbezogenes SystemComputeranimation
GruppenoperationOpen SourceSoftwareDistributionenraumMathematische ModellierungBenutzerfreundlichkeitExtrapolationProzess <Informatik>KonstanteSoftwaretestAutomatische HandlungsplanungCodeKanalkapazitätSoftwareentwicklerRechenschieberElektronische PublikationDateiformatOffene MengePunktElektronische PublikationDatenbankMultivariate AnalyseDateiformatMultiplikationsoperatorProgrammierungTesselationVorhersagbarkeitResultanteAlgorithmusÜberlagerung <Mathematik>Kategorie <Mathematik>FunktionalValiditätStützpunkt <Mathematik>Domain <Netzwerk>Bildgebendes VerfahrenBetrag <Mathematik>SoftwaretestSoftwareApp <Programm>MusterspracheE-MailGeradeKartesische KoordinatenShape <Informatik>Inverser LimesPhysikalisches SystemProzess <Informatik>PixelLineare RegressionGebäude <Mathematik>EinflussgrößeLuenberger-BeobachterMathematikBildschirmfensterInterpolationCASE <Informatik>Offene MengeGewicht <Ausgleichsrechnung>SatellitensystemDienst <Informatik>MengeProdukt <Mathematik>MAPPunktwolkeSensitivitätsanalyseFehlermeldungKontextbezogenes SystemNatürliche ZahlMetrisches SystemStandardabweichungMathematisches ModellVerdeckungsrechnungDatensatzObjekt <Kategorie>RandverteilungMinkowski-MetrikSchreiben <Datenverarbeitung>MetadatenDatenfeldProjektive EbeneNummernsystemMustererkennungDigital Object IdentifierAbstandWiderspruchsfreiheitDifferenteProzessautomationTeilmengeSoundverarbeitungMittelwertFlächeninhaltTemporale LogikSchnelltasteSelbst organisierendes SystemComputeranimation
Transkript: Englisch(automatisch erzeugt)
So, Open Earth Monitor, it's a project we just kick-started. It's a 30 million euro project under the Horizon Call governance with 20 partners
and we are happy to lead it. It will go until 2026 and we're just making implementation plan. But we have a quite ambitious agenda and you can read of course about the project homepages earthmonitor.org.
So you can read all about the project and what we plan to make, what the main objectives, etc. We do build up a lot on the existing systems and mainly of course open source and open data driven systems including some commercial big EOData systems like EODataCube
and Sentinel Hub. So we're building on top of several existing solutions and we're trying to make up something which could result in a higher quality and high usability environmental data. So something that we spent exploring last decade.
We would like to implement and put in action so we can generate data at a faster and faster pace. We had a meeting on the 19th of July, we had a public seminar day and we had quite some keynotes and through these keynotes also we made a lot of notes.
We invited 14 keynotes and we made experience researchers or project leaders and they work in large European or national or international organizations
including UN, European Space Agency, European Commission. So we asked them about data usability programs and how could we help, you know, which infrastructure can we help to make the data more accessible, more usable.
So some of the things that popped up, for example, Joanna Rauter from the Dutch Space Office, Netherlands Space Office, she mentioned, for example, that the Dutch government, they spent quite some resources and funding they give for different projects
and they would like to, for example, be able to track this project in future. So if there's a project about, I don't know, monitoring erosion or something or some investment into regreening some areas, they would like to be able to see how this funding is spent that you can see on satellite images.
So that's something that popped up from her talk. I remember where she said, you know, that's kind of one of the key problems we have. Then we had a very interesting talk by Gilberto Camara. So he was comparing doing local modeling versus going global and he showed these differences not only in the accuracy of mapping but also in the concepts in the legends use, how you collect training data
and there's all these complexities. So there were super interesting talks and we're just writing now a medium article where we're going to put all the videos, all the video recordings of all presenters and we will focus on the 10 key challenges that we discovered during the meeting
and 10 key messages, let's say. But I'm going to talk now from our perspective what we think are the main challenges facing modern geospatial data science. That's basically the idea of this talk. And then later on in the summer school, we're going to zoom in into different of these challenges and one of these challenges, a few of these challenges, they come up in hackathon.
So you're most welcome to follow the hackathon. So they will come up in the hackathon and so you could address some of these challenges by joining that hackathon and they're quite generic, as I said. There's also a Q&A in Nature Communications coming out
where many of the things I'll discuss you can also read in that article. This is like a general flow. This is basically my work last maybe 20 years. So it's really general flow, how you can do ecosystem, vegetation, soil mapping,
and it has these main phases. So starting from defining the area of interest and then making some plan for predictive mapping, then preparing the data, then maybe spending more time to prepare co-arrays
that will match better the processes that you want to describe. Then you can make, based on that, you can make a sampling plan. So if we have co-arrays, you can do some feature space optimization and to create a sampling plan. So let's say if you do light and hypercube sampling, and then you go collect the data on the ground, then you can do overlay,
fit the models, optimize the models, and then you get predictions. Then you can look whether the predictions are accurate enough or not. You can go back and you can repeat the whole process.
So these are kind of some steps that I usually run. I mean, that's how most of our projects look like. And in the open net monitor, also we mini kind of this setup and we split basically the project. We have these four main components. So this is the engine. So there's the software libraries that you develop to do different processes
from overlay modeling to accuracy assessment, et cetera, visualization. Then there's a segment which is the ground truth data. And there's a segment which is the EO data. We call it EO data, but it's all actually gridded data we can prepare, whether they come from earth observation from satellites or not.
And then you build these gridded data cubes. And today they're usually space time. So space time data. And then you have use cases. And use cases means once you have the one, two, and three working, you have a use case where you generate some models and you do prediction and then you serve that prediction for some purpose.
So that's also in a nutshell predictive mapping. So the first step is the preparation of data. Then you have once the data is ready, once you do sampling, et cetera, then you do overlay and you prepare the so-called regression classification matrix.
Then after you have the regression classification matrix, you can do fine tuning, feature selection, optimization of model. Then you can do maybe ensembling, so the model stacking. And then once you optimize that, you can do accuracy assessment.
You want to test. You do maybe some sensitivity analysis, accuracy assessment, cross validation. And then you estimate how good is the model. And once you're happy with the model, then you do production or prediction. And then you produce maps. And these maps you can then serve to GeoServer.
You can serve it through all the WebGI solutions. And then you can do some post-modeling diagnostics to see, again, whether there are some issues, whether there are some artifacts, seed outputs, et cetera. And once you do that post-modeling diagnostics, you can do reanalysis model improvements.
You can create extra points if it's needed. Then you can do a total reanalysis. So you just go back and do the whole round one more time. And then if all this thing is working, then you can move to using the data for operational work. And that's also open at Montessori in a nutshell.
These are all the processes that we've been building the last three years. We've been building this dashboard, the data portal, called equidatacube.eu. And all this data you see in equidatacube, they follow these principles. And most of the data we produce are based on machine learning
and primarily running special temporal machine learning on really large data sets. So we have now a couple of publications that came out. This one is using Lucas and Corinna training points. It's almost, I think, four million points that overlay space-time.
We prepared also a Lancer data cube for Europe, almost 20 terabytes of data. Then we do a space-time overlay, and we fit, and we produce predictions for billions of pixels. And so that's an example in practice how you can implement all that.
We also map forestry species. We have 16 tree species mapped at 30 meter resolution for 20 years. With uncertainty across Europe, we could move to mapping almost all of the tree species from the European Forest Atlas, which is 85 species. If we had enough point data,
here we had a bit less point data than for land cover. Mapping, but still I'm talking about millions of training points. And so that's also another example of applying machine learning to large data sets to produce something that could be usable, let's say, and of interest for different professions.
So through doing all these projects with the large data, we really learn a lot about predictive mapping, and I feel really mature to talk about what are the modern challenges of geospatial data science. In this case, focus on predictive mapping, but I think they are common to many geospatial data science programs.
And when I thought about how to generalize this, I tried to put it in five categories. So, of course, there could be more categories. This is just an attempt to systematize something. But let's say five categories, training data issues,
modeling issues, data distribution issues, like data distribution in terms of developing data services, then usability issues and governance issues. So these are the five groups of challenges that I think we could put as a systematic overview
of what usually are the problems. So the training data issues, usually that could be, for example, harmonization problems, legal and technical issues, then high cost of producing new observations,
modeling issues, different things. There can be problems with overfitting, extrapolation, artifacts and input data, lack of historical data, lack of consistent global monitoring stations, limited availability, quality of global covariate layers. So there are also different problems.
Issues two to five, data distribution issues using file formats which are not so usable, maybe not cloud optimized today, then high data volumes, you have fantastic data sets, you have problems accessing them because they're just super high data volumes. Then you have fantastic data
that has very incomplete or inaccurate metadata, so you have problems using it. Then usability issues, you can have a fantastic data, but it just doesn't match your application, so it's not relevant for decision. Then you can have data which is incomplete,
then also there are user communities, and let's say if you develop a data set and you didn't really do some homework and explore what the user community needs, then again, you develop data for yourself and you have disuseability issues. Then the governance issues, kind of data politics issues,
so you can have data sets which are not according to international standards, so the international organizations don't want to use them or national organizations. Then many organizations want this baseline estimate, so if you don't have a baseline estimate, it's difficult to make something unusable for application.
Then also sometimes you make a data set, but there's no strategy what you do with the data set after, who's going to maintain it, who's going to fix problems, who's going to convert to some next generation standard, so these are all the governance issues. We also, within OpenGeo Hub,
we also have, I try to list only the key issues that we have on a daily basis, and if I had to really put it into something that really bugs me, then it was this tree, so the data size, so we suffer now with bigger and bigger data and we have really technical challenges
with the data volumes. Then we are also really frustrated with spending too much time on just cleaning other people's data, and then we have a difficulty increasing usability of data without making significant additional investments, so these are the three things
I think that Leandro and me and my colleagues that we face basically on a daily basis every time they ask for something. That's kind of the things we have to be very careful. How do you solve most of the challenges? Well, I usually say, especially to you summer school participants,
you can solve a lot just by becoming a better coder, so you have to increase your coding skills. Of course, you have to have a fundamental knowledge of understanding the problem, et cetera, but becoming a better coder, I think it's certainly the first thing
you should consider, and then sometimes when you even do the coding at the highest level, eventually you reach technical limits of analysis, and in that case, you also have to look at ways to get more computing power
because sometimes to increase the accuracy, once you have the best data, once you have the best code, the key is really just to have more computational power and to have more facilities to help increase accuracy. Beyond these technical problems, there are methodological problems
that I would like to talk about, and this is usually the overfitting problem in machine learning, extrapolation problem, and then training machine learning models using non-probability samples, so these are kind of the methodological problems. They're not easy to solve, and that's why we also put them in the hackathon. We're interested to see what comes out,
so when you start, for example, doing some modeling, and there's a lot of interest now in providing both the predictions and uncertainty of predictions in terms of prediction intervals or confidence intervals or prediction errors, and then it becomes difficult.
First thing which is difficult, I noticed that if you want to do proper probability distribution, you need to know also the probability distributions into data, and also you don't know it, so that's really a challenge. There's no easy way to solve it. You get some layer, and you have no idea what's the uncertainty of that layer.
Then you have overfitting problems. I faced it a couple of times in my career. I heavily overfitted data because usually there was spatial clustering, and it's very, very challenging, and sometimes you don't see it. You just overfeed the data. Everything looks fine, and then later, when you do proper cross-validation,
you see that your model is basically fake. It's overfitting. Then extrapolation in feature space. I'm talking extrapolation feature space is also a big problem. I will show you some examples. Many machine learning models, they don't handle that so well.
Well, we say they don't handle it. There's a famous, I think, blog says extrapolation is tough for the trees, and here the trees in the context are the random forest, so there are some issues there, and it's not the easy things to solve, but it's important to be aware of it, and there are some remedies,
and we will be developing more and more. I think that's the task of the spatial data science to develop remedies and solutions for extrapolation problems. Then we have spatial clustering problems. Then even when you do uncertainty, and let's say if you do uncertainty properly and you get a really realistic estimate of uncertainty for every pixel,
how do you communicate that? How do you visualize uncertainty? There's also another challenge in the modern geospatial data science. Overfitting. As I said, many times I did overfitting. The first few times I wasn't aware. I just thought my models were great. I had R squared 0.9 or higher.
I thought this is great, and that's why I started liking random forest and things, but later on I noticed that this high R squared was not due to actual model performance, but it was due to problems with spatial clustering.
We had these repeated observations at the same location going through the depth, and that also comes with the hackathon data set, the USGS data set. You have the three or four observations in multiple depths. So if you see my tutorial, when I fit the model, ignoring that there's a repetition in space, then I get the R squared of 0.7,
but when I take the whole stations out, so whole points out, then the realistic R squared is 0.5. So you see there was an overfitting from 0.5 to 0.7. As one remedy for the overfitting is to really try out different so-called resampling methods.
So you take the training data and you resample it using different principles. Of course, the principles that match the properties of data, and then you repeat it several times and you just do the check just to be sure that you're not overfitting.
So I looked at some discussion on Twitter. One colleague posted a data set. A colleague is also one of the lecturers on Open GeoHub, Summer School's Dylan Baudette. So he posted this little data set. It's a simulated data set. Twitter is a great place if you're bored at home.
So I looked at it and I said, okay, and I think Hannah, you made the code for this. So I used the data set. So what you see here, this is the original simulated. So this is just a linear regression and you simulate noise. So you have X and Y, two variables.
Y is the target, X is the covariate. And you simulate the values just as a pure noise, a normalized noise. And then you add it to the regression line. And so you know that this is really linear regression with just the noise around the regression line. And if you do a random forest,
you just use a default. You can see the code. And by the way, in this block, you can see the code. Everything I do is there. So you can fit the random forest. And then there's a package called forest error, I think, a package, forest error packages.
And they just published a nice paper that they can rebuild this probability distribution error, errors, you know, prediction errors for the random forest model. It can be relatively easily implemented. It is really nice in our package that you can do this forest errors. And this forest error you see here, this broken line.
And so what happens first in this thing, you see that there's overfitting. So this blue line is artificially going up and down. It's trying to come close to the points. But this thing we know because we know this is just simulated noise. So we are really fitting through noise, this curve. And so that's classical example of overfitting.
And then as you go to the edges of the feature space, in random forest, because it's a tree-based learner, it basically assumes no structure in data. So it just, it will learn based on the data, it tries to come close to data. When it comes to the edge of the feature space, it uses the last decision, the last decision on the edge,
and then it just applies that decision because there's no more data. So it just gives the last decision. And so that's why you see this flat line here, right? And in this case, this cutting edge forest error package also produces this uncertainty, which is almost more narrow than uncertainty here. So it looks like the extrapolation
is less uncertain than interpolation. And we know that that's not true because let's say if we know that the red line is the real structure in the data, then we know that this blue line is actually going really off. So we know that the answer should be actually much higher. So that's example of overfitting
and also example of extrapolation at the same time. Now, what I did, I said, okay, well, let me look at my framework that I use for work, and I like to use this ensemble learners by stacking. And I put, I think, four learners, and one of the learners was a simple GLM,
and the other learners, they were more complex, random forest, XGBoost, I don't know, some other methods. And I put it, and then this is the thing I got after ensemble machine learning. So it does work a bit as a remedy for the overfitting.
So you see it's much less fitting if you compare the two lines. So there's a much less overfitting. And as you go towards the edges of feature space, you can see the uncertainty. It's kind of what we will expect that it grows. It's actually a very similar plot when you have the prediction errors
in linear regression, right? You have this coverage of the confidence interval as you move from the center of the feature space. So that's just for example, and there are some remedies if you want to do, if you want to deal with extrapolation problems, let's say. In this case, the remedy was to use a combination
of simple linear learners and non-linear learners together. And if you use that combination and you do the ensemble by stacking, the stacking eventually picks up as the best algorithm is the GLM. So it's not the random forest, it's actually random forest
you could take out from this system, but it finds it automatically. So I didn't have to specify anything. It just finds it because it does repeat it like fivefold or something, cross-validation, and it finds out. So there's this criticism on extrapolation and Hannah, I think she was in summer school in 2020.
Hannah gave this talk about area of applicability and all these problems of going outside of the training feature space. And there's this piece, there's this article they publish, Hannah and Edzer,
in Nature Communications, where it's a bit critique piece, where they ask whether extrapolation should be implemented at all. Maybe you should just mask out the pixels. And so something we can discuss. Also, when you think about this, we also have lots of machine learning papers where we predict the future.
So we predict distribution of something in future, given the climate scenarios, given some changes in population, I don't know. So we also do prediction in future. So the question is, should we also extrapolate going not only feature space, but extrapolate in time? And then how do we communicate?
Let's say if we do extrapolate, how do we communicate effectively uncertainty and overshooting? How do we communicate this? Here, I do this confidence prediction intervals, but how to do that with maps? How to do that efficiently? So you can read that paper.
I don't know if you saw it, but in Nature Communications, they speak about lots of these problems. Here's a paper we found. This group, Yulich Research Center, they actually develop a framework
where machine learning is combined with uncertainty assessment, and they do also produce these maps of applicability, everything which is darker, all these darker colors means it's feature space, there's no training points. So you see that's a serious problem. I mean, many global data sets are like this. You don't have a probability sample,
but you have these usually large gaps, especially Northern Hemisphere tropics. You have these large gaps that are really not easy to access. There's not many points, and you can see that, and apparently this paper, I haven't studied it in detail, but they do a really total job
of checking uncertainty of each input layer, and then thinking how to incorporate these uncertainties to see how much will they impact the final answer to your predictions. There's another paper just popped up, and it sits in the ground, my colleague from the WAC University, and colleagues,
so they publish on the problem with if the data are clustered, so can you really do estimation, can you do unbiased estimation if data is highly clustered, and so they simulated these cases where you have no clustering and really high spatial clustering, so there are different combinations. And so I think the conclusion
they came up to, as far as I understand, if the data is, for example, like here, highly clustered, that you cannot produce unbiased estimators, so basically, with this type of data, you shouldn't be making maps at all. So if I understood correctly what they claim, because there's no way to avoid bias,
and that means that this data should just stop and just go collect the funding and collect more point data. The problem is if you do space-time modeling, and if you want to do space-time modeling, how do you go back to past to collect data? So something that's mind-blowing, but we cannot make a time machine, right?
So then the question is, do we really just drop the whole field, or is it more that maybe we can do unbiased estimation, but we need a different method, because this is only in the context of the method they use, right? So that's a question. Can we have a remedy for that,
or do we really drop it completely? Do we stop doing that modeling analysis? And this is a really serious challenge, because on one side you have, for example, policy-making governance organizations, they need data. They need, then it says, give us an estimate, how much, I don't know, forest we have in Europe 200 years ago.
You know, give us an estimate of this, give us this, give us the data. So they're data hungry. On the other hand, we have strict statisticians saying, well, don't do estimation because you make a bias. And so it's really a challenge. You feel a bit divided between the two. I participated in this paper, together with Surya Gupta.
He just finished a PhD at ITA, focused on soil water characteristics. And what he did, he got this data set. He spent, I don't know, two years, you know, collecting, cleaning data, checking data, opening reports, papers. And this is the best he could create.
So this is the best he could create. And there was a lot of problems here. Like the Florida had about 10 times higher density of points than the rest, so you have a high spatial clustering. Then you have these huge gaps, Latin America. Luckily, we got data for Russian Federation,
but, you know, there's still lots of places, no data. Of course, our areas, deserts, we don't have any training data. People don't go to deserts to collect soil samples, you know. It just doesn't make sense also. So there was these huge gaps. So the question is, was this data set, you know,
we made the maps, we published this paper, but the question is, you know, how much of the, you know, how much of the effect of the clustering, how much does it touch our estimates? Our estimates may be biased in a way. You know, did we, for some areas, did we produce really something where uncertainty is so high that we're making wild guesses?
And as I said, the big problem is the spatial temporal models. So how do you validate that? How do you go to the past, you know? How to do that? So it's not something easy. Maybe for some things, we'll never know the truth, you know, we'll never get even close to truth.
But on the other hand, when you look at space time, you can do sometimes what you can do in space time, you can build the models. For example, if you say, I would like to predict in future, so you could go back to the past and fit the models for the last 30 years with the data we have. And if you fit the models for the last 30 years,
and if you find significant relationships, then it gives you some confidence to extrapolate that into future. So as long as you can reconstruct the past, that gives you some confidence to predict the future. But future is always nonlinear, so nothing is standard and it can get really complex. Visualization of uncertainty.
When I was a PhD student, I looked at this whitening, whitening techniques, you basically, you use the hue saturation intensity color model, and you use the color hues to visualize the values going from bluish, low values, reddish, high values. And then you're mixing the whiteness.
And so you kind of bleach, you bleach the map. And you can see here different variables. And this variable has a high uncertainty. And this variable has a, you know, higher certainty, let's say, less uncertain pixel. And you can see these are the results. On the left, you see the original predictions.
On the right, you see predictions visualized with uncertainty. Now, if you show maps like this, like a kind of bleach, it's kind of like buying a painting. And then your painting, instead of having really nice colors, it's kind of bleached, you know. And many of you probably go like, well, I don't want this. Can you give me the one, you know, without this whiteness?
Because it's a bit annoying, right? That you go like, but this is the reality. It's like a cloud, you know. If you have a high uncertainty, if the uncertainty is higher than the variance in the data, that means that you're really just making wild guesses. And it gives you a right to paint it as a white, you know, because it means basically we don't know really
with high enough certainty what's happening here. This is visualization of uncertainty with prediction intervals. Also very useful technique. But how do you do that with maps? You know, you need like a tree maps to visualize one thing and how to do that, you know, especially if you have a 3D data or 3D plus time,
then you have also confidence intervals. Then you have these multiple dimensions. And it's not easy to visualize. Then there's also beyond methodological problems, there's usability problems. There's a great paper by Julia Wagemann and she, they did this interviews and polls
and they collected all these things and they came up with these challenges to bigger data. So there's, they're listed here. Limited processing capacity in user side, growing data volumes, that's something we had also. I think that's common for many fields. Non-standardized data formats and dissemination workflows.
Too many data portals. It is also going more problem to, you know, you get like hundreds of data portals, even for the same field you have today, I think four or five global and cover mapping projects. And people are starting to get confused. They're lost. I mean, which one, which one do I use? And you just cost your time just to make a decision,
which data do I use? And then difficult data discovery. This is an example, fantastic data set, paper published, scientific data. We downloaded the data set. It was just tiles, I think. And then we downloaded the tiles and we built these mosaics for the whole world.
And this is the Sentinel Seasonal Data, four seasons. So you can see, this is Wagemann, so we are, you can see for different seasons, different events, how things change. And, but the problem is when you download the data, you create the global mosaic,
then you discover there's all these gaps. So it's a fantastic data. It's based on the Sentinel One, you know, cutting edge technology, but it's about 91% complete. So you have about 9% pixels missing. And we wanted to do this data set in multiple projects. And because this missing 10%,
we cannot use it because if we miss this data, it will propagate to anything we want to do. And so we had to drop this data set just because it misses 10%, because we cannot deliver something which is on our side also 90%. These people managed to get this data set out, but they didn't, let's say, fill in the gaps.
I don't blame it to them. I think they didn't have the Sentinel data, but this is an example. And also then you have these artifacts, lines, et cetera. As I said, also big challenge of data science. Realistically, I think you agree with me also 60% of time what we do is we clean up other people's data.
You're lacking this hackathon. We spend that 60% of time to clean up and prepare this data. But in many projects, when you start, you will spend from 12 months, you will spend six months or longer just cleaning up, organizing, checking tabular data, layers, GIS, importing and stuff.
And it's a real pity. So that's it from my side. This was a big overview from my 20 years of work experience and also from the EcoDataQ project. And just to announce our OpenArt Monitor project, now I pass it on to my colleague, Leandro, who's going to talk about the OpenGL perspective on handling large geodata.
Basically, I will explain how we are trying to manage these problems that Tom explained and what solutions we are using to really develop our applications
and do our mapping products. So, yeah, we'll start with this fungal pipeline. So basically, we have several applications
that we implemented internally at OpenGL. That we kind of, the code that we wrote, it's basically a kind of fungal pipeline that was able to reduce the data input that we have mostly, in this case, in this example,
for Earth observation data, so climate data, satellite data. So we have several kind of raw data sets out there. And these kind of Earth observation data sets, they are not a commodity at all. You need to do a lot of work
not only to do the modeling but also to do the prediction and generates complete mass. So, considering this data volume, I would like to emphasize that this kind of fungal pipeline is a way to reduce this universe of raw data
to a lower and more compressed data volume in a way that allows us to really develop the desired applications. Of course, you have some implications with that. There is no free meal,
but basically, we show two examples of it. But basically, this full pipeline, I would like to emphasize that we use kind of this cloud computing available,
so AWS, Google Cloud, Google Earth Engine, and more recently, OpenEO platform. Some of the archives of Sentinel and satellite data, they are like petabytes of data. So you can use cloud computing to kind of compress, reduce and export this data from these platforms
to in-house compute infrastructure. And this is what is called hybrid computing infrastructure. So we combine both of the best of the both worlds. And why you would have your computing infrastructure.
So, for example, why we have our computing infrastructure. When we prepare this dataset, imagine like Sentinel, Landsat, so we prepared several collections for Europe.
And when you start modeling, you can use your own infrastructure to really, for example, do predictions and see the results and do all this, implement the whole workflow in your own computing infrastructure without spending credits in the cloud computing.
So, and of course, you cannot host the entire Sentinel or Landsat archive internally. We are talking about petabytes of data. Maybe with a lot of money, you could have a computer centered, able to do it. So currently, that's kind of the architecture that we have. And basically, I think it's a nice thing
because it allows us to replace a lot with the data and see and, for example, implement different gap feeding approach, not for like very small areas, but for large areas and see what are the impacts at global and continental scales. And of course, doing it properly, you improve the quality of your output.
And basically, there are some simple stuff that we do and there is a cost on it. Of course, you need to consider it. So for example, if you have like this, the Landsat and the Sentinel data, most of this data you can find like in integer
and there are some data that are available at floating 32. So you can just convert it to a more like compressive range. And so we can reduce a lot the data size. Of course, introducing some, for some applications might be sensitive.
So I don't know, maybe for some forest, the land degradation could be in fact, so you definitely need to test, but doing this simple like conversion and reducing it, you can really compress the range and reduce the data size. Other thing that we implemented is actually this temporal aggregation.
So basically, for example, you have for Landsat 16 days of temporal resolution. For Sentinel now, you have five days. So and maybe you don't need like a very dense time series. So maybe one monthly time series would work,
a bi-monthly time series would work. So you can actually process in the cloud and aggregate different time intervals, generating for example, one temporal composite for your application and downloading it. So that's one way that we are also implementing. And for some case,
this is more like, it's not really, it doesn't fit for all the applications, but for example, you can also reduce for course resolutions. And we did it for with MODIS. So we did, for example, some experiment to harmonize MODIS data
that is available since 2000 with AVHR data. It's a kind of a long-term time series available since 82. And this data is just like one kilometer. So for this case, just like a course resolution would, was perfect to try to harmonize both products.
So it's more limited, but it's also an option. So here it's one example. So in this case, basically we took all the MoG13-T1, the vegetation index product, and we also have this product internally
in our infrastructure, all the product you can download from NASA, but you will download several tiles, HTF files, aggregated bands. So there is quite some work to download and process it. So 16 terabytes, right? And basically here, these first two steps we implemented in Google Earth Engine.
And so we remove it, the low EVI values considering like the pixel reliability. It's a quality mask in this product. And we did a bi-monthly aggregation. So this is a 16-day product we aggregated by every two months, and we exported it from Google Earth Engine.
And so these two steps implemented in Google Earth Engine, so after the implementation, when we exported the data, took one day. So with one day, we selected the desired band, removed the cloudy pixels, the pixels with shadow and clouds,
and implemented this temporal aggregation. After that, we had like full flexibility to implement in our internal infrastructure, like several gap-filling tests and considering, for example, different land masks to reduce the pixels only to land and removing water pixels
and things like that. So in this example, we used the Google Earth Engine, exported the data, and did all the rest internally in our own infrastructure. Now, as we have this data, we will do it for all the bands. We will just use our internal infrastructure. But it's one example that if you are interested in only one band,
that's a nice solution. So basically what we did, so this is the output that Google Earth Engine exported to us. It's aggregated two months. So January 2010. So we use it, here we use it, not all our servers, but all the threads here
were used to implement this gap-filling. And it's temporal gap-filling. So basically when we don't have observation in January, for example, it will look in other Januarys and it will look in like neighborhood months in temporal, but it will focus just like in other observations
for that pixels in different periods, maybe. And of course, there is a cost on it. So considering your application, maybe it's not suitable, but we also generate a kind of gap-fill flag. So you can actually know this pixel,
it's a regional pixel or it's a gap-filled pixel. And you can decide if you use or not. We are providing both the product gap-filled but also like a metadata information per pixel. So it allows the users to take the decision. And this is the output. So basically here,
we can see like this time series and basically how it's kind of the Earth breathing. So... It's like a heart beating or something. It's a similarity with the real life.
Yes. And for some areas, you have these seasonality aspects and this is the size of the data set. So it's a complete, consistent and full data queue. And in the end, we organize it in a stack. It's not really like official, this stack for the open land map, we are working on it, but you have all these layers like ready to go
with the flag indicating if it was gap-filled or not publicly available. So, okay, this is most, but... So don't ask it when I entered in the OpenGL hub. So yeah, I need to... I want to work with the Landsat data and I want to go to 2000
and really get and map the big shifts and what it's happening at 30 meters. And to work with the Landsat data, you have mainly Landsat, it's great data set, but you need to always remember that it's a long satellite program, the longest actually. And when you talk about it, you have different satellites,
different sensors covering all the period. So since 2000, we have Landsat 5, 7, 8, and now we have 9. And US GIS, they do mostly... They do a lot of work to try to harmonizing it and to put all these observations together,
but you cannot really cross check and have a consistent time series observation because basically the sensor is different. The sensor, for example, Landsat 5, it's old technology and you have 2000, 2010, mostly Landsat 5. So you actually, you need to... Considering that all the sensors
are producing different data sets and different values actually for the same aspects of the landscape. And it's tricky. So when you mix all these observations and to do what Don explained at this spatial temporal model, you are actually using one single model to map Landsat 5 and to process Landsat 5
and Landsat 8 and map what is the land cover, what is the soil, what is the tree species in that specific pixel. So you actually, you need to have a highly harmonized data set. Otherwise you will just see some trends and patterns that it's doing the data, not to what is happening in the landscape.
So considering it, you have a few harmonization techniques to do it. And when I started, I found these two mostly interesting. So we have the GLAD Landsat R&D. So this is an initiative developed by University of Maryland. They have the whole archive internally,
so petabytes of data, and they processed everything and did a proper harmonization and produced a very nice product that you can really compare and cross all the values through the Landsat 5, 7, and 8. And they are using it for all their products. So it's really nice to have several products,
and it's a very tested data set. This is one thing, but they release the data, but they don't release the software. So the software, it's private, proper from them, but the data is free. So you can download, you can access an API, but if you want to download, for example, I don't know,
the whole Europe or an entire continent, this feed is not great. Even in the US, we did some tests. So for example, just to have idea to download all this data for Europe, it took one month just to download and put in our, since 2000 to 2020.
So it's a large data set, but it's also, there is this bottleneck in the way that you can access the data. This is one solution. The other solution, it's the Force. It's a very nice solution. It's a software solution. Initially, it was developed inside
of the University of Humboldt. And basically here, they also add the Sentinel 2. So they have one approach to harmonize everything, Landsat and Sentinel, and really do the reprojecting, the grading, generate the temporal composites. But here, they don't do it for,
you don't have like a data available from Force. They have just a software. So you need to process by yourself. I know now there are some initiatives that are actually working with Force, processing the data and put it available. But mostly the work that we did was considering this GLAD Landsat R&D
because the data was there. So, and even considering this data, we still need to remove the clouds, the shadows, and implement the gap filling. So if you add a new harmonization pipeline on it, it would be a lot of processing time also.
So, and here, it's just like a conceptual explanation. So mostly, as in USCIS, they deliver level one, level two products. And now they are, I know that USCIS, they deliver also some level three products, but Force and GLAD, they are mostly in level three and level four.
So temporal composites, grid data. And so here's an example. So if you download one GLAD tile, you will have all the Landsat bands, one additional KIA mask, so quality assessment with indicating where is shadows,
no cloud and this stuff. And basically, of course, if you remove it and the pixels where you have cloud, you need to gap fill. So, and they are putting everything available and you need to do a lot of work even after the download.
So, and they organize that also this composite interval. So this is not like a one image for a specific day. This is like a temporal composite. So it's a level three product. So for example, the first composite covers 1st of January till 16th of January. So you have observations for different dates
inside of this time window. And, but they are completely harmonized and also observations from different satellites. So you can have like one pixel from Landsat 8, one pixel from Landsat 7 and things like that. So considering it after the download, we did like a cloud removal,
temporal aggregation by quarterly and matching with the seasons, gap filling and mosaic. So even after all this harmonization process, we still did a lot of work. And here's one example. So we are working with this data for several areas in the world. So Europe, US, Brazil, Canada,
and this is an example in Canada. So covering this period. So this is a temporal composite from 13th of September, 2019 to 1st of December. And for all these periods, these blank areas, we didn't have any observation,
maybe because the satellite couldn't obtain an image or maybe because it was cloud or low quality. So considering it basically, we used our gap filling approach considering all their observations in time and we did the gap fill, right? We filled these gaps with a vital observation,
but again, maybe the observation here could be like from one or two years ago and there is impact depending off your application. So we are always providing the flag mask, the gap filling mask, indicating what pixels are true, what pixels we didn't get filled. And so this is what we are seeing with the Landsat
and this is what we can see with Google Maps. So it's max. And of course, if you are looking it in the middle of the Amazon forest and you have a deforestation and you had a gap there in the same place where you had deforestation,
you will not see, there is no magic. But for the most part, but for a lot of applications, I would say it will fit. So, okay, and how we are doing it, really using our computing infrastructure. So basically we have several,
I would say like large capacity servers. So instead to have, I don't know, 30, 40 mid range servers, we really invest like in high capacity servers with most number of threads possible. So considering the budget, of course.
So mostly of our servers now, they have 96 threads and half terabyte of RAM. So it's a lot of RAM and a lot of threads, but if you want to work with the servers like in an integrated way, you need to set up a HPC infrastructure solution.
So basically what we did, we are using mostly SLURM to split the job and basically send for each of the servers a chunk of data to be processed. I will show with a conceptual example.
But for example, you need to send also the code. So you need to develop like some R or Python script and send it to be processed. And it will be like a prediction or a gap filling approach. So all the servers need to access a kind of distributed file system. And we are using Samba.
It's not the proper one for high performance computing, but we are using to have compatibility also with Windows and access these servers through Windows. But we are not using Samba to send data or to access data. We are just using Samba to send, for example, the scripts
and the code to be executed. To send data, basically we are using Minayo. It's a self-deployed solution that implements the S3 protocol. So imagine that you have all these called object solutions, S3, Google,
and they created like an internal, like one solution that you can deploy in your own infrastructure and has all the advantages of this protocol. So it's really, it's highly reliable for like HPC. So for example, when we have like that image that I presented of MODIS with several gaps,
I can put it on S3 and I can have several servers accessing it. Of course, it will be a bit slow, could be a bit slow because of the network problems and bottleneck to access it through the network, but it will not crash, but it's most important. So it's highly reliable and it manages really well
like the high throughput to data access. And to get all these libraries and processing running, we are using Docker. So we have our own containers with GDAL and the main R libraries, like MLR, MLR3, and raster.io,
and we implement the development of these scripts. We do it with RStudio and JupyterLab, but to do the processing, we actually sent the command for all the servers and we create a new container with the same libraries,
so with exactly the same environment that we did the development of the script. So it's highly reliable also, so we will not crash any machine because we are using the wrong library. It's the same environment. So basically here,
we have the short explanation for all the solutions. One thing that I would like to say is, in the MiniIO, mostly we are hosting the cloud-optimized geotips. So we did some tests with tileDB, these as Czar format and things like that, but for me, cloud-optimized geotips, it's the best solution right now for us
because it allows us to visualize the data. So I don't want to have a tileDB representation of, I don't know, a very dense time series or even the MODIS data that I could not see, that I cannot see because I need to access the data. We work with geospatial data, so I need to access and see and analyze it
and see at a global scale. So the cloud-optimized geotips, when we did this evaluation, it was the only solution that provides the pyramid layer. So I can have the same data structure and the same data format to do process...
I think, but also to visualize in QuantIIS. Of course, if I use IODB, maybe it's more reliable and cannot access more, receive more access, but it's not easy to visualize at global scale. Imagine it has all the pixels of Europe and try to visualize it because it doesn't have like overview structure as cloud-optimized duty.
I will talk more about it during my Python session. And yes, so we, these, all the Docker containers that we are using, they are public available in Docker Hub. So you can just start using if you want
in this RGO idea. You have the last version of GDAL, the MLR, everything ready to go, and the same in Python. So everything really integrated and you just start the container and can start working.
So this is one example of processing that we do. So this is the Landsat data for Europe. And we start in creating what we call like a tiling system. So in this tiling system, it's basically place where we have data considering like a predefined grid cell.
And here it's a 30 kilometer by 30 kilometers. So it means that 1000 by 1000 pixels. And basically we sent chunks of this data. We split the data in different CPU servers
and these CPU servers, they access the file in the storage server. So you have a kind of bottleneck in the storage. And of course, if you are doing like a really, a minor computation, I don't know, some NDVI calculation or something like that, might be the access time might be large than the,
or bigger than the processing time. But for the most part of the applications, that's not the case in our case. So for example, the gap feeling it's highly computational. So once that you load the data, you can load just that chunk and do the processing
and later write it back in the storage. So this is the first step. So first we sent all the servers, start processing different chunks of the data using the same script, it's parallel computing. Basically it's the same code as executing in different chunks of the data. And here's an example of our expanded infrastructure.
So here we are using all the threads and mostly this analysis was to a performer to calculate the gap feeling and the trend analysis also per pixel. And after that, we just write the tile
that we process it, we write it back in the S3. So in the end, you have like all the tiles, the small tiles, not the big mosaic, you have small tiles in the S3 ready to be mosaic and then to do the mosaic. So to do the mosaic and basically we use GDAL,
we build the VRT and we use GDAL translate or GDAL warp to transform all these tiles in a cloud optimized geotiff file and write the big mosaic and later we just drop all the files. So you could try to use some parallel writing. So I know that there are some solutions trying to do it.
So you could have like a big mosaic and you can put all the servers to write in different positions of this mosaic. But again, if you have something like that, still you need to convert in the end for a cloud optimized geotiff to have a full picture of your result.
So that's why we are just using cloud optimized geotiff internally. We starting testing tile DB, but for me, yeah, still cloud optimized geotiff, it's a nice solution because it brings also the pedometer layer and the overview of all the area.
And mostly of this processing, it's implemented in this Python library and we call it eumap. And you have also several tutorials here and the gap filling, the overlay. So I use this library to implement the overlay for the hackathon. We have also like a class called landmapper
that implements all the, that use all the signkit-learn models to predict and do the modeling, but also predict. So it's a kind of wrapper for the signkit library, signkit-learn and in a geospatial context.
So we have some wrappers also for geopundas and rasterio, but it's mostly like, we try to put all the rotings that we implemented here and keeping using it mostly for our applications.
Just to conclude. Yeah, you saw some challenges that we have, some challenges, some methodological, some challenges because, you know, the complexity of data or spending a lot of time cleaning other people's data. There's also many technical challenges
and Lando shows you with the real examples, the challenges we have, for example, with, you know, something like a really world-known Landsat and Sentinel missions. And then once you get into them, and if you want to do like a space time analysis and modeling, you immediately notice what we saw that,
you know, there's carbonization issues, there's gaps, you know, there's some artifacts in the images, you know, and it also sucks time and you have to do this gap-filling aggregation. So there's a lot of challenges. And I think the challenges that will remain is also this one, the five groups we mentioned,
the training data issues, modeling issues, data distribution issues, usability issues and governance issues. They will change. Maybe I missed some aspects here. If you come up with something, please let me know. And we live in amazing times. You know, I mean, I've been in this field
25 years almost now, and it's really amazing time as the satellite data is now publicly available. Then there's the open source software. It's really amazing. It's amazing developments. And we also have, hardware-wise, we have now possibilities to process large data sets,
even with no budget, just using Google Earth Engine or open your cloud. So without any budget, you can really do magical analysis. But still, even though it all looks really bright and great, you will see that when you start doing real projects,
it takes care for planning, testing, and that's why we actually also do that hackathon. We created that hackathon out of this lecture. It's really connected. So in the hackathon, you know, either of the hackathons you choose, you have to do very careful planning. You have to first see what's possible
if you will have enough time. And then for the second hackathon, you do this sensitivity analysis so you can create functions that, it's not about maximizing accuracy, but it's just checking how bad could your modeling go
and how sensitive is the data and all the software you have, the algorithms, how it is sensitive either to extrapolation, to spatial processing, et cetera. And if you work on tiles, how do you deal with the edge effects?
So if you just kind of do a Google app or something, how do you solve that? Yes, it's a good question, absolutely. There are applications where, let's say the edge effects are marginal problem, right? But there are applications where you have this proximity
and spatial dependence. So if you will tile, you lose the spatial dependence, and then you will have artifacts. And I had the problem with, exactly what you're asking, I had with the hydrological model with terrain. So when you do hydrological models, the watersheds, they go beyond the tiles.
And so when I started doing the tiles, of course the first time I just tested, it's okay, this is ridiculous, I get the lines, so it's not going to work. So I made in my land map package, I made a function when you do tiling, you can also specify the overlap. So you have these tiles that overlap. And so then you model with that overlap,
and then you cut out over the center, and then you glue it like that, and then you don't see the lines. So that's the one way to hack that problem, let's say. But yeah, it's not simple. So one solution would be to have the overlap, tiles with overlap, and you can even have the whole tile overlap.
So it becomes like a filter, like a moving filter, right? You have a tile which is like a three by three tiles, and then you do analysis, but you only take the results from the middle, and then you move to the other one. So that works. I tried with actually for whole of Africa like that.
Well, I guess it depends also on the spatial range, and the tendency you have in data sets, so short dependencies or long dependencies. And for the gap filling, you're doing it in space and time-wise. No, so the question is for the gap filling that Lando showed, the algorithms they use now are purely temporal based.
So they look at the temporal neighbors. So either January last year and the year after, or the February, March, but it could be combined with the spatial. So you do both spatial and temporal, but for example, spatial, you do, it's very important that it's like really edge,
you know, edge around the pixels, because as you move further in space, you're really, you could be really extrapolating something. So I think, I believe in what Lando does now, this temporal filtering with, especially when you have like 30 years,
then you have so much neighbors in time, you know, that I don't think you need spatial, but it could be also combined. There's also functions we made, I think in the project, where we do, we force removing the pixels. So we have the pixels, but we remove them randomly or something.
And then we do cross validation. And then we see whether some algorithm, for example, is better than the others. If something sticks out and we did that, right? Maybe you said that about this. Yeah, we did that to analyze with, because you can also just do a kind of a linear model.
So it will considering the whole time series, it's not perfect. You can also have, for example, like a gap and you just do like a linear model between the closest observations, like a straight line. So there are some ways you do it, like in a temporal domain. Basically what we are doing,
we are expanding like the window for the neighbors and taking the average, but not also from the, not only from the pixels from the past, but also from the future. So depending off the application might be, have some implications. So for example, for a more conservative approach, it's just in a temporal domain,
it's just keep the last visible observation. So you make sure that, okay, this is the last land use and then cover. So I will not overestimate or bring some land cover change from the future to propagate it to the past. So it depends, but for, so the way to assess it like objective,
it's just creating artificial gaps and you test out these different models and you can compare with the ground truth of the artificial gaps. And for the spatial, we have also some other maps, but it's a kind of like kind of weighted average.
So when you are really close from, you have like huge gaps, it doesn't work really well. It's just like a kind of interpolation. So in some case you don't have pixel at all. So because maybe the satellite doesn't cover that area or it's close to the coastal and they are limiting the data there.
So in some case we still use the spatial, but after the temporal gap filling. Yeah, I was thinking about the different spatial use case. So all the neighbor pixels that deferrals it at some point, but kind of the one you miss, it's kind of still first, and I can also indicate an error with the spatial neighborhoods and improve the gap filling.
Not at all. You can kind of construct these out in any direction, I guess. The best is to do this against like sensitivity on accuracy testing and to like also take some kind of probability sample, take out the actual pixels and then test different gap filling algorithms.
That's I think the best way to go. And then apply, but it also depends on the context what Lando said. Depends what the application is. If the application is only like deforestation, maybe there's some other gap filling will work for that. But if the objective is just to fill in all the pixels for all the land cover and as fast as possible
because we don't want to compute one week, right? Then you can take something more simple, I think. But it's also important that what Lando says, let me emphasize that one more time. Everything we get filled, we provide the mask and we say this pixel is gap filled. So it has to stay.
It's very important for science that you keep a record of any manipulation you do and that you specify that there's some gap filling being done so it's not like actual image. It's a gap filled value. So people don't, if there are issues,
they can back trace and says, oh yeah, I have this problem here, but I see it's because of gap filled. So something happened during that gap filling that for example, you don't see some deforestation or something. You see, one little thing about this style to be, I think it could be that it's faster for some things, And it's very, imagine it's beautiful.
Like you can put this multivariate data, space time data, you put it in a one file and it's a database file, TDB I think. It's a database file and then you put it in S3 and many thousands of people can access it and programming it in Python in R.
But you want to visualize it in a problem. You have to export from tile DB into GeoTIFF and then you put it in QGIS and then you can see it. And you see that little thing got us not to use it because we tested it. Actually, Lando puts the whole MODIS,
for example, time series, space time. He put everything in one file tile DB. It's beautiful. You just take the whole temperatures from MODIS and you put it in one tile DB file. So you can, and you can just, this is like a database. So you can connect to it, you can extend it,
you can search it, you can subset, you can do overlay, you know, you have sky's the limit. But to visualize it, you have to get it into GeoTIFF and then we realized, oh, it's a pain. And we dropped it and we just stayed with the cloud-obtained GeoTIFF.
And cloud-obtained GeoTIFF, I think most of you use this. Anybody is not using it or need some introduction. Just raise your hand, please. We have to know now. So, yeah, so we will do, Lando will do a gentle introduction to cloud-obtained GeoTIFF.
On the end, it's nothing special, but it is magical that you have just an image file, which has basically indexing inside. It has pyramids, it has like a tiling system. And so when you start doing visualization, overlay,
subsetting, it goes so fast and you can have, again, 1,000 people connecting to the same file and using it without downloading whole file. So that's a session that's coming on Wednesday. Yeah, so you can follow that. So I have a question about partners.
So probably you know that with ecology, often we know that it's not only about the value in some ways of certain things or the land cover category, but it's also about the spatial relationship between those categories. So there are different biodeversities depending on spatial patterns and so on and so on.
So from what I've heard already, I guess that you are not considering that while building the model, but do you plan to do so or do you even consider that in some future steps? That's a good point, if I understand you correctly.
So like in the first hackathon, we're just asking who will get the best predictions at test points, right? But you say, well, what about, you know, you get the quantity, you estimate quantities, but what about the patterns and shapes and so. Yes, I'm not sure what is the best metrics for that.
Like maybe do you have a one like a robust metric for how well you match the shape? I mean, imagine if you have a land cover system and you want to map the buildings, let's say, and then you know all the buildings like squared usually, and of course they cannot be too thin,
they cannot be too long, you know? So you could, but what is the one robust measure for that do you have maybe? Is there a one robust measure of the pattern accuracy or? There are some ideas that there is nothing which is one perfect solution, but with patterns, the cranks do not exist,
so that's gonna be also like huge problem for much land. So, but it's for sure, for the land cover, also depth deforestation, you know, the shapes and the spatial proximity and all these things, absolutely.
We ignore it, you know, many projects just ignore it. They just do like points and whether the point is correct or not. And especially I think that would be very useful with your land cover for the last two years. Like you have to save product for 20 years and keeping this spatial consistency could be important.
That's an idea. Absolutely, no, I agree. But just, we don't know how to do it, that stuff. Okay, more questions? Marcus. So you were saying that 60% of the time was data cleaning and the people here are lucky because you prepared the data for the hackathon,
but in the usual cases would be like, everyone does that for themselves. So it's not like one person, 60% but 20 persons, 60%. So how could this be avoided? Is there some other way to automate it or is it so individual that it's not possible?
Well, yeah, it's quite sad. I mean that for me also, I should also, you know, get the project and do some processing and I look at myself and I say, well, this week from five days I spent three days just checking, cleaning, removing. People like, you know, people are very creative.
They don't know how to put missing value so they put a zero, I don't know, for missing value. And then if I blindly use the data, it's completely messes up the model. So how to automate that? Well, yeah, you could, I think many things, you could make functions that, I think in this Tidy Wars in R, there's quite some functionality,
like while you load the data, you can add some cleaning and checking and validity checks and you know, like for example, for dates, you know, whether the date is correct format and you could add, you could make automate, absolutely. But you know, some things that's so complex
that you need really human to look at it with knowledge of biology, ecology, you know, now we work with health data and you know, you really have a, you need a human to look at it because it's just too complex to automate. I guess that's much about how users clean the data. So not about automating this process
but maybe about the data provider, provide the data in a certain way. Oh, this thing is all about, it's about standards. It's people using standards and, you know, caramelizing to same standard and then having software that pushes you to,
so like if you had the software that everybody, when they load the data and software, that it has so many validity checks that the users are then pushed to tidy up, to tidy up the data, you know. And then you have this international organization like GBIF, you know, GBIF and many organization,
they force you to enter all the metadata, to enter all the fields, you know, and that goes to validity checks and that's the way to increase the quality, yes. Okay, thanks. Martin, sorry. I would have one on the data sharing that you mentioned,
and that's the data licensing and data use policies. So we are working in the context of global air quality, for example, and here you experience that many countries have either no open license or not even a license concept at all. So that can make it very complicated if you pull data together
and then use it for here and to come and frame it. Okay, but you said it's for the air quality data. This was specific quality data, but I believe that exists also in many other fields if you start collecting field data. So as long as, of course, if you're working remote sensing data, it's easy in a sense because you have,
when you have to do with a few space agencies, but if you have to connect to many individual groups, research groups, all over the world, that can be a potential problem. Yes, I'm aware of that, yes. We are pushing through the Open Land Map. So that's one of our flagship projects. We're pushing through Open Land Map
and through Open Earth Monitor. We will be pushing to really get people to accept standards and simple setups, licenses standards, and then also tools that can be used to import, harmonize, and bind data in an easier way.
And then the data becomes more usable. And it has to go to GitHub. It has to be, you know, almost write, read and write like a Wikipedia approach for everyone so that people can fix, improve, and it's just constantly getting better and better. Yeah, that's a more technological aspect.
The other aspect is, I guess, more of this recognition aspect, which goes into this reward scheme in science so that the people who produce data feel that they're not rewarded sufficiently. If someone else then starts taking the data, using the data, publishing papers for the data and all that. So I think in that sense, there's some education necessary,
but also I guess one should underestimate the time it takes to convince people to actually open up data and make it available. Yes, I just had some people send me an email. They did a project in Africa because we build a Africa data cube,
Agronomy data cube. And they said, oh, we have these points. Will it be useful for you? And so then typically what I tell them first, get the DOI for your data, put it as an order, get the DOI, get the publication if you can, right? And then just take some metadata standard,
take some, you know, standard data format if it's geocompatible or, you know, geopackage or something. And then put your data into that or some simple format, like just CSV or, and then put it there, pick up some metadata standard and get the DOI and get the recognition.
Absolutely. The problem is the citations on data. Yeah, people are still, well, you know, we funded by government. Government wants it to publish in nature and science. I don't know. If you say, look, I made like 50 data sets used by, you know, 10,000 people, they said, no, no, no, where's the publication?
So you do have to also, but luckily now the journals, data journals, but they cost to get the open access publication, they cost too much if you ask me. So that's a problem. So now we have these data journals, but they still want like, I don't know, 2000 euros for a publication. So, but yeah,
my first recommendation is in order, just register that, put a minimum. And Zenodo now supports, works as a HTTP service. So if you put the cloud up in Geotiff, I don't know if you know that, if you put it as an order, you have a free hosting, cloud-optimized data.
So you have a free, basically it's paid by European citizen because it's this CERN accelerator side project, Zenodo. And so if you put it there, it will work as a cloud-optimized solution. So you can put, for example, 50 gigabyte file and people don't have to download
50 gigabyte, you can just load it into QGIS and you can work with it.