We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Implementing an openEO compliant back-end for processing data cubes on the JEODPP

00:00

Formal Metadata

Title
Implementing an openEO compliant back-end for processing data cubes on the JEODPP
Title of Series
Number of Parts
295
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Funded be the European Commission as a H2020 project, openEO aims for a new web service standard to process Earth Observation data cubes. At its core, openEO provides an http application programming interface (API) that defines how users can discover Earth observation data cubes and process them on compliant cloud back-ends. The back-ends run their own API instance, translating the http requests to their environment. The front end API implementation can serve different languages (R, Python, Javascript) on the client side. The openEO project has foreseen that APIs are implemented with an open source license (Apache 2.0). A number of back-ends are already available that are compatible with this suggested openEO standard. For instance, within the Joint Research Centre (JRC) of the European Commission, an API implemented in Python is being developed that can discover and process geospatial data collections available in the JRC Earth Observation Data and Processing Platform (JEODPP). For the back-end, an in-house library is being developed under the European Union Public Licence (EUPL). In this presentation, we will highlight the openEO concepts and focus in particular on the JRC back-end implementation details.
Keywords
Data miningExecution unitCubeCubeComputing platformImplementationElectronic data processingProcess (computing)Computer animation
Computer fontComputing platformImplementationCubeLecture/Conference
Block (periodic table)BuildingThermodynamischer ProzessLibrary catalogQuery languageComputer-generated imageryOpen sourceProcess (computing)Computing platformStapeldateiCubeVideoconferencingProcess (computing)CASE <Informatik>Block (periodic table)Computing platformThermodynamischer ProzessPresentation of a groupResultantBuildingOpen sourceLibrary catalogOpen setMedical imagingComputer animation
Computing platformProcess (computing)StapeldateiInternetworkingPoint (geometry)Process (computing)Web serviceMereologyElectronic data processingBitProof theoryConnectivity (graph theory)Web 2.0CASE <Informatik>Computing platformInternetworkingProgram flowchart
InternetworkingIntrusion detection systemGateway (telecommunications)Client (computing)GeometryLibrary catalogWeb serviceStapeldateiMiniDiscClient (computing)Gateway (telecommunications)Computing platformFront and back endsWeb 2.0Library (computing)Computer animationProgram flowchart
Client (computing)Gateway (telecommunications)SoftwareWeb serviceConnected spaceGoogolStapeldateiComputer animation
Library catalogInternetworkingImage resolutionBitComputer programmingComputing platformRight angleMathematical analysisVisualization (computer graphics)File viewerProcess (computing)Client (computing)Asynchronous Transfer ModeView (database)Program flowchart
Core dumpPoint cloudQuery languageClient (computing)SoftwareComputer programmingComputer fileTelecommunicationFormal languageGraph (mathematics)Form (programming)Programming languageIntermediate languageInterface (computing)CodeComputer animation
Core dumpPoint cloudQuery languageFront and back endsTerm (mathematics)ImplementationCore dumpComputer fileCubeGraph (mathematics)Source codeComputer animation
CubeThermodynamischer ProzessAngular resolutionDigital filterImage resolutionRange (statistics)Temporal logicAxonometric projectionLevel (video gaming)CuboidComputer fileRange (statistics)CubeAngular resolutionConnected spaceBoundary value problemComputer animation
Range (statistics)Temporal logicImage resolutionCubeThermodynamischer ProzessAngular resolutionAxonometric projectionDigital filterMathematical analysisAsynchronous Transfer ModeStapeldateiInteractive televisionKeyboard shortcutProcess (computing)Core dumpOpen sourceStapeldateiThermodynamischer ProzessTouchscreenMathematical analysisFront and back endsWordThree-dimensional spaceCubeImplementationCore dumpServer (computing)Semiconductor memoryPoint (geometry)Musical ensembleLibrary (computing)Graph (mathematics)BitDefault (computer science)Open sourceWeb serviceInteractive televisionInterface (computing)ParsingProjective planeBoundary value problemImage resolutionPrototypeData modelSuite (music)MultiplicationTemporal logicMultiplication signComputer animation
Module (mathematics)GeometryPreprocessorOpen sourceEmailObject (grammar)AerodynamicsLink (knot theory)Computer fileMappingLibrary (computing)Electronic mailing listFunctional (mathematics)Module (mathematics)Process (computing)Type theoryInterface (computing)Order (biology)Right angleKeyboard shortcutDemoscenePhysical systemProgram flowchart
Open sourceModule (mathematics)GeometryEmailObject (grammar)AerodynamicsLink (knot theory)Point cloudImage resolutionProduct (business)Photographic mosaicCodeMultiplication signModule (mathematics)Level (video gaming)Computer animationProgram flowchart
Point cloudPhotographic mosaicProduct (business)Process (computing)Level (video gaming)Computing platformImage resolutionTesselationProcess (computing)MetreWeb serviceMultiplication signAngular resolutionWeb 2.0File formatSoftwareSpectrum (functional analysis)MassGradient descentProjective planeState of matterCore dumpScaling (geometry)ResultantView (database)Computer animation
Computer-generated imageryMathematical optimizationContext awarenessVideoconferencingComputer fileProcess (computing)Loop (music)Computing platformVisualization (computer graphics)Presentation of a groupComputer animation
Computer fileView (database)WindowProbability density functionFormal languageMathematicsWordSign (mathematics)Slide ruleEmailContext awarenessComputer-generated imageryCubeMenu (computing)Open setComputing platformProgram flowchartXMLComputer animation
Product (business)LaptopGoogolTesselation
Kernel (computing)View (database)Rule of inferenceCodeAddressing modeMusical ensembleAngular resolutionMultiplication signTesselationAreaComputer fileSelectivity (electronic)Product (business)DialectMetreVisualization (computer graphics)Structural load
Latent class modelProbability density functionWordSlide ruleSign (mathematics)FreewareStapeldateiDialogverarbeitungLibrary catalogRepresentational state transferCubeAuthenticationQueue (abstract data type)ParsingGraph (mathematics)Thermodynamischer ProzessInteractive televisionLibrary (computing)Selectivity (electronic)Scaling (geometry)GeometryMereologyAreaLevel (video gaming)Asynchronous Transfer ModeProcess (computing)CodeEntire functionComputer animationLecture/Conference
CubeDialogverarbeitungStapeldateiRepresentational state transferLibrary catalogParsingGraph (mathematics)AuthenticationQueue (abstract data type)Thermodynamischer ProzessProcess (computing)Connectivity (graph theory)StapeldateiVideoconferencingCubeAreaMereologyGraph (mathematics)AlgorithmImage resolutionLibrary catalogAuthenticationAsynchronous Transfer ModeInteractive televisionInformationParsingConservation lawRule of inferenceGeometryComputer animation
VideoconferencingThermodynamischer ProzessCodeStudent's t-testSource codeComputer animation
Computer fileView (database)CodeObject (grammar)Proxy serverPoint cloudComputer-generated imageryMaxima and minimaPlane (geometry)Kernel (computing)RankingMusical ensembleSocial classDistanceOnline helpProcess (computing)Level (video gaming)Right angleMereologyScaling (geometry)ImplementationCodeDifferent (Kate Ryan album)DistanceHeegaard splittingPoint cloudAsynchronous Transfer ModeTesselationWindowDecision theoryInformationComputer fileAlgorithmSocial classThermodynamischer ProzessAreaComputer animation
Point cloudLibrary (computing)Presentation of a groupMereologyVisualization (computer graphics)PixelInterpreter (computing)File formatCartesian coordinate systemComplex (psychology)Maxima and minimaInternet service providerGraph (mathematics)DistanceFormal languageInteractive televisionOpen setAlgorithmProcess (computing)Right angleCubeKeyboard shortcutBitFront and back endsCodeType theoryMoment (mathematics)Computer fileCASE <Informatik>Social classArithmetic meanScripting languageStudent's t-testJava appletComputer animationLecture/Conference
Transcript: English(auto-generated)
He's going to be talking about implementing an open EO-compliant back-end for processing data cubes on JEODPP. Please join me in welcoming Peter.
Thank you for this introduction. My name is Peter Kempeniers. I'm with the Joint Research Centre of the European Commission, located in ISPRA. I'm going to talk about the implementation of an open EO-compliant back-end for processing data cubes on our data and processing platform.
I would like to acknowledge my colleagues, Thomas Kliment, André, who is also in this room, Davide Demarque and Pierre Swal. This is the outline of my talk.
I'm going to introduce this back-end, this data and processing platform. I'll also introduce you to the concept of open EO. There was a talk of Jeroen Dries on Wednesday, but in case you missed it, I will go over the concept here again.
Then talk about how we try to incorporate this open EO to make our back-end compliant to the concept. Then I will show you some results on the building blocks we have been built on for this open EO and for our internal use as well.
There is a data catalogue to query our image collections. We are getting quite close to the presentation before, so also it was on data cubes. This is a similar approach. We have also been working quite hard on an open source Python package called pyjail, which is also used for these results I show you at the end.
Hopefully, if there is some time, two small videos live. What is this GeoDPP? It's a processing platform with about 12 petabytes of data.
Most of it is Copernicus Sentinel-2 data. The data are connected to the processing nodes with 10-gigabit switches. For the processing itself, we have about 1,500 cores.
Then I will go through, especially two of the three I have listed here, components of this data processing platform, which we call GeoDesk, GeoBudge and GeoLab. Especially those last two I will cover in a bit more detail.
Here is a bit of a schematic overview of what the GeoDPP is about. As an entry point for the internet, we have the web services. We have an internal part of the web services, which you see here as the JRC web services.
Then there is the OpenEO part, which is supposed to be the entry point for users in general. This is a proof of concept, so it's not yet for the public, let's say, for now. There is the data catalog who is querying the data we have stored on our platform.
Then one layer below, we have the OpenEO backend that makes things happen for OpenEO. The services below, the web services for internal use, the GeoDesk batch and lab.
All of this being served by this PyGeo library in Python we've designed. Very briefly, these web services, the internal web services, what are they? There is a GeoDesk, which is nothing else than a clientless desktop gateway.
You have the impression as working in your own desktop, but at the end it is just a web service. The advantage of it is that all the software is already pre-installed and there is a very fast connection to the data.
It's located near the data, it's much faster than if you would have your own desktop and try to access data over the network. Then there is GeoBatch, which allows users to process data at large. There is GeoLab, which is a similar thing as the Google Earth engine.
It's a visualization and analysis platform that works in processing in a deferred mode. It's only processing what you see at the resolution you see in your viewer.
Let's go a bit into more detail in how to make this compliant to OpenEO, what's the idea behind. This is a general concept. What you see here on the right is clients can access the OpenEO in their own programming language,
be it R, Python or JavaScript, and they can push their code onto any of the compliant backends, making it interoperable. What makes this happen is the standardized communication interface in between
that translates the client-end programming software to... How does it work?
The client languages are Python and JavaScript in an intermediate language in the form of a graph, which is then transformed into a JSON file. The different backends are then responsible for translating this standardized graph,
what we call the core API of OpenEO, into their own implementation.
This is how it looks like in more practical terms. We start with a collection. This is more of an abstract idea, as we saw before in the data cubes. Rather than having a file-based approach, we're not talking about files anymore, but users think about a collection, can be sentinel to level 2A, for example, as a collection.
It then filters this collection by a geographical bounding box and a date range. This data is then loaded into on-the-fly data cubes, where the user specifies a spatial resolution, a temporal resolution,
a projection, perhaps a resampling if it's not the default nearest neighbor, and we create a data cube on the fly. This is the internal data model, so we're working with three-dimensional data cubes, where the third dimension is used for the time, xy for the spatial dimension, and the spectral bands are internally different data pointers.
It's in fact a multi-band three-dimensional data cube. This data cube is then further processed by interpreting this graph sent by OpenEO to the server.
This graph is then parsed and then translated into the backend's own implementation details. This is an internal kitchen for the backend, until a point where the last node says, I want to save this data, and this data is then fed back to the user or visualized on some screen.
To make this happen, we've created this PyJio library in Python that serves not only the OpenEO but also the other internal services for our internal users at the Joint Research Center.
I'll go a bit into more detail about this library. It's a Python package for analysis of geospatial data. It's developed in-house, and we are in the process of making this open source with EU public license.
It is able to bridge to other libraries, such as NumPy, without duplicating the memory. It is at the core written in C, C++ for performance, and we have bound this automatically through SWIG for flexibility that it's better to prototype,
and easier to prototype. Then it can be used both in batch mode and in interactive mode, as we will see later in the examples. Just a few words on how this is done. What you see on the left is the core libraries written in C and C++.
There is this interface file you have to manually write. This is then used by SWIG to create a SWIG module, which is compiling and linking all these things together in order to get a Python package.
The magic is all done by SWIG. What you define here in this interface file is a list of the functions you want to bind, and there is also a whole world of pain you can go into if you define type maps
and if you want to change some of the things you want to customize. Other than that, it works quite well out of the box, and still it's magic for me how it works, but at the end you get into a Python package from your C++ library.
Then, because this is an automatic system, the Python package you end up with is not so easy to work with. So what we've done here, and this is where André is coming into the scene, he's done a great job putting this all into Python modules.
So for the user, there's only what you see on the right-hand side in yellow. You see these Python modules, which are then accessing the giplip module here, the SWIG module, which is then accessing the dynamic libraries where the C++ codes have come into.
Let's see how this can be used. The example I'll give here is something we've been working on for quite some time. It's creating a global Sentinel-2 cloud-free composite based on level 2 ADL, so atmospherically corrected data,
produced at full spatial and spectral resolution. At each of the spectral resolutions, we are using the native spatial resolution at 10, either 20 or 60 meters. At the time we've been building this, we have used data from 2017.
There was not yet the atmospherically corrected data delivered by ESA, so we had to do this on our platform with the CentaCore software. For those of you who are interested, we have published the result, all this data you can download. For now, it's still all in tiles, but we're also planning to do some web map service
that you can visualize the data more easily. Internally, however, some details here on the large-scale processing. The way we've done it, we have kept the original resolutions of the MGRS tiles.
There are about 30,000 tiles that are covering the landmass of the world, and it took about 15 hours to process all these tiles. Internally, what we use to visualize this data, all these data are still in tiled format.
We have created overviews for all the tiles that are in the same projection, creating a VRT file for it, and create some overviews for each of those VRT files. By creating a collection, making the loop as you were mentioning in your presentation,
we can create back a collection. We can either use it for visualization, but also for the processing. This is how it's done. If it works, I would like to show you a small video on how we visualize this in our platform.
This should be this one here.
Somewhere hidden. Maybe I have to click on one of those.
Somewhere on the other. The room next door, maybe.
Here, this is in a Jupyter notebook.
We create a collection in the cell upper left, and then the tiles are being accessed. We can zoom and browse, as we all know it also from the Google Earth Engine. This is the actual product that has been created. You see the visual bands here at 10 meters, spatial resolution.
I zoom here in the area of Bucharest. The nice thing is that you don't have to prepare anything of the data rather than creating these virtual files. And with their overviews, you can just browse and it's loaded
at any time you zoom into the right tiles are being selected. Other than just visualizing, what we also can do is
interactively process the data. We can also interactively process the data with this PyGeo library. This is done in GeoLab.
At the upper left, we create a collection where we zoom into an area selecting Sentinel-2 data. Here, we create the code importing the PyGeo library, and we can create a code that is then what we can normally use
for the large-scale processing. Here, it is used in the deferred processing mode. If we create a map here, if we zoom into an area, if we pan the map, the area that is processed, this code is actually processed in execution mode in deferred processing.
This allows you to actually prototype as if you would process an entire dataset in the entire world, which is, of course, much faster.
Otherwise, you have to do the entire processing of the entire globe at the end to see that your algorithm is not working in this part of the world or in that part of the world. It's much easier to do it in the interactive mode where you can just browse to an area and then it's only processed on that resolution in that area.
I'll just end with a conclusion, and then maybe we can show the other video. I've been into some details of the GeoDPP backend, showing what the PyGeo Python package is all about. It can serve both the batch and the interactive processing mode.
We have seen that the components that make it OpenEO compliant, with, for example, the collection catalog that is written as a RESTful API, we can create on-the-fly data cubes from collections and then processing these data cubes.
There's still some work in process. We're not there yet. We still have to work on the job queues, on the graph parsing, and also the user authentication. By just finishing off here, I would like to see if this other video is also working.
I'm not sure. This is this one. I'll leave this up to you. Here you see, I'm creating a collection here on the upper left.
This is the actual code that's going to be processed. If we cut this code, it can also be used to run on the full tiles at large scale. Here, this is to compare two different processings.
We will see that in a split window, we can see the difference between two different implementations. We create a map, and then what you see on the right is a deferred processing mode. You see on the left part is one implementation, on the right part is another implementation,
and we can see which one of the two is better performing in some areas. To give the example, one is actually using some of the classification files that are produced by Centucord. There is some cloud information, some vegetation information, classifications.
On the right is an actual distance map to clouds. You can see there are some black, dark spots over there which were not there in the distance to cloud because it's a shadow.
The larger distances are not taken into account in the left, but in the right. This is some flaw of the processing on the right,
where there were some remaining clouds that were not detected by the cloud algorithm. Even there, if you take into account the distance to cloud, but if the cloud was not discovered, then the distance doesn't bring anything. Whereas the other algorithm, for example, was taking into account a maximum NDVI and favoring all those pixels with maximum NDVI
and was not taking that cloud. Just to show that each of the algorithms had their own flaws and pros, and using this interactive approach, it's much easier to design your algorithm and with this, I'd like to end.
Great. Thanks. Well, thank you, Peter. I think that's really interesting. I think the interactive, on-the-fly processing and visualizations are really cool. We don't have that. I'm jealous. We've got questions in the audience.
How well supported is NetCDF format? All of the presentation I see with Open Data Cubes are just processing Landsat and Sentinels, but no one actually speaks about NetCDF files, which are very complex file formats. For now, we're not using NetCDF. We're using GDAL as a backbone to access all the data.
Normally, all formats of GDAL should be supported. Here, we try to use all the formats as they come into by the data provider. In this case, when it's Sentinel-2, we are using directly the JPEG 2000, and we store them as such.
As those are the majority of the data we're using, we're not using NetCDF for the moment. Right. Thanks.
If I've well understood, you write an algorithm in Python, and it's transcripted to some JavaScript. Is that right? The library, it's a bit like what you have with the bindings of GDAL to Python.
The user is writing everything in Python. The pyjio library is used by the user, and under the hood, it's accessing the C++ code.
My question was more on the algorithm. I'm not completely familiar with OpenEO. Is the Python way of expressing your algorithms something that is standardized? By OpenEO, or was it your own API?
That was my question. I understand better the question now. Part of it is what I've shown in the end. The example was not really an application of OpenEO. It was not ready yet for OpenEO. This has been written in pure Python.
However, if we would solve it the OpenEO way, which is not fully there yet, the user would write his own code in his own language. It could be R, Python, or JavaScript.
Then there's an interpreter in between. Each of those languages have their own interpreter to translate this into a middle layer. The middle layer is a type of a graph where you just define all the different steps that are needed in a standardized language. This is then in the core API.
Each of the backends is interpreting this standard language into its own language. That could be whatever language also, but up to the backend to do this work. The code we have been using here to create those global mosaics,
was not ready yet for OpenEO. That was just in plain Python.