We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Bringing the App to the data - State of Play for EO data exploitation

00:00

Formal Metadata

Title
Bringing the App to the data - State of Play for EO data exploitation
Title of Series
Number of Parts
57
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production PlaceWageningen

Content Metadata

Subject Area
Genre
Abstract
Ingo Simonis is chief technology innovation officer at the Open Geospatial Consortium (OGC). At the ODSE conference,he explained how platforms for the Exploitation of Earth Observation (EO) data have been developed by public and private companies in order to foster the usage of EO data and expand the market of Earth Observation-derived information. His talk described the general architecture, demonstrates the Best Practices, and includes recommendations for the application design patterns, package encoding, container and data interfaces for data stage-in and stage-out strategies. The session further outlined how to interact with a respective system using Jupyter Notebooks and OGC Web APIs.
Keywords
Physical system
Process (computing)Point cloudSet (mathematics)Different (Kate Ryan album)Cartesian coordinate systemProcess (computing)MappingKey (cryptography)Computer animation
ForestDataflowPoint cloudAuditory maskingStatisticsCalculationProcess (computing)PreprocessorProcess (computing)Temporal logicNumberDifferent (Kate Ryan album)2 (number)Product (business)RandomizationPoint cloudIntegrated development environmentForestComputer animationProgram flowchart
StatisticsPrice indexTemporal logicAuditory maskingPoint cloudResultantProcess (computing)BitAuditory maskingIntegrated development environmentMathematical analysisPoint cloudComputer animation
Integrated development environmentFunctional (mathematics)ResultantProcess (computing)Graph (mathematics)Connectivity (graph theory)Computer animation
AlgorithmTelecommunicationPoint cloudProcess (computing)Different (Kate Ryan album)Point cloudAdditionSatelliteTelecommunicationAuditory maskingData transmissionResultantQuicksortNumberUniform resource locatorSet (mathematics)ForestSingle-precision floating-point formatAlgorithmElectronic data processingIntegrated development environmentRandomizationMechanism designStandard deviationComputer animation
SubsetAverageProcess (computing)Service (economics)ArchitectureMobile appSoftware developerCubeComputing platformInstance (computer science)Web browserIntegrated development environmentTestbedAreaDiscrete element methodComa BerenicesTraffic shapingPoint cloudTestbedLink (knot theory)Mobile appCartesian coordinate systemSoftware developerOrder (biology)Product (business)Mechanism designMetadataNumberIntegrated development environmentPoint cloudFunctional (mathematics)RandomizationSoftwareState observerTraffic reportingMereologyParameter (computer programming)Process (computing)Electric generatorComputer configurationStandard deviationForestLaptopRadical (chemistry)AdditionDifferent (Kate Ryan album)Term (mathematics)StatisticsYouTubeClassical physicsOpen setMappingFreewareData storage deviceSoftware testingSet (mathematics)Cycle (graph theory)Computer animation
Coma BerenicesQuadrilateralTestbedProcess (computing)ResultantPoint cloudCartesian coordinate systemCubeSubject indexingMechanism designData structureVisualization (computer graphics)Integrated development environmentContext awarenessChainFile formatInformation retrievalInteractive televisionFunctional (mathematics)Presentation of a groupOnlinecommunitySet (mathematics)SubsetElement (mathematics)Data storage devicePhysical systemLatent heatCombinational logicExtension (kinesiology)Programming paradigmField (computer science)Slide ruleComputer animation
Group actionTime domainStandard deviationProduct (business)Set (mathematics)Mobile appoutputResultantScaling (geometry)Category of beingCartesian coordinate systemInterface (computing)Level (video gaming)Mechanism designDescriptive statisticsAreaWindows RegistryPoint cloudVector spaceLatent heatMultiplication signRaster graphicsIntegrated development environmentGeneric programmingPlastikkarteBitData storage deviceMoment (mathematics)File formatTerm (mathematics)Mixed realityProduct (business)Process (computing)Type theoryProfil (magazine)Endliche ModelltheoriePerimeterData warehouseElement (mathematics)TesselationLinked dataComputer programmingInformation retrievalSpectrum (functional analysis)Semantics (computer science)Business modelWindowKey (cryptography)Order (biology)Software developerRow (database)Physical systemSpacetimeMappingStandard deviationDifferent (Kate Ryan album)Function (mathematics)Computer animation
Transcript: English(auto-generated)
Let's get started here. What's the motivation? Well, one, for example, to understand versus system processes, a classical scenario where we do have lots of data being involved to fully
understand how it functions. And all these different data sets are located in different archives, very often cloud based these days. And one of the key aspects is really that we not only have so many different data sets,
but we have lots of processes that actually work on these data sets. These processes then need to be applied to very heterogeneous data. And often enough, it is not that we download the data anymore, but we need to somehow bring the application to the data to enable the in-cloud processing.
Key challenge then is, well, which of the processes actually can be applied to what data? This mapping between processes and data is extremely important. So let's look at a typical in-cloud scientific workflow scenario. So here's a scenario I received from a German aerospace DLR.
It's a multi-data processing environment that uses data from different satellites, conducts a number of pre-processing steps as we see on the left hand side, like cloud masking, NDC calculation, temporal aggregation
statistics, and so on. And then there is a second major processing step where the actual classification of settlements based on a random forest classifier is executed. And then we get a world settlement footprint as the final product. So a classical workflow.
If you go back to our data cloud environment, well, we do have three different data sets. We do have the major processing steps, and we want to produce our results. So if we decompose this now a little bit with our actual scenario, we do have Sentinel-2, the multispectral data.
We have analysis ready data coming from the Landsat collection two. And we have some SAR data from Sentinel. We do run all these different processes that we can see here. And then at the end, we get the settlement mask. We can now do this all in a cloud environment.
Well, we hold all the components in an environment, the data and the processes. So we basically need to design our workflow, our scientific workflow, or our processing graph. And then we have the results. As long as this is all in one toolbox, like, for example, try to execute something like that in a Google Earth engine, it's rather simple.
But then we are fully limited to the functionality and the data that is provided by this toolbox. What we want to have instead is we want to have multiple environments. And for sure, then, we need some intercloud communication.
That's for sure. We do process different data sets at the native location. And then we may need to exchange some results. And we do have a number of, let's say, widely accepted algorithms. For example, the ESA snap package for the satellite data
processing. Now, it would be much easier for these different cloud environments if we would have a single API supporting, well, both of them in this case, or maybe all three of them if we consider three processing clouds, like in our example.
So this API could support some sort of processing, some sort of widely accepted processing. And it needs to allow the fine-grained access to those results that come out of these different data processing steps. Then, what in addition do we need? Well, we need a mechanism to develop additional applications
because, as we have seen in our scenario, some are very standard processes like cloud masking. But others, for example, the random forest classifier is a tool that needs to develop by experts, is very custom made.
And what we need to do is, well, we need to develop this tool somewhere and then deploy it in the clouds. And then we want to execute it there next to the physical location of the data to avoid these costly data transfers.
In the Open Geospatial Consortium, what we did over the last couple of years with support mostly from ESA and some support from NASA and National Resources Canada was that we developed an architecture and an API. The API is called DAPA, Data Access and Processing API. That allows access to the data sets,
and it provides support for a number of classical processing steps or functionalities like statistical values, min, max, average, temporary aggregation, and so on. In addition to that API, we developed a second one,
which is called ADES, Application Deployment and Execution Service. And that one now allows us to develop an application, submit it to that cloud via a standardized interface, request its deployment and then its execution with a specific parameterization.
So we do have two APIs that complement each other. One is providing the access to the data and the access to, let's say, built-in functions. And the other one is an API that allows us to develop an application, describe it with all necessary detail, and then submit it to the cloud environment
where it can be used to process the data that is already part of that cloud environment. Here's what it looks like. So we do have these two processes on the left-hand side. We have the application developer. That application developer develops the application
maybe locally and then packs it into a container, ships the container together with some metadata to the cloud environment. And there it gets deployed and then executed on demand. On the right-hand side, we have the application consumer
who is on the one side using the DARPA API to access the products that are already there or uses the functions that are already supported, these classical widely accepted functions. But on the other side,
this application consumer can even say, hey, I want to use applications as developed by these application developers. And the beauty is that we have two completely decoupled cycles here. So this works pretty much like any app store we are used from our cell phones where someone develops an application,
the consumer discovers this application, thinks, oh, this random forest classifier could be used for my world settlement application. So I want to give that a try and request the deployment within the cloud and then execute it to see what products come out of it. So this is actually a first very important step
towards a marketplace for Earth observation applications. And it's a great chance for application developers because whenever they have produced something that they think could be of value to others, they can make it available.
On the one side, I mean, they could do this before as open source, but here they have a chance to, in addition, make it available as a application that can be loaded on request and then they can sell their knowledge and the work they invested into the development of that application
actually gets reimbursed. So it's the generation of a new market that we see here. If you want to try things out, well, there are options to do so. One is a product coming out of an OTC initiative called Testbed 16.
So you see the link here. That environment developed by EOX in Austria allows to play around with the data access and processing API. It makes data from a number of collections available, maps and lands, that Sentinel, MODIS,
and it provides a Jupyter environment to run all the experiments. It's all cloud-based. It's all for free. You can apply for a free account. And then you can execute these different processing workflows. There are a number of tutorials available.
The data sets are described in detail. Everything runs in a Jupyter notebook or a couple of these. So it's straightforward to experiment with it. And it's quite interesting to see how easy it can be these days to actually use something that is developed
by someone else for your own scientific workflows. There are future experiments that we did. One is with the Testbed 16 that we have already mentioned. We have three major engineering reports
that came out of this initiative. One is about data access and processing. So how can I apply functions in the cloud? The other one is about the API that we have seen. So how does the API actually looks like? How can I interact with it?
And the third one is specific to the usage of what we call first-observation application packages, which are the package of the application, the container, the software container, all the metadata you need with Jupyter notebooks, which is probably among the best mechanisms
we do have these days to interact with those different applications that we have. Then everything was stress tested. So everything we have seen in the last couple of slides, the data access and processing, as well as the application deployment execution
in a initiative called the Earth Observation Applications Pilot. If you go to ogc.org slash eoapps, you find quite a number of engineering report that describe this in detail. You find a number of YouTube videos,
if you follow the link, that demonstrate how you can now package your application into a container, you describe your application in terms of what it produces and what it requires. You describe how the cloud environment
mounts the data for your application. So keep in mind that your container is a very versatile thing that is loaded, then it exists, but then it disappears after its termination. So somehow you need to make the data available to this environment, to this containerized application.
That is all described in this application package. And if you deliver that, together with the description, what additional parameters a user needs to set in order to execute it, then you're already all set to allow the user to make use of your application.
There is a future work that is currently ongoing in Testbed 17. One is about geodata cubes. And even though it appears on the first slide that we'll data cube to multi-dimensional structure
that allows us to access data in a very efficient way. Well, that's what we thought at the beginning of this initiative. And now what we see is a process away from a cube that's just a multi-dimensional structure or just an indexing system to some data located in the cloud.
We see that data cubes are often snapshots within scientific workflows or within any processing workflows because the modern conception of a data cube is not only a storage for data,
but it's a combination of storage for data with a set of functions to interact with that data cube. And the interaction is not only limited to retrieve subsets of data from a cube, but it again allows in-cube processing.
And we say in-cube processing, well, it may happen directly in the cube or it happens in the cloud environment around the cube, but we see that the concept of geodata cubes is more and more aligned actually with the idea that you generate an application locally,
you submit it to a cloud environment, there you can embed it into a workflow and then you can make individual snapshots of your processing chain available as data cubes and allow the consumers to even interact with the snapshot so that you may generate different visualizations
or you may continue the workflow differently within specific user community. So this is a very exciting field where we thought about, well, we just generate an API to a cube. What we now see is that it actually continues
this application to the data paradigm and mechanism. And then for sure, there are some other interesting elements like cloud-optimized geotiff and the SAR format to store the data and both play a role in this context
because first of all, lots of data these days is available as cloud-optimized geotiff so you can access it very, very efficiently which is interesting in the application to the cloud environment. And second, you need some mechanism
to deliver your results and interoperability to a large extent depends on the fact that your consumer is actually understanding what you are delivering and we see SAR possibly playing an important role in this context. And I think that concludes my presentation.
So I hope I could transport a bit this application to the data status quo in this presentation. And yeah, thank you very much for your attention. And I hope I stayed more or less in time. Yes, I may actually be very sorry.
I told you 15 minutes. Your specific talk actually has 25 minutes. But it's excellent. So we have more time for questions. So we're opening the floor. Questions. Okay, I will start. So you mentioned FAIR, right?
So FAIR is also like, FAIR means, so I mean, I'm just trying to think how do you connect FAIR in business? Because FAIR is a recipe how you do business. So how does the OGC survives by supporting both business and FAIR?
If you understand what I mean, if somebody makes a FAIR up, like in the workflow you mentioned, if you make a FAIR up, then it means I come and I copy that up and I make another up. And that's correct. Yes, and that's of course businesses wouldn't like that. So how do you basically have both?
Right, so imagine you develop an application and you register it with a cloud environment. The consumer side is not actually retrieving this application, right? The consumer sees the application
in a registry with a clear description. So my application, for example, produces the settlement map on a global scale. It requires the following input data and it produces the following output data. So the consumer can actually request
the deployment and execution of that app without ever seeing the app because the app itself is deployed and executed within the cloud environment. And then someone needs to pay for it, right? At the moment, what do we have? We have different business models in the cloud space
from credit systems to flat rates, to buy area, by processing cycles, by storage. I mean, at the moment we see a very, very interesting mix of business models. The way we see it that the cloud is offering then the execution of that application
to the consumer at a specific price saying, okay, you can execute for your area of interest and your time window. You can execute our application for the following costs. If you agree with it, well then provide your credit card details
and hit the execute button. And so from the OGC side, what's important for us is on the one side, we generate fair interfaces that make the application and the data products findable. We make it accessible because you can interact with it via a standardized interface.
We make it interoperable because it is well described in terms of what it needs, what it produces, what formats it delivers. And it is reusable because once developed, the application can be reused by as many consumers as possible.
But the consumers cannot necessarily directly copy the application, right? This is prevented by protecting the application within this cloud execution environment. Okay, yeah. That sounds like you theoretical, you solved it.
I just worry a bit in practice because of course, as I said, the businesses that don't want to share the recipes. So eventually if you make some upper map, you know, you cannot see behind. But I have another question for you. Actually, that's from a colleague connected online
from Hannes Reuter. So how do you see the OGC API in the future, especially for Earth observation? So I think Hannes wants to know if you could maybe predict what's the next frontier. We already see that on the one side,
we started with rather generic applications like OGC API for features or for coverages or for records and piles and maps. What we see that started right away was the development of standardized or for specific profiles of these like the environmental data retrieval API, EDR.
And I think what we will see in future is more of these types of profiled APIs that use elements from specific APIs, combine them with specific resource models
to serve the specific needs of the community. So the APIs, as we see them right now, have the advantage that you can bring them live very efficiently, very quickly, and then you can do, well, a couple of things with them,
but in order to really serve the specific needs of a community, you will need to develop specific profiles. So you will need to say, okay, well, this API always delivers these types of results. And that's the situation that we currently see. And then we have a spectrum, right?
I mean, you can argue that basically everything is a process. Even the production of a map could be considered being a process. But on the one side, if you think of the full spectrum of everything is a process and needs to be handled like a process, and on the other side, all you want to have is a map.
So what we see is that we will have specific APIs within this spectrum. Some are very convenient. We call them the convenience APIs. These APIs do a very specific thing and nothing else. Like with the OGC API maps, you can get a map,
but you cannot generate a multidimensional tile perimeter or something like that. Then we do have APIs that are fully generic. They allow you to execute any type of process. This is OGC API processes, which then requires that you describe the process in detail, the rather tedious detailed description.
You have lots of levels of freedom, but almost every process is supported. And application program interfaces like EDR, the environmental data retrieval, or DAPA, the data access and processing API, they are somewhere in between these extremes, right?
And I think what we will definitely see is that the full spectrum will be covered and specialized APIs will emerge like the EDR API already did. And we will see more and more of them.
And one of the key challenges for OGC, and that's why we are investing heavily into semantic reference and link data mechanism at the moment, is to make sure that you understand what your profile is actually doing, right? What is your profile being based off?
To make sure that any user of a specific API knows what they're actually dealing with. All right, just the last question. So cloud.ngotf, we use it now. It's our main data format. I really wish this thing existed 10 years ago,
but what about the vector data? What's the cloud format the OGC recommends for vector data as a cloud? So fully scalable. Yeah, please ask me in about two months time. Yeah, right now, working on that topic. We know that that card works well for the raster data.
And we have now started an internal initiative where we look at data warehouses that are storing vector data. And we are currently comparing the big vendors data warehouse solutions like from Google, from Microsoft, from Oracle,
from the Apache side. So we are comparing all of these. We are investigating what mechanisms they support to, and what mechanisms they offer to support geospatial properties in the vector space. And then we compare their results sets. And we are currently busy doing this comparison.
And once it is completed, which I think will be happening within the next two months, then I think we are much closer at defining or providing recommendations for the ideal cloud optimized vector format.