Our journey into an OGC-compliant Processing Engine
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 156 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68508 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
Streaming mediaComputing platformPresentation of a groupObservational studyOrder (biology)Data managementProcess (computing)Self-organizationState observerScaling (geometry)Computer animation
00:29
Personal digital assistantProcess (computing)Computing platformoutputInternet service providerFile formatData managementDemosceneLibrary catalogInternet service providerProjective planeComputing platformProcess (computing)Streaming mediaCore dumpStack (abstract data type)Multiplication signPhysical systemAlgorithmBlogPreprocessorDataflowWebsiteElectronic mailing listBitProduct (business)Data managementValidity (statistics)Parameter (computer programming)MetadataMedical imagingConfiguration spaceCovering spaceFile formatoutputOrder (biology)Functional (mathematics)Data storage deviceField (computer science)Point cloudFront and back endsLatent heatImplementationCASE <Informatik>Right angleDistanceCodeExtension (kinesiology)Revision controlLevel (video gaming)Musical ensemblePoint (geometry)Computer architecturePresentation of a groupSource codeDifferent (Kate Ryan album)Bit rateComputer animation
07:10
Process (computing)Computing platformFile formatoutputInternet service providerData managementAxonometric projectionMusical ensembleData modelMetadataData managementoutputFile formatProcess (computing)Connectivity (graph theory)Internet service providerMedical imagingLatent heatSurfaceLibrary catalogPersonal area networkField (computer science)Data storage deviceDescriptive statisticsType theoryObject (grammar)Set (mathematics)ThumbnailMetadataElectronic mailing listLimit (category theory)MereologyStack (abstract data type)Projective planeOrder (biology)DemosceneFile archiverValidity (statistics)Different (Kate Ryan album)Multiplication signMaxima and minimaPresentation of a groupBitMathematicsState of matterRule of inferenceEnumerated typeComputer animation
13:51
Process (computing)MetadataProcess (computing)MathematicsLatent heatData storage deviceMedical imagingoutputBitResultantFunction (mathematics)Extension (kinesiology)Object (grammar)Stack (abstract data type)Computer animation
15:11
Web pageProcess (computing)Meta elementLemma (mathematics)Limit (category theory)AuthorizationDependent and independent variablesOpen setVideo game consoleLibrary catalogOrder (biology)Software development kitBit rateoutputAlgebraGeometryOpen sourceService (economics)Link (knot theory)DataflowSlide ruleProcess (computing)Software frameworkMereologyData storage deviceObject (grammar)Computing platformTerm (mathematics)Web pageValidity (statistics)Position operatorResultantCartesian coordinate systemPhysical systemDebuggerSound effectOrder (biology)Data managementLibrary catalogCentralizer and normalizerLatent heatMultiplication signoutputDescriptive statisticsData conversionMetadataBit rateWindows RegistryInformationFormal languageRight angleDifferent (Kate Ryan album)Task (computing)Field (computer science)Stack (abstract data type)Computer animationLecture/Conference
23:33
Computer animation
Transcript: English(auto-generated)
00:00
So my name is Miguel, I work at UP42, if you haven't heard of UP42, it's a company that offers an earth observation platform that helps organizations order, access and analyze earth observation data. Our platform simplifies data access and management and enables imagery processing at scale.
00:29
Like in this presentation, I'm going to tell you how community specifications helped us design our products, in particular STACK and the OGC Process API core.
00:44
In particular, our STACK and OGC enabled us to scale, it took time to adjust our product to STACK and OGC specifications, but once that was done, they were adopted very quickly by our users and they enabled our engineering teams to design and develop faster.
01:07
I will also show you a little bit about our implementation, and I won't talk about system architecture, I won't showcase processing capabilities like live or anything, and I won't show you any code.
01:26
Also, I won't show you any memes today, which is something we do in our presentation, I'm not like, I can't do it. So, we like to praise ourselves as provider agnostic, because our users can leverage a wide diversity of data providers, ranging
01:49
from Airbus to Planet to Sentinel or Black Sky and more, we have like dozens of data providers in our platform. And we also offer best-in-class algorithms for pre-processing and processing of the data that
02:09
our customers buy, and so we have a pretty unique offering that enables very interesting use cases downstream. You can go out to our blog and website to see some of those, what our users are doing with our platform.
02:29
I'm going to start by showing you some screenshots of our platform, so you get a feel of it, and then I will go into more details on how we implemented the specifications.
02:45
So, as I said, we have a diversity of commercial and governmental data sources, it's like the flow in our platform normally starts on catalog. Here we have also stack search implemented, like stackish, so to say, and you can define your AOI,
03:16
and then you get a list of scenes on the left of providers that offer images for that AOI,
03:25
and for your search parameters like clouds, cover and so on, so then you select your configuration and order the data. After you order the data and the data arrives in your storage, it's magically mapped into
03:44
stack, so our data management system is 100% stack compliant, it has pretty cool search functionality. So, as you see on the left here, you will see a list of stack items that you would have in your storage after you order them.
04:13
You see some of the fields that are offered by stack, like this is ground sampling distance, this is cloud
04:24
cover, projection from the projection extension, like the EPSG code, this is all stuff that are populated by our pipelines, and on the right, after you pick one of those items, you will see a list of stack assets
04:42
that you can visualize on the map, like our backend is offering a version of ttyler, our implementation of ttyler, and you can stream data, we are requiring anything that is in our storage to have
05:04
common names for bands, so they are easily searchable, so we offer pretty nice user experience there. This is also all, like, we are an API first company, so everything that you can see
05:23
here is also, there is an end point for it, and they are conforming to stack API. Yeah, like, if you want to go into your processing journey, you select for processing your item, and then you would go on with your journey, you would, you will see, like, a list of processes that you
05:47
can select, you can already see that this is, like, could be, like, referencing the processes end point of OGC, but, like, I will show you this later, at this point, we are performing validation ahead of
06:06
execution, so we are checking if your item is compatible with the process that you are choosing. So, since our metadata in storage is so rich, we have very high requirements regarding what is allowed to go into
06:22
storage, so we are pretty sure that once you select your item, you select your process, that the process execution won't fail. Like, we, with this kind of validation, we are guaranteeing very low failure rates. However, we are
06:42
coming from a very different place, like, that's where, that's a story, like, I'm going to tell you. So, it all started very differently. So, our legacy platform, when the company was founded a few years ago, it was not exactly data provider agnostic, so, right now, we have, like, so many, like, providers that we
07:08
are offering, and at the time, we had actually only one provider that was delivering data in one input format. So, like, because it's, like, just one format, there was actually no need to have, like, very sophisticated data management,
07:25
because it's just a list of archives, like, each scene is, like, stored in one archive, and it's only one format. So, we could, like, just, our workflow would be just order data and go to processing immediately, and the issue with
07:43
that is that, like, our processing capabilities were very often designed to be compatible with this one input format, and then, like, that causes a lot of failures, like, because once there was any change in the input format, then we had an issue.
08:02
So, they were tailored to our only data provider. But because we wanted to be, like, provider agnostic, we needed to do something about it, so, we tried again, and we tried again by, first of all, introducing a new data catalog, and this state new data catalog was, in fact, provider agnostic.
08:26
So, the problem with that was that, since, because we didn't have data management yet in place, our processing engine at the time just didn't work with all these different providers.
08:42
So, we needed to do something, and, yeah, so, we failed again, and because we failed again, it was time to fail better, and that's where we are now.
09:00
We introduced, after introducing the data catalog that I showed you before, we introduced also data management capabilities that I showcased, and a new processing engine, and the, like, the special, what's special about, like, our
09:21
data management and processing capabilities is that we don't allow a stack item to go, to be uploaded. Like, we don't allow an asset to be uploaded into our storage without a minimum set of metadata fields. So, we are very strict about that. So, that ensures, like, a lot of quality in our processing engine.
09:43
So, on data management, we are compliant with the stack API, and we have OGC processes API compliant processing engine. Now, we're going a bit more into details, like, just illustrating, like, how
10:05
an item would look like, and, in particular, how the assets look like there. Our data management, I think, there was a presentation last year by Batouan about, like, how our data management component works, and explaining all these details.
10:24
I'm just going to touch it very on the surface, and, yeah, like, we have, as I told you, like, any image that we sell or that we produce must always be referenced in our stack catalog. And then we require, like, any stack items to provide, like, fields like pen description, projection, ground sampling distance, and so on.
10:51
So, we are, like, stricter than the stack specification, because we want to ensure that things are well documented in our storage.
11:01
And, yeah, we also, like, if you see here, like, on our assets, like, this asset object, we see, like, we have type, which is always the same type. We convert everything to COG, so this is also verified whenever something is uploaded to storage.
11:24
And we also have, like, a very precisely defined set of rules that we allow, because that enables better searchability and thorough validation ahead of processing.
11:44
So, we allow, for example, data, metadata, like, as enums for roles, preview, thumbnail, panchromatic, multispectral, and these things can be combined with each other, and enable, like, better searchability.
12:04
As for processing, we don't have processes, a list process endpoint yet. So, process descriptions are right now conformed to OGC processes API specification, but, or the process
12:24
description, but they are internal, like, in our server, because we have a limited list of processes. But this is coming soon, like, in the next few weeks, we will have a public API with an endpoint with get processes process ID.
12:42
And what I want to focus today is more on the inputs part of the process description. Like, here, you see already the post processes, process ID execution, so when we want to execute
13:06
a process, and the inputs, we are leveraging, like, stack, and putting it together with OGC processes specification, because we require an item ID that is referring to an item in our storage, and this item is
13:26
a stack item, obviously, and so we, every process that we execute needs to start with a stack item. Then, after execution, we implemented also get jobs, and get jobs job ID, in order to check
13:46
the status of your job, and get job metadata, and looking at, like, an example of job metadata, this is, like, a process ID detection change, so it's, like, change detection between two images,
14:04
so you would need, like, typically, like, the inputs would be two items, so you refer to these items in your storage, and then the results, here we deviate a bit from the OGC specification,
14:23
but this is also not a must in the specification, we call it results, and it's a collection, stack collection, so we, here, produce a collection ID, because we are not sure whether every process will output one item, like, one specific spatial and temporal extent,
14:49
there might be, like, the process might choose to output, like, three different items, so we need to work with the collection, but the baseline is that we have a stack object as input, and a stack object as output, that was our objective there.
15:12
Here is, like, how our API reference looks like. I really love this page, because it,
15:22
like, you can go through all the, all our endpoints, like, catalogue, orders, processing, assets, and, yeah, you can also, like, just convert, yeah, just, like, translate it into, like, different languages,
15:41
like Python, PHP, Ruby, Node, like, the requests, so it's pretty cool, and you can, like, experiment with our API already, so you can, like, as a reference, I will, I linked, I have the link in the final slide. So, just to recap, like, our processing engine is basically, like, follows this flow, you select a process, this
16:08
is the part where it's not yet an endpoint, but it's, like, only on, yeah, it's, it will be out in a few weeks, and it's something that you can do via front-end to list the processes already.
16:25
You have executes, like, a post-process execution endpoint, which, yeah, which works with, on the stack, in stack out kind of framework, that's everything that goes in must be a stack object that is on your storage.
16:48
So, we combined OGC process API specification with the stack spec, and, like, in the job metadata, you will have input and results as stack objects.
17:00
Yeah, like, the thing about this, like, having stack in, stack out, and having the job definition based on that is that it makes the validation that I talked about, like, very easy, because, like, our stack objects are very rich in terms of metadata, and so it's very easy to check compatibility between the process and the stack item,
17:29
and we can fail ahead of execution, so we guarantee quite low failure rates, and, yeah, like, also the design effort for engineers, because we adopted these specifications, was much lower, in the end,
17:46
it's a very intuitive solution for users as well, and that's, like, one of the lessons learned from this. Yeah, going into lessons learned, like, by combining OGC and stack, our design became kind of self-evident and
18:07
intuitive, and the specification solved a lot of problems we had before, as you saw in the legacy platform. So, basically, do use community specifications and tooling, and you don't need to reinvent the wheel, that's, like,
18:23
for 99% of your applications, and, yeah, like, design your solution carefully, it comes with positive side effects later, because by, yeah, by investing in the comprehensive stack metadata requirements for our data management system,
18:43
we kind of, without knowing it, enabled a validation service that was, that is central to our processing engine, so, like, there are always positive side effects, and, yeah, it was great fun being here, I loved Phosphor-G, it was my first time,
19:04
and it was, it's very thought-provoking, I have amazing conversations, and I'm looking forward to coming next year. Thank you very much. Thank you very much, Miguel. I loved your stack-in, stack-out approach, it makes a lot of
19:23
sense, so I guess you also track in the stack metadata all the processing steps, right, that came. So, like, each job, like, traceability is kind of insured via job, via a job registry service, so the job metadata would refer always to the,
19:51
like, to the idea of what's coming in, and the idea of what's going out, so, in the next job, we will be able to track that. Yeah, that's very cool. Thanks.
20:09
I did not get properly how did you ensure that the input are proper to your, compatible with your process, because you also said that you don't provide any process description, so I'm wondering
20:25
from where the metadata information, your platform is taking the information to ensure that the process will work. Yeah, so you mean regarding the process description, like this process endpoint?
20:43
How you can validate, from where you find the information, your validation, before executing the process, if you don't have the metadata information stored in the process description. So, we, like, the process description is there, we currently don't have the endpoint for exposing it, that's the only
21:04
part that is missing in the whole thing, so it's like we have a task execution service, and there we are storing, like, statically the process descriptions, and the next step would be enabling, like, partners to provide their own algorithms,
21:24
to provide their own process descriptions, yeah. For now, we have, like, the static, because it's only a few. Can you show us again the job status, the slide where you add the job status?
21:43
Yes, yeah. Is the definition as input actually used to start the execution of your process? The first step is not for the execution, so it first, we hit another service where we have, it's called the validation service, and then
22:03
only after we update the status as valid, so you see, you would get, like, a status successful, this is after the run is successful. Like, the first step would be if it's valid or invalid, yeah. If it's valid, then we go on to execution. Right, but the definition field right here, the definition object that you have in
22:26
the status information, does it correspond to the input that was provided during the execution? Yes, so it's some kind of lineage, and you are using results, results field, which is an object, collection ID, but I don't see any links to the result?
22:44
Yeah, like, so here, there would be a collection ID, since this is, like, an engine that is running in the app42 platform, you can use this ID to refer, no, actually, it's not, you're right, this is not necessarily an ID.
23:07
I'm not sure now if we expose, I think we expose it as an ID, but it's a reference to our asset, our stack API. Okay, so then why you don't use a specification which mentions that you can
23:21
add links inside your job status to point directly to your collection and collection ID? Yeah, like, we are pointing to our, to the, to our special asset service where our stack API is hosted. Thank you very much.