We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Standardized Data Management with STAC

00:00

Formal Metadata

Title
Standardized Data Management with STAC
Title of Series
Number of Parts
266
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
FOSS4G 2023 Prizren STAC is a well-known and acknowledged spatiotemporal metadata standard within the community. There are many applications with open-source data; however, there are few adoptions by premium satellite imagery providers. At UP42, we adopted STAC as the core metadata system within our applications and provided STAC API for users to manage their data easily. The ongoing adoption challenges with multiple data providers taught many takeaways that we would like to share with the community. - UP42: a short introduction - Data management challenges at UP42 - Solution with STAC & cloud-native asset format - STAC implementation: lessons learned - Current state and way forward
Multiplication signPresentation of a groupAnalytic continuationWordStack (abstract data type)Goodness of fit
Data storage deviceProduct (business)Focus (optics)Video gamePresentation of a groupData managementExpert systemGoodness of fitComputer animation
Data modelElectric currentData managementImplementationSoftware developerOrder (biology)Process (computing)Data storage deviceComputing platformAlgorithmTask (computing)SatelliteLibrary catalogFile formatStandard deviationSource codeMusical ensembleEuclidean vectorPrice indexContent (media)Data typeMedical imagingDigital filterDot productMassGeometryAreaMetadataImplementationDifferent (Kate Ryan album)Subject indexingMetadataAlgorithmKey (cryptography)Computing platformData storage deviceCASE <Informatik>Uniform resource locatorMultiplicationInternet service providerPhysical systemInterface (computing)Medical imagingSingle-precision floating-point formatData structureIterationMultiplication signAreaPoint cloudInteractive televisionLibrary catalogPoint (geometry)Stack (abstract data type)Plug-in (computing)Process (computing)NumberScaling (geometry)Video gameComputer fileVideo game consoleOrder (biology)Source codeStandard deviationGeometryIntrusion detection systemRevision controlInformationPresentation of a groupMomentumMobile appTerm (mathematics)Right angleAnalytic setBitComputer programmingData managementPay televisionSatelliteGame controllerComplex (psychology)MereologyFile systemCovering spaceOperator (mathematics)Computer animation
GeometryComputer-generated imageryRevision controlPhysical systemOrder (biology)Extension (kinesiology)Link (knot theory)Limit (category theory)Digital filterLie groupCategory of beingData modelEmbedded systemInformationProcess modelingMetadataSemantics (computer science)HeuristicPay televisionTelecommunicationParameter (computer programming)Computing platformInterface (computing)Data structureComputer clusterLibrary (computing)Operator (mathematics)BitData storage deviceFront and back endsDescriptive statisticsPoint (geometry)BuildingAuthenticationMereologyExtension (kinesiology)AlgorithmInformationDirection (geometry)Process (computing)Product (business)Block (periodic table)Lattice (order)Cartesian coordinate systemDifferent (Kate Ryan album)CASE <Informatik>Arithmetic meanMappingComputer programmingNP-hardPhysical systemComplex (psychology)MetadataTelecommunicationReading (process)Computer fileHeuristicMusical ensembleOpen sourceMedical imagingInternet service providerStandard deviationComputer scienceSoftware frameworkPay televisionEndliche ModelltheorieLibrary catalogSatelliteMultiplication signOrder (biology)Covering spaceType theoryFilter <Stochastik>Presentation of a groupLink (knot theory)Graph coloringWebsiteDistanceStack (abstract data type)MultiplicationSoftwareTesselationGame theoryRight angleComputer animationProgram flowchart
Open sourceMusical ensembleBitLibrary (computing)Open setSlide ruleProjective planeDifferent (Kate Ryan album)CASE <Informatik>Computer animation
Computing platformParameter (computer programming)Set (mathematics)Interface (computing)Pay televisionImplementationTelecommunicationGeometryPoint cloudPhysical systemExtension (kinesiology)Computer-generated imageryRevision controlData managementCartesian coordinate systemMetadataData storage deviceProduct (business)Open setFeedbackPhysical systemBitInternet service providerRepository (publishing)Client (computing)Multiplication signOpen sourceDifferent (Kate Ryan album)1 (number)Video gameExtension (kinesiology)Stack (abstract data type)Pay televisionRight angleCurveLatent heatComputing platformInformationPresentation of a groupAnalytic setDirection (geometry)Queue (abstract data type)Computer animationProgram flowchart
Transcript: English(auto-generated)
All right. Good morning, everybody. Another continuation of stack, I do feel like maybe this presentation should have been in the afternoon. And every time the word stack comes up, we do a shot or something. That could be a good way to get us drunk quickly
and get to know each other. And so we're going to be talking about how up 42 is using stack and some of the challenges we faced. But first of all, who are we? So I'm this is my first phosphor G. And probably I'm one of the least technical people in the room. So please go easy on me. My focus is actually as a product manager
on the user experience. So actually, what's really critical for us is, because we're all familiar with data, and we're all familiar with the challenges that data presents, and some of the challenges for us when you actually receive the assets into your storage, we want to make life of the user as easy as possible to standardize and to really
facilitate a good user experience. And for us, stack has been very important. And Batman, he's, he's the real expert who's been facing the challenges and implementing it for our users. So he'll be doing most of the talking today. And so first of all,
a short introduction to who we are, and then we'll look at the challenges, and then the solutions. And then finally, some of the lessons learned. And of course, we continue to evolve as the stack. So we're very keen to keep keep the momentum going and see what comes next. And if you had been to the previous presentations we gave last year, I think already you'll see we've made big steps forward in our implementation.
So first of all, if you're not familiar with up 42, and we are a commercial company, but we also do have non commercial data available such as Sentinel to. So for us, what's really important is bringing together all the different satellite providers, also aerial data providers. And then additionally, providing analytics, or the ability for you to actually
add your own analytical solutions into our platform for you to get the most out of the data. And you know, there's no lock in. Once you've downloaded the data, you can take it use it freely. Of course, many people go into QGIS immediately. But you can also
use our platform really as the first starting point to amalgamate and find all of this data, and hopefully then derive insights. So for us, it's mainly searching our catalog, and then also tasking. So you can also request future data. Of course, that's for
premium commercial sources, you can place your orders on the platform. And then what's going to be critical in this talk is once you've actually ordered your data, and it's being delivered to storage, the download the management, the access and the further use, which is, of course, really why you come to get the data, right. And as mentioned, you can upload your own custom algorithms, there's many different, many ways you can
take with the data. And to briefly mention, we have a console interface, but also SDK API. And we're also available on ArcGIS as a plugin. So yeah, that's mainly how we're available. And so finally, to talk specifically about the storage, and really,
what's critical for us was, once you realize how many different data providers there are, and we're constantly adding more and more data providers to our catalog. I was really shocked about how different providers give the data when you when you press request. And maybe I was very naive, assuming it would be a nice geotiff or something like
this, and everything would be really easy. And I think that's really critical, because not everybody will be as familiar with the technical side of the data. And I think what's critical is all data should be accessible and usable for, you know, the wider community,
particularly as we saw earlier in the keynote, you know, it's used for many very good things, you know, in life. So I think, for us, making data easy and accessible is really critical. And really, you know, when you're dealing with data at scale, and multiple providers, you can't be expected to understand all the different providers way of working.
So that's where Stack for us really adds a benefit to indexing and navigating the data, to really create that single point of interaction, and to make it convenient for you as a user. And then also, the interoperability is also critical, because you know, data isn't used in isolation. So this is really where Stack starts to come into the picture for us.
And the journey began a couple of years ago. And now it's really our way forward right now. So with that, I will hand over to Batu, who is going to give some of the more technical information and really walk through. So small technical introduce.
Okay, I hope you get it right. So I want to start with a simple question. Usually,
when you search for data, satellite data, in the providers catalog, you filter according to some dates or like lot cover. But once you download the data, and you have like, for example, two different data from two different providers,
how would you understand whether this data is like after some date, or whether this data is like with in terms of cloud coverage is less than a number. For that providers give you a metadata like this. And this is the two example of metadata. One is for PHR,
playout, and one is for Spiviv. And your job is actually need to find those keys, you know, like with this horrible XML, in my opinion, but it's used to be the, it's mostly the XML files they deliver. And like you need to find, okay, where is this information about date, you know,
is like, if you carefully look at, you'll see that in the imaging date in the left side is like saying in the date of the acquisition date, and the other side is like in the start time. And our problem is basically was in the storage, people download data, and they download many data to their storage. And from multiple different providers, at some point, you want
to understand, you know, like you want to understand which of these data is after some date or which of these data is like have less cloud coverage. In that sense, I need to write a program for you to do this filtering operations. But like, for each provider,
should I go and should I try to detect in which key I need to like search this imaging date or like cloud coverage. It's a bit not very easy to program such kind of how to say, like many control flows. That was our problem, basically. And also, additionally, sometimes you
want to provide an AOI, and you want to say that we want to get this, you know, like all the images from Berlin, for example, I need we need to program something like this efficient for users to do it. For example, as an example, if you don't have a standardized metadata systems
like stack, what you have is like for each provider, they give you area in terms in geometry, location, bond, AOI, sometimes they use camel case, sometimes they use like different keys to define the same information for date item ID and like many more other
metadata information. They all follow different standards. But in stack, for example, you just say that, you know, give me the geometry. And it's like it gives you the geometry, because every we know that geometry is under this geometry key, which is like much more simpler, which is like you just check the date time and you see the date time. If every provider would return
such metadata, it would be really easy for me that I don't lose any more hair. I'm just 27 years old and already getting out of my hair. And the implementation question was that when we first started, it's more that Mattias talked about like catalog collection item, but in our
case, our case is a bit different. We are not provider, so we don't have like many files from single point, but we have like multiple providers having like different kind of file system and like file structure. We need to make a bit standard version of this structure as well.
And the other part is like the second big question is that this complexity doesn't go anywhere. We need to transfer this information from the metadata system to stack metadata system. And this third problem was like the biggest problem I remember is that stack, it doesn't care about app42, but we also, you know, like we need some information that we can
put some asset IDs, some account IDs and workspace IDs for you to navigate around. And the solution was like this in overall, it's like it's, we worked around like seven, eight months, a lot of iterations, but it was the, it is the general structure right now.
We have a platform and you have assets. And for these assets in the left side, we also keep some metadata for our systems. And we say that each app42 asset is equal to stack one stack collection. And under this, you have some multiple items and assets. And the way that we extend actually app42 information is using against stack extensions.
Like having this asset ID, like having these extensions and putting some information and all of our storage is one catalog. I'm going to make it more visual for you. So let's order you, you order one item, you order one app42 asset. We also had a zipped, that's why I'm
saying we in the, for legacy purposes, we were storing assets as zipped. But right now, I'm going to get to there. Let's say you have one app42 asset for that we provide you one collection as you see here. And like there are now three items defined for each geometry
so that you have three items under this collection. If you check the metadata, for example, I think it's a bit, the colors are not seen very much, but this is basically, this is for one collection in the left-hand side, you see here is like, it's the collection and the other sides, you see the one example of one item is that the type feature. And as you
can see, we can add app42 related information like here, and we can add more like stack extensions to validate those information. And you can add more like standard information about date, time, title, description, and for item, for example, in the first question is what we
added, like start date, time, end date, time, date, time is similar for all providers in the system. So it makes it much easy for me to filter, to search through this metadata information. There are like many more metadata you can add for sure. It's just for short for the presentation purposes. Since we have this kind of system, I can use CQL filters in the left-hand side.
What you should focus is more like in this filter is a bit big and also complex for me is like, for example, what I say is that give me 10 item after 2022 and give me the items
that cloud cover is less than 20%. And the API, our stack API, deliver you that information in the right-hand side, like 10 items for you. And it's like if you go into the metadata of these items, you see that is like they are actually comply with these filter operations.
And the asset model as a next step, actually we had this problem, stack assets is more like downloadable links in stack API. So what we store is only one zip. You can only download one zip. There is actually a bit way of downloading things inside zip, but it's a bit complex
operation for API. So we only had one asset. You just download all the zips. So we couldn't make use of these stack assets before. What we're going to do in the next, for example, we're going to extract all the information so that you also access to individual files in the system. And since we have this operation, you can easily, for example, read this part of image
only from red band. You can easily build, how to say, on downstream application to work this metadata and make it easier. These are more like, how to say, building blocks to make it in all other operations easier to program. Whatever the challenge is actually
in the tool side, to mapping the information to stack, meaning that taking all this information from providers metadata and putting into stack standards was not easy. It's not easy. We are still struggling with that because the providers are not really delivering you schema of the
metadata system. They may not be consistent. And sometimes they even don't deliver metadata. It's the worst case. You don't have any information. How much should I map? And they use different cementing, different heuristics. It's hard to program this mapping
operation. We use a library for that, which is private right now. I can give more details later. I didn't want to add the details. And for API purposes, the problem where delivery structure are very different in premium work. And it's not like Sentinel or Landsat. It's like there are
more complexities. There are tasking cases, which has different structures. We had problems with zipped assets. And now we solve this just ignore zipped and extract all information then do it. But still it's a bit missing part to handle easily. And there
are some tiled asset, tri-stereo asset pairs, which you don't find in open source data mostly. It's more like a premium tasking operations. And since we have one opportunity as it is one stack collection, it actually makes sense also search through collections. But right now,
there is only one API for item search. I guess collection search is also RSP, RFC for that, whether we implement or not. And authentication is not related to like free software, but we also had to implement something to authenticate so that you only reach out to your assets,
specifically not others assets. Is it like a, how do you say, is six points to learn generically in the bright side and the dark side. The good thing is that stack makes you really easy to communicate over these information like satellite data.
Because in our company, for example, it's not only geospatial people are working, there are product people, there are like computer scientists. And when you explain a topic to them and when you have to explain 10 different topic from 10 different providers, it's really getting hard. So communication improved much better among the teams.
And it's fully flexible, makes you flexible that you structurize things for your use case. In our case, it's like collection item assets, the structure, but in your case, you can also structure for your case. And there are
like very great tools, like there are already backend frameworks for you to implement easily and all other Python packages are really great. And the evident negative side is that onboarding is not very easy, especially in the other side of the world, non-geospatial people.
It's a bit hard. You need to be a bit nerd to understand all these things. It's okay for us, but people are not really, it's a bit, people struggle a bit. For premium data, maybe there are like more missing parts that current information is not a bit,
how do you say, current information may not be adequate for the premium data descriptions. And for the backend performance, we started with PgStack and then had to switch SQLAlchemy. The problem we are here is that we have a system that we store data in collection and item,
but the providers, for example, store data as like one collection and many items. So in this case, it was a bit hard to go through that. Yeah. These are a bit the general lessons. I guess we can talk after this meeting more. It used to be yours, but I've got to cover if you like.
So in the way forward is like, what we want to do is, as I said, we are now building the building blocks for all these easy access, easy processing, fast processing things. And like using more, following more standards, it makes it pretty easy for our
applications, for us to go into that direction. What is next is basically just following more these standards and applying more. And so in that sense, it's going to be much easier to, when you go to the algorithm size and processing sites. That's basically it. Thanks. Thank you. Do we have any questions?
We had to invent it in the company. It's like, I can explain the details,
but making the library open source, et cetera, is a bit of a thing that we are talking about right now. Currently, it's not available open source, but I can explain a bit the idea. The problem in our case is that the problem to maintain all these different providers, so it's mostly for our use case, actually. I'm not sure many people would be interested.
It's also a good chance to have a bit, whether you are interested or not in general people. We don't want to put an open source project without following it, without being sure of it. It's going to be followed, it's going to work for people, et cetera.
But I can give the details for sure. I think you might notice also, can we go back to the earlier slides with Jason? We also had to add a lot of our own information. Here you can see, we had to add all our own up for the queue specifics there,
because of course, how we handle things also from workspaces and account management, and we're trying also now to make the open source data more readily available in our platform, so the Sentinel 2 data is obviously free, so you don't have to be paying for your credits
to access this data. This is one of the challenges also, is blending the worlds of the open source and the premium together to try and give a product that fits all, because there is not one size that fits all. It's the TLDR. We have another question here. Thanks for talking with us. I'm just wondering, have you provided some feedback to some of your
providers, particularly ones that are not so, you know, not so ASER and some of that, but ones that particularly they go through you as a intermediary to the client, have you given them a bit of feedback on what they can produce? Yeah, exactly, like the product managers are mostly in contact with the providers
for different teams, data tasking, and for our analytics and storage. It's also not very easy for them to acknowledge directly. There is a bit of a learning curve, and also there is already established system for the metadata system. There are many applications
they have to work, so that's a bit, this is why they can easily adapt. There are already some providers like Capella, it's already returning a stack. It makes my life easier just, you know, like just like copy paste, you know, but for others it takes a bit of time.
We have a question right here. Same question, but yeah, okay. Any other questions? I have one. So you have some metadata that you defined for R42, have you specified that in a extension and published that?
We have extensions, but I guess we use it in internal repositories right now, but yeah, definitely we can publish. It's mostly opportunity related. Yeah, okay, it's available, sorry. Yeah, yeah, it's open. Yeah, it's open. Yeah, yeah.
I had to, sorry, yeah, I had to admit the presentation, the information are not 100% correct. I have to make it a bit readable from the audience, that's why it's not fully. Okay, all right, so thank you again.