We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

STAC: Search and discovery of geospatial assets

00:00

Formal Metadata

Title
STAC: Search and discovery of geospatial assets
Subtitle
Introducing a new cloud-native cataloging specification for geospatial data
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The talk introduces STAC, the SpatioTemporal Asset Catalog specification. It aims to enable a cloud-native geospatial future by providing a common layer of metadata for better search and discovery. It is an emerging open standard to catalog and expose geospatial data from different sources either in a static or dynamic way. We’ll cover the core set of metadata fields for STAC Catalogs, Collections, and Items first, along with available extensions for describing different types of data (EO, SAR, Point Cloud, etc.). With the basics of STAC in hand, the talk will go through the Open Source ecosystem for working with STAC metadata: validators, graphical user interfaces and client command line tools and libraries for search, access, and exploitation. The SpatioTemporal Asset Catalog (STAC) specification is an emerging standard to catalog and expose geospatial data from different sources. It aims to enable a cloud-native geospatial future by providing a common layer of metadata for better search and discovery. This talk gives a detailed overview of STAC and the way it allows for static and dynamic implementations at the same time. The simple concept of static catalogs living alongside the data on cloud file storage (e.g., AWS S3, GCS) by adding small JSON files is highlighted before talking through the dynamic searchable APIs built on top of the new OGC API – Features standard. The talk will cover the core set of metadata fields for STAC Catalogs, Collections, and Items, along with available extensions for describing different types of data (EO, SAR, Point Cloud, etc.). With the basics of STAC in hand, the talk will go through the Open Source ecosystem for working with STAC metadata: validators, graphical user interfaces and client command line tools and libraries for search, access, exploitation and API generation. The specification is an open standard developed on GitHub by a wide range of organizations with a strong focus on extensibility to support various domains. It encourages interested parties to extend the specification for their needs for a future of interoperable discovery and work with geospatial data. An ecosystem of Open Source tooling is evolving around the specification.
Point cloudLibrary catalogGoodness of fitLibrary catalogTemporal logicLecture/Conference
Library catalogSearch engine (computing)Right angleStandard deviationFile formatMetadataWritingFocus (optics)Computer animation
Library catalogPersonal digital assistantStandard deviationMetadataParticle systemGoodness of fitMoment (mathematics)Open setComputer animation
GoogolWeb portalLibrary catalogSearch engine (computing)Web portalOpen setSatelliteMoment (mathematics)Web crawlerSingle-precision floating-point formatGoogolView (database)Optical disc driveComputer animation
Library catalogOpen sourceMetadataInformationComputer animation
Library catalogStack (abstract data type)Vapor barrierSimilarity (geometry)Focus (optics)MetadataQuery languageAerodynamicsOpen setOpen sourceCovering spaceStandard deviationTeilkörperSimilarity (geometry)MetadataInternet service providerReading (process)Single-precision floating-point formatCASE <Informatik>State of matterClient (computing)AdditionVapor barrierLatent heatCentralizer and normalizerDifferent (Kate Ryan album)Temporal logicCloud computingWritingComputer fileOpen sourceLibrary catalogData storage devicePlanningRevision controlDynamical systemStandard deviationLink (knot theory)Focus (optics)TeilkörperServer (computing)Computing platformOpen setSoftwareSubject indexingDatabaseInformationStack (abstract data type)Descriptive statisticsWeb portalGoogolGroup actionSet (mathematics)SpacetimeRow (database)5 (number)Moment (mathematics)SineService (economics)BitData centerWordProcess (computing)Musical ensembleVotingComputer animation
Convex hullLink (knot theory)Library catalogWeb browserMusical ensembleTesselationLatent heatStructural loadRange (statistics)GeometryMetadataDifferent (Kate Ryan album)Internet service providerServer (computing)File formatComputer fileLevel (video gaming)Point cloudRow (database)View (database)Volume (thermodynamics)Landing pageComputer animation
Computer iconLibrary catalogKörper <Algebra>CubeQuery languageQuicksortDatabase transactionPoint cloudPoint (geometry)SatelliteContent (media)TeilkörperWeb browserClient (computing)Stack (abstract data type)TeilkörperPoint cloudSet (mathematics)NumberLibrary catalogCubeCartesian coordinate systemWeb browserServer (computing)Java appletOpen setDatabase transactionDigital object identifierRevision controlKörper <Algebra>ImplementationVirtual machineYouTubeLink (knot theory)DatabaseMetadataLatent heatValidity (statistics)Landing pageInternet service providerSatelliteDifferent (Kate Ryan album)MereologyCore dumpPoint (geometry)CASE <Informatik>Numbering schemeStandard deviationClient (computing)Process (computing)Dependent and independent variablesGoogolFamilyStaff (military)Suite (music)Democratic Action PartyPhysical systemComputer animation
Plug-in (computing)YouTubeStructural loadInstance (computer science)Parameter (computer programming)MetreComputer animation
Motion blurAmsterdam Ordnance DatumLibrary catalogOpen setStructural loadLibrary catalogMoment (mathematics)Flow separationStack (abstract data type)SatelliteState observerPearson product-moment correlation coefficientNumberComputer animation
HexagonGoogolElement (mathematics)Library catalogHome pageStack (abstract data type)NumberSingle-precision floating-point formatComputer animation
Point cloudMusical ensembleProgrammable read-only memoryMenu (computing)PlastikkarteExecution unitRootInformationDisk read-and-write headInformationVotingCategory of beingProjective planeMusical ensembleArchaeological field surveyInternet service providerGeometryAdditionTeilkörperDistanceDescriptive statisticsLink (knot theory)Library catalogComputing platformThumbnailCuboidPoint cloudComputer animation
Home pageError messagePoint cloudFacebookOpen sourceComputer animation
Transcript: English(auto-generated)
Good morning, I'm today talking about STEC, Spatial Temporal Asset Catalog. Has anybody of you heard about STEC already? Or, well, yeah, that's great. Where do you have heard about it?
Where have you heard about it? Anyway, so that's an introductory talk. I'm just giving the basics today. And the basic for this was,
what is annoying about metadata, right? So if you have data and want to expose it to search engines so that users can use it, so you need to basically know a metadata standard and expose it or write XML and stuff like that. And also if you use it, you need to understand the metadata format.
And we're trying to tackle that with focus on search and discovery of metadata. We may get trapped with this, of course. So if there is like 40 metadata standards, then we do a new one, and then there's 15 competing metadata standards. So that might happen,
but we tried to avoid this and of course give good reasons for why we're doing this. So first of all, like at the moment when you're trying to search for geospatial data, then you're probably getting to view any of these portals that are out there,
like dozens of portals where you can download your data, like Copernicus open access up for Sentinel data or the NASA CMR and so on and so on. But of course to find your data, you need to know all these portals, right? So otherwise you won't be able to find the data anyway. And we could also like go and see
whether there was something like a dictionary where everybody puts in his data, but or like a person that views all the data and puts together a dictionary about them. But then we already realized that there was too much data out there,
like Yahoo is not a thing anymore where people looked at the data and put everything into a dictionary. So we now have crawlers like for Google, for example, where everything is in one place and you can just visit Google and find everything. So we think finding things via Google or any other search engine is better than going through all of these portals
to find data. And of course, like there is going on so much satellites at the moment that there are better ways of data and you just need good tools to find all the data. So for example, if you're going through looking into Sentinel data, then you get this from ESA where it says,
if you just want a single granule, a single tile, then you get all this data. And what you really want in the end is just maybe the metadata and the actual data file. So that's just these two things and this whole bunch of information that you get.
And then you have to look through all these files. Like if you have Sentinel-2 metadata, XML files, it's 20 megabytes of XML that you need to go through. For stack, that's only 22 kilobytes that you can, as a normal user, can really understand.
Like for example, for comparison, the plain text Bible is just 4,000 kilobytes. So that's quite a lot to read if you want to understand the data. And you even need to find some kind of documentation how this all works and what it actually is, the data that you're finding, the metadata.
So while you're doing this, now there are many standards and proprietary solutions also for APIs that you can, like the portals, with very similar scopes and capabilities, but it would be a good idea to basically unify them and make them interoperable so that a client can access all these APIs
and all these data. That's a barrier for adoption, and so we thought about stack could be a good idea to evolve. So what is stack actually? It's basically defining a metadata standard for, or specification, it's not a standard because we're not working for a standardization company,
but it's for just a specification of what we think is useful for geospatial catalogs and assets with a focus on search and discovery. So in most cases, you won't find any information how to process the data, then you can still link from stack to the original metadata for processing, but to actually first find and discover the data,
you can use stack. So it's very simple, it's JSON-based, so most people can really read JSON as it's just a very thin layer on top of the metadata, and it's extensible, so you don't need to write any,
like, for example, for previous things, when you had XML, you needed to write an XSD schema and adopt it so you can add things, but now it's just JSON where you can put your own things into that in addition to what we have standardized already. Also, a different thing from other previous standards is that you have also a static catalog that you can draw,
so you can basically put your metadata files together with your data or put it on top of that, like if you have exposed central data, for example, in an S3 storage bucket, you can open another S3 storage bucket with your static files that are conformed to stack,
and then you can crawl through all these metadata files. They are linked together with links, and it's such that you don't need a server to run it. You just can put it on your file storage, and then it's there, and Google can make use of it. You don't need to write any software for that or something like that.
That's the static catalogs, and then there is dynamic APIs, of course, as well, because if there is thousands and thousands of files, you probably need to put them into the database and index them so that you can better search for them. So you also expose an API, which is based on the recent version of OGC API features,
the former WFS, Web Feature Service from the OGC, and we just put a thin layer on top of this standard to make it searchable, and it's an open specification, of course. That's why we're here, open source. Everybody can contribute,
and so what is it not? It's not a full-fledged metadata state node. As I said, it's not for processing stuff or so. It's basically really focusing on search and discovery, although you can, if you want, put your processing information into that. It's extensible. So also, it's not a replacement for the data provider's internal metadata.
So you can basically, from your item file, you can link to your other metadata or other files that you have, previews and stuff like that. It's not the single source of truth in this case, and it's also not for all kinds of datasets.
So it's just for spatial temporal, and you don't really can expose things like additional, like you can link to additional documents, but it's not meant for putting other things in spatial temporal data into your stack catalog. So as such, it's also not a replacement for ISO standards,
for example, and stuff like that, OGC, CSW. There is a recent innovation plan for OGC catalogs or records, which is also a new API, and we try to align also with this effort. So what's the state of stack at the moment?
At the moment, we're at version 0.9, just released some days ago, and we're heading towards releasing the first stable version, 1.0, in mid-third or fourth quarter of this year. And there is also plans to separate
the actual specification work for the metadata and the API in the next weeks so that they're more streamlined towards their use cases. So what is this specification actually about? What do we expose there? So there's first, there are catalogs, collections, items,
the API, and extensions and best practices. So what is all this about? A catalog is basically a very rough or very small thing for cataloging. You can group your collections and items with it. It's very simple. It's like basically just an ID, a description,
and additional links to whatever you want to group. And then a collection is basically an addition of a catalog. It extends it and adds collection-level metadata to it. For example, the extend, spatial and temporal, license, provider, and all these things that you have.
For example, if you want to expose sentinel data, you want to talk about what the sentinel data actually is, like which platform it is using, which temporal coverage it has, spatial coverage it has, which license, where you can find licensing information,
who's a provider, and so on and so on. Then this can be used standalone. So if you don't expose any assets, granules, whatever, you can only use it to expose your collections as well. Like for example, Google Earth Engine, if you know that, just exposes their collections,
and then you need to use their tools to actually use this data. So you can find this data as collection then, but then it tells you that you need to use your tools. So if there is data also which you can't download in the traditional sense, then you can at least use it in any cloud provider that is out there and exposes it as stack collection.
And collections are also useful for summarizing the actual item data that is exposed. And items itself are the actual granules, so the individual tiles. And items are basically GeoJSON features. So the feature is basically then the geometry
of the asset that is exposed. And an asset, for example, in an item could be the file for band one, and then another asset is band two, and so on and so on. And all these assets can then be downloaded and provided with additional links, like for example, the provider specific metadata
in ISO format or whatever. And this actually is very nice if you combine it with cloud-optimized GeoTIFs. So if you can see here, I'm not sure whether, yeah, it's basically just a browser that is working on a GeoTIF,
yeah, on a cloud-optimized GeoTIF. And cloud-optimized GeoTIF is basically a GeoTIF that is a bit different structures, and with HTTP GET requests, range requests, you can basically, without any server software, you can browse it on a map. So for example, this is, I think, Leaflet,
and if you can see it, if it zooms out, then it basically, if you zoom in now, it loads the data, just the data it needs, so that there could be a 500 megabyte file behind that, and it just downloads the things you see here. So that's, of course, pretty nice if you don't want to expose a WMS,
especially for that, and you can just download the data that you need and view it while discovering it, whether it contains the data you need or not. The API itself is, as I said, aligned with all you see API features. It's pretty simple, I think. There is a landing page with capabilities.
There is collections that you can actually expose. For example, that would be second and center of two. And then the items would be each granule that you can download here as data. And then there is the stack-specific search endpoints, where you can basically search for whatever is in the files, whether the cloud cover or the extent or the provider or the license and so on.
That's defined as open API documents, so pretty easy to use with the open API ecosystem as well to implement. And then we have, basically, for items, the metadata fields are very slim, like there is title and extents,
and there you can specify some things like when the metadata has been created or updated. But then, basically, this thing in the core is very slim, and then you extend it with extensions. So for example, for content, we have extensions for describing data cubes.
For EO data, which is, in this case, electro-optical. Then for machine learning, to specify the labels. Point cloud data, for SAR data, then we have a specific one for satellite data, which is basically inherited from EO and SAR. And for scientific data, like exposing DOIs
and stuff like that. And then for the API, of course, this is also, in the core, very slim, and then you can extend it via fields, for example, that you can say, I don't only want a certain set of fields in my response so that it gets smaller. You can query it via some specifics. You can sort it.
And there is a transaction extension to basically add and remove fields and update fields, items and stuff like that. And of course, also for versioning, if there is different versions of assets, then you can version that. There is a growing ecosystem behind Stack. You have, for example, already a validator
where you can just put your catalog into it and it validates it, whether it's okay and according to the Stack. And there is an extension for intake. I don't know intake, but it was said, it's a big thing, oh, whoops, in the Python world. Then there is PyStack for catalog creation and all work with Stack catalogs.
And similarly, works set Stack. Then there is a number of clients, for example, Stack browser, which you already saw when we had this cloud-optimized YouTube preview there. This is basically a human-readable version
for the catalog, for the JSON files, which also expose, for example, schema.org translations so that they can be crawled by Google and their new Google dataset search. There is a QGIS plugin, the set search for searching data, the set fetch for fetching the data or downloading the data.
And then there is set API browser, which is basically, so Stack browser was more for the static catalogs and set API browser is more for the API part because there you can also search and Stack is browsers just for going through the links that are in the data.
Then there is a couple of server implementations which you can use to expose your data, for example, Staccato in Java. Stack API, which is, I think, a Node.js application and set API PostgreSQL, which is basically, I think, Python with a PostgreSQL database behind it. And this, for example, is a QGIS plugin
where you can basically just specify your required parameters and then it searches for data and basically loads the data directly as cloud-optimized YouTube into your QGIS instance to work with. And then you can, as it's a cloud-optimized YouTube, also directly zoom into that
and it loads the appropriate data and so on. We're working on basically making several catalogs available, openly available. At the moment, there are Sentinel 1 and 2, Landsat 8 and USGS Landsat Collection 2 is directly offered as Stack and Coq catalogs from USGS.
There's CBRS4, which is a Chinese-Brazil satellite for Earth observation. NAAP, NASA CMR is also translated into Stack and there are a couple of more things that are coming and are in preparation. And maybe in the future also your data,
it's pretty simple to expose such things. So if you have data that you want to be found, then it's a good idea, I think, to expose it as Stack catalog. Here's an overview who is already exposing their data as Stack and working with Stack. Quite a number of entities.
And now I maybe just show you a single example here. Like for example, this one is the catalog. So it basically just is JSON. You can, I guess most of you can read what is in there and it basically gives you an idea.
It's Copernicus S2 here and then the title and description, what it is about, a license, keywords, provider information, the extent information, tempo, and all the other summary data that is like, for example, how to cite, what the ground sampling distance for the individual things is,
the constellation, platform names, projection information, the bands, for example, which bands are in the actual assets that you can download, band one, band two, and so on. Common names, wavelength, and links that you can basically visit to get more information.
And that's a collection and now you can basically also look at the items, which is then an individual granule. It says, for example, which ID it has, which collection it belongs to, and there is, again, links to get additional information, the bounding box of the granule, geometry assets that you can download basically now,
which is JSON files, what is it, JPEG 2000, either thumbnail or the actual data. And is there anything else in here? Yeah, some additional properties like cloud cover, values, and so on.
So that's how it's working. And yeah, I'm happy to take your questions if there are any. Thank you very much for listening to this talk.