STAC Best Practices and Tools
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/69198 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
FOSS4G Firenze 202246 / 351
1
7
13
22
25
31
33
36
39
41
43
44
46
52
53
55
58
59
60
76
80
93
98
104
108
127
128
133
135
141
142
143
150
151
168
173
176
178
190
196
200
201
202
204
211
219
225
226
236
242
251
258
263
270
284
285
292
00:00
Stack (abstract data type)Element (mathematics)Software engineeringComputer animation
00:23
Vector graphicsSoftwarePoint cloudData structurePersonal digital assistantObservational studyDatabaseProcess (computing)Multiplication signMereologyOnline helpDatabaseRaster graphicsSet (mathematics)ImplementationData structureFocus (optics)NeuroinformatikOpen sourceModulare ProgrammierungComputer animation
00:54
RWE DeaLevel (video gaming)WordKey (cryptography)MereologyComputer fileLatent heatMetadataPoint cloudCore dumpInternet service providerInformation privacyComputing platformLibrary catalogStack (abstract data type)Data structureComputer animation
02:49
BlogClient (computing)MereologyClient (computing)File formatPoint cloudComputer animation
03:06
Open setCASE <Informatik>Suite (music)Shape (magazine)SoftwareCore dumpStack (abstract data type)Point cloudExtension (kinesiology)Computer fileDomain nameLatent heatArmComputer animation
03:45
Library catalogDependent and independent variablesSatelliteInheritance (object-oriented programming)Element (mathematics)Computer-generated imageryPoint cloudFile formatDemosceneSpatial data infrastructureComputing platformDigital signalRWE DeaRadio-frequency identificationLibrary catalogInformation privacySource codeComputer programmingArmStack (abstract data type)Open setSubject indexingNeuroinformatikBitSet (mathematics)Computer animation
04:26
Component-based software engineeringLibrary catalogGeometryExtension (kinesiology)Connectivity (graph theory)Latent heatData structureLibrary catalogTerm (mathematics)Computer fileIntrusion detection systemExtension (kinesiology)System callRaster graphicsInformationAttribute grammarBitDifferent (Kate Ryan album)NamespaceSet (mathematics)ImplementationTheory of relativityField (computer science)Temporal logicMusical ensembleValidity (statistics)Instance (computer science)Open setRight angleKey (cryptography)Endliche ModelltheorieStack (abstract data type)CubeSelf-organizationPhysical systemWeb browserProjective planeComputer animationJSON
06:51
Covering spaceOpen setSurfaceProduct (business)Computer-generated imageryImage resolutionRevision controlMedical imagingFraction (mathematics)Thermal radiationAreaPrice indexCase moddingLevel (video gaming)Musical ensembleTemporal logicMoistureAngular resolutionVariable (mathematics)Maxima and minimaPoint cloudLibrary catalogPhysical systemWeb browserParameter (computer programming)SpeciesTime domainWärmestrahlungFile formatComputerSpectrum (functional analysis)Matter waveOpticsMetadataAngleCoefficientStatisticsComputing platformContent management systemRaster graphicsAxonometric projectionClique-widthProjective planeLibrary catalogRevision controlRaw image formatFluid staticsUniform resource locatorGreatest elementWeb pageWeb browserStack (abstract data type)Computer animation
07:16
Covering spaceSurfaceOpen setImage resolutionRevision controlMedical imagingMaxima and minimaDigital photographyThermal radiationPoint cloudAngular resolutionPhysical systemLevel (video gaming)Fraction (mathematics)MoistureProduct (business)Musical ensembleTemporal logicAreaCase moddingPrice indexLibrary catalogProgrammable read-only memoryWeb browserSet (mathematics)Computer-generated imageryDiallyl disulfideVariable (mathematics)Parameter (computer programming)Torsion (mechanics)Programmable logic arraySeries (mathematics)WärmestrahlungFile formatComputerMetadataOpticsSpectrum (functional analysis)Matter waveComputing platformAngleCoefficientContent management systemClique-widthCodeComputer fileAzimuthPixelView (database)GeometryRaster graphicsShape (magazine)Green's functionData typeAxonometric projectionBit error rateLibrary catalogExtension (kinesiology)Web browserGoodness of fitInternet service providerStack (abstract data type)Computer animation
07:44
Point cloudComputer-generated imageryComputerLevel (video gaming)Product (business)SurfaceWärmestrahlungGeometryFile formatSatelliteWeb browserRaster graphicsLibrary catalogQuery languageMathematical analysisForestPhysical systemFocus (optics)Term (mathematics)DemosceneFluid staticsStack (abstract data type)Uniform resource locatorPoint (geometry)Library catalogWeb browserFluid staticsLatent heatExtension (kinesiology)Computer animation
08:30
Library catalogFluid staticsSoftware testingPerturbation theoryLibrary catalogRight angleConformal mapSocial classDifferent (Kate Ryan album)PlanningNeuroinformatikComputer animation
09:15
Library catalogRevision controlExtension (kinesiology)Latent heatObject-oriented programmingOperations researchStructural loadClient (computing)Reading (process)Query languageFluid staticsSelf-organizationElement (mathematics)View (database)GUI widgetString (computer science)InformationConfiguration spaceImage resolutionDefault (computer science)Computer reservations systemCubeDemosceneVirtual realityMusical ensembleParameter (computer programming)GeometryAxonometric projectionLandau theoryMeasurementExecution unitField (computer science)Self-organizationPoint cloudInterface (computing)CubeShapley-Lösung1 (number)Utility softwareBitGoodness of fitClient (computing)Data structureExtension (kinesiology)Core dumpProjective planeOrder (biology)Modulare ProgrammierungStack (abstract data type)Raster graphicsSpacetimeTouchscreenNetwork topologyOpen setMereologyElectronic mailing listAttribute grammarLatent heatData typePoint (geometry)Subject indexingSoftware developerDynamical systemAdditionSoftwareInformationRevision controlPhysical systemLaptopObject (grammar)Theory of relativityLibrary (computing)GeometryLibrary catalogNeuroinformatikDescriptive statisticsFluid staticsComputer clusterWritingIntegrated development environmentComputer animation
14:12
CASE <Informatik>CASE <Informatik>MereologyObservational studyComputer animation
14:28
Link (knot theory)Data bufferLibrary (computing)UsabilityStack (abstract data type)Array data structureSubsetLibrary catalogPlot (narrative)GeometryPhysical systemReading (process)Raster graphicsAxonometric projectionComputer configurationExtension (kinesiology)Object (grammar)Stack (abstract data type)Software repositorySlide ruleProjective planeBound stateDistortion (mathematics)GeometryTerm (mathematics)Raster graphicsCuboidEstimationBitCellular automatonQuicksortProcess (computing)Shape (magazine)Utility softwareGoodness of fitLine (geometry)Similarity (geometry)Chemical equationData bufferSquare numberRectangleTesselationCurvePoint (geometry)Level (video gaming)Point cloudInformationTouch typingOperator (mathematics)Electronic visual displaySource codeComputer animation
18:14
WebsiteSelf-organizationStochastic differential equationLatent heatTemplate (C++)Repository (publishing)Software repositoryGoodness of fitStack (abstract data type)NeuroinformatikNamespaceHydraulic jumpSlide ruleLatent heatOpen setSource code
18:57
Common Language InfrastructureClient (computing)Fluid staticsLibrary catalogRaster graphicsArray data structureKey (cryptography)Point cloudFile formatExtension (kinesiology)Online helpCore dumpShape (magazine)Open setWebsiteProgramming languageInformationValidity (statistics)Point (geometry)CASE <Informatik>MereologyFile archiverMultiplication signRaster graphicsStack (abstract data type)Online helpExtension (kinesiology)Projective planeWebsiteNumbering schemeComputer animation
20:49
Table (information)Content (media)Stack (abstract data type)Vector spaceType theoryField (computer science)Link (knot theory)MetadataComputer-assisted translationSatelliteExtension (kinesiology)Library catalogTemporal logicMereologySoftware repositorySoftware developerSource codeJSONXML
21:08
Group actionCategory of beingLibrary catalogPrice indexExtension (kinesiology)Form (programming)Operator (mathematics)Internet forumRight angleElectronic mailing listSubject indexingSoftwareLibrary catalogSource codeComputer animation
21:31
Multiplication signComputer animation
Transcript: English(auto-generated)
00:00
Yeah, so Matt was listed first on the talk, so if you're expecting to hear Matt, feel free to leave and come back in an hour if we're talking about it. So my name is Pete Ganofsky. I'm a geospatial software engineer for Element 84. Sorry. And today's talk is practical stack, or stack best practices and tools.
00:24
So I spend most of my time working with the Microsoft planetary computer, helping them ingest data sets. I also help maintain a lot of the open source software packages that are part of the stack ecosystem. And so today's talk is going to be mostly derived from my experiences.
00:42
That is to say, I'm going to kind of focus on rasters, data structures, and best practices. I'm not going to talk as much about API implementations and databases. There will be other talks for that, hopefully at this conference. So I'm going to really try and focus things either on data producers or data consumers, so folks that are creating stack metadata or folks that are using stack metadata.
01:05
So just to start things off, what is stack? So it is a metadata specification, but I really think it's more than that. I think stack really is all the stuff that's around the specification, the tooling, the community, the things that take the specification for something interesting and turn it into something useful.
01:24
So kind of in my own words here, I'm going to describe what I think stack is for and then what it actually is. So first stack is for public data discovery. So even if your assets are restricted or authorized or not open to the public, if you make your stack API, if you make your stack catalog public,
01:42
it's like an advertisement for your data. You let other people see what you have and maybe make it interesting for them to explore more and maybe use your data. It also, stack enables efficient data access. So one of the key parts of stack is it pulls really critical data out of your data assets and up to the metadata level.
02:00
This allows well-constructed tooling to access your data in an intelligent way, in a cloud-native way, without actually having to touch all of your data asset files. So if you use stack well, you allow people to use your data better. Stack is also for seamless data interoperability. Again, because you're pulling the essential bits out of your data files and into the metadata,
02:25
you can use the same tooling across data providers or data platforms. So what is it? So first is specifications. There's four core specifications, three kind of data structure specifications and one API spec.
02:43
But stack is also, in my opinion, more than just the specs. It's about the tooling. And I'm going to actually put up a quote here from Paul Ramsey because I like it so much and I'll just read it. One of the quiet secrets of the cloud-optimized geospatial world is that while all the attention is placed on the formats, the actual really, really hard part is writing the clients that can efficiently make use of the carefully organized bytes.
03:06
I love that quote. Paul was talking about a hypothetical cloud-optimized shapefile in a JavaScript reader, but I think it applies to all geospatial software and stack tooling in particular. We need to make sure that we make ergonomic efficient software
03:23
so that we can really use stack to its full potential. Stack is also all about the open community. The core spec is intentionally very limited in scope and so a wide suite of extensions allow the community to help tailor the actual use cases of stack to their specific domains and their specific uses.
03:43
So just to prove that I'm not just waving my arms and talking about vaporware here, here's just a little bit of a screenshot from stackindex.org of some of the available stack catalogs and APIs that you can go explore and use data from.
04:03
There are stack catalogs from government sources such as the USGS for Landsat, the NASA CMR stack, there's private holdings. And then Big Tech has open offerings. Google Earth Engine has a stack catalog. The AWS public datasets program is available via Earth search.
04:20
And then Microsoft has their planetary computer. So that's kind of the broad overview of what is stack and what we're doing. So now let's get technical. So as I mentioned, there's four key components to the stack specification. The three data structure specs are item, collection, and catalog. And then there's an API specification as well.
04:43
So this is the item spec. There's an item on the left and a picture of it on the right. I say items and assets. So assets really are the key to stack. I think it's kind of funny that asset is in the name of stack, you know, spatio-temporal asset catalog, but it doesn't have its own actual specification.
05:00
It's all rolled into the item spec. But assets essentially map one-to-one with files. That means that one asset can have multiple bands if we're talking about rasters. This is a little bit of a different model than other systems used such as Open Data Cube. And there are extensions in the stack ecosystem to allow you to describe what bands are available in your assets.
05:23
All three of the data structure specifications are defined in terms of JSON, and they all come with JSON schema for automatic validation. And items are GeoJSON features, so you can use them just like any old GeoJSON feature. For instance, you can drag and drop it into QGIS and just look at it, which is kind of handy.
05:43
The next data structure is collections. Collections are a set of related items, and they summarize information about those items. I like to think of them as namespaces. So within a collection, you cannot duplicate IDs, but IDs can be duplicated across collections. So it's like a Python namespace or whatever you want to do.
06:04
Collections are really useful for discovery, so they include spatial and temporal extents. Oh, my mouth still show up. Cool. They can also summarize attributes of the data and then describe what assets are available in the collection as well.
06:20
Finally, the third data structure is the catalog. Catalogs are basically just folders. They don't have too many fields in the actual specification, and they're mostly used for organization. They may contain many, many collections. For instance, the planetary earth catalog contains up to about 100 collections within it.
06:42
It shares a lot of fields with the collection, but they aren't the same thing, which is a subtle difference, but matters for some implementations. So here I'll call out Matthias' pretty shiny project called Stack Browser. So it's really nice to be able to look at stack catalogs as raw JSON,
07:02
but it's even nicer to be able to do it in a visual and explorable way. And so this tool, Stack Browser, is a great way to do that. You can build and install it yourself, or you can use a hosted version. There's a URL there at the bottom of the page. You can point it at any stack API or any static catalog.
07:20
And as a data provider, so as a person who creates stack, I use this to audit my own stack that I make, so I check to make sure my spatial extents are good, that all of the text that I've written is good copy, things like that. And then as a data consumer, as somebody that's trying to use stack, I use Stack Browser to see what's available from a catalog that I may not be that familiar with. So because you can point it at any catalog,
07:41
you can just go see, hey, cool, these are the assets available. It implements these extensions, so on and so forth. You'll notice that I said that Stack Browser can load both static catalogs and APIs. This is a pretty significant distinction, and I'm going to spend a second on it.
08:01
So Stack API is the fourth of the four stack specifications. I'm not going to dive into its details here, but I like to think of it as basically a catalog with superpowers, so it behaves like a catalog, but then it has the ability sometimes to search or to filter and do other things. As a data consumer, so as a person trying to use stack data,
08:23
it can be really hard to tell whether a given URL points to a static catalog or a dynamic catalog. One way to do it, and it's a little awkward, but you can do it, is to actually load the catalog itself and look for a conforms to member. So on the right is the Google Earth Engine catalog.
08:42
You can see that it doesn't have a conforms to member, which means that it's probably a static catalog. So if you're going to do any searching or filtering of the items within that catalog, you're going to have to do it yourself because there's no brains behind that catalog. On the left is the planetary computer, which is a full API and implements item search and other, they're called conformance classes.
09:02
But basically that means you can then ask the API to do some of the searching and filtering for you. So that's an important thing to know, and depending on whether you're working with a static catalog or an API, you're going to choose different Python tooling. So right now, Stack is at version 1.0,
09:20
has been for over a year at this point. Stack API is at release candidate one of its version one. I think there might be a version two coming up soon, I'm not sure. Both specifications are hosted on GitHub, and both specifications have an extensions organization. So again, the extensions, which I'll come back to later as well,
09:40
are where we do like dynamic additions to the specification. And Phil Varner just recently split out all of the Stack API extensions to its own organization. So that's the Stack description. Now let's get into some of the tooling. So there's a Stack utils GitHub organization that hosts some,
10:04
but not all, Stack-related software packages. There's also Stack-related software packages all over. I put four on the screen here that I'd like to call out specifically, three of which are in Stack utils and one which is in a different organization. So PyStack is the foundational Python library.
10:22
It just is an API, it does not have a command line interface. It is used to read and write static Stack objects, so catalogs, collections, and items, that should say items there. This is kind of the core that all the rest of them are built upon. PyStack client is for APIs. So PyStack is for kind of the core data structures,
10:42
and if you're dealing with an API, you're going to want to use PyStack client. That has both command line interface and a Python API. Stack tools is built on PyStack. So PyStack is explicitly very low on dependencies. It tries to be really light and easy to install and usable in almost any environment. If you're trying to do anything with Stack items, catalogs, or collections
11:03
that requires other dependencies, whether it's ShapeLeap, PyProj, Rasterio, whatever, those utilities live in Stack tools, because Stack tools basically takes all the dependencies it needs and is a little bit of a Swiss army knife, let's be honest. And so there's a lot of good utilities there.
11:20
Finally, I'd like to call out ODC Stack, which is a relatively new project. It comes out of the Open Data Cube ecosystem, but I will say it does not depend on the Open Data Cube software system. So ODC Stack is a standalone package that just has Open Data Cube branding, but doesn't have the big dependency tree of Open Data Cube.
11:42
And it is one of the tools out there that can load Stack data into an X-ray, so very useful for data consumers and data scientists. There's a bunch more projects in the Stack.util's GitHub organization, and then Stack Index lists a bunch of other software projects as well.
12:01
So a little bit more on ODC Stack. So again, a lot of folks are used to working with X-rays, Geopandas, data structures like that. Getting Stack assets into those data structures is a little tricky. There's a couple projects out there to do it. Stack Stack is one of them. ODC Stack is a newer one that has been supported in development by Microsoft.
12:26
This is just a bit of a gif of how it works. I won't go into the details, but I'll just point you to both the documentation for ODC Stack, and then Element 84 has put together some Geo notebooks that also demonstrate its usage on GitHub,
12:41
so I encourage you to look at that. We're also going to be updating the planetary computer, some of the planetary computer example notebooks to use ODC Stack as part of the fall release as well. So there will be more demonstrations of its use there. So ODC Stack is a cool tool, Stack Stack as well, because it reads data in a cloud-optimized way.
13:00
It doesn't go and fetch all of your data assets. It only grabs the bytes that it needs. But in order to do that, Stack items need to have certain extensions on them to allow those readers to be smart about what they're doing. So basically, they need to know where the assets are in space and what the data types are of those assets. Otherwise, it has no idea what to go ask for
13:21
when it tries to fetch a remote resource. So I'm going to call out two stock extensions that are critical for that, the projection extension and the raster extension. I've listed here the fields that are needed. This is not all of the fields that are enabled by those extensions, but just the ones that are required to load data into an X-ray
13:41
or to use it in a cloud-optimized way. As a data consumer, ODC Stack allows you to provide this information if it's not on the Stack items already, and that's documented as part of the ODC documentation. But as a data producer, if you're making Stack items,
14:01
if you add these attributes, you will make your users a lot more happy because a lot of the tooling will just use automatically. And yeah, that's what I'll say about that. All right. So I thought as part of this talk, I'd do a case study, so a practical example from a GitHub issue that illustrates some of the issues
14:23
that arise as a data consumer when the data producers don't do things quite right. So this is an issue that was raised on the ODC Stack repo, but it didn't end up actually being an ODC Stack problem at all. It was a problem with the Stack items themselves. Essentially, when the user loaded data into the ODC Stack,
14:44
it had no data buffer around it. So they just loaded a single asset, tried to make a pretty picture out of it, and there was this gap of no data cells around the actual data. The user thought it was an ODC Stack problem.
15:00
However, it actually ended up being a problem with the items geometry versus the actual asset balance. So if you look at the pink square rectangle prism, that is the item geometry, and then the black is the asset itself. You can see that the pink item geometry
15:23
is rotated relative to the actual asset. This is a byproduct of the fact that Stack geometries are always latitudes and longitudes based on WGS 84. So because the item geometry is a square, I assume that the person created the item geometry
15:40
from the bounding box of the asset, and they did so when the bounding box of the asset was reprojected to WGS 84 before creating the geometry. This means that the geometry actually kind of overestimates the size of the asset. ODC Stack doesn't know about the asset's geometry itself
16:01
because, again, it's a cloud-optimized reader. It's trying to just learn that information from the Stack metadata, and so it assumed that the bounds of the asset actually goes all the way to the edge of the pink, which is a little bit bigger than the asset itself. The solution is relatively simple. As a data producer, you simply just need to create an item geometry
16:22
and then re-project the geometry itself. As a data user, fixing this sort of problem can be a little tricky. You can go back to the data producer and ask them to fix stuff, or because you're going to be touching the assets themselves, you can actually just correct the items before you read them in. But that's a little bit of trickier work
16:40
and, again, it's nicer if the data producer can make things line up. Doing operation... Oh, sorry, jumped ahead of myself. A similar sort of problem comes when you're trying to display Stack assets on a map. This is a MODIS tile. MODIS is projected onto a sinusoidal grid, and so you can see when it's displayed on a map,
17:01
it tends to appear curved. And so the blue box, which is on the outside here, is item geometry that was created by the process I described on the previous slide. But it still doesn't actually look like it's capturing the actual data bounds
17:20
because of that distortion induced due to the projection. A way to fix that is to actually densify, is the term that we use, which is basically add a whole bunch of points to the item geometry before re-projecting to help it suck in against the actual data itself, which is the green box there.
17:41
This is a relatively non-trivial operation, and this is kind of where Stack tools comes into play. Again, to do those operations, you're going to be using Shapely. You're probably going to be using Pyproj, Rasterio, and those are things that we don't want to put into PyStack because we're trying to keep PyStack relatively light. And so we've added a whole bunch of utilities to Stack tools to help you do things like this,
18:00
to create raster footprints from the data themselves, densify geometries, and a whole bunch of other things. So I encourage you to check those out if you are creating Stack and want to make sure that it's looking good. Stack tools also provides a namespace for what we're calling Stack tools packages.
18:22
I need to wrap it up here. So I'll jump through these last couple slides relatively quickly. So Stack tools packages, there's a whole bunch of data set specific packages here. So when we create stuff for the planetary computer, whether it's modus stack or sentinel-2 stack or whatever, we put those repos into Stack tools packages.
18:41
If you're creating Stack items for a specific open data set, check out the Stack tools packages and see if we've already done the work for you. And if you are creating your own and it's not in Stack tools packages, consider contributing back to the community there. Okay, just rolling things up. There's one frequently asked question and a couple other items.
19:03
First, how do I create a Stack item? I get this question a lot. If you're doing a Python API, use PyStack as mentioned. There's two command line tools that I like to recommend. One is Stack tools, and the other one is a project called rheostack. Stack tools creates a relatively simple basic item.
19:20
Rheostack creates a much richer one, it's more configurable. So if you're just trying to see what a Stack item looks like for a raster, that's a quick way to do it. For data consumers, these are some points I've mentioned most of them. The only thing I'll call out right now is item six. We've seen cases of assets themselves being reprocessed
19:41
or disappearing from underneath folks. Cough, cough, Landsat, cough. And so be aware that this could happen if you're trying to publish your data as a part of a paper or something like that. It may behoove you to actually archive the data somewhere else to make sure that doesn't go away on you. It's not a super common problem, but if you're trying to be really exact about stuff,
20:00
that's a good thing to be aware of. For data producers, the only thing I'll call out here is Stack Check and Stack Validate, which are two tools that have been developed recently that go beyond JSON schema validation and do some best practices checks for your items and collections.
20:23
Finally, please help us to find extensions that help everybody. If you bring your use case to an extension, that's where the spec of the community will help evolve and describe what you're using your data for. I'll skip that last point, but you can ask about it later, because I'm running out of time. All right, a quick run-through of the resources used to make this talk
20:42
and that you can use. The main website is amazing. I really like it. I reference it more than I would like to admit. stackspec.org. The best practices document is part of the spec, but it is not versioned like the spec, so we try and keep it up to date as we learn things or develop new things. That's in the GitHub repo.
21:02
There's a Gitter channel. Come ask questions, ask for advice. There's a discussions board for longer-form discussions where we talk about new extensions. Right now we're talking about the best practices for operational forecast data, so it's a good place to look if you have a bigger question.
21:22
Finally, the stack index provides lists of catalogs, software, and other resources. Sorry to blow through the last of those. I think that's it. Thanks for your time and your attention.