We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

XDGGS: A community-developed Xarray package to support planetary DGGS data cube computations

00:00

Formal Metadata

Title
XDGGS: A community-developed Xarray package to support planetary DGGS data cube computations
Title of Series
Number of Parts
156
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
## 1. Introduction Traditional maps use projections to represent geospatial data in a 2-dimensional plane. This is both very convenient and computationally efficient. However, this also introduces distortions in terms of area and angles, especially for global data sets (de Sousa et al., 2019). Several global grid system approaches like Equi7Grid or UTM aim to reduce the distortions by dividing the surface of the earth into many zones and using an optimized projection for each zone to minimize distortions. However, this introduces analysis discontinuities at the zone boundaries and makes it difficult to combine data sets of varying overlapping extents (Bauer-Marschallinger et al., 2014). Discrete Global Grid Systems (DGGS) provide a new approach by introducing a hierarchy of global grids that tesselate the Earth's surface evenly into equal-area grid cells around the globe at different spatial resolutions, and providing a unique indexing system (Sahr et al., 2004). DGGS are now defined in the joint ISO and OGC DGGS Abstract Specification Topic 21 (ISO 19170-1:2021). DGGS serve as spatial reference systems facilitating data cube construction, enabling integration and aggregation of multi-resolution data sources. Various tessellation schemes such as hexagons and triangles cater to different needs - equal area, optimal neighborhoods, congruent parent-child relationships, ease of use, or vector field representation in modeling flows. Purss et al. (2019) have explained the idea to combine DGGS and data cubes and underlined the compatibility of these two concepts. Thus, DGGS are a promising way to harmonize, store, and analyse spatial data on a planetary scale. DGGSs are commonly used with tabular data, where the cell id is a column. Many datasets have other dimensions, such as time, vertical level, ensemble member, etc. For these, it was envisioned to be able to use Xarray (Hoyer and Hamman 2017), one of the core packages in the Pangeo ecosystem, as a container for DGGS data. At the joint OSGeo and Pangeo code sprint at the ESA BiDS'23 conference (6.-9. November, 2023, Vienna), members from both communities came together and envisioned implementing support for DGGS in the popular Xarray Python package, which is at the core of many geospatial big data processing workflows. The result of the codesprint is a prototype Xarray extension, named xdggs (https://github.com/xarray-contrib/xdggs), which we describe in this article. ## 2. Design and methodology There are several open-source libraries that make it possible to work with DGGS. Uber H3 , HEALPIX , rHEALPix , DGGRID , Google S2 , OpenEAGGR - many if not most have Python bindings (Kmoch et al. 2022). However, they often come with their very own not easy-to-use APIs, different assumptions, and functionalities. This makes it difficult for users to explore the wider possibilities that DGGS can offer. The aim of xdggs is to provide a unified, high-level, and user-friendly API that simplifies working with various DGGS types and their respective backend libraries, seamlessly integrating with Xarray and the Pangeo open-source geospatial computing ecosystem. Executable notebooks demonstrating the use of the xdggs package are also developed to showcase its capabilities. The xdggs community contributors set out with a set of guidelines and common DGGS features that xdggs should provide or facilitate, to make DGGS semantics and operations possible to use via the user-friendly Xarray API of working with labelled arrays. ## 3. Results This development represents a significant step forward. With xdggs, DGGS become more accessible and actionable for data users. Like traditional cartographic projections, a user does not need to be a expert on the peculiarities of various grids and libraries to work with DGGS, and can continue working in the well-known Xarray workflow. One of the aims of xdggs is making DGGS data access and conversion user-friendly, while dealing with the coordinates, tesselations, and projections under the hood. DGGS-indexed data can be stored in an appropriate format like Zarr or (Geo)Parquet, with according metadata to understand which DGGS (and potentially under which specific configuration) is needed to address the grid cell indices correctly. An interactive tutorial on Pangeo-Forge as open-access resource is being developed as well to demonstrate to users how to effectively utilizing these storage formats, thereby facilitating knowledge transfer in data storage best practices within the geospatial open-source community. Nevertheless, continuous efforts are necessary to broaden the accessibility of DGGS for scientific and operational applications, especially in handling gridded data such as global climate and ocean modeling, satellite imagery, raster data, and maps. This would require, for example, an agreement ideally with entities such as the OGC for DGGS reference systems’ registry (similar to the epsg/crs/proj database).
Keywords
127
Open setAnalytic setComputing platformSoftware developerSystem programmingCalculationGroup actionArray data structureSpacetimeCubeBitForm (programming)NeuroinformatikOpen sourceState observerIntegrated development environmentAuthorizationSoftwareQuicksortScalabilityInclusion mapTheory of relativityHacker (term)CodeOpen setLecture/ConferenceComputer animation
Interface (computing)Gateway (telecommunications)Spectrum (functional analysis)DisintegrationPoint cloudVisualization (computer graphics)Touch typingScaling (geometry)NeuroinformatikLipschitz-StetigkeitPlotterComputer animation
BlogSource codeDimensional analysisArray data structureTime domainData structureLevel (video gaming)Semiconductor memoryBitSubject indexingQuicksortComputer programmingNumbering schemeOpen sourceMagnetic-core memoryArray data structureProjective planeComputer animation
Discrete groupSystem programmingOpen setTesselationHierarchyPartition (number theory)Address spaceAbstractionTable (information)Maß <Mathematik>Shape (magazine)DisintegrationScale (map)NetzwerkverwaltungStatisticsSoftware frameworkInformationUnified threat managementImage resolutionExpert systemSystem administratorShape (magazine)PolygonHierarchyLatent heatVertex (graph theory)Open setDistortion (mathematics)Projective planeSystem programmingRight angleAddress spacePartition (number theory)Different (Kate Ryan album)Arithmetic meanPixelVector spaceTerm (mathematics)Hybrid computerBitAreaWeb 2.0Zoom lensRaster graphicsStatisticsTesselationData managementTime zoneAssociative propertyCategory of beingCASE <Informatik>QuicksortDrill commandsSatelliteProper mapCubeCodeSubsetOrder (biology)Queue (abstract data type)MetreImage resolutionNeuroinformatikExecution unitTessellationMiniDiscSoftware frameworkBoundary value problemLocal ringData integrityLevel (video gaming)Unified threat managementUniform resource locatorPoint (geometry)Computer animation
CubeMultiplicationImage resolutionScale (map)System programmingInterpolationCalculationVisualization (computer graphics)Boundary value problemSubject indexingInheritance (object-oriented programming)Operations researchGeometryHierarchyOpen setAttribute grammarDimensional analysisCoordinate systemVariable (mathematics)Term (mathematics)QuicksortBitZoom lensVariable (mathematics)Different (Kate Ryan album)Neighbourhood (graph theory)Operator (mathematics)Boundary value problemOffice suiteMultiplication signSystem programmingExtension (kinesiology)Unified threat managementGeometryObservational studyDimensional analysisSource codeTable (information)Time zoneHexagonStatisticsIntrusion detection systemLevel (video gaming)SpacetimeNetwork topologyInformationLattice (order)Image resolutionAreaSquare numberBinary filePoint (geometry)SpeciesAutomatic differentiationData storage deviceVector spaceComputer animationLecture/Conference
Subject indexingDimensional analysisVariable (mathematics)Coordinate systemImage resolutionData storage deviceFile formatArray data structureMetadataSource codePoint cloudData structureCompact spaceAreaView (database)CubeEquals signPlot (narrative)Subject indexingCoordinate systemSystem programmingConfiguration spaceQuicksortMetadataField (computer science)File formatPoint cloudSpacetimeSimilarity (geometry)Query languageRotationDifferent (Kate Ryan album)Library catalogIntrusion detection systemHexagonBitDistortion (mathematics)SphereInformationSubsetHill differential equationRight angleArithmetic meanCorrespondence (mathematics)Latent heatUniform resource locatorData compressionData conversionPoint (geometry)MultilaterationData storage deviceModel theoryOrder (biology)Multiplication signMathematicsState of matterArray data structureGeometryTime zoneWeightComputer fileCurveProjective planeImage resolutionIdentifiability1 (number)Boundary value problemComputer configurationPolygonParameter (computer programming)Lecture/ConferenceComputer animation
Digital signalTwin primeWorkstation <Musikinstrument>InformationScale (map)Key (cryptography)Revision controlSoftware testingNeuroinformatikPersonal digital assistantSubject indexingNumbering schemeKeyboard shortcutImplementationTime zoneNumbering schemeConfiguration spaceAreaRevision controlTwin primeDifferent (Kate Ryan album)Hill differential equationKeyboard shortcutDigitizingGreen's functionGroup actionSubject indexingCASE <Informatik>Programming languageAdditionInheritance (object-oriented programming)UsabilityComputer programmingSpacetimeImage resolutionComputer animation
Mathematical analysisModel theoryWebsiteWeightFile viewerOrganic computingData storage deviceNeuroinformatikUniform resource locatorDatabaseRight angleWeightINTEGRALFormal language1 (number)Mathematical analysisVariable (mathematics)SatelliteClassical physicsDifferent (Kate Ryan album)Data storage deviceModel theoryComputer animation
AbstractionDiscrete groupSystem programmingVolumeJames Waddell Alexander IICubeSource codeCollaborationismMaß <Mathematik>AreaExtension (kinesiology)Equals signImage resolutionData analysisScale (map)Latent heatSubject indexingModel theorySimilarity (geometry)DatabaseHill differential equationDimensional analysisTestbedAdditionData compressionPoint (geometry)Numbering schemeComputer configurationLevel (video gaming)1 (number)View (database)Raster graphicsArray data structureComputational complexity theorySubject indexingBlock (periodic table)Software maintenanceAlgorithmQuicksortBitTheoryCodeClassical physicsFile formatDefault (computer science)SpacetimeData storage deviceLipschitz-StetigkeitHexagonMultiplication signTable (information)Arithmetic progressionInstallation artGeometryTime zoneSystem programmingMusical ensembleImplementationPresentation of a groupSimulationTransformation (genetics)Operations support systemState diagramServer (computing)Binary codeData storage devicePixelComputer animationLecture/ConferenceMeeting/Interview
Least squaresComputer-assisted translationComputer animation
Transcript: English(auto-generated)
Hello everybody. Thank you very much for the introduction, Luis. I would like to present to you today our work on XDDGS, a community developed X-Array package to support planetary DDGS data cube computation. This is a really long title, even more authors. I'll get a little bit more into the background just to catch up.
So, X-Array is a really, really widely used library, Python library, for working with arrays. And the Earth's observation community is really widely used in the Pan-Geo and Jupiter ecosystems for data cube calculations in the climate space as well.
DDGS means Discrete Global Grid Systems, and it's a new form of spatial reference system. And in the BIDS conference, Big Data from Space, last year in November in Vienna,
shout out to Stefanie Lumnitz, three big communities came together, the OGC Geospatial Consortium, where I'm also here at disclosure, co-chair of the Discrete Global Grid Systems working group.
Then there's also Peter Strobel from JSC. Pan-Geo developers, especially here, Anne Fuyu, Tina Odaka and Ryan. And this all happened at the joint Pan-Geo OS-Geo code sprint. So, I was sitting with Tom Kralides trying to hack something into PyGIP as well. So, what is Pan-Geo?
So, because we are on the Phospho-G, which is an OS-Geo conference, Pan-Geo is an international community of a bit more geoscience related, open source, inclusive, and scalable software for planetary and large environment, etc.
Data computation is strongly Python, but there's also Julia and R. So, it's more computation and programming. And there's a huge ecosystem. I'm 100% sure most of you will have been in touch with some of those. If you have worked with Jupyter, scaling out with Dask, X-Array,
and the whole Matplotlib and Pi visualization things. The nice thing is with this type of ecosystem is you can develop on your computer, scale it then out on HPC or on cloud using the same stack, which is really nice. So, a quick background on X-Array. So, X-Array is a little bit of a foundation in many workflows, scientific workflows.
It's an open source project. It's already around a while. And it's for n-dimensional labeled arrays. So, if you have worked with Rasta, you have loaded Rasta as a NumPy array, for example, into memory, then you sort of know where we are at. And this takes it to a whole other level.
You can chunk it out of core larger and larger. And it provides labels, so you don't have to know the index number and the index of the array. So, this helps you really to program with that. And it's included, integrated with a lot of other packages.
Okay, so this is sort of the background. This is status quo, widely used. Then what I just did, global grid systems. It's, as I said, a spatial reference, a new type of spatial reference. We can see a bit of a hybrid between vector and Rasta, with the idea that the cells or pixels that cover the whole globe
have approximately almost the same area. The Open Geospatial Consortium has put that already into a so-called abstract specification that becomes also an ISO standard. And they call it, it's a discrete global grid system. It's a spatial reference system that uses a hierarchical tessellation
of cells to partition and address the globe. So, what does that mean? Hierarchical means you have, as you can see on the right side, at different resolutions, very similar to like zoom levels, if you have web tiles, for example. Partition means you have subsets that cover,
that refer to a certain area on the globe. And address means each cell, each pixel has a unique idea. This gives, this property gives some really, really nice data management things that you can associate lots of data to a certain, not only point,
but a certain area on Earth that is uniquely identify indexable. So, this has also been recognized by the United Nations, experts on global geospatial information management, GGM. We indeed need to combine more and more tabular and spatial data.
But the challenge we often run in, especially in terms of spatial units, we have polygons of different shapes and sizes, administrative boundaries, of course not the same area, so dealing with area statistics, you know, you always have a couple of trip wires, you have to normalize and so on. Then we have rasters, even if, like we do in data cube computations,
if we align, if we have rasters of same resolution, let's say 10 meter, the rasters also have to be aligned and in the same coded reference system in order that you can do a proper drill down. Choosing a DGGS, that is sort of taken care of you, take care of for you, that because the, as I said,
the cells are uniquely identifiable, they're always in the same place. If you then decide to integrate data with one of these, you can then also do summary aggregations to the coarser resolutions. So, this provides a nice framework, especially for data integration.
So, maybe to visualize it a little bit, the actual gridding problem. So, one big use case is also increasingly with the statistical agencies in Europe. So, we have Eurostat for all of Europe.
They, of course, have a statistical grid that is based on the European code reference system, the ETRS 30, 35. Then if we want to use satellite data, we have Sentinel-2, for example, and UTM, UTM different zones, and if we go further to the north,
even those zones are overlapping, so there's oversampling. It has to be resampled into your data queue because you have to decide usually which projections you use, continental or local projection. So, you have to usually fit everything into that one,
and that projection can usually not be too big, otherwise you run into aerial distortions. So, and on the right side, for example, in hexagonal-based DGGS, that would look the same, always the vertices would always be in the same location, so you wouldn't run anymore into such issues.
So, in terms of hierarchical, so multi-zoom, that sounds a bit fancy here. So, as you sort of aggregate data, you can use different data variables at different zoom levels, then you zoom in, and then this would be Estonia,
and then as you zoom in, you have still a structured grid, which is good. And here, as a maybe helper, you have cell IDs that refer to a place on Earth, and you associate arbitrary information. You can even technically, as you want your table to be,
and another thing is combining data from different sources. So, let's assume, like ESA does right now, having a study, having Sentinel data available in ATGGS, you wouldn't download the whole UTM zone of 3 to 7 gigabytes,
you would just select, I need only the data for these and these zones, for these and these cells, and then you have maybe from the statistical office population data, and because of the IDs, you can do a simple table join. Dealing with discrete global grid systems, so things you typically have to do is producing data sets,
and right now we're at the situation, obviously we don't have data collection at the DGGS level, so we only have vector data and we have cluster data, so right now we have to do re-gridding. Funny though, the Pangea community, which is very active also in the climate and ocean space,
they do re-gridding all the time, because all their climate and ocean grids are actually at different resolutions, different grids, so they do that all the time anyway, so they know about that. For many statistical things, species, population, we do binning, like we do nowadays also already, we do point data binning into spatial bins like squares or hexagons,
and then things you want to typically do is you do aggregations over larger areas, you want to have cell boundaries, neighbors, a nice thing also is with hexagon topologies, you don't run anymore into the four or eight neighborhood,
so technically you always have direct neighbors across all, and typically other things you can do are of course also geometric operations, like cell overlap, intersection, and so on. And you have to eventually store that, and with XGGS, with XGGS now being an extension to X-Array,
you basically still work in X-Array, so you have your coordinates, your dimension, but that dimension, that main dimension that we came up with would be the cell ID, which is basically a 1D array, and all your data is associated to the cell ID,
so then you just have to store it in a meaningful container, like that supports array data and meaningful metadata. So the NetCDF model has been inspiring for this type of data. We would of course take it further and store it in chunked arrays in XAR. I'll come to that a little bit later.
But here, anyone who has ever opened a data set with X-Array knows what this means. You have the coordinates, and then you have the data variables, so here it's the air temperature that is indexed against the cell ID only. And the funny thing is, from the cell ID, you can always go back to the location,
so here we don't need lat-longs or anything else. So one thing that we have to do is, when we start right now, we have to convert data into DGGS. So what we do right now is, for example, for rasters, we take the centroids, and then we derive the cell IDs.
There's a whole other thing that I have to probably write about, the different ways to consider when converting different types of data, and especially at different resolutions, into a DGGS system. So one funny thing here is, as you can see, the data is indexed as DGGS,
you only have cell IDs, but you still can do a query on coordinates, because you know, I know, TATU is around 27 by 57.6 or something, lat-long, and that would be converted into the corresponding DGGS cells,
zone identifiers, and this way you would still do a spatial subset. So the finding of location via cell ID is still at the base of it. Inversely, if you need to go back to lat-long space,
you can always get the cell boundaries, which are the polygons, or the cell centroids, as lat-long, because the DGGS systems intrinsically know that conversion. It's basically like projections, where you know that at that index,
you know, with some geo origin, the conversion, you can go back and forth. So the math is, so to say, stable. So in order to save that meaningfully, so we have been experimenting with SAR,
I mean, SAR itself doesn't need experimenting in itself anymore, it's a fairly trusted data format to store arrays, and it does a chunk, so it's cloud optimized, you put it in cloud buckets for example, and similar with cloud optimized geotiffs,
you don't need to read the whole file, you only, based on the metadata, you only go to the chunks that you need, and it also supports compression of the chunks, so it's actually pretty space-wise pretty okay,
and similar, because being inspired, sort of inheriting from NetCDF and CF conventions, you can have a lot of metadata associated, and this is really important, because right now, the state of defining the configuration of a DGGS is still a little bit, I wouldn't say fragile, but every library and every system needs a couple of specific parameters,
it would be nice maybe to have that approach at some point, or something in this direction, like a nicer catalog of that, I mean, there's not so many, it's maybe a handful, but some of those have a couple configuration options,
where the origin is, a rotation of the thing, or if the indexing goes in our space field curve, like this way, or this way, nested, or those type, that needs to be stored, of course, with the S metadata, as sort of coordinate reference information with the container,
and then SAR provides the means for that. Right now, you and my colleague have been helping, so we're working with Digigrid, which is still currently a command line tool, we are working with Kevin Zarr also, and making it more of a library, then you would have a nice C++ library for that.
H3 might be a thing you have already come across, and others are hill peaks and hill peaks, which are originating actually from astronomy, from looking outward and tessellating the sphere as you look outward, and then sort of making it useful for looking onto the Earth.
So, one interesting bit that we also did in the paper is, I had a paper a few years back, when I assessed the aerial distortions of different DGS systems, and at that time I missed hill peaks,
so we added that in that paper as well, and as you can see, where's the thing here, this is the, yeah, no, actually it's a point here, so here's hill peaks, so you see it's really, really good also, similar to like the hexagonal ones,
hexagonal here, hexagonal ones, so that's pretty good. That is also documented in the article. And why is it interesting? Because although hill peaks is really used in astronomy, some digital twin, especially in Europe,
we have the digital twin thing going on with destination Earth, green deer, and climate data spaces, and the colleagues here, Tina from Iframer, they took hill peaks and they do some amazing things with it,
so that's why I want to show this here as well, and that's why we found it really important also to mention hill peaks as an additional DGGS. And then a quick update on DGGrid, so DGGrid actually is 10 years, recently had the 10 year anniversary,
and is now just in almost released version 8.1, with some additional improvements on ICS7H, so that is a similar configuration like H3, but H3 is not equal area so much, so that's why we put this extra effort in working with the DGGrid,
and there are some nice addressing schemes for ICS3H and ICS7H for really having these indices being meaningful parent-child relationships, so you can find the children that are related to one zone at the next resolution, and vice versa, you can find the parents,
so you can do group by parents and so on. And we're working, as I said, towards Python bindings, so that having Python bindings in this case means actually having C compatible bindings that we would hope will open up a whole new usability in different programming languages.
Some examples, inspiration, how we have used, not only we, but on the lower right, it's a suitability analysis for fuel station suitability, location in Northern Europe, so a classic suitability analysis like in Rasta you do,
you have all these variables, and then you calculate the weighted suitability on top, and because it's so easy to tabulate in a store like Clickhouse database, other companies, startups, use also hexagonal ones to integrate different types of satellite imagery, like Planet and Sentinel, to do their modeling.
Here, you can find this particular paper here, and pip install XDDGS, that will actually, I think, only come per default with H3, so we are working on the DGGRID implementations,
and Ephraim is working on the Heelpix stuff, so some stuff is obviously still moving. And with that, I have done my 20 minutes.
We've been put some questions. I was wondering if you have compared the efficiency of compression algorithm
when you store as a DGGS index data set rather than a classical raster, does it affect the compression, classical compression method, like deflate or whatever, and most certainly there are some options you cannot use, like algorithms that are designed for 2D data,
like JPEG compression or JPEG 2000, stuff like that, so what are your thoughts about compression and DGGS-friendly storage? Yeah, so that's a very good question. Thank you. So, the format option versus the data indexing type, so to say.
So, right now, we mostly work in two main types. One is ZAR, and ZAR compression is fairly good, and you use the things that come with the ZAR package,
like BLOSC and ZLIP or ZSTD. So, those ones at the similar resolution, so that you have a similar number of pixels, so to say, and you have the similar number of values, behaves fairly similar. Then another one would be Parquet.
Parquet also has pretty good compression, but if you're purely tabular in ZAR, you would have maybe additional dimensions like time, which you can't sort of model in Parquet purely. And last but not least, purely operational, so ESA and the Testbed 16 with GeoServer, and we also, as an operational data store,
we use Clickhouse, and the Clickhouse database is also an OLAP database, and it also has really good compression. So, loading the data in purely from a raster point of view stays at a similar compression. But the concept is slightly different,
so it's not purely a binary block. Let's say that it's compressed like a TIFF, for example, or something like this. And a more interesting question that I had the other day is computational efficiency. So, certain algorithms that work on rasters
based on the 2D access, they would not work for hexagons, obviously. They could work for quadrilateral ones, but there's also algorithms for hexagonals, but the data access, I think this is not as mature yet,
but with X-Array DDGS, being able to use the general scientific ecosystem of X-Array, and the way X-Array shuffles stuff around, this is a good way right now to explore that a bit more in an operational space, so to say. So, we are very lucky that from last year,
now we move actually from a more scientific, theoretical point of view. I mean, Luis has followed the progress. We're going now to try to really do stuff with it. Let me just add to this that if you're storing this data still in the raster concept,
usually what happens is that you transform a zone into a one-dimensional array. And so, when you do that transformation, there are some methods from traditional to the raster, but you still can use them. And Jerome Saint-Louis from Gnosis is starting this team.
Any other questions? Yes, please. Let's wait for the microphone. Thank you for your presentation. I'd like to make sure about hill peaks. Is it right now supported to deal with hill peaks
or is it going to be developed? So, you're asking if this can be used with hill peaks? Yeah, yeah. Yeah, so if Rame is using it, I'm not quite sure. So, Justus, I'm not sure if Justus has done a couple of pull requests. He and Benoit are like the main maintainers
to make sure which code goes in. So, I'm not sure at which stage the operational stuff is indexed. We have right now a little bit of a situation because we have some heavy dependencies. So, each of these DGGS systems, except maybe H3, come with their own heavy dependencies. Like hill peaks or CDS hill peaks is a fairly heavy package.
DGGRID comes with its own challenges. So, right now, the way the Python packages are set up with the dependencies, we don't want to pull in all those things. So, we have to modularize this a little bit. But in general, they are working with it and they do that. Okay, great. Thank you.