XDGGS: A community-developed Xarray package to support planetary DGGS data cube computations
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 156 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68575 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
FOSS4G Europe 2024 Tartu154 / 156
6
33
35
53
55
59
61
67
70
87
97
99
102
103
104
105
107
111
121
122
123
124
125
126
127
128
134
144
150
151
155
00:00
Open setAnalytic setComputing platformSoftware developerSystem programmingCalculationGroup actionArray data structureSpacetimeCubeBitForm (programming)NeuroinformatikOpen sourceState observerIntegrated development environmentAuthorizationSoftwareQuicksortScalabilityInclusion mapTheory of relativityHacker (term)CodeOpen setLecture/ConferenceComputer animation
02:11
Interface (computing)Gateway (telecommunications)Spectrum (functional analysis)DisintegrationPoint cloudVisualization (computer graphics)Touch typingScaling (geometry)NeuroinformatikLipschitz-StetigkeitPlotterComputer animation
02:36
BlogSource codeDimensional analysisArray data structureTime domainData structureLevel (video gaming)Semiconductor memoryBitSubject indexingQuicksortComputer programmingNumbering schemeOpen sourceMagnetic-core memoryArray data structureProjective planeComputer animation
03:24
Discrete groupSystem programmingOpen setTesselationHierarchyPartition (number theory)Address spaceAbstractionTable (information)Maß <Mathematik>Shape (magazine)DisintegrationScale (map)NetzwerkverwaltungStatisticsSoftware frameworkInformationUnified threat managementImage resolutionExpert systemSystem administratorShape (magazine)PolygonHierarchyLatent heatVertex (graph theory)Open setDistortion (mathematics)Projective planeSystem programmingRight angleAddress spacePartition (number theory)Different (Kate Ryan album)Arithmetic meanPixelVector spaceTerm (mathematics)Hybrid computerBitAreaWeb 2.0Zoom lensRaster graphicsStatisticsTesselationData managementTime zoneAssociative propertyCategory of beingCASE <Informatik>QuicksortDrill commandsSatelliteProper mapCubeCodeSubsetOrder (biology)Queue (abstract data type)MetreImage resolutionNeuroinformatikExecution unitTessellationMiniDiscSoftware frameworkBoundary value problemLocal ringData integrityLevel (video gaming)Unified threat managementUniform resource locatorPoint (geometry)Computer animation
07:20
CubeMultiplicationImage resolutionScale (map)System programmingInterpolationCalculationVisualization (computer graphics)Boundary value problemSubject indexingInheritance (object-oriented programming)Operations researchGeometryHierarchyOpen setAttribute grammarDimensional analysisCoordinate systemVariable (mathematics)Term (mathematics)QuicksortBitZoom lensVariable (mathematics)Different (Kate Ryan album)Neighbourhood (graph theory)Operator (mathematics)Boundary value problemOffice suiteMultiplication signSystem programmingExtension (kinesiology)Unified threat managementGeometryObservational studyDimensional analysisSource codeTable (information)Time zoneHexagonStatisticsIntrusion detection systemLevel (video gaming)SpacetimeNetwork topologyInformationLattice (order)Image resolutionAreaSquare numberBinary filePoint (geometry)SpeciesAutomatic differentiationData storage deviceVector spaceComputer animationLecture/Conference
10:20
Subject indexingDimensional analysisVariable (mathematics)Coordinate systemImage resolutionData storage deviceFile formatArray data structureMetadataSource codePoint cloudData structureCompact spaceAreaView (database)CubeEquals signPlot (narrative)Subject indexingCoordinate systemSystem programmingConfiguration spaceQuicksortMetadataField (computer science)File formatPoint cloudSpacetimeSimilarity (geometry)Query languageRotationDifferent (Kate Ryan album)Library catalogIntrusion detection systemHexagonBitDistortion (mathematics)SphereInformationSubsetHill differential equationRight angleArithmetic meanCorrespondence (mathematics)Latent heatUniform resource locatorData compressionData conversionPoint (geometry)MultilaterationData storage deviceModel theoryOrder (biology)Multiplication signMathematicsState of matterArray data structureGeometryTime zoneWeightComputer fileCurveProjective planeImage resolutionIdentifiability1 (number)Boundary value problemComputer configurationPolygonParameter (computer programming)Lecture/ConferenceComputer animation
16:11
Digital signalTwin primeWorkstation <Musikinstrument>InformationScale (map)Key (cryptography)Revision controlSoftware testingNeuroinformatikPersonal digital assistantSubject indexingNumbering schemeKeyboard shortcutImplementationTime zoneNumbering schemeConfiguration spaceAreaRevision controlTwin primeDifferent (Kate Ryan album)Hill differential equationKeyboard shortcutDigitizingGreen's functionGroup actionSubject indexingCASE <Informatik>Programming languageAdditionInheritance (object-oriented programming)UsabilityComputer programmingSpacetimeImage resolutionComputer animation
18:01
Mathematical analysisModel theoryWebsiteWeightFile viewerOrganic computingData storage deviceNeuroinformatikUniform resource locatorDatabaseRight angleWeightINTEGRALFormal language1 (number)Mathematical analysisVariable (mathematics)SatelliteClassical physicsDifferent (Kate Ryan album)Data storage deviceModel theoryComputer animation
18:40
AbstractionDiscrete groupSystem programmingVolumeJames Waddell Alexander IICubeSource codeCollaborationismMaß <Mathematik>AreaExtension (kinesiology)Equals signImage resolutionData analysisScale (map)Latent heatSubject indexingModel theorySimilarity (geometry)DatabaseHill differential equationDimensional analysisTestbedAdditionData compressionPoint (geometry)Numbering schemeComputer configurationLevel (video gaming)1 (number)View (database)Raster graphicsArray data structureComputational complexity theorySubject indexingBlock (periodic table)Software maintenanceAlgorithmQuicksortBitTheoryCodeClassical physicsFile formatDefault (computer science)SpacetimeData storage deviceLipschitz-StetigkeitHexagonMultiplication signTable (information)Arithmetic progressionInstallation artGeometryTime zoneSystem programmingMusical ensembleImplementationPresentation of a groupSimulationTransformation (genetics)Operations support systemState diagramServer (computing)Binary codeData storage devicePixelComputer animationLecture/ConferenceMeeting/Interview
24:36
Least squaresComputer-assisted translationComputer animation
Transcript: English(auto-generated)
00:00
Hello everybody. Thank you very much for the introduction, Luis. I would like to present to you today our work on XDDGS, a community developed X-Array package to support planetary DDGS data cube computation. This is a really long title, even more authors. I'll get a little bit more into the background just to catch up.
00:22
So, X-Array is a really, really widely used library, Python library, for working with arrays. And the Earth's observation community is really widely used in the Pan-Geo and Jupiter ecosystems for data cube calculations in the climate space as well.
00:48
DDGS means Discrete Global Grid Systems, and it's a new form of spatial reference system. And in the BIDS conference, Big Data from Space, last year in November in Vienna,
01:03
shout out to Stefanie Lumnitz, three big communities came together, the OGC Geospatial Consortium, where I'm also here at disclosure, co-chair of the Discrete Global Grid Systems working group.
01:20
Then there's also Peter Strobel from JSC. Pan-Geo developers, especially here, Anne Fuyu, Tina Odaka and Ryan. And this all happened at the joint Pan-Geo OS-Geo code sprint. So, I was sitting with Tom Kralides trying to hack something into PyGIP as well. So, what is Pan-Geo?
01:41
So, because we are on the Phospho-G, which is an OS-Geo conference, Pan-Geo is an international community of a bit more geoscience related, open source, inclusive, and scalable software for planetary and large environment, etc.
02:02
Data computation is strongly Python, but there's also Julia and R. So, it's more computation and programming. And there's a huge ecosystem. I'm 100% sure most of you will have been in touch with some of those. If you have worked with Jupyter, scaling out with Dask, X-Array,
02:23
and the whole Matplotlib and Pi visualization things. The nice thing is with this type of ecosystem is you can develop on your computer, scale it then out on HPC or on cloud using the same stack, which is really nice. So, a quick background on X-Array. So, X-Array is a little bit of a foundation in many workflows, scientific workflows.
02:45
It's an open source project. It's already around a while. And it's for n-dimensional labeled arrays. So, if you have worked with Rasta, you have loaded Rasta as a NumPy array, for example, into memory, then you sort of know where we are at. And this takes it to a whole other level.
03:02
You can chunk it out of core larger and larger. And it provides labels, so you don't have to know the index number and the index of the array. So, this helps you really to program with that. And it's included, integrated with a lot of other packages.
03:21
Okay, so this is sort of the background. This is status quo, widely used. Then what I just did, global grid systems. It's, as I said, a spatial reference, a new type of spatial reference. We can see a bit of a hybrid between vector and Rasta, with the idea that the cells or pixels that cover the whole globe
03:44
have approximately almost the same area. The Open Geospatial Consortium has put that already into a so-called abstract specification that becomes also an ISO standard. And they call it, it's a discrete global grid system. It's a spatial reference system that uses a hierarchical tessellation
04:03
of cells to partition and address the globe. So, what does that mean? Hierarchical means you have, as you can see on the right side, at different resolutions, very similar to like zoom levels, if you have web tiles, for example. Partition means you have subsets that cover,
04:23
that refer to a certain area on the globe. And address means each cell, each pixel has a unique idea. This gives, this property gives some really, really nice data management things that you can associate lots of data to a certain, not only point,
04:40
but a certain area on Earth that is uniquely identify indexable. So, this has also been recognized by the United Nations, experts on global geospatial information management, GGM. We indeed need to combine more and more tabular and spatial data.
05:00
But the challenge we often run in, especially in terms of spatial units, we have polygons of different shapes and sizes, administrative boundaries, of course not the same area, so dealing with area statistics, you know, you always have a couple of trip wires, you have to normalize and so on. Then we have rasters, even if, like we do in data cube computations,
05:23
if we align, if we have rasters of same resolution, let's say 10 meter, the rasters also have to be aligned and in the same coded reference system in order that you can do a proper drill down. Choosing a DGGS, that is sort of taken care of you, take care of for you, that because the, as I said,
05:43
the cells are uniquely identifiable, they're always in the same place. If you then decide to integrate data with one of these, you can then also do summary aggregations to the coarser resolutions. So, this provides a nice framework, especially for data integration.
06:06
So, maybe to visualize it a little bit, the actual gridding problem. So, one big use case is also increasingly with the statistical agencies in Europe. So, we have Eurostat for all of Europe.
06:22
They, of course, have a statistical grid that is based on the European code reference system, the ETRS 30, 35. Then if we want to use satellite data, we have Sentinel-2, for example, and UTM, UTM different zones, and if we go further to the north,
06:44
even those zones are overlapping, so there's oversampling. It has to be resampled into your data queue because you have to decide usually which projections you use, continental or local projection. So, you have to usually fit everything into that one,
07:00
and that projection can usually not be too big, otherwise you run into aerial distortions. So, and on the right side, for example, in hexagonal-based DGGS, that would look the same, always the vertices would always be in the same location, so you wouldn't run anymore into such issues.
07:21
So, in terms of hierarchical, so multi-zoom, that sounds a bit fancy here. So, as you sort of aggregate data, you can use different data variables at different zoom levels, then you zoom in, and then this would be Estonia,
07:41
and then as you zoom in, you have still a structured grid, which is good. And here, as a maybe helper, you have cell IDs that refer to a place on Earth, and you associate arbitrary information. You can even technically, as you want your table to be,
08:06
and another thing is combining data from different sources. So, let's assume, like ESA does right now, having a study, having Sentinel data available in ATGGS, you wouldn't download the whole UTM zone of 3 to 7 gigabytes,
08:21
you would just select, I need only the data for these and these zones, for these and these cells, and then you have maybe from the statistical office population data, and because of the IDs, you can do a simple table join. Dealing with discrete global grid systems, so things you typically have to do is producing data sets,
08:43
and right now we're at the situation, obviously we don't have data collection at the DGGS level, so we only have vector data and we have cluster data, so right now we have to do re-gridding. Funny though, the Pangea community, which is very active also in the climate and ocean space,
09:01
they do re-gridding all the time, because all their climate and ocean grids are actually at different resolutions, different grids, so they do that all the time anyway, so they know about that. For many statistical things, species, population, we do binning, like we do nowadays also already, we do point data binning into spatial bins like squares or hexagons,
09:26
and then things you want to typically do is you do aggregations over larger areas, you want to have cell boundaries, neighbors, a nice thing also is with hexagon topologies, you don't run anymore into the four or eight neighborhood,
09:42
so technically you always have direct neighbors across all, and typically other things you can do are of course also geometric operations, like cell overlap, intersection, and so on. And you have to eventually store that, and with XGGS, with XGGS now being an extension to X-Array,
10:04
you basically still work in X-Array, so you have your coordinates, your dimension, but that dimension, that main dimension that we came up with would be the cell ID, which is basically a 1D array, and all your data is associated to the cell ID,
10:21
so then you just have to store it in a meaningful container, like that supports array data and meaningful metadata. So the NetCDF model has been inspiring for this type of data. We would of course take it further and store it in chunked arrays in XAR. I'll come to that a little bit later.
10:42
But here, anyone who has ever opened a data set with X-Array knows what this means. You have the coordinates, and then you have the data variables, so here it's the air temperature that is indexed against the cell ID only. And the funny thing is, from the cell ID, you can always go back to the location,
11:02
so here we don't need lat-longs or anything else. So one thing that we have to do is, when we start right now, we have to convert data into DGGS. So what we do right now is, for example, for rasters, we take the centroids, and then we derive the cell IDs.
11:21
There's a whole other thing that I have to probably write about, the different ways to consider when converting different types of data, and especially at different resolutions, into a DGGS system. So one funny thing here is, as you can see, the data is indexed as DGGS,
11:44
you only have cell IDs, but you still can do a query on coordinates, because you know, I know, TATU is around 27 by 57.6 or something, lat-long, and that would be converted into the corresponding DGGS cells,
12:05
zone identifiers, and this way you would still do a spatial subset. So the finding of location via cell ID is still at the base of it. Inversely, if you need to go back to lat-long space,
12:23
you can always get the cell boundaries, which are the polygons, or the cell centroids, as lat-long, because the DGGS systems intrinsically know that conversion. It's basically like projections, where you know that at that index,
12:43
you know, with some geo origin, the conversion, you can go back and forth. So the math is, so to say, stable. So in order to save that meaningfully, so we have been experimenting with SAR,
13:02
I mean, SAR itself doesn't need experimenting in itself anymore, it's a fairly trusted data format to store arrays, and it does a chunk, so it's cloud optimized, you put it in cloud buckets for example, and similar with cloud optimized geotiffs,
13:21
you don't need to read the whole file, you only, based on the metadata, you only go to the chunks that you need, and it also supports compression of the chunks, so it's actually pretty space-wise pretty okay,
13:40
and similar, because being inspired, sort of inheriting from NetCDF and CF conventions, you can have a lot of metadata associated, and this is really important, because right now, the state of defining the configuration of a DGGS is still a little bit, I wouldn't say fragile, but every library and every system needs a couple of specific parameters,
14:07
it would be nice maybe to have that approach at some point, or something in this direction, like a nicer catalog of that, I mean, there's not so many, it's maybe a handful, but some of those have a couple configuration options,
14:21
where the origin is, a rotation of the thing, or if the indexing goes in our space field curve, like this way, or this way, nested, or those type, that needs to be stored, of course, with the S metadata, as sort of coordinate reference information with the container,
14:42
and then SAR provides the means for that. Right now, you and my colleague have been helping, so we're working with Digigrid, which is still currently a command line tool, we are working with Kevin Zarr also, and making it more of a library, then you would have a nice C++ library for that.
15:03
H3 might be a thing you have already come across, and others are hill peaks and hill peaks, which are originating actually from astronomy, from looking outward and tessellating the sphere as you look outward, and then sort of making it useful for looking onto the Earth.
15:26
So, one interesting bit that we also did in the paper is, I had a paper a few years back, when I assessed the aerial distortions of different DGS systems, and at that time I missed hill peaks,
15:42
so we added that in that paper as well, and as you can see, where's the thing here, this is the, yeah, no, actually it's a point here, so here's hill peaks, so you see it's really, really good also, similar to like the hexagonal ones,
16:03
hexagonal here, hexagonal ones, so that's pretty good. That is also documented in the article. And why is it interesting? Because although hill peaks is really used in astronomy, some digital twin, especially in Europe,
16:22
we have the digital twin thing going on with destination Earth, green deer, and climate data spaces, and the colleagues here, Tina from Iframer, they took hill peaks and they do some amazing things with it,
16:40
so that's why I want to show this here as well, and that's why we found it really important also to mention hill peaks as an additional DGGS. And then a quick update on DGGrid, so DGGrid actually is 10 years, recently had the 10 year anniversary,
17:00
and is now just in almost released version 8.1, with some additional improvements on ICS7H, so that is a similar configuration like H3, but H3 is not equal area so much, so that's why we put this extra effort in working with the DGGrid,
17:22
and there are some nice addressing schemes for ICS3H and ICS7H for really having these indices being meaningful parent-child relationships, so you can find the children that are related to one zone at the next resolution, and vice versa, you can find the parents,
17:43
so you can do group by parents and so on. And we're working, as I said, towards Python bindings, so that having Python bindings in this case means actually having C compatible bindings that we would hope will open up a whole new usability in different programming languages.
18:03
Some examples, inspiration, how we have used, not only we, but on the lower right, it's a suitability analysis for fuel station suitability, location in Northern Europe, so a classic suitability analysis like in Rasta you do,
18:21
you have all these variables, and then you calculate the weighted suitability on top, and because it's so easy to tabulate in a store like Clickhouse database, other companies, startups, use also hexagonal ones to integrate different types of satellite imagery, like Planet and Sentinel, to do their modeling.
18:42
Here, you can find this particular paper here, and pip install XDDGS, that will actually, I think, only come per default with H3, so we are working on the DGGRID implementations,
19:01
and Ephraim is working on the Heelpix stuff, so some stuff is obviously still moving. And with that, I have done my 20 minutes.
19:21
We've been put some questions. I was wondering if you have compared the efficiency of compression algorithm
19:40
when you store as a DGGS index data set rather than a classical raster, does it affect the compression, classical compression method, like deflate or whatever, and most certainly there are some options you cannot use, like algorithms that are designed for 2D data,
20:02
like JPEG compression or JPEG 2000, stuff like that, so what are your thoughts about compression and DGGS-friendly storage? Yeah, so that's a very good question. Thank you. So, the format option versus the data indexing type, so to say.
20:26
So, right now, we mostly work in two main types. One is ZAR, and ZAR compression is fairly good, and you use the things that come with the ZAR package,
20:41
like BLOSC and ZLIP or ZSTD. So, those ones at the similar resolution, so that you have a similar number of pixels, so to say, and you have the similar number of values, behaves fairly similar. Then another one would be Parquet.
21:01
Parquet also has pretty good compression, but if you're purely tabular in ZAR, you would have maybe additional dimensions like time, which you can't sort of model in Parquet purely. And last but not least, purely operational, so ESA and the Testbed 16 with GeoServer, and we also, as an operational data store,
21:22
we use Clickhouse, and the Clickhouse database is also an OLAP database, and it also has really good compression. So, loading the data in purely from a raster point of view stays at a similar compression. But the concept is slightly different,
21:42
so it's not purely a binary block. Let's say that it's compressed like a TIFF, for example, or something like this. And a more interesting question that I had the other day is computational efficiency. So, certain algorithms that work on rasters
22:01
based on the 2D access, they would not work for hexagons, obviously. They could work for quadrilateral ones, but there's also algorithms for hexagonals, but the data access, I think this is not as mature yet,
22:21
but with X-Array DDGS, being able to use the general scientific ecosystem of X-Array, and the way X-Array shuffles stuff around, this is a good way right now to explore that a bit more in an operational space, so to say. So, we are very lucky that from last year,
22:41
now we move actually from a more scientific, theoretical point of view. I mean, Luis has followed the progress. We're going now to try to really do stuff with it. Let me just add to this that if you're storing this data still in the raster concept,
23:01
usually what happens is that you transform a zone into a one-dimensional array. And so, when you do that transformation, there are some methods from traditional to the raster, but you still can use them. And Jerome Saint-Louis from Gnosis is starting this team.
23:25
Any other questions? Yes, please. Let's wait for the microphone. Thank you for your presentation. I'd like to make sure about hill peaks. Is it right now supported to deal with hill peaks
23:41
or is it going to be developed? So, you're asking if this can be used with hill peaks? Yeah, yeah. Yeah, so if Rame is using it, I'm not quite sure. So, Justus, I'm not sure if Justus has done a couple of pull requests. He and Benoit are like the main maintainers
24:01
to make sure which code goes in. So, I'm not sure at which stage the operational stuff is indexed. We have right now a little bit of a situation because we have some heavy dependencies. So, each of these DGGS systems, except maybe H3, come with their own heavy dependencies. Like hill peaks or CDS hill peaks is a fairly heavy package.
24:22
DGGRID comes with its own challenges. So, right now, the way the Python packages are set up with the dependencies, we don't want to pull in all those things. So, we have to modularize this a little bit. But in general, they are working with it and they do that. Okay, great. Thank you.
Recommendations
Series of 17 media