We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

State of GDAL

00:00

Formal Metadata

Title
State of GDAL
Title of Series
Number of Parts
351
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
We will give a status report on the GDAL software, focusing on recent developments and achievements in the 3.4 and 3.5 GDAL versions released during the last year, but also on the general health of the project. In particular, we will present new drivers such as the one handing Zarr datasets (format for the storage of chunked, compressed, N-dimensional arrays) or the Spatio-Temporal Asset Catalog Items driver to create virtual mosaics from STAC items, and potential future additions such as a new JPEG-2000 based driver using the Grok library, a driver for the SAP Hana database or driver for columnar storage format such as Apache Parquet and Arrow. The topic of coordinate epochs in geospatial datasets and how we’ve addressed it in various formats (GeoTIFF, GeoPackage, FlatGeobuf) will also be mentioned. As well as other improvements such as the JPEG-XL codec for the GeoTIFF format, or support for 64-bit integer data types in rasters. We will present the new CMake build system, the roadmap for its implementation, and its advantages for users and developers.
Keywords
202
Thumbnail
1:16:05
226
242
State of matterCommunications protocolAbstractionLibrary (computing)WikiOpen sourceReading (process)Array data structureRaster graphicsServer (computing)State diagramData compressionStandard deviationSelf-organizationGroup actionHierarchyFloating pointString (computer science)Computer fileMetadataDigital filterCoordinate systemPoint (geometry)AerodynamicsTransformation (genetics)Fluid staticsAttribute grammarComputer reservations systemPhysical systemSystem programmingProcess (computing)Visual systemComputer configurationConsistencyParallel portEmailVideo trackingDevice driverArrow of timeData storage deviceInformation retrievalQuery languageGoogolInformationExtension (kinesiology)Source codeData analysisDatabaseGeometryDevice driverCodecIntegerLocal GroupMusical ensembleImplementationOcean currentPoint cloudCodeLatent heatDevice driverData storage deviceLink (knot theory)Array data structureRemote procedure callComputer fileSource codeProjective planeLevel (video gaming)Stack (abstract data type)Open sourceHierarchyInsertion lossArrow of timeSpacetimeCodecMedical imagingPairwise comparisonProfil (magazine)Performance appraisalCoordinate systemFamilyVisualization (computer graphics)EmailTrailComputer configurationPhysical systemProcess (computing)Heat transferInformationTransformation (genetics)Computer reservations systemFluid staticsDynamical systemVirtualizationDevice driverDatabaseMusical ensembleRaster graphicsLibrary (computing)MiniDiscTerm (mathematics)MetadataExtension (kinesiology)Self-organizationData compressionGeometryGoodness of fitGroup actionAuthorizationAttribute grammarMultiplication signServer (computing)Dimensional analysisMultiplicationDifferent (Kate Ryan album)Scheduling (computing)PlanningParallel portBranch (computer science)1 (number)Shared memoryIntegrated development environmentRevision controlSoftware maintenanceWritingSoftware bugLimit (category theory)Sampling (statistics)NumberAxiom of choiceCASE <Informatik>Subject indexingBitLibrary catalogQuery languageImplementationPoint (geometry)Local ringClassical physicsAdditionProduct (business)Filter <Stochastik>String (computer science)Standard deviationTraffic reportingCuboidNeuroinformatikTexture mappingRange (statistics)Table (information)File formatSoftwareTesselationOpen setElectronic mailing listRepresentation (politics)Computer programmingBlack boxSemiconductor memoryImage resolutionData analysisSocial classData typeMultilaterationIntegerDirection (geometry)Key (cryptography)Session Initiation ProtocolCore dumpTwitterUtility softwareWindowService (economics)Compilation albumAddress spaceMathematicsFlow separationOrder (biology)Data conversionMetreSlide ruleShift operatorField (computer science)Regular graphRow (database)Task (computing)Closed setIndependence (probability theory)PlastikkarteCurvatureWeightSign (mathematics)Photographic mosaicCross-platformBuildingTemporal logicMetropolitan area networkWKB-MethodeThread (computing)Perspective (visual)PixelSoftware developerTestbedHand fanCommunications protocolNumeral (linguistics)Computer animation
Transcript: English(auto-generated)
Hi, so my name is Yvan Roux, I'm an independent free and open source software developer, mostly focused on GDAL, Map Server approach and QGIS. So in this talk I'm going to go over the changes GDAL has received this past year for the 3.4 and 3.5 releases and I will also talk a bit about virtual directions.
So what's GDAL in what slide first? It stands for the Geospatial Data Abstraction Library, which is a black box you use often without realising it when you want to read and write geospatial formats in MOSI or
C++ open source or closed source GIS software and as of today it handles roughly 250 different formats and as a trend of recent years also network protocols and services. GDAL is released with MIT open source license which is super-permissive and we release a
new version with features every six months and a bug fix releases every two months. In GDAL 3.4 we have added a new driver to read and write data sets in the ZAR format.
So what is ZAR very quickly? It's a cloud-oriented format for the storage of chunk-compressed multi-dimensional arrays and here I've tentatively tried to present a 3D index grid so you have the concept of chunking which is a process of grouping together data values along a number of samples
in each dimension. And functionally ZAR shares a lot of concepts with the NetCDF or AGF5 formats and it's actually quite common to see ZAR data sets being converted from NetCDF.
So similarly to NetCDF 4 and ZAR you will have a hierarchical organisation of arrays in groups. The values of arrays can be numeric data types, strings, compound data types and some more esoteric data types. Metadata can be attached to arrays and it's defined in JSON files and one big difference
with NetCDF or HDF5 is ZAR is a multiple file format so each chunk of data is stored as a separate file. This kind of advantages for parallel update of data sets or if you need to extend the
data set in any of its dimensions, for example time and ZAR supports so many different lossless compression methods with filters such as delta encoding which can increase the efficiency of compression.
One important point to have in mind is that ZAR doesn't come from the geo community so you might encounter difficulties as you might already have with NetCDF or HDF where there's no canonical way of encoding geo referencing and as I said a number of data
sets are actually converted from NetCDF so the NetCDF-CF conventions are often found in ZAR data sets. Also to be noted the ZARv2 specification which is the one in actual production currently has been submitted as a candidate for GSC community standard.
So what's exactly in the GDAL-ZAR driver? So it has read and write capabilities. It's natively written to use a new GDAL multidimensional API which was an addition in GDAL 3.1 but it also supports exposing ZAR data sets as classic 2D raster for easier
consumption. The driver also handles most common ZAR data types, so mostly the numeric ones as well as strings and it works with both local data sets or remote data sets that
are stored on the usual commercial cloud storage you use nowadays. The driver supports the current ZARv2 specification and it also supports the experimental V3 one
and it has support for CRS encoding as a specific metadata in the JSON files and the CRS can be encoded as WKT or Projecim. And in some scenarios you can use multithreading capabilities for parallel decoding of chunks.
Here I've put a few links to the ZAR specification itself and the documentation of its reference Python implementation, the documentation of the GDAL driver itself and a link to the evaluation engineering report of the OGC testbed where the driver has been developed.
Another driver that has been added in GDAL 3.4 is the STACK-IT. STACK-IT stands for Special Temporal Asset Catalog items, so you are probably already familiar with it, but STACK is a family of specifications and APIs for cataloging
and discovering dataset metadata and STACK items are the last level of catalog when browsing a STACK hierarchy. For the driver to be able to use STACK items in a useful way
from a GDAL perspective, it requires that each item, each image is published with a few extra information to give the CRS, the size in pixels, the resolution and the extent. And with all of that, the driver can build a virtual mosaic which can be used directly,
of course, but it can also be serialized as a VRT file which is a GDAL-specific format for virtual files and mosaics. And when you store as a VRT, it will freeze the list of items
and enable direct later reuse. Here I have presented an example of a query against the STACK API implementation of a Microsoft planetary computer. So the query has a filter on the collection of interests, the bounding box and the daytime range you are
interested in. And if you can see here, it's actually reported as a VRT dataset, a virtual roster, and you have the list of all tiles that participate to this request and you have
a GDAL dataset. Now a rather advanced topic about coordinate reference systems. If you store coordinates in so-called dynamic CRS,
things like WGS84 or ITRF, you need to be aware that into CRS coordinates of ground points are not fixed and they move over time. So this is a phenomenon of the order of few centimeters per year, but over several decades it can amount to shifts of one or several
meters. So if you want to do precise coordinate conversion to other CRS, you need to qualify each coordinate with a coordinate epoch. And most of the time we are lucky that the coordinates of the same dataset are referenced against the same epoch. So in GDAL, we have
decided to add an optional attribute with a coordinate epoch which is stored in the OGR special reference class. And with that, we are able to propagate that information down to Proj so you can have accurate transformation between dynamic CRS and static CRS using time-dependent coordinate transformations. As the main GDAL and OGR utilities have
been updated to be able to specify a source and target coordinate epoch. And we have also worked with a format specification to be able to store that coordinate epoch in a standardized way
in the few popular formats like GeoTIFF, GeoPackage or FlatGeobuff. And of course, we have added that capability in GDAL-owned formats of VRT and the Ox XML sidecar file. That said, if you have the choice, I'd suggest not using dynamic CRS when you can, but rather
use static blade fixed CRS to avoid all the complication of coordinate epochs and time-dependent coordinate transformations. Now let's go forward to GDAL 3.5 which has been
released in May this year. So one of the big work items was adding a CMakeBuild system for those who are not familiar with CMake. It's an open source and cross-platform tool to address software compilation process. Up to now GDAL had two different build systems,
one for Unix-like systems and another one for Windows. So having two different build systems where most of what they do is supposed to be common was maintenance and usage burden. Capabilities were not always identical and option naming was not consistent between the two
systems. Parallel builds were possible but a bit limited too. And for GDAL developers, they were lacking important features such as tracking of header dependencies.
So the choice of CMake was really obvious to solve those issues and particularly because it also has good support in IDEs such as Visual Studios and users have been really crying for CMake for GDAL for many years, maybe 10 years or more.
So the plan and schedule that was agreed with the community was to add in GDAL 3.5 CMake as an extra build system so close to the existing ones. The existing ones have been
deprecated but kept in GDAL 3.5. And in GDAL 3.6 we will remove those past AutoConf and NMake build systems to have CMake to be the only one and actually that's what has been done in GDAL master branch two weeks ago so we are fully on track with the plan.
Huge credits belong to Hiroshi Miura who has provided the initial material for the new build system and this was really an amazing contribution with all the dependencies that GDAL must address
and other contributors have also helped a lot to to improve once it was merged into GDAL master and I should also tell that given the many months long effort this has required to to be polished and usable for production usage this wouldn't have been possible without
the funding provided by the GDAL sponsorship program. Feature-wise one big addition of GDAL 3.5 is a new geo-parket and geo-road drivers. Both formats are open source, open specification, specification of tabular file formats with data organized in column oriented fashion.
So what does that mean? That is information for a given attribute or given field is packed together in the file by groups of many rows whereas most of the vector formats are row
oriented you can think of CSV file for example and the columnar organization of parket and Hurrow make it better fitted for data analysis databases or system. Parket and Hurrow have also grown
outside of the geo community so there was a need to add geo capabilities on top of them and this is really what geo-parket is. It's an extension of top of parket to define linear metadata like the CRS geometry type as well as how geometries are actually encoded in
the file so in the case of geo-parkets they are encoded as WKB. So what about Hurrow and geo-urrow? I'm not going to go into the details about the difference between Hurrow and parket. They belong to the same ecosystem but basically parket is
more for long-term preservation and efficient compressed storage whereas Hurrow is more about in-memory processing and being able to transfer data between processes on the same node
and occasionally it can be serialized on disk too. Other notable features of JITA 3.5 is that the geotiff driver has been extended to a new codec using the JPEG-XL compression. The L in JPEG-XL means long terms and it's the intention of JPEG-XL authors to have this format
to have the same extent in time as a good old JPEG format if that's ever possible. JPEG-XL is a competitor with other more modern image formats such as AV,
IF, HE, IF and WEBP. There are various online comparisons between those formats for those interested. JPEG-XL has both lossless and lossy profiles. To be noted that
lossy JPEG-XL can be obtained by transcoding existing JPEG file without any additional loss and when doing that you will save approximately 20 percent of space. The GDAL driver doesn't implement that feature. JPEG-XL can support thousands of
channels and bands and perhaps the most interesting feature compared to JPEG is that it's really good for high bit deaths data so if you want to compress
12 or 16 bits data sets this is definitely where it will shine. It also has a reference optimized implementation, libG-XL which is the one used by GDAL of course. You should also remember that or be aware that this codec is only available if you build GDAL
with its internal libtiff library because this codec is for now only in GDAL and has not been upstreamed yet in libtiff. GDAL raster core has been extended to support 64-bit
signed and unsigned integers for raster values and it's currently implementing the geotiff key netcdf and hdf5 drivers. A new vector driver has been contributed for the proprietary SIP ANA database. You need also their closed source ODBC driver to have it fully working
and we have also removed 10 or so legacy and maintained drivers to give some room for the new things we have added these pastures. A quick glance at GDAL 3.6 which
is planned for this November. Slightly related to what I mentioned previously about the Geoparket drivers, we have added a new API or new virtual method in the OGLayer class to be able to read
an OGLayer vector layer as a arrow in memory compatible representation and this allows for higher throughput when you want to get data from a OGL data source and it also eases interoperability with a number of packages in the R and Python ecosystem such as Geopanda
among others. Another main work is enhancements of the open file gdb driver which is a fully open source driver that reads S3 file geotatabase so now it has creation and update capabilities
you no longer need the proprietary SDK to do that so it will make the interoperability with S3 software much easier. There's also a new driver for standalone JPEG excel file what I
mentioned previously was JPEG excel embedded in TIF files and also new driver for a few GPU texture formats. As I mentioned previously the GDAL project has been running a sponsorship program for one year now and it has been very successful to enable the project to tackle a lot
of smaller bigger tasks that tend to lack from contribution or regular funding and it has also benefited to other dependencies of GDAL such as Proj and more recently Geos
so big thanks to our sponsors to make it possible and that's it.