We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Geospatial and Apache Arrow: accelerating geospatial data exchange and compute

00:00

Formal Metadata

Title
Geospatial and Apache Arrow: accelerating geospatial data exchange and compute
Title of Series
Number of Parts
351
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
The Apache Arrow (arrow.apache.org/) project specifies a standardized language-independent columnar memory format. It enables shared computational libraries, zero-copy shared memory, streaming messaging and interprocess communication without serialization overhead, etc. Nowadays, Apache Arrow is supported by many programming languages. Geospatial data often comes in tabular format, with one (or multiple) column with feature geometries and additional columns with feature attributes. This is a perfect match for Apache Arrow. Defining a standard and efficient way to store geospatial data in the Arrow memory layout (github.com/geopandas/geo-arrow-spec/) can help interoperability between different tools and enables us to tap into the full Apache Arrow ecosystem: - Efficient, columnar data formats. Apache Arrow contains an implementation of the Apache Parquet file format, and thus gives us access to GeoParquet (github.com/opengeospatial/geoparquet) and functionalities to interact with this format in partitioned and/or cloud datasets. - The Apache Arrow project includes several mechanisms for fast data exchange (the IPC message format and Arrow Flight for transferring data between processes and machines; the C Data Interface for zero-copy sharing of data between independent runtimes running in the same process). Those mechanisms can make it easier to efficiently share data between GIS tools such as GDAL and QGIS and bindings in Python, R, Rust, with web-based applications, etc. - Several projects in the Apache Arrow community are working on high-performance query engines for computing on in-memory and bigger-than-memory data. Being able to store geospatial data in Arrow will make it possible to extend those engines with spatial queries.
Keywords
BitTask (computing)Arrow of time
Process (computing)Latent heatFormal languageSemiconductor memoryFile formatRepresentation (politics)Table (information)Arrow of timeAdditionVariety (linguistics)Core dump
Computer-generated imageryBeta functionGame theorySystem callProcess (computing)Formal languageImplementationInteractive televisionProcess (computing)Structural loadPhysical systemSemiconductor memoryArrow of timeData storage deviceComputer animation
Source codeData storage deviceInformation retrievalNumbering schemeData compressionJava appletBeta functionFocus (optics)Vector spaceTable (information)Computer networkData structureRun time (program lifecycle phase)Data compressionArrow of timeImplementationMultiplication signFile formatComplex (psychology)Data structureDiallyl disulfideReading (process)Semiconductor memoryComputer file
Shape (magazine)Arrow of timeFrame problemLibrary (computing)Semiconductor memoryVariety (linguistics)Computer filePhysical systemFile formatReading (process)Form (programming)Computer animation
Computer animationDiagram
Beta functionFile formatRepresentation (politics)BitArrow of timeBlogData storage deviceSlide ruleSemiconductor memoryLink (knot theory)Standard deviationRevision controlComputer fileVector spaceComputer animation
Transcript: English(auto-generated)
Okay, so we have a bit impossible task to explain why Apache Arrow is relevant for geospatial in four minutes. So let's start with what is Apache Arrow. At its core, it's a specification how to represent tabular data in memory.
But on top of that, you have, yeah, some other specifications, a lot of variety of languages implementing this specification with additional tools, so you get a multi-language toolbox to move data around in this format to process data in this format. So for example, if you want to move data from R to Python
or from any language to another language or load data from storage, you might potentially have a lot of possible interactions. But because if those languages can all speak Arrow, we can easily move data from one data system to another or share an implementation of a Parquet reader
or something like that. So this is actually already reality right now for many tools. Just to be clear, it's not only about moving data around, also for actual in-memory processing on this data. Apache Parquet is something else that is often mentioned at the same time, but just to be clear, so Apache Parquet
is a file format that already existed before Apache Arrow, actually. But it's also focused on columnar data. And yeah, it has complex compression techniques, but so it's a widely used file format. The reason that it's often mentioned at the same time
is that many of the Arrow implementations also have a Parquet reader, given that it's very much aligned. And so historically, the reason that many people were using PyArrow was to read Parquet files. So one of the things that Apache Arrow wants to enable, things like better data access,
easier to move data around, efficient in-memory data structures, sharing implementation, that's all things that are relevant for geospatial as well. So for example, GDAL can read a variety of file formats, but when GDAL is reading something
that still needs to be moved into actual library that is using GDAL on Python library or QGIS, PostGIS. And yeah, there is a new upcoming feature in GDAL that GDAL can export a layer as Arrow memory formats.
And so that means that at once, any data system that already understands Arrow gets relatively cheap access to everything that GDAL supports. And for example, if you have a Rust data frame library that is built on top of Arrow, they can directly get access to data
supported by GDAL as well. This doesn't necessarily need to be limited to GDAL. We can imagine a potential future where it is used more to share data around. So we have Apache Parquet, a widely used file format. We have Apache Arrow, which is becoming the de facto standards
to represent data in memory. There is a increasingly big ecosystem of tools that nowadays support Apache Arrow. And so the opportunity here for geospatial is yeah, that we can by adopting those formats and standards
to get access to this ecosystem. Do we need something more? Maybe a little bit. And that's what we did with GeoParquet and GeoArrow. So GeoParquet is not a new file format. It's just agreeing on how do we store geospatial data in Parquet. GeoArrow is also not a new memory format. It's just agreeing on how are we going to store
vector data in GeoArrow. And so for example, yeah. Here there are some links if you want to know more to the slides and to the long version and some interesting blog posts. Thank you, Joyce. I will have to catch you up. Thank you very much, Craig, please.