Geospatial and Apache Arrow: accelerating geospatial data exchange and compute

Cite

Related Material

FOSS4G

Open Source Geospatial Foundation (OSGeo)

Bossche, Joris van den Dunnington, Dewey

Formal Metadata

Title

Geospatial and Apache Arrow: accelerating geospatial data exchange and compute

Title of Series

FOSS4G Firenze 2022

Number of Parts

351

Author

Bossche, Joris van den

Dunnington, Dewey

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/69118 (DOI)

Publisher

FOSS4G

Open Source Geospatial Foundation (OSGeo)

Release Date

2024

Language

English

Production Year

2022

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

The Apache Arrow (arrow.apache.org/) project specifies a standardized language-independent columnar memory format. It enables shared computational libraries, zero-copy shared memory, streaming messaging and interprocess communication without serialization overhead, etc. Nowadays, Apache Arrow is supported by many programming languages. Geospatial data often comes in tabular format, with one (or multiple) column with feature geometries and additional columns with feature attributes. This is a perfect match for Apache Arrow. Defining a standard and efficient way to store geospatial data in the Arrow memory layout (github.com/geopandas/geo-arrow-spec/) can help interoperability between different tools and enables us to tap into the full Apache Arrow ecosystem: - Efficient, columnar data formats. Apache Arrow contains an implementation of the Apache Parquet file format, and thus gives us access to GeoParquet (github.com/opengeospatial/geoparquet) and functionalities to interact with this format in partitioned and/or cloud datasets. - The Apache Arrow project includes several mechanisms for fast data exchange (the IPC message format and Arrow Flight for transferring data between processes and machines; the C Data Interface for zero-copy sharing of data between independent runtimes running in the same process). Those mechanisms can make it easier to efficiently share data between GIS tools such as GDAL and QGIS and bindings in Python, R, Rust, with web-based applications, etc. - Several projects in the Apache Arrow community are working on high-performance query engines for computing on in-memory and bigger-than-memory data. Being able to store geospatial data in Arrow will make it possible to extend those engines with spatial queries.

Keywords

foss4g2022

generaltrack

Stateofsoftware