Geospatial and Apache Arrow: accelerating geospatial data exchange and compute
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/69118 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
BitTask (computing)Arrow of time
00:11
Process (computing)Latent heatFormal languageSemiconductor memoryFile formatRepresentation (politics)Table (information)Arrow of timeAdditionVariety (linguistics)Core dump
00:37
Computer-generated imageryBeta functionGame theorySystem callProcess (computing)Formal languageImplementationInteractive televisionProcess (computing)Structural loadPhysical systemSemiconductor memoryArrow of timeData storage deviceComputer animation
01:15
Source codeData storage deviceInformation retrievalNumbering schemeData compressionJava appletBeta functionFocus (optics)Vector spaceTable (information)Computer networkData structureRun time (program lifecycle phase)Data compressionArrow of timeImplementationMultiplication signFile formatComplex (psychology)Data structureDiallyl disulfideReading (process)Semiconductor memoryComputer file
02:11
Shape (magazine)Arrow of timeFrame problemLibrary (computing)Semiconductor memoryVariety (linguistics)Computer filePhysical systemFile formatReading (process)Form (programming)Computer animation
03:03
Computer animationDiagram
03:12
Beta functionFile formatRepresentation (politics)BitArrow of timeBlogData storage deviceSlide ruleSemiconductor memoryLink (knot theory)Standard deviationRevision controlComputer fileVector spaceComputer animation
Transcript: English(auto-generated)
00:02
Okay, so we have a bit impossible task to explain why Apache Arrow is relevant for geospatial in four minutes. So let's start with what is Apache Arrow. At its core, it's a specification how to represent tabular data in memory.
00:22
But on top of that, you have, yeah, some other specifications, a lot of variety of languages implementing this specification with additional tools, so you get a multi-language toolbox to move data around in this format to process data in this format. So for example, if you want to move data from R to Python
00:41
or from any language to another language or load data from storage, you might potentially have a lot of possible interactions. But because if those languages can all speak Arrow, we can easily move data from one data system to another or share an implementation of a Parquet reader
01:00
or something like that. So this is actually already reality right now for many tools. Just to be clear, it's not only about moving data around, also for actual in-memory processing on this data. Apache Parquet is something else that is often mentioned at the same time, but just to be clear, so Apache Parquet
01:22
is a file format that already existed before Apache Arrow, actually. But it's also focused on columnar data. And yeah, it has complex compression techniques, but so it's a widely used file format. The reason that it's often mentioned at the same time
01:41
is that many of the Arrow implementations also have a Parquet reader, given that it's very much aligned. And so historically, the reason that many people were using PyArrow was to read Parquet files. So one of the things that Apache Arrow wants to enable, things like better data access,
02:01
easier to move data around, efficient in-memory data structures, sharing implementation, that's all things that are relevant for geospatial as well. So for example, GDAL can read a variety of file formats, but when GDAL is reading something
02:20
that still needs to be moved into actual library that is using GDAL on Python library or QGIS, PostGIS. And yeah, there is a new upcoming feature in GDAL that GDAL can export a layer as Arrow memory formats.
02:42
And so that means that at once, any data system that already understands Arrow gets relatively cheap access to everything that GDAL supports. And for example, if you have a Rust data frame library that is built on top of Arrow, they can directly get access to data
03:01
supported by GDAL as well. This doesn't necessarily need to be limited to GDAL. We can imagine a potential future where it is used more to share data around. So we have Apache Parquet, a widely used file format. We have Apache Arrow, which is becoming the de facto standards
03:21
to represent data in memory. There is a increasingly big ecosystem of tools that nowadays support Apache Arrow. And so the opportunity here for geospatial is yeah, that we can by adopting those formats and standards
03:40
to get access to this ecosystem. Do we need something more? Maybe a little bit. And that's what we did with GeoParquet and GeoArrow. So GeoParquet is not a new file format. It's just agreeing on how do we store geospatial data in Parquet. GeoArrow is also not a new memory format. It's just agreeing on how are we going to store
04:01
vector data in GeoArrow. And so for example, yeah. Here there are some links if you want to know more to the slides and to the long version and some interesting blog posts. Thank you, Joyce. I will have to catch you up. Thank you very much, Craig, please.