Agile Geo-Analytics: Stream processing of raster- and vector data with dask-geomodeling

Zitieren

Zugehöriges Material

FOSS4G

Open Source Geospatial Foundation (OSGeo)

Wel, Casper van der

Formale Metadaten

Titel

Agile Geo-Analytics: Stream processing of raster- and vector data with dask-geomodeling

Serientitel

FOSS4G Firenze 2022

Anzahl der Teile

351

Autor

Wel, Casper van der

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/69059 (DOI)

Herausgeber

FOSS4G

Open Source Geospatial Foundation (OSGeo)

Erscheinungsjahr

2024

Sprache

Englisch

Produktionsjahr

2022

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

We present _dask-geomodeling_: an open source Python library for stream processing of GIS raster and vector data. The core idea is that data is only processed when required, thereby avoiding unnecessary computations. While setting up a dask-geomodeling computation, there is instant feedback of the result. This results in a fast feedback loop in the (geo) data scientist’s’ work. Big datasets can be processed by parallelizing multiple data queries, both on a single machine or on a distributed system. ### Abstract In geographical information systems (GIS), we often deal with data pipelines to derive map layers from various datasets. For instance, a water depth map is computed by subtracting the digital elevation map (DEM) from a water level map. These procedures are often done using open source products such as PostGIS and QGIS. However, for medium to large datasets (＞ 10 GB) the extent of these analyses are costly due to memory restrictions and computational cost. As a rule, these issues are tackled by manually cutting the dataset into smaller parts. However, this is a tedious and time-consuming task. In case one needs to this regularly, this is not feasible. We present the open source Python library _dask-geomodeling_ [1] to solve this issue. Instead of a script, dask-geomodeling requires a so-called “graph”, which is the definition of all operations that are required to compute the derived dataset. This graph is generated by plain Python code, for instance: ``` plus_one = RasterFileSource('path/to/tiff') + 1 ``` Note that these operations are lazy: there is no actual computation done and therefore the above line executes fast. Only when actual data is requested: ``` plus_one.get_data( bbox=(155000, 463000, 156000, 464000), projection='epsg:28992', width=1000, height=1000 ) ``` An array containing the data is computed. No need to load the whole TIFF-file in memory if you only use a small part! The computation occurs in two steps. First, a computational graph is generated containing the required functions. While generating the computational graph, the operations may be chunked into smaller parts. Second, this graph is evaluated by _dask_ [2], using any scheduler (single thread, multithreading, multiprocessing, distributed) that is provided dask. This library is open source under the name “dask-geomodeling” and is distributed on Github, PyPI, and Anaconda. A hosted cloud version is also available under the name Lizard Geoblocks [3]. Currently, we have implemented a range of operations for rasters, vectors, and combinations. The community is welcome to use our library, benefit from it, and expand it!

Schlagwörter

foss4g2022

generaltrack

Stateofsoftware