We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

State of GeoPandas and friends

00:00

Formal Metadata

Title
State of GeoPandas and friends
Title of Series
Number of Parts
351
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
GeoPandas is one of the core packages in the Python ecosystem to work with geospatial vector data. By combining the power of several open source geo tools (GEOS/Shapely, GDAL/fiona, PROJ/pyproj) and extending the pandas data analysis library to work with geographic objects, it is designed to make working with geospatial data in Python easier. GeoPandas enables you to easily do operations in Python that would otherwise require desktop applications like QGIS or a spatial database such as PostGIS. This talk will give an overview of recent developments in the GeoPandas community, both in the project itself as in the broader ecosystem of packages on which GeoPandas depends or that extend GeoPandas. We will highlight some changes and new features in recent GeoPandas versions, such as the new interactive explore() visualisation method, improvements in joining based on proximity, better IO options for PostGIS and Apache Parquet and Feather files, and others. But some of the important improvements coming to GeoPandas are happening in other packages. The Shapely 2.0 release is nearing completion, and will provide fast vectorized versions of all its geospatial functionalities. This will help to substantially improve the performance of GeoPandas. In the area of reading and writing traditional GIS files using GDAL, the pyogrio package is being developed to provide a speed-up on that front. Another new project is dask-geopandas, which is merging the geospatial capabilities of GeoPandas with the scalability of Dask. This way, we can achieve parallel and distributed geospatial operations.
Keywords
202
Thumbnail
1:16:05
226
242
State of matterArrow of timeTwitterCore dumpSoftware developerOpen sourceMathematical analysisScalabilityComputer fileWorkstation <Musikinstrument>PolygonGeometryDemo (music)DistanceExecution unitOperations researchOpen setArithmetic meanPoint (geometry)RadiusPredicate (grammar)Maxima and minimaObject (grammar)Array data structureMilitary operationFunction (mathematics)CalculationAttribute grammarObject (grammar)GeometryShape (magazine)CodePresentation of a groupMereologyOpen sourceAdditionVector spaceKeyboard shortcutFigurate numberMoment (mathematics)Loop (music)DistanceMathematical analysisWorkstation <Musikinstrument>PolygonLibrary (computing)BuildingReading (process)Software developerFunctional (mathematics)Overhead (computing)BitTable (information)Level (video gaming)ScalabilityProjective planeArrow of timeInterface (computing)Operator (mathematics)Core dumpMaxima and minimaFamilyINTEGRALComputer fileLipschitz-StetigkeitInteractive televisionPlanningFlow separationCASE <Informatik>Scripting languageLimit (category theory)InformationFrame problemSpacetimeStreaming mediaLatent heatPoint (geometry)Computer animation
Operations researchFunction (mathematics)BenchmarkPolygonPoint (geometry)Neighbourhood (graph theory)Block (periodic table)Keyboard shortcutOverlay-NetzData modelPredicate (grammar)Query languageDirected setPerformance appraisalMultiplicationSoftware testingFeedbackCodeAlpha (investment)Installation artGeometryOpen setOpen sourceComputer fileElectric currentBuildingDifferent (Kate Ryan album)Rule of inferenceBroadcasting (networking)Shape (magazine)Overhead (computing)GeometryDistanceLoop (music)FeedbackSemiconductor memoryCodeSoftware testingAlpha (investment)CASE <Informatik>Object (grammar)Keyboard shortcutReading (process)PolygonComputer fileCharacteristic polynomialMultiplication sign2 (number)MultiplicationArray data structureTable (information)BitWindowMathematicsFunctional (mathematics)Operator (mathematics)Interface (computing)Type theoryOverlay-NetzEndliche ModelltheorieMoment (mathematics)Default (computer science)Single-precision floating-point formatPoint (geometry)Extension (kinesiology)Row (database)BenchmarkSoftware developerShapley-LösungElectronic program guideReal numberBeta functionAdditionCartesian coordinate systemFile formatThread (computing)Human migrationAttribute grammarVector spaceSimilarity (geometry)Ultraviolet photoelectron spectroscopyQuery languageBuildingComputer animationXMLDiagramProgram flowchart
FeedbackInstallation artLevel (video gaming)Functional (mathematics)Row (database)Keyboard shortcutSelectivity (electronic)Set (mathematics)Latent heatArrow of timeReading (process)Software developerFile formatSemiconductor memoryTable (information)Computer animation
Vector spaceStreaming mediaArrow of timeBuildingComputer fileLatent heatFile formatMultiplication signGeometryComputer animation
Computer fileElectric currentBenchmarkBuildingVector spaceArrow of timeStreaming mediaFile formatOpen sourceJava appletData compressionInformation retrievalData storage deviceNumbering schemeComplex (psychology)Binary fileVariable (mathematics)CASE <Informatik>Fatou-MengeDecision tree learningLibrary catalogComputerSubject indexingTask (computing)Scheduling (computing)AlgorithmSoftware frameworkRead-only memoryParallel computingGraph (mathematics)Thread (computing)Process (computing)Virtual machineDistribution (mathematics)Frame problemComputer-generated imagerySlide ruleGeometryComputer fileType theoryPhysical systemCoordinate systemOpen sourceBuildingBitMultiplication signINTEGRALOperator (mathematics)Data compressionCASE <Informatik>Arrow of timeCore dumpParallel portProjective planeSemiconductor memoryFeedbackReading (process)Functional (mathematics)Formal languageMetadataInformationData storage devicePartition (number theory)Link (knot theory)Data typeFile formatTask (computing)Overlay-NetzAlpha (investment)NeuroinformatikFrame problemInterface (computing)Fitness functionMereologyGraph (mathematics)Scheduling (computing)Internet service providerVirtual machineOpen setComputer animationXML
Transcript: English(auto-generated)
Thank you. So, quickly about me. So I'm, yeah, mostly active in the Python space, so core developer of pandas, Geopandas. And I'm currently working at Vault-on-Data on the Apache Arrow open source project.
So that's something you can chat with me as well. So this talk will give an update about Geopandas. I did it here, easy, fast, and scalable geospatial analysis in Python, but so, yeah, focusing on vector data, on tabular vector data, making this more accessible
in Python. There is a bit of history here. Given this is not a Python-specific conference, I will give a very brief intro to Geopandas as well. So it's mostly extending existing libraries, there is the pandas library to work with tabular data, it's extending that to work with geospatial data,
building on top of the open source geospatial tools that people here are familiar with. So how does it look like? You can, in Python, interactively or in a script,
for example, read some files, and your data has a geometry column, in this case, for example, polygon data. You can use the existing pandas functionality, so if you want to work with the attribute data, filter on the column or calculate something on the column that's still available, but in addition to pandas, it gives you access to a lot of geospatial operations,
for example, like calculating here the distance, in the very wrong way, because I have latitude, longitude data, and that's something to be aware of in Geopandas or limitation, Geopandas, by being based on geos, it assumes a Cartesian plane, so you need to take care of that yourself
to re-project if needed. And for example, we then also implement things like a spatial join, so here I added to the bike stations examples in which district it is located. A lot of those things, features built on other libraries,
so as a summary, you can use it both interactively or in a scripting way, so in very much like how, you could say like how PostGIS extends PostgreSQL, Geopandas extends pandas.
I have a few highlights of new features in Geopandas itself, although I will be very quick here because the more interesting part of the presentation comes afterwards. One nice thing, it's small but it's still convenient, is that there is a new explore method.
It was said to me that I had the mouse to scroll, so I could actually show it. Apparently that's not here, so I can't show that it's actually interactive. But so that's something new based on folium and leaflet. In addition to the traditional matplotlib based static figure that Geopandas already provided,
so that's another example. We already had the spatial join since the last release. There is also an addition to that family of functions to join on the nearest geometry, so not an exact predicate, but a nearest, and for example, it also gives you the ability
to do a max distance search. There is better integration with PostGIS. There are some other, many more improvements in the last releases. But so what I want to focus on here for the rest of the talk is, yeah,
explain some new developments on the level below here. Because, so as I mentioned before, Geopandas is really built upon some other libraries, so under the hood, we have geos for the spatial operations using the shapely bindings to it,
GDOL for reading and writing files, through the fiona bindings approach for reprojections, and so there has been quite some active development to improve some of those bindings, which of course also directly benefits users of pandas.
And it's mostly focused on improving the performance of those bindings, and I'm going to start with the shapely geos bindings, which is about shapely 2.0, which is coming shortly.
So taking one step back, what exactly is shapely? It's a Python binding to the geos C++ library. Geos is also used by PostGIS, by R, by many other tools that provide geospatial vector operations, and so shapely provides a nice Python interface
to work with geometry objects, points, lines, polygons, you can do all kinds of operations on them. But so shapely itself is limited to single objects. It also has no direct ability to attach it with attributes, information,
and so that's how Geopandas extends like the functionality of shapely, by putting those shapely objects in a column in your data frame. At the moment how Geopandas does that is by having a column where we actually put shapely objects. So if you do a certain operation, for example, you want to calculate the distance
of every geometry in your column to some other geometry, we iterate through those objects in Python, so basically in pseudo code it would look like that, but basically what is happening, in a Python loop we loop through the geometries, we call shapely, shapely is calling geos.
And this gives a bottleneck, in performance bottleneck in Geopandas. By doing this loop in Python, it gives quite some Python overhead. So if you want to improve the performance here, what you need to do is to bring the loop a little lower and bring it to where we are calling geos,
so that's in shapely. And so that's what we have been doing the last years. It was started in a separate package in PyGeo, started by Casper who is somewhere in the room here, over there in the back, and he created PyGeos to have a new way
to work with those geometry objects based on arrays, assuming that you have arrays of data, not just a single object, providing vectorized functions and doing this much faster. In the meantime, PyGeos has been fully merged into shapely to bring those new improvements
to shapely use as well. Shapely has been refactored to use a C extension instead of C types to make it more robust, have all the new functionality from PyGeos while also keeping the familiar interface for people who are already using shapely.
And so to give you a small idea of what it looks like, so as I mentioned before, at the moment you need to do a manual for loop if you have lots of points in this case, lots of geometries, you need to write your for loop yourself, in shapely 2.0 there is a,
in this example a contains function that will do the array based operation for you, and also it's very similar as numpy functions that broadcast so if you have different shapes of arrays, it follows the numpy broadcasting rules. So this is more convenient also
if you have arrays of data, and because the loop is no longer done in Python but in C it's also much faster. So a very simple benchmark here, one million random points, a single polygon checking whether the, all the points, for all the points whether it's contained in a polygon, you see a very nice speedup,
similar for distance, again a nice speedup. The exact speedup depends very much on the operation you're doing, if it's a very cheap operation in Geos, then the overheads in Python is relatively larger so you have a bigger speedup.
Another example but more integrated into an actual application like the spatial join in Pandas is under the hood using the str3 from Geos, also there we improved performance to do queries in bulk using the str3 and so you also see that the spatial join improves a lot.
A lot more other things as well, more bindings for additional Geos functions, yet str3 already mentioned improvements there, support for the fixed precision model in overlay functions that Geos has since a few releases, releasing the GIL so you can do multi-threading,
we don't provide it ourselves but it's possible now, and so the full release notes are available in the documentation for the upcoming release. There are a few API changes, geometry objects are now immutable, we removed some features that made it more difficult
to work with arrays of geometries for example, geometries are no longer iterable which means that if you would like, if you have a multi-polygon and you want to loop through every polygon that the multi-polygon is made of, you no longer can just iterate through the multi-polygon
but you need to iterate through the .geom's attributes, something that already existed before as well so you can write code that is perfectly compatible with both the old and the new way. And we also have a migration guide with all the details on how to update your code to be compatible with 2.0.
But the status, it's fully integrated, we had an alpha release, the first alpha release of Shapley 2.0 in July, planning a next beta release in the near future and so please, how can you help? Try it out, test it with your code, install 1.8, fix all the deprecation warnings,
then install 2.0 and see if it still works and give us feedback so we can make Shapley 2.0 reality in the coming months. So, that's for Shapley. Moving to the next one, GDAL,
for reading and writing vector file formats. So what Geopandas currently uses in its read file function or two file function are the Fiona bindings to GDAL but very similarly as in Shapley where we had an interface for single objects
and we needed to loop in Python. The same actually happens in Fiona. Fiona is providing an interface to get record by record and again, that gives a lot of overhead and that's something we can improve. And in this case, it's not done in Fiona itself but with some Geopandas developers,
we created a new package by Oguro to also here vectorize the IO on top of GDAL. So it's very much focused on what Geopandas needs. Geopandas just wants to read a full layer as a table at once, get it or write it
and that makes that by focusing on that aspect can be faster than Fiona but it also makes it much less general purpose so it's certainly not meant as a full replacement of Fiona. We also have wheels for windows
so that makes the installation a bit easier. Fiona actually now has as well but hopefully that will also improve the installation situation a bit. How can you use it directly to using PyOguro here on top or you can actually use Geopandas, the read file function.
This should have been Geopandas.read file, sorry about that and you can specify the engine PyOguro instead of Fiona and the goal is to make that the default in the future because it gives some nice speed ups. So here is the example for reading
a large Geopackage file with polygons, building outlines and you can see that it takes more than three minutes to read it with Geopandas using Fiona and goes down to 25 seconds using PyOguro while at the same time and that's on the other side
using less memory while doing so. The exact speed up will very much depend on the characteristics of your file, the file formats but we generally see very nice speed ups using PyOguro.
So again here it's a very new package so you're very welcome to install it, test it with your workflows and give us feedback if something is not yet fully working as it should be. Taking one step back on how this works
with the Python bindings and GDAL. So GDAL is reading data from or writing data that's focusing on reading, reading data from all kinds of file formats and then the Python bindings and this is true for both Fiona or PyOguro, they get the features,
get the data row by row using the get next feature API of GDAL and so this is still even though we can optimize and call this function in PyOguro on the incitons so it's compiled to C
at the lower level, it's still overhead to have to do this feature by feature and that's something that we can improve so how this data is moved from GDAL having it read to the bindings such as Python. That's something we can improve
and that's where Apache Arrow also comes into the picture. Apache Arrow provides a standardized in memory format for tabular data, it has some specifications around it and so there is a new, well, there was a proposal and it's already implemented and accepted coming in 3.6
by Ivan Rouh, the main developer of GDAL so to have a column oriented API that you can, given a certain layer of data sets and a certain selection, give me the full data from all columns for all rows at once
using the Arrow specification and so going back to our example from before, this certain file, also here you can see that it still, on top of the improvement that PyOguro already did, gives another two times improvement
for this specific file and for geo package because it, again, depends very much on the format that you're reading. So that's not yet something you can use directly, it's, this is with GDAL master but that's coming in the future as well. What I showed here and also in the previous,
well, these slides, so here I was using geo package. We can improve the performance of reading those files that GDAL can read but another way to improve your IO performance is to, yeah,
look for another file format and that's where I quickly want to mention geo parquet. I have too many slides about geo parquet so I will try to be very brief but Apache parquet is an existing file format, open source, available in many languages to read.
It's column oriented, that's an important aspect of it so it's very much made for tabular data and to have efficient reads for tabular data and it's a widely used file format. So what do we want to do with geo parquet
is not to create a new file format but basically standardize how in the geospatial community we can use parquet to store geospatial data. So specifically what data type use currently, we are using well-known binary but we can improve on that in the future
and how to store some metadata about the coordinate reference system, the geometry type, et cetera. You can already read and write in Python and comparing here again with the same buildings file,
you can see that parquet is still a lot faster than the others. Just a note, this is tested with pyogre without arrow integration. So yeah, if we could make this two times faster but even then parquet is still a bit faster
and especially when reading only a single column, so that's what you don't really see here. Because of being columnar, it's very efficient to only read a specific column. It's fast and at the same time, it can also manage to provide you very small files so it's very optimized regarding compression
while still being fast to read. Some history about it, it's still new, evolving, so again, feedback is very welcome but I want to mention a last part very quickly because I don't have much time, is that Geopandas, everything that I've told,
Pandas, Geopandas works nicely and we can improve the performance but as long as the data that you have fits in memory and it's also only using a single core. So that's something that can be improved and that's what Dask Geopandas is trying to do.
So Dask is a project to provide parallelism. Very briefly, it creates a task graph under the hood for the tasks that need to be done and then has some scheduler both on a single machine but to use all your cores or on a distributed server
with multiple nodes. And for data frame, how it works under the hood is that you have a chunked data frame where each chunk is a Pandas data frame or in our case, a Geopandas geodata frame. And that way, it can build up a task graph
and each task will say for this chunk, do that computation, for the other chunk, do that computation and it can be spread across cores or across nodes on this distributed cluster. So Dask Geopandas is giving this integration
very much as the Dask data frame already exists and is extending Pandas, Dask Geopandas does the same for Geopandas and so we have a very familiar interface as Geopandas. There is already support for parallelizing IO, all the basic element-wise geospatial operations
and the spatial join as well can run in parallel or distributed. And there is some basic support for spatial partitioning. And given the time, I'm going to skip this one. I also can't scroll so that's no problem. It's a very young project so this is a bit like, yeah,
alpha project but it's certainly ready to be tried out. There are still things that we need to improve like having full coverage of what Geopandas can do. For example, we don't yet have an overlay function making better use of the spatial partitioning information.
That's certainly something to improve but it's certainly ready to be used for the basic use cases and there are some links to the documentation and tutorial. Well, so that's it. Want to end with saying that a lot of what I told
is not, are not things that is just things that I did but so it's from the whole community around it with many contributions from other people so thanks a lot to those people. Thank you very much.
We have a few questions.