GeoPandas 1.0 and beyond
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69450 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202484 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
GeometrySoftware maintenanceHill differential equationMultiplication signProjective planeRevision controlProduct (business)BitObject (grammar)Goodness of fitComputer animation
01:47
Projective planeMultiplication signLecture/Conference
02:14
SoftwareSoftware developerProjective planeWordPresentation of a groupComputer animation
02:48
Series (mathematics)Structural loadGeometryFrame problemBitLogicSet (mathematics)Table (information)Similarity (geometry)Right angleObject (grammar)Series (mathematics)Software maintenanceProjective planeDifferent (Kate Ryan album)Key (cryptography)Extension (kinesiology)Process (computing)MereologyDistanceVolume (thermodynamics)Level (video gaming)MeasurementLatent heatPolygonAreaBranch (computer science)Revision controlHeegaard splittingVector spaceLoop (music)MathematicsInterface (computing)Library (computing)Type theoryMultiplication signSoftware developerCoordinate systemOperator (mathematics)Reading (process)File formatVisualization (computer graphics)Interactive televisionComplete metric spaceComputer animation
10:14
System of linear equationsMultiplication signMathematicsComputer animation
10:38
Price indexDefault (computer science)Installation artComputer fileReading (process)Different (Kate Ryan album)Default (computer science)Context awarenessGeometryMathematical analysisBitRow (database)Network topologyWrapper (data mining)Subject indexingLibrary (computing)Link (knot theory)CodeWindowFrame problemOperator (mathematics)Projective planeKey (cryptography)Computer animation
13:18
Parity (mathematics)Texture mappingQuery languagePrice indexPressureComputer programmingMaterialization (paranormal)Module (mathematics)GeometryCodierung <Programmierung>Latent heatComputer fileSingle-precision floating-point formatSubsetProjective planeArrow of timeFunctional (mathematics)Buffer overflowRevision controlMereologyCuboidTable (information)Representation (politics)Data storage deviceCodeSoftwarePoint (geometry)Boundary value problemMultiplication signBuffer solutionSet (mathematics)CASE <Informatik>Predicate (grammar)Stack (abstract data type)Natural numberTwitterQuicksortPlanningNamespaceOperator (mathematics)MathematicsBitPolygonState of matterSubject indexingLattice (order)Proper mapMetreRoutingDistanceComputer animation
19:15
BenchmarkPolygonPoint (geometry)Block (periodic table)Neighbourhood (graph theory)BuildingRevision controlFile formatComputer fileMultiplication signComplex (psychology)Semiconductor memoryDefault (computer science)CASE <Informatik>GeometryFilm editingOperator (mathematics)Slide ruleImplementationSurjective functionArrow of timeComputer animation
20:50
SineImplementationGeometryData conversionCASE <Informatik>Computer fileBitArrow of timeObject (grammar)Error messageFunctional (mathematics)SphereDecision theoryRight anglePlanningOperator (mathematics)Library (computing)CurvatureString (computer science)PolygonSpherical geometryCodeScaling (geometry)NumberParity (mathematics)Mathematical analysisPlastikkarteSpacetimeSoftware bugMinkowski-GeometrieMedical imagingState of matterWebsiteBranch (computer science)Different (Kate Ryan album)Projective planeType theoryShader <Informatik>ExpressionValidity (statistics)Internet service providerSoftware maintenanceCurveThread (computing)Computer animation
25:36
Computer animation
Transcript: English(auto-generated)
00:04
Good afternoon. I'm Martin. I'm actually based here at Charis University, which is down there under the hill. It wasn't a long trip for me. I'm one of the maintainers of the GeoPandas project. Before I start, can I ask quickly who knows GeoPandas at least a little bit?
00:25
Okay, most of you. That's good, that's good. I can be brief in my introductions. Thank you. Okay. I'm going to talk about GeoPandas 1.0, which is a release we have finished after a long, long time.
00:46
I'm going to talk about what has changed and what's coming after that. The title of the talk is GeoPandas 1.0 and Beyond. The whole story starts quite a while ago at a conference very similar to this one at SciPy 2013, which was in Texas, I believe.
01:09
And it started during birds-of-feather session. People who were dealing with spatial data were just sat together and were kind of figuring out what to do next.
01:25
And someone asked a question at the time. Can we somehow link geometries to Pandas objects? Would it kind of work? And during the evening and the night and the day after, they actually hacked the very first version of GeoPandas.
01:48
And a year later at SciPy 2014, Kelsey Jordal, who is the main father of the GeoPandas project, actually presented back at SciPy GeoPandas 0.1, the very first release.
02:04
It took them a whole year to actually finish the work, to make it into something which is usable. And it's very interesting to see this presentation, which is roughly 10 years old, because the way how Kelsey talks about GeoPandas and how it aims to
02:24
simplify working with spatial data within the Pandas ecosystem and all these things, I could essentially use exactly the same words today, which is really nice to see that even with Kelsey's departure from the project a few years after SciPy 2014,
02:42
we still kind of managed to do exactly what he envisaged in the very beginning. I will do a brief introduction about what GeoPandas is. As you can guess probably from the name, it has something to do with Pandas. And GeoPandas essentially provides subclasses of Pandas objects, of Pandas series into a GeoPandas GeoSeries,
03:06
and Pandas DataFrame into a GeoPandas GeoDataFrame. The reason why we have subclasses and not extensions of any other sort, accessors, is essentially the age of the project.
03:22
Back then, it was actually even complicated to subclass Pandas objects. There was a lot of coordination between developers on the Pandas side to ensure that we can actually do this. And since then, it stayed as it is.
03:40
But what does it mean, GeoSeries and GeoDataFrame? GeoSeries is simple. It's a series which contains geometries. And it has geometric detail, and all the geometric methods we define in GeoPandas are tied to a GeoSeries. GeoDataFrame is a Pandas DataFrame which contains at least one GeoSeries,
04:03
of which one is usually marked as active GeoSeries. So when you apply a geometric operation on the whole DataFrame, it's applied to the active geometry column. If you are familiar with GIS, it's kind of very similar logic, but GeoPandas has a bit more flexibility.
04:24
In practice, it looks like this. If you open geospatial data, which is a tabular dataset with usually one geometric column, GeoPandas loads it in this way. It looks exactly like Pandas DataFrame would look like.
04:41
It just has that very specific column. And GeoPandas understands what the column means and understands how to work with it. Which means that we can, for example, measure areas of polygons. We can get the geometric centroids. We can measure distances. We can do a lot of things related to geometric objects.
05:05
We can plot them on the map, be it static with matplotlib, or interactive through folium into the JavaScript library called leaflet. And all this support is built in GeoPandas. It's gradually appeared over the years, and it took a while.
05:25
So as I mentioned, we started at SciPy 2013. I wasn't part of the team back then. Actually, no one who is part of the team right now was part of the team back then. It kind of has switched completely. But we started with a workshop, and then the first release in 2014.
05:45
In 2016, Kelsey Jordal, I think, was changing his jobs. And Joris van den Boschett became the lead maintainer of the project. He is still one of the maintainers of the project until today.
06:01
And he's one of the key persons within the geospatial ecosystem, covering a lot of different things. And he's also one of the maintainers of Pandas. In 2017, there were discussions how to make this faster.
06:21
At the time, all geometries were shapely objects, and each geometry was its own shapely object. And if you wanted to measure an array of all of them, you had to essentially do a for loop. We did it internally, but it was a for loop, calling the underlying C++ library called geos one geometry at a time.
06:43
And it wasn't efficient. So a couple of people started figuring out whether we can rewrite this in Cython, whether we can do vectorized interface. And that essentially led to this tag in the Geopandas history,
07:02
which is seven years ago, we actually tagged 1.0 death version, and it took us a long time to actually get rid of the death part of the tag. But the reason for this was to allow people to play with this Cython branch,
07:24
but in the end, it kind of all turned out completely differently, and the whole work on vectorization was kind of split apart from Geopandas, and over the years, it kind of materialized in a package called PyGeos,
07:43
which was then merged back to shapely. It was kind of a long story, but right now we have a new version of shapely, shapely 2.0, which essentially contains vectorized interface to geos, and there is no longer a need for a Python for loop if we want to get an area of all geometries, for example.
08:05
In 2019, we have kind of refactored the way how we dealt with geometries into pandas extension array, because that became possible at the time, and over the years, we kind of added a lot of minor or major changes.
08:23
A big thing for us was the year 2020, when the project got affiliation with NAMFOCOS, and we were able to ask for at least some small funding, and we are able to receive donations right now and use them in some way to support the project, but that doesn't change the fact that the whole project is still maintained by volunteers,
08:47
and some of us are able to work on it as part of their jobs, kind of, if we decide that it's a good thing to do, and we don't ask anyone. Version 0.8 included the PyGeos, which is the vectorized engine,
09:05
the original vectorized engine for geometries, and it allowed significant speedups. It was still optional. We had to support different engines. It was a bit complicated internally. A few years ago, we included interactive visualization of data,
09:24
and then we included another library, which is another interface to another C++ library called GDAL, which takes care of reading and writing specialized geospatial file formats. Again, we tried to ensure that we can do this in a vectorized way,
09:41
and PyAgrio became supported two years ago, and right now, it's going to be fully embraced within Geopendos. Last year, we got a sponsored level at NAMFOCOS, which actually allows us a bit more, it gives us potential to actually raise money.
10:05
We never managed to, since we even started looking into that, but we have infrastructure, so if you want to donate something to Geopendos, you can. Right now, we have 1.0 release. I think it's two weeks ago.
10:24
It took a long time, and some stuff has changed. It's the major release. It's the first major release. As it comes with major releases, some heartbreaking changes are happening. We tried to minimize them, so there's not a lot of those, but some have happened.
10:42
To mention a few, when you were doing spatial join between two geometries, we have never preserved the name, the index name of the one on the right, the one you're joining to your data frame.
11:00
Right now, we're doing that, which unintentionally breaks some downstream code, but there wasn't a good way of going around it. We've tried to ensure that Geopendos has an easier dependency tree, so some packages which can be more challenging to install or build,
11:23
for example, on Windows, are no longer required, and we are allowing using Geopendos even outside of the geospatial context, within some spatial operations, we know that people are using it for analysis of some microscopic data,
11:41
where the geo context doesn't really matter, so they don't need the geolibraries. We now require Shapley 2.0, and we have switched the default engine for reading and writing files from Fiona, which was there since 2013, to Piagria, which is much faster.
12:03
It's fully maintained by the Geopendos team, unlike Fiona, which is an external project, and is designed to be used with Geopendos, but it occasionally leads to slight differences in how the files are read. This is the key dependency tree of Geopendos.
12:26
The middle row covers Python packages, so we have obviously Pandas, we have Shapley, which provides geometries, we have Fiona and Piagria providing the tooling to read and write files, and we have Pyproj, which provides us a tooling to understand
12:44
where exactly on earth the geometries are, but both Shapley, Piagria, and Pyproj are actual wrappers to C++ libraries, GDAL, Geos, and Proj, and especially with GDAL,
13:02
it can be a bit tricky to ensure that it's all compiled correctly and it's all working correctly, so these two are now optional. We still obviously require Geos because Geopendos without geometries is just Pandas. We have a lot of new stuff coming, or arrived.
13:24
There was a big project which was funded by Nafocus, which we thank for, to get the API priority with Shapley, which means that every single function from Shapley is now available directly in Geopendos as methods,
13:42
and you, in most cases, don't need to leave the Geopendos namespace to do essentially any supported operation. I will get to reasoning why this is important a bit later when talking about the future plans. We have new, very powerful unions, so if you are joining geometries together,
14:05
that operation can take quite some time, especially if you have a lot of geometries. If you know that those geometries are forming a spatial polygonal coverage, which means that they are like, let's say, state boundaries,
14:21
you know, one polygon lies next to each other, there are no intersections, overlaps, or anything of this sort, you can use the new coverage union methods, and the performance will be about, I don't know, 8 to 15 times faster than it was before, depending on the use case.
14:41
This is exposed through all the relevant methods where any union is happening. Until Geopendos 1.0, you were able to join two sets of geometries spatially based on intersection, whether one is within another,
15:01
whether they are overlapping, whether they are touching, but there had to be some relationship where coordinates were actually at the same place. Right now, that changes, and we have included a new spatial predicate called D-within, which stands for distance within, and we are able to join points which are, for example,
15:22
within 100 meters from the other points. There is no need to do a manual buffer of 100 and then do the intersection join. You can go directly, and it's going to be much faster than using kind of a more complicated route which wasn't necessary before.
15:40
The same, obviously, is exposed within the underlying spatial index if you want to work with spatial index yourself and don't want to rely on spatial join. We spent quite some time on proper support of GeoParker and GeoArrow. For those of you who don't know what that is,
16:00
GeoArrow is an in-memory specific layout for representation of geometries, and GeoParker is a way how to store GeoArrow or any arrow table efficiently into a Parquet file.
16:24
GeoParker is a specification of how to do that with spatial data, how to do projections, how to encode geometries, and it has a new version, 1.1, which I'm not actually sure if it actually came out already, but we already support it, which allows us to write a covering bounding box into the file,
16:45
allowing extremely fast spatial filtering on the read so we can read a very small subset of your data very quickly without touching anything else in the file. And we can finally use GeoArrow as encoding of geometries
17:02
within the Parquet file because until GeoParker 1.0 specification, it was only a well-known binary, so it was a binary blob for each geometry, which wasn't very efficient. We have removed a module called datasets.
17:22
It was there for illustrative purposes. We've used it in documentation, and a lot of people are using it in their teaching materials, but they were a bit troublesome. We have a replacement that's called geodatasets. It works very similarly.
17:41
Some data are there, some data are not. But importantly, there are no contested political boundaries in the whole geodatasets project because there were contested political boundaries in the datasets module of Geopandas. The datasets were there since 2014. It used geometries from the Natural Earth project,
18:04
but there were issues like this. Natural Earth has their own definition of how to deal with contested political boundaries, and it just doesn't end well.
18:21
At some point, there was a lot of anger on Twitter about the fact that Natural Earth data were showing Crimea as part of Russia, not part of Ukraine. We quickly fixed that and quickly decided, okay, we don't want to deal with any of these issues.
18:42
Geopandas is a software project, not a data project. We don't want to deal with political issues, so Geopandas datasets are gone. I know that it broke a lot of tutorials, a lot of code on Stack Overflow. I'm sorry for that, but we just had to do this.
19:03
There are some deprecations, obviously, as part of Data 1.0, which were announced, two versions, three versions, or even, I think, 11 versions ago. So I can just quickly skip those. So is it any faster these days? Kind of depends which versions are we comparing.
19:23
If you were using the latest Shapley and Piagria, optionally, before, it's not faster than that 0.14 was, but it is faster if we kind of take the whole milestone of the work which was targeted for 1.0
19:41
and compare it to what happened before. So right now, we do require Shapley 2.0 before you were able to use the older Shapley 1.8, and you can see for yourself how faster the new implementation actually is, depending on the geometries, depending on the use case and operation, but it can easily be 80 times even more faster
20:04
than the original one. That obviously also translates into spatial joins, which are slightly refactored, so it's even faster kind of internally. Reading files with Piagria is way faster than it was with Fiona, and it's more memory efficient.
20:23
Again, it depends on the file type and it depends on the complexity of geometries. But there is also one thing which is not on slides. We optionally right now allow using Arrow to ship the data from GDAL onto GeoPanda,
20:42
which cuts these times to about half again. But for some reason, we weren't able to do that by default yet. So what's coming next after GeoPanda 1.0? A lot of stuff is happening on the side of geos and Shapley,
21:01
and we're hoping to get them into GeoPanda soon. A big thing is probably going to be coverage simplification, because right now, if you have geometries and you simplify them, they are simplified one by one. So states of Africa will look like this once you simplify them. But what we want is something like this.
21:23
And the image on the right is actually coming from a branch of Shapley which allows coverage simplification. Obviously, we hope to have some coverage validation for all those coverage-based operations. And it's possible, probably quite likely,
21:42
that in the future, we will also have support of curved geometry types within Shapley and within GeoPanda. So stuff like curved polygons, circular strings will make their way into GeoPanda for some people who might actually need those.
22:04
A big thing, hopefully, is the incoming spherical geometry engine. So right now, GeoPanda requires geos, which can do spatial operations only on a plane. It assumes that everything is flat.
22:20
The world is flat. You somehow have to project it into a flat space. Not every operation, not every analysis, especially if you're working with a global scale, can be done using geos. So we will provide support for package calls fairly, which uses Google's S2 library instead of geos,
22:43
which exactly provides support of spherical geometries. And this is one of the reasons why we wanted to have the functionality of Shapley exposed as methods on GeoPanda's level,
23:00
because if we are relying on GeoPanda's methods, GeoPanda itself can automatically decide whether we should use spherely or Shapley. If you have your geometries as spherical geometries of spherely objects and you apply Shapley function, obviously it's going to error.
23:21
But if you are relying on GeoPanda's methods, GeoPanda can do the decisions for you and just use whatever is most suitable for your use case. We're also hoping to provide a bit more support of GeoArrow. I think that the next talk in this room
23:41
will talk a bit more about what Arrow is, so if you're interested overall what Arrow is, just stay here. This is probably not going to be useful for majority of users, but if you are dealing with very large data, you may want to delay conversion of geometries
24:02
when reading GeoParquet file into Shapley until you actually need it, or in some cases you may not need it at all. We're hoping that in coming years, it's probably like four to five, we may be able to replace Shapley optionally with Rust implementation of geospatial operations
24:28
within the GeoArrow RS package, which should provide hopefully some performance boost and multi-threading that GeoS is not able to provide right now, but this is kind of very open. It may happen, it may not.
24:45
We're trying to do a lot of stuff on scaling within the dust chip on this project, which is kind of lagging behind, so we want to get the parity and full compatibility with the expression engine and interactive mapping based on data shader. All of these have kind of different branches somewhere,
25:00
but it's not finished and tested. I'm obviously standing here as one of the maintainers of the project, but we couldn't do the whole project without a vast number of contributors. We counted over 200 contributors, code contributors over the years.
25:21
There are hundreds more who are publishing, just reporting bugs or writing tutorials and everything else. So we thank them all, and we thank them Focus for support, and I thank EuroPython for their space here. I think we have four minutes for questions, if you have some.
25:44
Thank you.