We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Geospatial Analysis using Python and JupyterHub

00:00

Formal Metadata

Title
Geospatial Analysis using Python and JupyterHub
Subtitle
Processing, analyzing, and visualizing geospatial data on a high performance GPU server
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Geospatial data is data containing a spatial component – describing objects with a reference to the planet's surface. This data usually consists of a spatial component, of various attributes, and sometimes of a time reference (where, what, and when). Efficient processing and visualization of small to large-scale spatial data is a challenging task. This talk describes how to process and visualize geospatial vector and raster data using Python and the Jupyter Notebook. To process the data a high performance computer with 4 GPUS (NVidia Tesla V100), 192 GB RAM, 44 CPU Cores is used to run JupyterHub. There are numerous modules available which help using geospatial data in using low- and high-level interfaces, which are shown in this presentation. In addition, it is shown how to use deep learning for raster analysis using the high performance GPUs and several deep learning frameworks.
Keywords
20
58
Mathematical analysisWebsiteWeb pageOnline helpLecture/Conference
UsabilityKernel (computing)Virtual realityVirtual realityIntegrated development environmentMehrplatzsystemInstallation artRevision controlCASE <Informatik>Line (geometry)Kernel (computing)SpacetimeRegular graphDifferent (Kate Ryan album)LaptopComputer animation
Kernel (computing)Virtual realityBand matrixSoftware testingTouchscreenKernel (computing)Different (Kate Ryan album)Procedural programmingIntegrated development environmentPhysical systemReading (process)File systemComputer fileInstallation artRevision control2 (number)Data storage deviceMultiplicationVirtual machineGeometryPurchasingInversion (music)Server (computing)Online helpModule (mathematics)Core dumpComputer animationXML
Kernel (computing)Band matrixEuclidean vectorMotion captureGamma functionOpen sourceProcess (computing)Modul <Datentyp>Vector graphicsPoint cloudPoint (geometry)Raster graphicsModule (mathematics)LaptopRaster graphicsEndliche ModelltheorieEuclidean vectorPoint cloudUniform resource locatorLibrary (computing)Power (physics)SoftwareVector graphicsMilitary baseBitGeometryOpen setStandard deviationSeries (mathematics)Open sourceObject (grammar)Band matrixComputer animation
Raster graphicsVector graphicsModul <Datentyp>Open sourceModule (mathematics)Interactive televisionDatabasePoint (geometry)PolygonCircleCurveSurfaceTriangleCoordinate systemSphereEllipsoidPole (complex analysis)Axonometric projectionPhysical systemCartesian productCalculationVirtual realityDirection (geometry)Representation (politics)SpherePoisson-KlammerBinary codePoint (geometry)PolygonService (economics)CASE <Informatik>Fitness functionCoordinate systemLatent heatLevel (video gaming)Vector graphicsLine (geometry)Medical imagingZoom lensDistortion (mathematics)Library (computing)Module (mathematics)Raster graphicsDatabaseModule (mathematics)Physical systemSweep line algorithmLaptopCartesian productShapley-LösungVolume (thermodynamics)GeometryProjective planeShape (magazine)BitDirectory serviceKeyboard shortcutOpen setFile formatMappingStructural loadDegree (graph theory)Library catalogString (computer science)EllipsoidElectronic mailing listSimilarity (geometry)Different (Kate Ryan album)TupleRotationOperator (mathematics)PiLink (knot theory)Computer fileSound effectAuthorizationStandard deviationUniform resource locatorCodeMultiplicationKey (cryptography)Binary fileElectronic data processingFeldrechnerSpheroidCloningPole (complex analysis)WKB-MethodeGroup actionArchaeological field surveyComputer animationXML
Software testingPoint (geometry)PolygonFile formatVector graphicsGeometryQuery languageFrame problemPlot (narrative)DistanceoutputSource codePoint (geometry)MultiplicationSet (mathematics)Time zonePlotterElectronic visual displayPolygonTimestampVector graphicsModule (mathematics)Error messageLine (geometry)Complex (psychology)Binary fileGeometry1 (number)DistanceCalculationTwitterRepresentation (politics)Link (knot theory)BitHistogramVolume (thermodynamics)Functional (mathematics)Cartesian coordinate systemMultiplication signResultantComputer fileShared memoryLevel (video gaming)Frame problemShape (magazine)Operator (mathematics)CASE <Informatik>InformationCodeFile formatProjective planeMoment (mathematics)Order of magnitudeIncidence algebraDifferent (Kate Ryan album)Query languageModule (mathematics)Nichtlineares GleichungssystemExpression2 (number)Dot productXML
Multiplication signCalculationGeometrySlide ruleCartesian coordinate systemMoment (mathematics)Scheduling (computing)Presentation of a groupLink (knot theory)DistanceComputer programmingMemory managementProjective planeWave packetRepresentation (politics)Point (geometry)Endliche ModelltheorieClassical physicsQuery languageComputer hardwareTwitterLimit (category theory)Module (mathematics)Image resolutionDifferent (Kate Ryan album)PiLibrary (computing)Set (mathematics)Web pageProduct (business)Semiconductor memoryRow (database)MultiplicationLatent heatContext awarenessIntegrated development environmentComputerComputer chessGoodness of fitTable (information)Software developerDatabaseBitDirection (geometry)Computer animationXMLLecture/Conference
Transcript: English(auto-generated)
Thank you very much for the kind introduction. I'm talking about Geospatial Analysis, and the second thing you see the title is JupyterHub. Any one of you already used JupyterHub before?
Oh, cool, so not too many, but more and more and more. That's great. Actually, I could have titled that talk with JupyterLab instead of JupyterHub, but I want to show you how cool JupyterHub is. It's basically JupyterLab where you can log in.
So if you go on a website and you have a JupyterHub installed, you just get this page here, and you can sign in, and after signing in, you see this. So it's perfect. So it's a multi-user JupyterLab, and that's pretty much all.
There is something I'd like to show you, too. This one you can do with a regular Jupyter notebook or JupyterLab installation, too. You see I have three kernels here. I have a Python kernel, I have a Markdown kernel, I have an R kernel, but there is another feature.
You can have kernels with different Python versions, and that's quite handy. And you just create a virtual environment. You see that above, using Conda in our case, and environment name, whatever you like, and then you specify Python 3.5, 3.6, 3.7, whatever.
Don't use 2. And the IPython kernel. And then you activate this environment and install all your cool packages you want to use, and after that, you can create a new kernel with the line above. Just ipykernel install userspace and name of the new kernel.
And the screen turns black. And then you can list all the kernels using JupyterKernels backlist, and you see actually all the kernels installed. So if you make this procedure for, let me say,
five different Python versions, you will see actually five different Python versions in your JupyterLab environment. And that's really quite handy, and if, and now we come back to the original title, geospatial, if you install geospatial modules, then you usually have to install
many C-based libraries, and for that, it's really, really recommended to have multiple Python versions and environment. And of course, if you are on JupyterHub, you will have your file system there, and you can access all your user files from the JupyterLab or hub.
So what we are doing, we have a HP Apollo 6500 server, and on this server, we installed JupyterHub, and we bought this machine with 48 cores, 192 gigs of RAM,
and attached it to our small storage system with 120 terabytes, which is actually quite fast storage, where we have one GB per second reading and writing speed, and that's also a very important fact. If you have terabytes of geo data, you want to have a really fast and reliable system.
We also have Ford and we have Tesla V100 in it. Wow, that's high tech here. So I think the cable should be changed tomorrow. Okay, so what I want to say, we have a Tesla V100, the SXM2 model.
That's here, one of them, and uses lots of power and has 900 GB bandwidth, so it's quite fast. That we use to create our deep learning models. More about that maybe later.
So what is geo data? There are some standards, ESO standards, describing what's geo data, the technical mission 211 series, and so on. But the most important is, most data you have has a geospatial component. Most data you actually have has a location component,
or you can create a location component out of it. And mostly people use GIS software to load and manage this data. However, that's something I do not want to do personally. I use Python for that. So what I show you now
is everything I'm doing with geo data is done in a Jupyter notebook. And you can really uninstall all GIS software if you do that. And today I'm limiting myself to vector data and a little bit raster data.
There is other geospatial data like point clouds and three objects, and that's not what I'm going to tell you. So everything is open source, I'm showing today. The most important two libraries are C++ based. It's GDAL OGR, okay.
And the second library is Geos. And they have bindings in Python, and it's really not Pythonic. So therefore, some people created new Python modules
which are really Pythonic and use the same C++ library, and it's much, much nicer to work with that. I would not recommend using GDAL directory. I would use rasterio for raster data processing, fiona for vector processing,
and shapley to do some vector data operations. I will show you in an instant. And if you know Pandas, a really nice module in Python, there is also Geopandas, which extends Pandas for geospatial data. So that's, I give you the links,
which projects we are looking at today. The most important is that we use Jupyter notebook, and the first module I'm showing you is Folium. Folium is basically leaflet.js, JavaScript library to create maps. It's one of many JavaScript libraries to create maps,
and with three lines of code, you have a map in your Jupyter notebook or JupyterLab. So you can specify the important Folium module, and you just create a map, you specify a location, and a zoom level.
A zoom level is how far you are away from the ground. There are typically about 20 zoom levels. You know that from other mapping services like Google Maps, Bing Maps, Open Sweep Map, Yahoo Maps, and all these map services that exist today. Another thing is if you look at vector data,
there are some specifications like the OTC simple feature access specifications, where geodata and in this case vector data is defined. This is used in many databases like PostGIS, PostgreSQL, and so on, and one of many representations is just using text.
So I use text to specify a point, I use text to specify a polygon, and so on. The reason for that is you can print it, and in 100 years you can still read it. So at the Chigo World, that's a very important topic. There's also the WKB, a binary format,
but I'm not talking about that now. So here are some examples. If you specify a point in WKT, well-known text, it's just point, brackets 10, 20 in this example, or if you have polygon, it would be polygon, text coordinates, or there are some things
like multi-polygons, so you have multiple polygons. For example, if you have a country with islands, there are multiple polygons in that. There are also countries with holes, and then you have a hole. This is all specified in a WKT, so it's a nice thing, and we can use that directly.
We can use that directly. So we can create something similar like the WKT, just using a Python list and tuples for the coordinates, and you see you create, you import the polygon, import point, and here you just specify your polygon.
And if you look at it, you see the first and the last point is the same. That's an important aspect of this standard. The first and the last point are the same, so we have a closed polygon. We can actually load it from text, too. We can create a string with the WKT definition
and load this using shapely WKT and just load S, S for string, and then we have our polygon definition. Another format which is quite popular in the JavaScript world is GeoJSON, and there you also create your polygons
and specify the coordinates. That's another approach to define vector data. Of course, there are many other formats, too. I'm not going into details there now, but that's what you find if you go into the geo business. So let's just add such a GeoJSON in volume.
You see it's a little bit more complicated, but basically you open the GeoJSON file, you load it, and you put it on the map. Again, same syntax. And then you use the GeoJSON from volume, it's just called GeoJSON, and you add it to your map.
In this case, I loaded GeoJSON of Switzerland. Okay, you see that? It's the shape of Switzerland. And now I do the same again, but I plot it directly using Shapley, and you see it's not the same, so there is a distortion.
So Switzerland is not that distorted, usually. And the reason for that is we have different coordinate systems. So let me show you the critique here on the sphere. You know that there is longitude, latitude, longitude along the greater latitude for the poles,
and you can project this to a map. The easiest way is just to create out of the sphere, you just create a Cartesian coordinate system, so you do map the latitude, longitude on it, and then you get this one,
and that's a completely distorted image of the world. It's not what you see in Google Maps, actually. There is even more bad with more distortions. So there are some definitions. The Earth is an ellipsoid,
so the World Geodetic System, 1984, defined some data of how the Earth is best fit in a rotational ellipsoid or spheroid, and out of that you can create different map projections.
I took three here out of many 10,000 different, actually you could invent your own map projection if you want to, and here I printed three of them, and you see they are all a little bit different. Mercator projection is what you know from Google Maps, et cetera,
and you see the Antarctic down here is bigger than most other continents, which is completely wrong, but it's an effect of projections. So we can look at these so-called coordinate reference systems or spatial reference systems,
and we can have two special cases. This one is we use geocentric Cartesian system, that's just Cartesian system with X, Y, Z, or we use projected coordinates, that's usually not 3D, it's actually flat, and that's, actually every country
has its own representation. Switzerland has its Swiss grid, and for example, all the countries they have their special coordinate systems too. I'm not going into details here, but you can look it up at epsg.io, you can look the system of your country.
EPSG is the European Petroleum Survey Group, they catalog all these coordinate systems, and for example, the EPSG 4326 is the World Geometric System 1984. Okay, that was a little bit off topic, let's look at the real example,
we are located around here. So we can say we have a longitude of 7.5, so here Greenwich is zero, and we are seven degrees to the east,
and then 47 is the latitude, so here will be equator, so we go 47 up here, and we are in Switzerland, at the Congress Central Basel, so that's how it works, maybe, the problem we will see in an instant,
so with Shapley we can do some nice expressions, we can check if a point is inside a polygon for example, that's a very complex operation, but with Shapley it's just a few lines of code, actually one line of code, so you create a point, 47.7, that's our coordinate of the Congress Central Basel,
I can look at it as WWKT representation, I see a point at the coordinate, so everything is perfect, and then I check the operation, this Euro-Python point is within Switzerland, and we get the result false,
so what did I do wrong? Lower case, wrong projections, all wrong, it's very simple, you see, I show you the result, how it is done correctly,
so what was the difference? I flipped the latitude and longitude, now I have the longitude first, and then it works, so the problem is, before we had the volume module, volume size first latitude, then longitude,
Shapley size first, first longitude, then latitude, and that's a common problem, some people say lat long, long lat, lat long, this is best, or no, that is better, and the confusion is perfect, so we have to always consider that,
and know which module uses which representation, personally I prefer this approach too, because it's something like x-axis first, and y-axis second, but in geographic coordinates, you can't say x-axis and y-axis, so that's the point of where many people find
it's worth disputing, so I said before we have other vector formats, I'm not going into the details, I just recommend if you want to read vector data, use the Fiona module, but as the time is going on, I'm showing quickly Geopandas, which is pandas with the ability to make
some geographical geospatial queries, so I can load something, let me load a dataset with all cities of the world, with more than 5,000, with a population greater than 5,000, you can download this dataset at geonames.org,
so it's very small, so you don't see that good well, because it has many data in it, so I reduced it to the most important data, so I take the name, latitude, longitude, population, now you see I take latitude first and longitude, and that's the dataset. You can create a Geopandas out of it,
the trick is to make a column that's named geometry, and in this geometry, you have a shapely representation of the geographic information, this could be a point like in this case, or a polygon, a multi-polygon, or whatever, you can create your geometry column just there.
So Geopandas can also plot, like we know that from pandas, just make your geodata frame, and you plot it, and if you plot all cities of the world, you see you recognize the shape of the continents, more or less, so Europe is quite green in this case,
so there are many cities. So I can do some queries, it's a basic Geopanda, so it's the same, and you see if I make a query, name Basel, I get Basel information, but more interesting are spatial queries, so let me get the distance from the Congress Center here
to all other cities in this dataset, so I just create our point again, and calculate distance, and make a new column, miss distance, and I sort this column distance, so I show you the result,
it's simple to understand, so you have the name here, and the geometry, and here the distance, so you see we have Biersfelden, it's just next to Basel, and Basel itself, so it's a little bit strange, because it's always the center, so it's the distance to the coordinates, so we are closer to Biersfelden than to Basel, Beningen, Weil am Rheinders in Germany,
Saint-Louis, France, and so on, so that's the names of these errors with the distance. So I can also query within a polygon, so I can use my polygon again, and say I would like to have all the cities within the polygon Switzerland,
and then we see if I do that, and combine it with something else, like for example I would like to have all the cities with a population bigger than 20,000, and this is not sorted actually, but it doesn't matter,
so I get all the cities in this data set within Switzerland, and with a population greater than 20,000. So let's do one more thing, display the cities in a volume map, that's quite easy, you can combine those modules, so you just create, apply for example,
you can specify a function which fills the, creates marker of every city, and then you have that in volume. So let me do a last example, before the session chair throws me out. There is for example a nice data set
with live earthquakes, or the earthquakes of the last two weeks, so you can download that directly with this link. I do that for example with the requests module, and then I store it as a file, earthquakes, GeoJSON, I just did it half an hour ago about,
and that was the result. So I can use GeoPandas to open my GeoJSON directly, and display the first five incidents, and again I simplify the data set, I reduce it to four columns, time, magnitude, place, and geometry,
we see the first five, it's not sorted anyway, and we see a trend in California at the moment, there is a hotspot there at the moment, and we can create a histogram out of the data, that's a nice way, using histograms with 16 bins in this case,
we see most, luckily most earthquakes are around three, and there are higher ones in there unfortunately, and we can, you see in the first column here, you have a timestamp, and to change the timestamp to a better readable representation,
you can use the daytime, and the timezone module of Python, and create a new column which is more readable, so we have 10 of July, this is UTC timezone, maybe we hear something about timezones in the lightning talk, I don't know, tomorrow, tomorrow,
very nice talk about timezones, very important, Miroslav, Miroslav, so we can plot this, and we can also plot multiple Geo data sets, and you see I read this Geo data frame, and I can combine this using just plots,
multiple plots using the axis, you can have multiple, so I can display the continents, and some earthquake on it, you could do more, you could change the size of the dots, depending the magnitude, and I think the cable says it's time for questions,
so thank you very much for your attention, and are there any,
I think, actually there is a microphone on the table,
I think somewhere, I'm not quite sure. Hello? I'm not sure. Can you say something about what you use this very expensive computer for? Yeah, that's a good question, I unfortunately wanted to say more about that, but I was wrong at the time, after 35 slides I said,
oh, I have to stop slightly, we do some project, for example, to detect solar panels on the roofs, we have the data set of orthophotos of whole Switzerland, that's about two terabyte of data, and there we try to detect solar panels, different kind of solar panels,
and therefore we create models, deep learning models, and train that, and for training we use the forward GPUs, and to improve it. Oh, there, it's confusing, Mr. Microphone. And of course many other applications,
we do many deep learning projects at the moment. No, I didn't skip, I actually didn't even put inside in this presentation,
but I don't have it ready, actually, to. Are there any solutions for geodata, spatial queries in databases, in Django applications you would recommend, because we've seen Python now, but if I have to trim it down to SQL,
it becomes a bit more complex, especially when I have to do it from a Django direction. Of course you can, this is something I don't like to answer in a Python conference, but you ask me now, for example, is PostgreSQL and PostJS, and PostJS uses spatial queries too, you can do the same like I showed here,
and unfortunately you can do that with PostJS much faster than using that JupyterLab solution I showed you, so what I showed you is actually slower than if you are using Postgres. But you can do actually the same. The disadvantage of course is
you don't have a nice Python environment, you can't program it nicely like this, you can just do queries. I'm aware of that, but it's a feature of a specific database, and if I want to do it from Django and the Django query should also work with SQLite, then I think I can't just use the Postgres features.
Are you aware of the product GeoJango? There is a GeoJango which takes care of these details, so you can directly access the features of PostJS with GeoJango.
Some possibilities to use these libraries for the planets other than Earth, so Mars or... Yes, it's actually no problem, you can do any planet.
The only problem is that you don't have high resolution data of other planets, but it's basically the same. You just need the model, there are models for Mars for example, there are models for most near planets. On Earth you have the WGS84 representation,
but Mars is basically also an ellipsoid, so you can do exactly the same calculations. You could even do distance calculations from one point to the other with Geopandas and the Mars dataset, so it's no problem. Okay, thanks.
Yes, I think all your Python slides will be on the website program, so all speakers will upload them
and you can just download it from the place where the schedule is, so just click on the topic and you will get the link to the slides.
That's a very good question. Don't use Geopandas for very large datasets. It's the same like Pandas, you can't use Pandas for very large datasets at the moment.
The developers are working on that, they try to do some memory voodoo, sorry for that, but it will not work unless you use modules. I didn't show that because it's already too much in detail. If you use Fiona for example, you can take one, actually one row of the dataset
and you have that in the memory, so you have to do the memory management yourself. For example, if you want to do some distance calculations, you would just do it on a per row base, and then you could take a multi terabyte dataset and do your calculations with that.
There is also for larger datasets, there is PySpark and GeoPySpark. There is a trend in putting a Geo in front of classic Python modules, and with GeoPySpark or PySpark, you can do much bigger calculations.
Actually, there is almost no limit. It's a hardware issue. If you have enough money for the hardware, you can have unlimited amounts of data.
Okay, thank you very much again.