We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Mapchete - tile-based geodata processing

00:00

Formal Metadata

Title
Mapchete - tile-based geodata processing
Title of Series
Number of Parts
295
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Mapchete (https://github.com/ungarj/mapchete) is a tool written in Python which helps processing large amounts of geodata such as global high resolution datasets. It does so by executing a user-defined Python function on smaller chunks of data (tiles). The standard tiling schemes follow the well-known tile pyramid schemes used by WMTS which also enable mapchete to let the user easily preview process outputs using a built-in development server (Flask) hosting an OpenLayers page. By processing large areas through their much smaller tiles or metatiles, possible memory errors can be avoided. Furthermore, tiles can be processed on multiple CPU cores in parallel which speeds up the processing time. All geospatial data (i.e. raster and feature data) are internally handled and exposed to the user-defined process function either as NumPy arrays (raster) or GeoJSON-like dictionaries (features) which can easily be edited with well-known Python packages like shapely or scipy. For I/O operations mapchete makes heavy use of rasterio (https://github.com/mapbox/rasterio) and Fiona (https://github.com/Toblerity/Fiona). It can read data formats supported by these packages and can currently write outputs into WMTS-like tile directories of GeoTIF
Keywords
129
131
137
139
Thumbnail
28:17
PiService (economics)Thermodynamischer ProzessScale (map)Process (computing)Array data structureSatelliteoutputRaster graphicsFunction (mathematics)Simultaneous localization and mappingConfiguration spaceCodeParameter (computer programming)Directory serviceComputer fileSatelliteFunctional (mathematics)Directory serviceTesselationSoftware developerData structureProcess (computing)Projective planeMedical imagingFunction (mathematics)Library (computing)Scaling (geometry)Web browserArray data structureThermodynamischer ProzessWeb 2.0Raster graphicsMappingZoom lensNumberType theorySingle-precision floating-point formatTessellationCore dumpBitQuery languageDiagramPlug-in (computing)Computer fileoutputSoftware frameworkFormal languageDevice driverParameter (computer programming)Electronic mailing listSuite (music)Revision controlSoftware testingImage processingReading (process)File archiverMultiplicationUniform resource locatorWebsiteAreaMessage passingMatching (graph theory)Client (computing)Network topologyService (economics)Electronic data processingRight angleFirst-person shooterLink (knot theory)CASE <Informatik>Data compressionClosed set2 (number)Level (video gaming)Open setTape driveOffice suiteHeegaard splittingComplex (psychology)Numbering schemePiMultiplication signForcing (mathematics)Repository (publishing)FrequencyArchaeological field surveyComputer animation
Computer configurationArc (geometry)Interface (computing)Line (geometry)Revision controlFile formatoutputFunction (mathematics)Electronic mailing listMessage passingProcess (computing)Local ringPrice indexData conversionTerm (mathematics)Server (computing)Subject indexingProcess (computing)TessellationStandard deviationoutputCASE <Informatik>DivisorPlug-in (computing)Endliche ModelltheorieTesselationPoint cloudZoom lensModule (mathematics)Function (mathematics)Set (mathematics)BitLevel (video gaming)Sheaf (mathematics)Parameter (computer programming)Musical ensembleMultiplication signData structureComputer fileShape (magazine)File formatGeometryPhotographic mosaicWeb browserCuboidBlock (periodic table)Thermodynamischer ProzessOcean currentPersonal area networkScripting languageResultantOpen setAnnihilator (ring theory)Rational functionDevice driverException handlingObject (grammar)FrequencyNumbering schemeNumberFilter <Stochastik>Functional (mathematics)Latent heatLie groupTraffic reportingMereologyMappingQuicksortMenu (computing)Computer configurationMedianWindow functionLine (geometry)Configuration spaceVideo gameThumbnailHill differential equationCategory of beingSource code
Digital filterTerm (mathematics)CASE <Informatik>Computer fileTime seriesPhotographic mosaicArray data structureFile formatGAUSS (software)Process (computing)Endliche ModelltheorieFunction (mathematics)Field (computer science)Network topologyComplex (psychology)Line (geometry)Multiplication signWorkstation <Musikinstrument>NumberShape (magazine)Source code
Digital filterLine (geometry)Active contour modelProcess (computing)Computer fileRaster graphicsGeometryBit rateMappingGraph coloringDifferent (Kate Ryan album)Expected valueOffice suitePolygonTouchscreenEndliche ModelltheorieNatural numberWater vaporFunction (mathematics)Active contour modelSource code
Active contour modelLine (geometry)BildsegmentierungProcess (computing)Device driveroutputFunction (mathematics)Directory serviceArray data structureData structureHypothesisWeb browserParsingSource codeMatrix (mathematics)Client (computing)Raster graphicsMathematical analysisSatelliteLevel (video gaming)Visualization (computer graphics)Photographic mosaicMedical imagingDirectory serviceClient (computing)WritingOnline helpType theorySingle-precision floating-point formatOcean currentCASE <Informatik>Function (mathematics)Server (computing)Process (computing)HypothesisFile formatReading (process)Direction (geometry)Installation artDifferent (Kate Ryan album)Point (geometry)TessellationWeb browserOpen setSubject indexingPlug-in (computing)Parameter (computer programming)TesselationCore dumpFunctional (mathematics)outputInverse functionBitSet (mathematics)Data structureZoom lensArithmetic meanWorkstation <Musikinstrument>Decision theoryLine (geometry)QuicksortComputer chessMultiplication signKey (cryptography)State of matterRepetitionPhysical systemAreaCuboidDevice driverVideo gameLevel (video gaming)2 (number)Fingerprint
Client (computing)Directory serviceRaster graphicsWeb browserMathematical analysisSatelliteLevel (video gaming)Service (economics)BitThermodynamischer ProzessLecture/Conference
Source codeCore dumpElectronic visual displayBitDifferent (Kate Ryan album)Vector potentialMultiplication signType theoryOffice suiteInternet service providerTesselationMatrix (mathematics)TessellationProjective planeMeta elementWeb 2.0Numbering schemePixelBuffer solution
2 (number)InformationTessellationMeta elementLevel (video gaming)AlgorithmPresentation of a groupMultiplicationMultiplication signNumbering schemeTesselationPreprocessorOrder (biology)Array data structureImage warpingModal logicView (database)Time zoneProcess (computing)Scaling (geometry)RepetitionThermodynamischer ProzessPerformance appraisalGreatest elementBitAlgebraLibrary (computing)Control flowLecture/Conference
Control flowBit
Transcript: English(auto-generated)
Okay. It's 3.30. I think we can start now. So, hello everybody. My name is Joachim Unger. Daniel already introduced me together with him. We are working for EoX, a small company,
not so small anymore, from Austria, Vienna. We mainly do projects with ESA in the area of satellite imagery, so processing, writing clients, and today I will be talking about MapChatter. It's a tool developed by us, and it does tie-based geodata processing, and
let's see whether it can break the curse of raster sites server processing. So, the project is available in GitHub, so it's a Python package. You have here
the URLs to the GitHub repository, as well as some documentation on Readthedocs. It's been published under a MIT license, and at least our test suite tested against Python versions 3.5 to 3.7. We first created it four years ago, and it has been in development
since, serving us for various projects, so like creating web maps and doing some satellite imagery and processing. So the initial idea was to use Python code, because Python is
a very nice language to learn, so it's not as complex as other ones, so that's why Python is so popular, and to use this and build a framework around it to facilitate global scale processing, and the core idea is simply that you write your Python function
and then MapChatter takes it and reads the data and applies this function on your data, and then writes an output. That's basically it. So here we created a short
diagram that should explain it a little bit better, so as in every type of processing, you have inputs, so data inputs and data outputs, and there's something happening in between. So what's happening in between is, as I mentioned before, the Python process, it could also be Cython, if you wrap it under using Python. The nice thing is you
have access to NumPy or SciPy or any image processing libraries like Pillow, Scikit, and whatnot. And as inputs, we are heavily using Rasterio and Fiona to read Raster and
vector inputs, so MapChatter takes, for example, a Raster image and converts it into an NumPy array, which then is available within the process function. Likewise, with vector data, we use Fiona to open it and basically use the GeoJSON-like dictionary
and pass it on to the process function. It's also modular, which means you can write plugins. So internally, we created a satellite archive reader, which basically reads the Sentinel-2 archives and Sentinel-1 archives from AWS and Mundi, and converts it into a four-dimensional array, which then we will
have available within the process function again. And as outputs, likewise, we use Rasterio and Fiona to store either single GeoTips or so-called tile directories. I will go into a little bit more detail afterwards, or use
or write in GeoJson if you have a vector output. And there are also plugins available, and can be extended by plugins. So, for example, we also have an X-ray writer, which allows to write a tile directory where the tiles are not GeoTips, but X-ray dumps, so, for example, NetCDF or SAR files.
So what you need to create a process is have a so-called MapChatter file. It's basically a YAML, so it follows the YAML syntax. And here you
should be stored which driver to use, which compression, whatever. You define your pyramid or grid, because, as I told you before, we are processing in tiles, so every input data has to be brought to the same grid, or to the same pyramid. So you have to define it here, and, of course,
you can also add some custom parameters for your process function. The process function is being stored in a Python file, so we just need a Python file containing an execute function, and this execute function can open the data and do whatever you want with it, and then you return
either a numpy array if you write the raster output, or a GeoJSON-like feature list for vector outputs. The tile directory output is something we borrowed the concept from WMTS.
Ivan was mentioning the web maps, so web maps are split up, as you know, in tiles and process levels, and so when you access, for example, OpenStreetMap, your browser would query PNGs or JPEGs according to a certain tile directory structure. So the first number you see here
is the zoom level, the second you see here is the column, and then there's zero, and we simply use geotiffs or x-ray dumps or whatever to store the data. It has some big advantages,
so, for example, multiple processes can write into the same output because it's not a single file, it's based on many files, so you can do a great parallel processing, and you can store really, really large data, so we are processing Sentinel-2, cloudless mosaics, so we have a worldwide 10-meter coverage,
and we store it in this format, so it's not one big tiff, but it's thousands of tiffs, and if we know the structure and the tiling scheme, we can easily query the tiles we need for a given bounding box. So you could say it's a little bit like a cog, but split up into multiple files because we have
internal tiling and overviews, but it might be an exaggeration as well. And, yeah, as I mentioned before, we are not bound to geotiffs, we can also do, of course, PNGs, as already possible, and geochasings, NetCDFs, whatever. And natively, Mobjetec can also read and write from
and to S3 buckets, which also helps with the cloud processing use case. So you could have various workers, and they all write into the same bucket, so that's very handy. It's being run via the command line.
So the command line interface has a couple of commands I will quickly pass through it. There's a command called formats, which simply lists the already installed formats. I mentioned the plugins before, here you can see whether your plugin already installed or not. Likewise, with processes, we can also register processes, so you don't have to provide the path to a Python file,
but you can also point it to a Python module path, and if you register it into the main package, then it gets listed here. The create command simply creates a new Mobjetec and dummy Python files, so it's been there to make it easier
to create a new process. There's a serve command, so the serve command is very interesting for debugging. It starts a flask server on your local host, and then you can open the browser and open layers,
a map would show up, and as you zoom and pan through the map, it would actually process the tiles and show you your current process outputs. So it's quite handy when you're developing your process and playing around with scripts and parameters that you immediately see the results.
Then there's the main command, it's the execute command, it simply executes the whole process on your given AOI. Because T-Dog cannot read tile directories, of course, there's an index command, and index would simply create either a VRT file, or if you later on use it with map server,
you can also create a shape index file. Because the VRT is not really optimized for a large mosaic of data containing thousands and millions of files, so map server has this special format, it's a shape file,
or can be a geo package, where just the bounding boxes of your tiles are stored, and then it can easily read it. And with this index command, you can create it in one line of command. And there's a convert command, it heavily borrows from the syntax of rio-convert, so the ristiro command line tool.
It simply extends it by being able to read and convert to tile directories, and vice versa. So you can use it if you have a global dataset, so for example when we get a customer request, they want a certain country or AOI from our global mosaic, we use this command to extract a subset.
I'm now going to show you some examples, so the first, oh, it's barely legible. Okay. So the first example I wanted to show you is the hillshade, so it's mainly, or I think you could say it's the hello world or ndvi example of cartography.
So Ivan already mentioned it. Unfortunately, yeah, it's very, yeah, it's very dark. Yeah. So on the left side you see the yaml configuration, and you see that you can just provide the path to the python file. Then the next block should be the pyramid,
so here we have one of our standard pyramids, it's called the geodetic pyramid. You can set performance settings like meta-tiling, so it would not process the 256 times 256 tiles, but two by two of these tiles, which makes much of the processes a lot faster.
Then you have an input section, and input can be anything, so it can be a path to a TIF, either remote or local, or some special input from one of your plugins. And at the output, likewise, you just specify which of the output formats you want to have and which bands and some driver-specific settings.
Then you have optional parameters, so you can limit the zoom levels you want to process, because we are always talking about tile pyramids. And the last section is the process parameters, so in this case you have the hillshade parameters, so like the azimuth, the altitude, and the exaggeration factor, the set factor.
And what you probably don't see here, we can have zoom level dependent parameters, so what I did here below is to have the exaggeration factor of the elevation model being, it changes between the zoom levels, so the more you zoomed out, the more exaggeration we want to have,
because we want to have the mountains shine out. And on the right side you have the execute function, it's quite straightforward, so you have this magic MP object, which opens data. Then you have the hillshade function, which just takes this numpy array,
and then you return the numpy array and you're done. So my initial idea was to do a live demo, but it didn't really work out, so here you have a browser window, and on the top you see the command I would have to run. It's the map chat to serve, and then the path to the file. And then in the background,
it would process the tiles that you're looking at, so whether you pan or zoom in or zoom out. The second example is, the first example but with filters, so it also calculates the hillshade, but you can also import pillow and scipy
and do some gauss filters, median filters, to play around to generalize the elevation model a little bit, so to optimize for your visualization, for your hillshade output. It's really bad that you cannot see it. So I will have to explain.
So it just has more lines, so what it does is to convert the numpy array to the format that pillow requires, and then applies this filter, and then you go to the next step. So you see, the more complex your use case will be,
your process file would grow with it, but you can do a lot of stuff here. So this would be the output of the filtered mosaic. I have the time series. Oh, what I forgot to mention is you can also do clipping. So the reason you won't see any pathometry here is because we also provided a shapefile
and told the process file to clip the raster via the geometries. So we were using this a couple of years ago already to create the hillshadings of our WMS maps, and it was quite fun and quite handy to work with it.
What we also did was to extract contour lines out of elevation models. I will skip through this a little bit faster. What we did in our maps was to, we wanted to show the contour lines in different colors depending on the underground they're in. So, for example, on water, that should have been blue. On land, brown.
And we wanted to have light blue contour lines on glaciers. So what this process does is simply reads them, extract the contour lines, intersect it with a different geometry. So in this example, it's simply the natural earth land polygon, and returns the geochasen output,
which then we could, so this is a screenshot from QGIS. So I visualized the contour lines depending on whether they are over land or over sea. So you can do different visualizations here.
This is a more recent example because we are doing a lot of mosaics, so we are trying out different approaches, and here we tried out some image segmentation. So we are using one of our existing mosaics. One of the reasons I'm showing you the example is if you see on the top left corner,
you see a path to an S3 bucket, but it's not a TIF, it's just a directory. It's because it's a tile directory. So you can, out of the box, just point it to this input, and it would recognize it and the tiling scheme, and then read the data accordingly. So this would have been very cool to show live,
but we have to do it via screenshots here. So what we did with this example was to take the RGB mosaic and use a scikit-learn or a scikit function that does image segmentation and use the output segments
and visualize it in a nice way. So just draw the red outlines over the... Is it visible? Okay, a little bit. And here again, the nice thing was you can zoom and pan, and in the background, it would load the TIFs
and extract the lumpy arrays, run the segmentation, and then renders the PNGs, which then can be visualized in the browser. And here I've assumed inversion of book harvest. So this makes it really nice to fiddle around with the process parameters and how many segments to use and how big should they be, and does it even work,
or is the output the way I want it to be? And yeah, stuff like that. So wrapping it up, we switched to a plug-in-based system, so we try to keep the core package as small as possible, but using plug-ins to extend it
via processes or input or output formats. Therefore, we use the Python setup tools entry points, which is quite easy to use. And there's also a... It's also in GitHub, it's our X-Array driver, so you can try it out there. Simply put, pip install mapjetter and mapjetter xarray,
and you can read and write multi-dimensional data as well. There was already a PostGIS plug-in, but it's been written like four years ago, and I didn't manage to keep it updated since then. But it could be extended via this way as well. So you see, we can be very flexible
towards different types of data and data structures. Because Ivan mentioned GeotiffJS, we already... So last year, more than one year ago, we supervised the master thesis
that would... Well, the topic was first, can we read those tile directives in open layers, and how would it work, and then under which circumstances would it be efficient or not.
So I mean, it was already discussed, but for the use case that the user just wants to have one single visualization, it's probably better to have it processed on the server, or rendered on the server and just visualized on the client, but if the user wants to interact with the data, then it's better. But then you have to weigh the pros and cons,
because if you transfer the data, it takes time, so it depends really on the download speed. And if you want to interact a lot with the data, then you pay the price. If you don't want to, then the server simply gets slower. What we are now working on
is to make GeotiffJS able to read tile directories, because GeotiffJS now can just read single TIFs, but the idea is to point GeotiffJS to a tile directory, and it reads the tile structure, and while rendering the data, or while being provided with the bounding box, it should figure out which TIF to read,
and which subsets, and how to warp it to put it into the correct output. But there we still have a big chunk of conceptual work in front of us, because the current tile pyramid definition is very minimal.
Let's put it this way. It worked for our use cases the past years, but to make it interoperable with other tools, we might have to change here and there something of the concept. Yeah, and the target of this would be to have... So the OpenLayers interface I showed you,
it just works with PNGs, so the target would be to have GeotiffJS there so that your process output could be any type of TIF, and you could interact with and query it while your Geotiff tiles get rendered, or get processed by the server, and then you can directly look into the data,
because the other possibility would just have to be that you create a bunch of TIFs, run the vrt index command, and then open it in QGIS, and having it directly in the browser would be much more nicer and efficient. So next step and some follow-up talks I recommend.
So the next step would be to work on the GeotiffJS client. Before that, we have to figure out the tile directory spec. So if you think this data structure will be useful or would be useful for any of your use cases, then feel free to contact either me or Fabian.
We'll be happy to get some help there, because more brains working on the same problem is usually better than having just two. And what's also on my agenda is to implement the possibility to write single geotiffs with custom overviews,
meaning that the overviews in the geotiff are not being generated from the base level, but you can write each overview individually with different settings. So for the hillshade example, this would really be nice, because then you could store the hillshade in one TIF, and depending on the zoom level, it's a little bit different, at least for this use case.
It would be great. Yeah, follow-up talks. Fabian is the creator of GeotiffJS. He will talk tomorrow afternoon. So if you found these talks in this session interesting, you should go there. And also tomorrow in the same session, there's my colleague Peter,
going a little bit more into detail on our cloudless processing. And yeah, it's also, I think, in the same session tomorrow afternoon in the Oprah room. So thank you for your attention. And yeah, feel free to get in contact if you have any questions or suggestions.
All right, so I think we managed to catch up a little bit, a little delay in the display at the beginning. So we're in time. We don't need this five minutes of changing rooms now. So you have five minutes for questions.
So yeah, I think for everyone that has been working with different sources and bringing things together and larger data sets, this can have or has some potential or you see some potential in this tool.
So maybe give it a look. And so if you have some questions for Joachim. Yeah, sure. So the time you are going to work on this presentation, but right now is it like a similar approach like when you should like to type? Like you started as a data item and then write type?
Yeah, currently we follow the WMTS simple approach, which just defines the Web Mercator and the Geodetic Tiling Scheme. And yeah, we added on top, you know, the possibility to store the tiles in meta tiles
and even using the paddings, the pixel buffers. But we, yeah, we just wanted to define it a little bit more explicit. So there's a Python package in the core of MapChatter called tile matrix. And the possible type of definitions there
are very, very basic. So not at all the whole scope that WMTS, for example, provides. But it can do whatever projection you want as well. It's just not that obvious to use a non-default tiling scheme. And we want to improve on that.
All right, so many more questions? Actually, all right, yeah. I wonder if you can give us some, for example, how long it takes in general?
Well, it takes just a couple of seconds. Okay, sorry. The question was whether I could provide some performance information on, for example, how long the hillshade would take. Yeah, it's just a couple of seconds, yeah.
I don't know for the quick view. I mean, it strongly depends on what the underlying geotiff is, so I usually create the geotiffs with overviews. Because otherwise, if there are no overviews and you try to assume level three, it will take a while.
But then, it's super fast. I mean, for the heavy lifting, we are using Rasterio, which has GEDAL on the bottom, which uses C. So the warping is fast. Then, we have access to NumPy arrays. So if the algorithms are fast, the thing works, yeah.
But as I mentioned, the tiling scheme or the meta tile setting also plays a big role. So usually, it's better to have bigger tiles than to have many, many smaller ones.
So the question was, if there is a big tiff, whether there should be some pre-processing
in order to be used with this library. Short answer, no. So everything that can be read with Rasterio should work with MapChatter as well. Long answer, yes. You might get some more performance if you use internal tiling and overviews.
So the question is whether we use multi-processing to process multiple tiles, yeah. So we use either the multi-processing package, or we can alternatively use billiard, which is the multi-processing fork used by Celery so that we can have a Celery worker use
a spawn multiple process as well. So the question is whether it's possible to distribute it
on multiple workers. Well, we have a wrapper around it, but it's not yet open-sourced, which we might do in the next months. Maybe, maybe not. I don't know. But basically, yeah. So you have your MapChatter workers, and you send them a process, and they will do the same.
We don't distribute the tile processes themselves, because on the large-scale processing, we found out that it's better to cache some of the data, and therefore it makes more sense to have one zone with multiple tiles being processed by one worker
that has cached data. But yeah, there's something in the pipeline, yeah. All right, so I think one last question, depending on your necessity for coffee. All right, so great.
I think thank you very much all for your time. Thank you for all the great presentations. And yes, I guess I see some of you in a little bit after the coffee break. So yeah, thank you all.