We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Scalable geospatial processing using dask and mapchete

00:00

Formal Metadata

Title
Scalable geospatial processing using dask and mapchete
Title of Series
Number of Parts
156
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Dask is a flexible parallel computing library that seamlessly integrates with popular Python data science tools. With its task graph and parallel computation capabilities, Dask excels in managing large-scale computations on both the local machine as well as on a computing cluster. Mapchete, an open-source Python library, specialises in parallelizing geospatial raster and vector processing tasks. Its strengths lie in its ability to efficiently tile and process geospatial data, making it a valuable asset for handling vast datasets such as satellite imagery, elevation models, and land cover classifications. This talk delves into the integration of these two technologies, showcasing how their combined capabilities can be used to conduct large-scale processing of geospatial data. It will also show how we at EOX are currently deploying our infrastructure and which challenges we face when using it to process the cloudless satellite mosaics under the EOxCloudless product umbrella.
Keywords
127
Service (economics)Process (computing)Mathematical analysisProduct (business)AdditionSatellite1 (number)Photographic mosaicGoodness of fitProcess (computing)File archiverTerm (mathematics)Exploit (computer security)Lecture/ConferenceComputer animation
GEDCOMThermodynamischer ProzessSpacetimeProcess (computing)CodeParallel portFunction (mathematics)Image resolutionExecution unitAzimuthConfiguration spaceType theoryFatou-MengeMaxima and minimaoutputProcess (computing)Function (mathematics)Configuration space1 (number)LaptopLevel (video gaming)Thermodynamischer ProzessScaling (geometry)Functional (mathematics)Parameter (computer programming)SpacetimeMultiplication signoutputPoint (geometry)Open sourceComputer fileCodePerfect graphMatrix (mathematics)Physical systemModule (mathematics)File formatSet (mathematics)Constraint (mathematics)TesselationAdditionMatching (graph theory)Power (physics)Raster graphicsShapley-LösungVector spaceComputer animationLecture/Conference
Common Language InfrastructureFile formatFunction (mathematics)Vector spaceRaster graphicsoutputDirectory serviceExtension (kinesiology)SatelliteMaizeParallel portData structureType theoryInformationInternet service providerDefault (computer science)Revision controlParallel computingTask (computing)SequenceWrapper (data mining)Graph (mathematics)Scheduling (computing)Client (computing)Tablet computerQuery languageGateway (telecommunications)Continuous trackScaling (geometry)FrequencyTask (computing)CodeData storage deviceRevision controlGraph (mathematics)SequenceCASE <Informatik>Multiplication signLevel (video gaming)Scaling (geometry)Function (mathematics)Parallel portProcess (computing)Physical systemObject-oriented programmingFitness functionCore dumpVirtual machineNeuroinformatikCache (computing)1 (number)BuildingGraph theoryImage resolutionScalabilitySocial classException handlingInterface (computing)Concurrency (computer science)AdditionMappingTesselationWritingCurveStack (abstract data type)File archiverGraph (mathematics)Extension (kinesiology)Directory serviceSatelliteFile formatComputer fileConnectivity (graph theory)BitLibrary (computing)Thermodynamischer ProzessService (economics)Vector spaceRange (statistics)Arithmetic progressionMaxima and minimaRandomizationCurvatureFlagRight angleRow (database)InformationDefault (computer science)Error messageParameter (computer programming)Zoom lensGateway (telecommunications)Latent heatSingle-precision floating-point formatClient (computing)Representational state transferMechanism designInterior (topology)Configuration spaceoutputSubsetStreaming mediaSinc functionMereologyComputer animationLecture/ConferenceXML
Task (computing)Graph (mathematics)Graph (mathematics)Random numberException handlingRead-only memoryError messageSemiconductor memoryPoint (geometry)Existential quantificationLevel (video gaming)Task (computing)Graph theoryVirtual machineError messageArrow of timeDampingThermodynamischer ProzessComputer animationLecture/Conference
MassComputer hardwarePhysical systemPhotographic mosaicComputer animationLecture/Conference
Graph (mathematics)User profileSimultaneous localization and mappingPhotographic mosaicUniform resource locatorScheduling (computing)CASE <Informatik>Computer animation
Server (computing)Point cloudView (database)Computer-generated imageryMaizeThermodynamischer ProzessHill differential equationTesselationLimit (category theory)ResultantMedical imagingPixelConnectivity (graph theory)Buffer solutionCASE <Informatik>Functional (mathematics)Product (business)Process (computing)Physical systemParameter (computer programming)Latent heatScheduling (computing)InformationError messageClient (computing)Mechanism designStandard deviationMereologySingle-precision floating-point formatBit rateAzimuthVector spaceMultiplication signMultiplicationImplementationAlgorithmMappingSoftware developerDatei-ServerLocal ringActive contour modelElectric generatorMetreElectronic data processingDatabaseGraph (mathematics)Task (computing)Moment (mathematics)MiniDiscTessellationSet (mathematics)Graph (mathematics)Presentation of a groupServer (computing)Line (geometry)Neighbourhood (graph theory)Computer animationLecture/Conference
Least squaresGEDCOMComputer animation
Transcript: English(auto-generated)
Who of you already used Dask somewhere? Tried it out, yeah? And two of you used Knabtjätte? OK, thanks, Tina. Good. So a short introduction. I am from a company called EoX. We are based in Vienna. That's a larger town next to the Danube river.
My team is mainly working with processing of large archives of satellite imagery, mainly Sentinel-2. So we are producing exploration-ready products out of it, and cloudless mosaics. Who of you already heard of the Sentinel-2 cloudless layer?
OK, yeah, thanks. For the ones of you who didn't know, you can go to this website, esthermaps.eu. You can view the global 10-meter cloudless mosaic of Sentinel-2.
And you can also use the WMS in non-commercial terms. So you can load it into QGIS and use it as an additional layer, for example. So what is Knabtjätte? Knabtjätte is a tool, open source, written in Python. You can find it on GitHub.
But what is it exactly? So mainly, it's a processing engine, first and foremost. So it runs custom processes on potentially very large GSP facial data. And it's also a set of command line tools that help you to achieve this. How does it work? So it's very simple. So we are using the WMTS time matrix system
to chop up the space into smaller chunks and process each of the chunks individually. So there's not much matching in there. But yeah, it's a very powerful system. And what's also important, it allows to save the processing recipe. So once you have a processing output
and you have the configuration and the code next to it, you can always reproduce what you would have processed, which can be quite useful in some time. Why is this useful? So yeah, large or even global data cannot be processed at once. I mean, for the ones of you who deal with raster imagery, you probably would have already had the situation
more than once that your laptop feels too small for the data you're going to convert or to process or whatever. So you need to find a, so MapChat there should help in achieving large-scale processing with this. And yeah, all the processing steps and recipes are preserved.
So what do we need for this? So we need, first and foremost, we need, of course, a process. So for example, a hillshade would be a process. And we need a configuration for it. So that's basically the points of the process and points of the inputs and the output format. Very simple. How does a process look like? I hope you can see it.
Wasn't able to make it any larger. So it should be written in Python. And it can be either a Python file or a module with the only constraint is it has to have a function called execute. And this execute function should make sense to have some input arguments and some keyword arguments.
With that, and MapChat would then use the configuration and map whatever you define in the configuration to your processing function. And within the function, you can do whatever you want. So you can use familiar tools like NumPy if you're dealing with raster data or Shapley if you're dealing with vector data.
So it's completely up to you what's happening inside. The configuration is also very simple. So we are using the Yammer syntax. And the only things you have to define is, of course, the process itself. And inputs would be useful. Outputs. And then some additional data if you want.
And the processing parameters, and yeah. And with these two things, the process output is perfect, reproducible. In addition to this, so what do we need? So we need a set of commands, of course. So MapChat ships with a set of command line tools which will help you.
So the most important, of course, is execute. And yeah, if you're finished with your process, you can simply run MapChat to execute and point to the process file, right? And then you have some additional flags. So if you just want to do a subset of what you're intending to do, then you can add the bounce parameter, for example,
or only process certain zoom levels. Data formats. So internally, we are using Rasteria and Fiona. They're both based on GDAL and OGR. So basically everything that can be read by them can be read by MapChat. But we also have some special extensions.
So for example, we can read and write SAR archives. And we have an extension called MapChat.EO which allows us to read from satellite archives. The default format is called tile directory. It's a bit of a special one
because when you're producing a large output that wouldn't fit into one geotiff anymore, you have to find some solutions how to achieve this. And again, here, the WMTS system comes to rescue. So it's basically a tile directory. It's very much like a map cache you would have, but with the one exception
that you're not restricted to using PNGs or JPEGs, but you can use geotiffs, for example, or also a vector format like flat geobuff. And with geotiff, you have the advantage that you're not restricted to the eight-bit data range, but you can use whatever you want, so floats, yeah.
And you can store global high-resolution output with this. It's also really nice because you can do regional updates. And it allows you, of course, to write an update in parallel, which is important for scalability. So if you have a lot of workers writing on the same output, then you also want the output format
to be able to handle this. If you had a single file, you would have to find some locking mechanism and then have the workers write in sequence, which would not be that efficient. For this tile directory, we worked on a stack extension.
It's called TiledAssetsExtension. And basically, it replaces the URL to the single TIFF, for example, with a schema. So you find it like here. So you have zooms, rows, and columns. And, oops, getting confused here, yes.
And since GDAR 3.6, I believe, we also have a GDAR driver. So you can imagine that this stack.json, which is already a stack item, it's like a VRT and steroids. So if you were using VRTs, for example, in a VRT as an XML GDAR format, which points to every TIFF file
you want to add to your VRT, the stack file is more efficient because it doesn't store the single parts, but only the schema to the parts. And it can be loaded, because it's a GDAR driver, it can be loaded into QGIS and even from S3.
So, for example, yeah, you would see it like this. QGIS. So, let's move on to parallelization. So, Mapchatter internally creates tasks and then decides on how to process these tasks. So the most simple case would be
no parallelization, of course, so do it in sequential. Then, also, it can do it in parallel using the Python concurrent features. Concurrent features, multi-processing capability. And as an addition, it can use Dask. So, what is Dask?
For the ones of you who don't know yet, Dask is a Python library for parallel and distributed computing. It's very nice because it replicates the concurrent features API, so you can have both of them in your code in parallel, and yeah, abstract it really nicely.
Also, a nice feature of Dask is you can have task graphs. So, for example, your tasks can depend on each other, which also helps if you're processing a lot of things. Then, Dask would help you to efficiently decide which tasks should be done when and where.
And if you look at this Dask graph, it should remind you of something. Yes, exactly, it's a tile pyramid again. So, if you're building a large tile pyramid with a lot of overviews, then you can use the task graphs, for example. Comes really handy. So, how do we integrate it? As I told you before, we have these three versions,
so sequential, concrete features, and Dask. Internally in the code, we of course abstract it away using executor classes. To go into detail, the Dask executor, it wraps around Dask and handles all the task execution.
Internally, Dask will then connect to a service called the Dask scheduler, so it can be on your local machine, another process, or it can be an external service. We are using MapChate then internally and deployed as a service,
and we call this MapChate Hub. So, MapChate Hub is basically the MapChate processing engine in the core, and we are adding a deployment infrastructure around this. So, this allows us to do some asynchronous processing, because you can imagine if you do a large-scale thing, then it will not finish early, so you would need to wait for a time,
for a couple of hours, for example. So, this is why asynchronous is very important. We use it for yerks-cloudless and yerks-maps, and also for some ugly use cases. And the interface we implemented is OTC RP processes, like, I say like, because we did it like two years ago,
and the specification evolved, so we have to review that eventually. But it's really, I would say it's like 95%, it's the OTC RP processes. So, and this is how it works. So, we have MapChate Hub, it has, in the core,
it has the processing engine, and then a Dask client. It's connected via a REST interface, we're using FASTA before this, and we deployed it over Kubernetes. So, we have, and in Kubernetes, we have a Dask gateway, it's a package maintained by the Dask community,
and the Dask gateway, in the Dask gateway, the client can request the Dask cluster. So, and it works that way, the user, or IES, we post the job, then MapChate Hub would then look at the job configuration, and the inputs and the outputs, and then determine the tasks
he has to do, build internally either a Dask graph, so if dependencies are required, or apply task streaming. And then, if it's finished with that, it requests a Dask cluster, and then Kubernetes will handle all the outer scaling things. So, if you have a small job, it will just request like one worker, if you have a large job,
it will request, yeah, whatever you set to the maximum. Yeah, and of course, then it sends the tasks to the cluster, and waits until they're ready, and in the meanwhile, it tracks the progress, and tracks the progress in the MongoDB in the background. So, you could always, basically, you post a job, you get an ID back, and then you can poll the job,
and it will tell you the percentage of progress. And in the end, it cleans up, and is finished, in the best case. But of course, the best case doesn't always happen. So, we are using Dask since three years now, so before that, we had the system using Celery,
and we really had a steep learning curve, and went through a lot of pain, in order to get the output ready, because we struggled with a lot of random exceptions, so Dask is really not great in providing you useful information of what went wrong. So, it took us in between a long time
to figure out what went wrong. What we found out, we had a lot of connection errors, so things we cannot do much about, because it's deployed on AWS. And we also got random workers killed, so eventually we found out, oops, they were running out of memory. So, yeah, we learned through the years
how we could massage everything, so it goes as smoothly as possible. We also found out that large task graphs also would cause the cluster to fail. So, yeah. And, yeah, very important is the memory consumption. So, when you write the map jetter process,
you should keep in mind that it runs on some machine, and if you're initializing a large array, and like, I don't know, float32, then this machine may break, and Dask may know nothing about this, and throw you a random error. So, it's not intelligent enough to point you to the fact that, hey, you're using too much. It just says, hey, I've killed the worker
because something's going wrong with that. It's really annoying, I can tell you. But, yeah, if everything runs smoothly, then you really have a nice system where you can do large things and great things, and yeah. So, here you see a Grafana dashboard
with the, yeah, hardware consumption for our cloudless mosaics. And, yeah, a nice feature of Dask is also that every Dask scheduler ships with a dashboard. So, if you have the URL of the dashboard, you can log into it and see what's happening on the cluster.
And, in the best case, you end up with a cloudless mosaic, in that case of Estonia. So, I was faster than expected because I'm already finished. I would like to point you to the fact that we are also having a talk about MuServer late in the afternoon in this very room.
So, thank you very much. All right, thank you for your talk. Now we have the chance to ask some questions.
Thank you very much. I'm interested in these connection errors. We have similar problems also. Are these, can you give some insight what they're due?
Is it also due to connections between the file server and the processing node, or what is it about in your case? If we would only know. I mean, in some cases it was, so for example, if the task graph is too large
and the specs of the scheduler is too narrow, then the scheduler would simply then be irresponsive and you get a connection error. The same in some cases with the workers. So, as you can imagine, we are reading a lot of data. So, like Andrea showed you before, so a lot of Sentinel products at the same time.
So, we have a lot of IO strain on the workers. This could also cause the workers to be irresponsive. So, the task scheduler, of course, it always checks on the workers and sends pings out. So, hey, are you alive? And if they don't respond, then yeah. There are some internal retry mechanisms and we played around with that, but yeah, eventually it could always happen that, yeah.
So, if I understand well, it's, so it's rather between the scheduler and the processing nodes, this connection that is lost, right? Yeah, or from the client to the scheduler as well, if the scheduler dies. Yeah. Yeah. So, basically every part of the whole system can fail
and if one essential part of the system fails, like the scheduler, then your job is gone and there's nothing you can do about this except the retry. But we also managed to max out the AWS S3 rate sometimes because when you have like 100 or 200 workers
writing at the same time on S3, then even Emerson crumbles.
Yeah, thanks for the great presentation. You mentioned at the beginning for MapShetty, you can do rest and vector data. I'm just wondering if you could give an example of how you used it for a large vector data set. You mentioned Shapely, so would it be things like intersections, buffers, generalizations? Yeah, for our maps, I mean, it's already a couple of years back,
we used it to generate global contour lines, for example. So, we had a like a global, what was it, 30 meter elevation data and we used this type-based approach to extract the contour lines and write them into a postures database afterwards. But yeah, of course, the vector capabilities
is a little bit limited due to the fact that you're processing tides. So, if you have larger features than a tire, then it gets split up and this is not always what you want, yeah.
Yeah, thank you for this. At the moment, I use, for example, hill shading, gdaldem hill shade, but I saw your example. Is it basically that you support all the parameters that we can use in gdaldem hill shade? Yeah, we don't have access
to the internal gdaldum hill shade implementation. So, I used a Python-based approach, but there you have the, are you referring to the three parameters you need? So, the idea of using? Yeah, of course, there's like the ordinary parameters like azimuth and, but I use some other parameters
like multi-directional compute edges. Well, you can call the hill shade function multiple times with multiple parameters. Okay. You would have to write a custom process for this, but yeah, definitely possible. We do the same. So, we have a custom hill shade on our maps as well
and yeah, I was using the same approach. Thank you, we'll try it out. Yeah.
This tile-based approach for data processing seemed really interesting, but for me, it's always the question. So, is there anything to facilitate development of algorithms and data processing pipelines when you have like something that requires
not only local pixel information, but also the neighborhood around or probably some global things that needs to be calculated on the whole image or some larger region so that you have like spillover from one tile to another of data to just to get the result in a single tile? Yeah, that's an excellent question.
I mean, the hill shade example is, is a good example for this. So yeah, we have, I cannot show you. We have a parameter you can define when you define your process pyramid that's called pixel buffer and it would add to every process tile a buffer of this and clip it away afterwards.
I mean, you can use it within reason. I mean, getting the whole image information. Yeah. Theoretically, you have the path of the image somewhere and get it and have to open it in Rasterio and then get the information out of it. But near neighborhood, yeah, it's there. It was, creating the hill shade
was the first use case for this tool, so having the pixel buffer was also one of the first features, yeah. I would have a question. I found interesting that you used this OGC API processes.
So just out of interest, are you satisfied with the standard and the stuff you did differently? Would you suggest it to add this to the standard? We added one thing for us, which was the posting of processes, which was not really that, the ideas were there like two years ago,
but there was not a specification for this. I heard that, meanwhile, it is there. I have to review this, but this was the major deviation we had from this. All right.
Okay, then thanks again, Joachim. So a warm applause for him. Thanks. Thanks.