Distributed tile processing with GeoTrellis and Spark

Video in TIB AV-Portal: Distributed tile processing with GeoTrellis and Spark

Formal Metadata

Title
Distributed tile processing with GeoTrellis and Spark
Alternative Title
Geospatial - Geotrellis and Spark
Title of Series
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2016
Language
English
Production Year
2015

Content Metadata

Subject Area
Confidence interval Java applet Scaling (geometry) Zoom lens Range (statistics) Numbering scheme Database Price index Open set Parallel port Dimensional analysis Data model Preprocessor Computer cluster Different (Kate Ryan album) Single-precision floating-point format Core dump Set (mathematics) Cumulative distribution function Arm Zirkulation <Strömungsmechanik> Mapping Digitizing Software developer Maxima and minima Instance (computer science) Category of being Process (computing) Raster graphics Endliche Modelltheorie Quicksort Spacetime Point (geometry) Three-dimensional space Computer file Maxima and minima Control flow Similarity (geometry) Frequency Cache (computing) Term (mathematics) Subject indexing Authorization Energy level Implementation Electronic data processing Focus (optics) Matching (graph theory) Key (cryptography) Tesselation Weight Code Basis <Mathematik> Group action Symbol table Mathematics Uniform resource locator Explosion Query language Musical ensemble Table (information) Library (computing) Zirkulation <Strömungsmechanik> Code State of matter Multiplication sign Mathematical singularity Combinational logic Set (mathematics) Function (mathematics) Mereology Dressing (medical) Web 2.0 Mathematics Strategy game Semiconductor memory Feldrechner Cuboid Software framework Process (computing) Endliche Modelltheorie Category of being Resource allocation Algebra Library (computing) Area Curve Electric generator Common Language Infrastructure File format Temporal logic Data storage device Numbering scheme Measurement Open set Tessellation Type theory Self-organization Energy level Asynchronous Transfer Mode Functional (mathematics) Implementation Service (economics) Table (information) Image resolution Virtual machine Discrete element method Metadata Read-only memory Average Operator (mathematics) Analytic continuation Mathematical optimization Operations research Projective plane Java applet Ripping Euler angles Software maintenance Subject indexing Hybrid computer Key (cryptography)
Word Mapping Server (computing) Network switching subsystem Zoom lens Tape drive Energy level Library catalog Endliche Modelltheorie Number
Group action Context awareness User interface State of matter Decision theory Analogy Set (mathematics) Function (mathematics) Counting Cumulant Different (Kate Ryan album) Endliche Modelltheorie Multiplication Partition (number theory) Social class Area Beta function Mapping Structural load Token ring Tessellation Cognition Root Befehlsprozessor Hill differential equation Right angle Summierbarkeit Energy level Task (computing) Sinc function Resultant Spacetime Functional (mathematics) Histogram Token ring Computer-generated imagery Virtual machine Maxima and minima Password Mass Graph coloring Number Operator (mathematics) Energy level Cellular automaton Memory management Core dump Total S.A. Library catalog Pell's equation Fermat's Last Theorem Personal digital assistant Calculation Password Bellman equation Video game Musical ensemble Table (information) Identity management Local ring
Suite (music) Service (economics) User interface Zoom lens Mereology Special unitary group Graph coloring Writing Different (Kate Ryan album) Operator (mathematics) Green's function Entropie <Informationstheorie> Source code Absolute value Key (cryptography) Mapping Tesselation Menu (computing) Maxima and minima Total S.A. Tessellation Type theory Arithmetic mean Raster graphics Calculation Order (biology) Summierbarkeit Spacetime
Mapping Vector space Tesselation Different (Kate Ryan album) Source code Special unitary group
Functional (mathematics) Context awareness Mapping Key (cryptography) Computer file Tesselation Computer file Polygon Java applet Special unitary group Metadata Web 2.0 Duality (mathematics) Logic String (computer science) Set (mathematics) Source code Utility software Extension (kinesiology) Form (programming)
Polygon Histogram Mapping Computer file Tesselation Multiplication sign Weight Java applet Discrete element method Geometry Array data structure Web service Video game Extension (kinesiology) Integer
Googol
great and somebody's robber manual I'm the maintainer of a project called you trolls and I'm here to talk to today about a distributed tile processing with due Charles and spark so presented as the motivating questions sort of how do we had we work in very large rested and was a very large I mean I'm rested data that you need to work on a cluster with it will fit into the memory of 1 machine even be finishing arm and also you know if we wanted you could a process this data quickly and usually with this talk at a couple different datasets but under specifically talk today about the NASA next downsampled climate projections open set I saw how we work with this this data set so what is it what is the status of the work there were talking about on these things called global circulation models which are models of a world wide temperature and precipitation and it predicts out had those data points over about a hundred years and to try to measure you know like what climate changes it is happening so there's about an organization called the the intergovernmental panel on climate change and they publish a assess report and about every 5 years that's a sort of a scientific curation of at the output to these models and the data that we have available and by collaborated on by more than a hunter authors were worldwide some of the best minds are working on climate change are working on this on this on this data and we can organize the model data into 3 key categories and there's the different models that are on that come from all over the world different scientific it's institutions and then there's the model ensemble is we take averages of the models I it outputs 3 different datasets there's a temperature max temps remain imprisoned precipitation and this areas that they run over I 1 set is like a historical run from 1950 until 2006 and then and there's these things called our CPs which a carbon emissions scenarios that basis say OK how how's carbon going to be it how is the human race going to deal with carbon emissions over the future of for example here's 3 of the 4 of our cities that are included in the next is that see the R C P 4 . 5 says I carbon emissions will peak in 2000 are 20 40 around 2040 2050 and then go down to a lower than current levels of which would be pretty awesome and P A . 5 says who were just I continue business as usual and I keep up and carbon in the year so the next i downsampled data is the monthly data over the United States I stand symbols is pretty high resolution confidence that the dataset and there's for RBC are included from 2006 to 2009 is available on S 3 publicly as a over 8 thousand that CIA files at about a giga so each of them and we as part of our preprocessing we exported due tiles and so I measure of how how big that was is 15 point 3 terrabytes I mean even for 1 combination of 1 model 1 dataset and in 1 or CP scenario I like a match for 1 of those like 9 . 9 2 gigabytes even if you working by 1 with 1 dataset across that time period of time it's pretty pretty large so what are the workflow for working with this this dataset so the tools that we use like is that on the maintainer of due Charles would you Charles is is it's a scholar library for doing I have a say anything geospatial we have library functionality for doing the vector processing are re projection and duties and stuff but our main focus has been on Rasta processing doing really really fast and optimize um single tile dislike was single-threaded chunking Theressa data doing complex you special operations on metadata and then i it's also framework for doing distributed reciprocity of tile dresses I we have k Our main framework that we've had historically is based on actor and works really well for if you can fit like a roster in a single machine and has a a Corsican paralyze over that I'll rested data and on a single machine but most our current development is focused on I'm integrating with sparked helps swimmers workers later I we have that the people who know what map algebra terms are we have all these local zonal local and global operations and implemented on roster and were currently an incubation allocation Tech so for those you don't spark is i it's fast and it's a really fast I Cluster Compute Engine and its could could be described as again next generation Hadoop research if you do sort of like a larger data processing you know a workers than it on the ball i it's it's really it's a really seen these days I train scholars nice for us because we integrate really well that are but also has binding for Python and Java and for back and we use Accumulo which is a um big table implementation on top of HDFS so we use that and use that to store tiles and you and some indexing like you gives us a really good precision about where we wanna store tiles and be able to do range queries on how do I get tiles out of course store I another great point about cumulo is that it is by Dumais another location that project and you mentioned i which is a skull project and we have to be collaborating with them in the future even more the as so is some strategies working these big rosters I 1 thing is to just Tyler answers given very giant Rasta you just I break it up into the individual tiles and work on the individual tiles that's 1 strategy so those a spatial tiles but you could also have special temporo piles where you would look at the tells as sort of a a a a a stack of tiles in 3 dimensions with mentioned being time and then you can index the style sort of project the two-dimensional or three-dimensional index schemes onto a single dimension by using is what's called a z curve Our hashing in the 2 the sense of on the left we have the two-dimensional z curve that allows us standard of suppose my way and then there the harder to look at a three-dimensional z curve and to Ernesta they allow us to the range queries of continuous and spaces in the three-dimensional space judicial space and not have to agree individual tile ideas I we can actually turn them into a set of arranges the sees this and so spark uh provides this arm idea called of resilient distributed dataset the sort of the core the type and SPARQL and it allows you to look at a time a large dataset across a cluster as sort of a single collection let you do functional transformations on that the so what we provide the spark is a roster RTD and that I arrested decay with key in the key type is is generic and based on the tiling so are the indexing so we had we support does spatial only tile Rastas at war wrasses with armor spatio-temporal key for the use of state space and key was the 1st thing we have to do to get to this said next dataset of into trellis is to these and data loading and so 1st of net CDF is sort of a a tough to work with because it's just all packed into from these large files as the 1st that did which is used similar rest stereo Python code and to tired to take the net CDF of format entirely each of the bands of that a roster into vital if I told you to tiles and that's a really embarrassingly parallel problem you can have it is far up some of the E C 2 instances that the reform as a symbol q service either that takes than its year file downloads a chunks it out in the present another 3 bucket so we don't really need them complex indexing for that sort of and that sort of steps so that occurs on my get hub if you wanna take a look at that of that code and the Mexicans the just the data into cumulo using judicial box so that process is um a little more complicated because we are indexing the tiles I wish to take the digits of S 3 re-project them into Weber characters in 1 of your money map but we mosaic that files into a RTMS tiling scheme so we determine what is your level would the level should fit to and then take the tiles and kind of make up the Web tiles out of the the tiles the incoming and we can cure made up the zoom levels and calculate the index splits to give to accumulate so that the it stores of very efficiently and what we end up with hybrid
little catalog you're and that looks
like this and you can see that I have I C C S M 4 is a the model tape and CP 45 is the carbon emissions scenario where there's we curb carbon emissions i in the future and we see we have a number of zoom levels and the dates of for each of the months for just I just 4 years we wouldn't go and take a look at them on a map I and then use or ICP 85 with which is car emissions that are I'm actually a lot higher and you could see that there's that blue words is colder so but you can see that the ICPD 5 is less cold always take a look at i in 2006 see in August's with that looks like YCP 45 there in this case that that more more hot it is some it looks like this and then the RCP 85 with a higher carbon emissions you see a lot more red 7 this should be enough to convince you to curb your carbon emissions I mean you will live in this world the this world so array so
now I do some life coatings you chose to Ochoa show what was her looks like 2 work with this data not on a map but actually in a in a consul I right so this is on scholar so I apologize if you are familiar scholar hopefully you will follow along and forcing reduces to some some imports of us in the functionality of the Rasta package for you shows that the package In spark by poor Accumulo and of new password token and and then make this spot context implicitly available so 1st oranges connects to Accumulo was season zookeeper output zookeeper kind of keeps track of of working those doing as an hour connected and there is a the catalog attached to the cumulants since cumulant that catalog and in Diageo trolls the idea the catalog is where you load data out of receded into so I can start loading data out of this by doing catalog that load I have to strongly typed the m the key type so that I get RTD actually typed on the space time t and then knows that modeling the CSM for Dutch or CPU 45 analysts work with sea level 5 and well you know I need to put in the later class and that's it the great so we can do the same thing with or CP 85 and now we have these rats oddities that represent all the tiles for that dataset and I by import some of operations here like local arrester operations I can do things like take a difference of those 2 brasses so I take or CP 85 minus RCP 45 great so that this gives me a rat's RTD that represents the work that be the difference between 2 Rastas I have actually done anything exam massive for any results but let's do something easy income and take the minimax of that rested you'll see that spot kicks off and I and I get some results I and we see that it's negative 14 and 18 so this would mean that some in some other cases the RCP 45 the lower cognition scenario but it has cells that are warmer than the corresponding and Rasta that's a B or C P 85 which should be surprising because the different model calculations but what what would be surprising if there were more warmer cells in our CPU 45 then or CP 85 right to work and expecting the more carbon there the hotter than the highest get so this was that I'm currently working on my local machine is like 30 30 gigs I think the without and the the and a 1 band is a it's like it's each band so 1 month state over and is a set of tiles each 2 thousand 512 by 512 but the total I areas is thing is like 7 thousand by 5 thousand roughly had great so what I can do here's just do some like functional calculation the difference to say I if I take a lawful global if operation which takes um a function that says for each value let i if the value is data we were ignored no data values and the it's above 0 which means that the 85 his lecture warmer then assign it 1 or else assign it 0 by making can store that off temporarily great a lot of coding create and and so now I have another otherwise rats RTD where all the tiles are either transformed to 1 or 0 better off of it is warmer or colder as so I can do this I can map into the tiles do it a little temple packing and then take the tylenol I wanna do tiles of 2 2 or a double and take the sum of it so now this gives me a RTD of doubles at just represent the sums of each the tiles and then I could just use some all of them that RTD to get the total sum they'll see that markets often action calculates all that stuff as so we get the hotter value in the hotter cell values is that number and analyst sheep and take the colour values can we get another number and if everything works out then Potamitis colder should be a large number so there are a lot more hotter cells in the RCP 85 scenario then the R 44 the it was yes a local the you always have to you yes so spark those like some spillover stuff so just loads different partitions of the tiles works on them and then throws the like is it does all the memory management for me which is really key because I can't that's a that's a tough problem the basal that's that allow the solver a binary another thing that we could do is actually unstable so instead of looking at just the difference with the negative values of a look at the absolute difference between the values so I can use a local ABS solutions are are actually value function which takes that value each cell and then I could do something like catalog that save I call it given the ularity color death it's at sea level 5 unmitigated the alarm the table to store the tiles Accumulo and then give it adds stiff whom I so this is actually shocking 1st and then the survived to this right and you should be able
see yes see if this is the spot you INEC it actually saving off the the value in so this this is the 36 . 1 gigabytes is where like how big that RTD is I had right now it's also calculating a histogram for the whole entire as because it has been a really useful for them visualizing so decide countless Instagram states that often to metadata and
so now if I restart my service going I We should be picking up that new that layer and be able to look at it which we can also
I saw eye color this a little differently just as a this is easier the lighter green is lower values in the higher green is a larger differences in its z didn't do any pure meaning to the stressor aware so is always in mobile 5 of us of a tragedy zoom out all above great I mentioned I left them out it with defined as a k greater stereo in this part of it it so I would do something a little more complicated say I wanna find the title in the difference and absolute difference that has the most variance so that the the greatest total difference um for that I'm a do on sort of as the sum calculation again but it is something a little different take and we a map of the tile in that Atelidae entirely again I take the sum but keep the tile Aidid and how some of data because I'm actually look at that later so we just space that and so the type great and then what I could do now like and that's called this sums and then I can do the spark already the max operation on any the given an ordering the does not order um that the that this setup all types do the double impact again so this is the idea I we have the tile in the sum and then we say disorder by the sum then and this yeah Suite they give up but and then so this is uh this should be the uh tile idea like was the tile and the the so the spot kicks off is there does the max operation it does the map and the max operation so I get the sum and I get the tiles tile and I get the key of so if we look at that the keys says it's uh 2058 0 3 so March of 2058 we can go ahead and take a look at that
we 58 of 3 whom and you can see that here's a pretty large difference I can come to guess where the tile would be but instead of that I'm actually put it on the map and this will show some vector and
capabilities of Geotrellis have for some important stuff to work with vectors a
and the to to say the great
so I'm importing the re-projection logic added to the judicial logic and then also proj four-second specifically named these are uh the the seriousness I have a utility function just reduce into a file that'll painted on the map of that's that that's there I yeah so the who so yes I wanna find the extent of that key and if we look at the key gives me the daytime but it also gives me this like tile quarter for the keys so we have some metadata that can allow you to turn that into a map extent I and that'll be Weber cater so so-called the McKenna extents and and and so would be death that metadata a that map transform this form the year key so she get aware Mercator stand but I actually want the i in lat long so do w extents I had that reproject from web cater to last long great so now I can just write so that Jason out just turn it to a polygon and then to judges and now that should actually show up on
my map offended all things correctly yet so now we get the extent of the tile in lat long that we can put through web service and is pain on a on a leaflet maps on I think that's all I could I wanted to show you this uh more stuff about calculating isochrones at time for time for that so
thanks be of life could something of the my colleague on is that talk about some the reader that we use to read this you test that we put into S 3 from the nets yeah files
Feedback