Distributed tile processing with GeoTrellis and Spark
Video in TIB AVPortal:
Distributed tile processing with GeoTrellis and Spark
Formal Metadata
Title 
Distributed tile processing with GeoTrellis and Spark

Alternative Title 
Geospatial  Geotrellis and Spark

Title of Series  
Author 

License 
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2016

Language 
English

Production Year 
2015

Content Metadata
Subject Area 
00:00
Confidence interval
Java applet
Scaling (geometry)
Zoom lens
Range (statistics)
Numbering scheme
Database
Price index
Open set
Parallel port
Dimensional analysis
Data model
Preprocessor
Computer cluster
Different (Kate Ryan album)
Singleprecision floatingpoint format
Core dump
Set (mathematics)
Cumulative distribution function
Arm
Zirkulation <Strömungsmechanik>
Mapping
Digitizing
Software developer
Maxima and minima
Instance (computer science)
Category of being
Process (computing)
Raster graphics
Endliche Modelltheorie
Quicksort
Spacetime
Point (geometry)
Threedimensional space
Computer file
Maxima and minima
Control flow
Similarity (geometry)
Frequency
Cache (computing)
Term (mathematics)
Subject indexing
Authorization
Energy level
Implementation
Electronic data processing
Focus (optics)
Matching (graph theory)
Key (cryptography)
Tesselation
Weight
Code
Basis <Mathematik>
Group action
Symbol table
Mathematics
Uniform resource locator
Explosion
Query language
Musical ensemble
Table (information)
Library (computing)
Zirkulation <Strömungsmechanik>
Code
State of matter
Multiplication sign
Mathematical singularity
Combinational logic
Set (mathematics)
Function (mathematics)
Mereology
Dressing (medical)
Web 2.0
Mathematics
Strategy game
Semiconductor memory
Feldrechner
Cuboid
Software framework
Process (computing)
Endliche Modelltheorie
Category of being
Resource allocation
Algebra
Library (computing)
Area
Curve
Electric generator
Common Language Infrastructure
File format
Temporal logic
Data storage device
Numbering scheme
Measurement
Open set
Tessellation
Type theory
Selforganization
Energy level
Asynchronous Transfer Mode
Functional (mathematics)
Implementation
Service (economics)
Table (information)
Image resolution
Virtual machine
Discrete element method
Metadata
Readonly memory
Average
Operator (mathematics)
Analytic continuation
Mathematical optimization
Operations research
Projective plane
Java applet
Ripping
Euler angles
Software maintenance
Subject indexing
Hybrid computer
Key (cryptography)
09:27
Word
Mapping
Server (computing)
Network switching subsystem
Zoom lens
Tape drive
Energy level
Library catalog
Endliche Modelltheorie
Number
10:44
Group action
Context awareness
User interface
State of matter
Decision theory
Analogy
Set (mathematics)
Function (mathematics)
Counting
Cumulant
Different (Kate Ryan album)
Endliche Modelltheorie
Multiplication
Partition (number theory)
Social class
Area
Beta function
Mapping
Structural load
Token ring
Tessellation
Cognition
Root
Befehlsprozessor
Hill differential equation
Right angle
Summierbarkeit
Energy level
Task (computing)
Sinc function
Resultant
Spacetime
Functional (mathematics)
Histogram
Token ring
Computergenerated imagery
Virtual machine
Maxima and minima
Password
Mass
Graph coloring
Number
Operator (mathematics)
Energy level
Cellular automaton
Memory management
Core dump
Total S.A.
Library catalog
Pell's equation
Fermat's Last Theorem
Personal digital assistant
Calculation
Password
Bellman equation
Video game
Musical ensemble
Table (information)
Identity management
Local ring
17:15
Suite (music)
Service (economics)
User interface
Zoom lens
Mereology
Special unitary group
Graph coloring
Writing
Different (Kate Ryan album)
Operator (mathematics)
Green's function
Entropie <Informationstheorie>
Source code
Absolute value
Key (cryptography)
Mapping
Tesselation
Menu (computing)
Maxima and minima
Total S.A.
Tessellation
Type theory
Arithmetic mean
Raster graphics
Calculation
Order (biology)
Summierbarkeit
Spacetime
19:36
Mapping
Vector space
Tesselation
Different (Kate Ryan album)
Source code
Special unitary group
19:56
Functional (mathematics)
Context awareness
Mapping
Key (cryptography)
Computer file
Tesselation
Computer file
Polygon
Java applet
Special unitary group
Metadata
Web 2.0
Duality (mathematics)
Logic
String (computer science)
Set (mathematics)
Source code
Utility software
Extension (kinesiology)
Form (programming)
21:23
Polygon
Histogram
Mapping
Computer file
Tesselation
Multiplication sign
Weight
Java applet
Discrete element method
Geometry
Array data structure
Web service
Video game
Extension (kinesiology)
Integer
22:03
Googol
00:06
great and somebody's robber manual I'm the maintainer of a project called you trolls and I'm here to talk to today about a distributed tile processing with due Charles and spark so presented as the motivating questions sort of how do we had we work in very large rested and was a very large I mean I'm rested data that you need to work on a cluster with it will fit into the memory of 1 machine even be finishing arm and also you know if we wanted you could a process this data quickly and usually with this talk at a couple different datasets but under specifically talk today about the NASA next downsampled climate projections open set I saw how we work with this this data set so what is it what is the status of the work there were talking about on these things called global circulation models which are models of a world wide temperature and precipitation and it predicts out had those data points over about a hundred years and to try to measure you know like what climate changes it is happening so there's about an organization called the the intergovernmental panel on climate change and they publish a assess report and about every 5 years that's a sort of a scientific curation of at the output to these models and the data that we have available and by collaborated on by more than a hunter authors were worldwide some of the best minds are working on climate change are working on this on this on this data and we can organize the model data into 3 key categories and there's the different models that are on that come from all over the world different scientific it's institutions and then there's the model ensemble is we take averages of the models I it outputs 3 different datasets there's a temperature max temps remain imprisoned precipitation and this areas that they run over I 1 set is like a historical run from 1950 until 2006 and then and there's these things called our CPs which a carbon emissions scenarios that basis say OK how how's carbon going to be it how is the human race going to deal with carbon emissions over the future of for example here's 3 of the 4 of our cities that are included in the next is that see the R C P 4 . 5 says I carbon emissions will peak in 2000 are 20 40 around 2040 2050 and then go down to a lower than current levels of which would be pretty awesome and P A . 5 says who were just I continue business as usual and I keep up and carbon in the year so the next i downsampled data is the monthly data over the United States I stand symbols is pretty high resolution confidence that the dataset and there's for RBC are included from 2006 to 2009 is available on S 3 publicly as a over 8 thousand that CIA files at about a giga so each of them and we as part of our preprocessing we exported due tiles and so I measure of how how big that was is 15 point 3 terrabytes I mean even for 1 combination of 1 model 1 dataset and in 1 or CP scenario I like a match for 1 of those like 9 . 9 2 gigabytes even if you working by 1 with 1 dataset across that time period of time it's pretty pretty large so what are the workflow for working with this this dataset so the tools that we use like is that on the maintainer of due Charles would you Charles is is it's a scholar library for doing I have a say anything geospatial we have library functionality for doing the vector processing are re projection and duties and stuff but our main focus has been on Rasta processing doing really really fast and optimize um single tile dislike was singlethreaded chunking Theressa data doing complex you special operations on metadata and then i it's also framework for doing distributed reciprocity of tile dresses I we have k Our main framework that we've had historically is based on actor and works really well for if you can fit like a roster in a single machine and has a a Corsican paralyze over that I'll rested data and on a single machine but most our current development is focused on I'm integrating with sparked helps swimmers workers later I we have that the people who know what map algebra terms are we have all these local zonal local and global operations and implemented on roster and were currently an incubation allocation Tech so for those you don't spark is i it's fast and it's a really fast I Cluster Compute Engine and its could could be described as again next generation Hadoop research if you do sort of like a larger data processing you know a workers than it on the ball i it's it's really it's a really seen these days I train scholars nice for us because we integrate really well that are but also has binding for Python and Java and for back and we use Accumulo which is a um big table implementation on top of HDFS so we use that and use that to store tiles and you and some indexing like you gives us a really good precision about where we wanna store tiles and be able to do range queries on how do I get tiles out of course store I another great point about cumulo is that it is by Dumais another location that project and you mentioned i which is a skull project and we have to be collaborating with them in the future even more the as so is some strategies working these big rosters I 1 thing is to just Tyler answers given very giant Rasta you just I break it up into the individual tiles and work on the individual tiles that's 1 strategy so those a spatial tiles but you could also have special temporo piles where you would look at the tells as sort of a a a a a stack of tiles in 3 dimensions with mentioned being time and then you can index the style sort of project the twodimensional or threedimensional index schemes onto a single dimension by using is what's called a z curve Our hashing in the 2 the sense of on the left we have the twodimensional z curve that allows us standard of suppose my way and then there the harder to look at a threedimensional z curve and to Ernesta they allow us to the range queries of continuous and spaces in the threedimensional space judicial space and not have to agree individual tile ideas I we can actually turn them into a set of arranges the sees this and so spark uh provides this arm idea called of resilient distributed dataset the sort of the core the type and SPARQL and it allows you to look at a time a large dataset across a cluster as sort of a single collection let you do functional transformations on that the so what we provide the spark is a roster RTD and that I arrested decay with key in the key type is is generic and based on the tiling so are the indexing so we had we support does spatial only tile Rastas at war wrasses with armor spatiotemporal key for the use of state space and key was the 1st thing we have to do to get to this said next dataset of into trellis is to these and data loading and so 1st of net CDF is sort of a a tough to work with because it's just all packed into from these large files as the 1st that did which is used similar rest stereo Python code and to tired to take the net CDF of format entirely each of the bands of that a roster into vital if I told you to tiles and that's a really embarrassingly parallel problem you can have it is far up some of the E C 2 instances that the reform as a symbol q service either that takes than its year file downloads a chunks it out in the present another 3 bucket so we don't really need them complex indexing for that sort of and that sort of steps so that occurs on my get hub if you wanna take a look at that of that code and the Mexicans the just the data into cumulo using judicial box so that process is um a little more complicated because we are indexing the tiles I wish to take the digits of S 3 reproject them into Weber characters in 1 of your money map but we mosaic that files into a RTMS tiling scheme so we determine what is your level would the level should fit to and then take the tiles and kind of make up the Web tiles out of the the tiles the incoming and we can cure made up the zoom levels and calculate the index splits to give to accumulate so that the it stores of very efficiently and what we end up with hybrid
09:28
little catalog you're and that looks
09:32
like this and you can see that I have I C C S M 4 is a the model tape and CP 45 is the carbon emissions scenario where there's we curb carbon emissions i in the future and we see we have a number of zoom levels and the dates of for each of the months for just I just 4 years we wouldn't go and take a look at them on a map I and then use or ICP 85 with which is car emissions that are I'm actually a lot higher and you could see that there's that blue words is colder so but you can see that the ICPD 5 is less cold always take a look at i in 2006 see in August's with that looks like YCP 45 there in this case that that more more hot it is some it looks like this and then the RCP 85 with a higher carbon emissions you see a lot more red 7 this should be enough to convince you to curb your carbon emissions I mean you will live in this world the this world so array so
10:46
now I do some life coatings you chose to Ochoa show what was her looks like 2 work with this data not on a map but actually in a in a consul I right so this is on scholar so I apologize if you are familiar scholar hopefully you will follow along and forcing reduces to some some imports of us in the functionality of the Rasta package for you shows that the package In spark by poor Accumulo and of new password token and and then make this spot context implicitly available so 1st oranges connects to Accumulo was season zookeeper output zookeeper kind of keeps track of of working those doing as an hour connected and there is a the catalog attached to the cumulants since cumulant that catalog and in Diageo trolls the idea the catalog is where you load data out of receded into so I can start loading data out of this by doing catalog that load I have to strongly typed the m the key type so that I get RTD actually typed on the space time t and then knows that modeling the CSM for Dutch or CPU 45 analysts work with sea level 5 and well you know I need to put in the later class and that's it the great so we can do the same thing with or CP 85 and now we have these rats oddities that represent all the tiles for that dataset and I by import some of operations here like local arrester operations I can do things like take a difference of those 2 brasses so I take or CP 85 minus RCP 45 great so that this gives me a rat's RTD that represents the work that be the difference between 2 Rastas I have actually done anything exam massive for any results but let's do something easy income and take the minimax of that rested you'll see that spot kicks off and I and I get some results I and we see that it's negative 14 and 18 so this would mean that some in some other cases the RCP 45 the lower cognition scenario but it has cells that are warmer than the corresponding and Rasta that's a B or C P 85 which should be surprising because the different model calculations but what what would be surprising if there were more warmer cells in our CPU 45 then or CP 85 right to work and expecting the more carbon there the hotter than the highest get so this was that I'm currently working on my local machine is like 30 30 gigs I think the without and the the and a 1 band is a it's like it's each band so 1 month state over and is a set of tiles each 2 thousand 512 by 512 but the total I areas is thing is like 7 thousand by 5 thousand roughly had great so what I can do here's just do some like functional calculation the difference to say I if I take a lawful global if operation which takes um a function that says for each value let i if the value is data we were ignored no data values and the it's above 0 which means that the 85 his lecture warmer then assign it 1 or else assign it 0 by making can store that off temporarily great a lot of coding create and and so now I have another otherwise rats RTD where all the tiles are either transformed to 1 or 0 better off of it is warmer or colder as so I can do this I can map into the tiles do it a little temple packing and then take the tylenol I wanna do tiles of 2 2 or a double and take the sum of it so now this gives me a RTD of doubles at just represent the sums of each the tiles and then I could just use some all of them that RTD to get the total sum they'll see that markets often action calculates all that stuff as so we get the hotter value in the hotter cell values is that number and analyst sheep and take the colour values can we get another number and if everything works out then Potamitis colder should be a large number so there are a lot more hotter cells in the RCP 85 scenario then the R 44 the it was yes a local the you always have to you yes so spark those like some spillover stuff so just loads different partitions of the tiles works on them and then throws the like is it does all the memory management for me which is really key because I can't that's a that's a tough problem the basal that's that allow the solver a binary another thing that we could do is actually unstable so instead of looking at just the difference with the negative values of a look at the absolute difference between the values so I can use a local ABS solutions are are actually value function which takes that value each cell and then I could do something like catalog that save I call it given the ularity color death it's at sea level 5 unmitigated the alarm the table to store the tiles Accumulo and then give it adds stiff whom I so this is actually shocking 1st and then the survived to this right and you should be able
16:50
see yes see if this is the spot you INEC it actually saving off the the value in so this this is the 36 . 1 gigabytes is where like how big that RTD is I had right now it's also calculating a histogram for the whole entire as because it has been a really useful for them visualizing so decide countless Instagram states that often to metadata and
17:15
so now if I restart my service going I We should be picking up that new that layer and be able to look at it which we can also
17:30
I saw eye color this a little differently just as a this is easier the lighter green is lower values in the higher green is a larger differences in its z didn't do any pure meaning to the stressor aware so is always in mobile 5 of us of a tragedy zoom out all above great I mentioned I left them out it with defined as a k greater stereo in this part of it it so I would do something a little more complicated say I wanna find the title in the difference and absolute difference that has the most variance so that the the greatest total difference um for that I'm a do on sort of as the sum calculation again but it is something a little different take and we a map of the tile in that Atelidae entirely again I take the sum but keep the tile Aidid and how some of data because I'm actually look at that later so we just space that and so the type great and then what I could do now like and that's called this sums and then I can do the spark already the max operation on any the given an ordering the does not order um that the that this setup all types do the double impact again so this is the idea I we have the tile in the sum and then we say disorder by the sum then and this yeah Suite they give up but and then so this is uh this should be the uh tile idea like was the tile and the the so the spot kicks off is there does the max operation it does the map and the max operation so I get the sum and I get the tiles tile and I get the key of so if we look at that the keys says it's uh 2058 0 3 so March of 2058 we can go ahead and take a look at that
19:38
we 58 of 3 whom and you can see that here's a pretty large difference I can come to guess where the tile would be but instead of that I'm actually put it on the map and this will show some vector and
19:53
capabilities of Geotrellis have for some important stuff to work with vectors a
19:59
and the to to say the great
20:04
so I'm importing the reprojection logic added to the judicial logic and then also proj foursecond specifically named these are uh the the seriousness I have a utility function just reduce into a file that'll painted on the map of that's that that's there I yeah so the who so yes I wanna find the extent of that key and if we look at the key gives me the daytime but it also gives me this like tile quarter for the keys so we have some metadata that can allow you to turn that into a map extent I and that'll be Weber cater so socalled the McKenna extents and and and so would be death that metadata a that map transform this form the year key so she get aware Mercator stand but I actually want the i in lat long so do w extents I had that reproject from web cater to last long great so now I can just write so that Jason out just turn it to a polygon and then to judges and now that should actually show up on
21:24
my map offended all things correctly yet so now we get the extent of the tile in lat long that we can put through web service and is pain on a on a leaflet maps on I think that's all I could I wanted to show you this uh more stuff about calculating isochrones at time for time for that so
21:47
thanks be of life could something of the my colleague on is that talk about some the reader that we use to read this you test that we put into S 3 from the nets yeah files