Bestand wählen
Merken

Distributed tile processing with GeoTrellis and Spark

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
great and somebody's robber manual I'm the maintainer of a project called you trolls and I'm here to talk to today about a distributed tile processing with due Charles and spark so presented as the motivating questions sort of how do we had we work in very large rested and was a very large I mean I'm rested data that you need to work on a cluster with it will fit into the memory of 1 machine even be finishing arm and also you know if we wanted you could a process this data quickly and usually with this talk at a couple different datasets but under specifically talk today about the NASA next downsampled climate projections open set I saw how we work with this this data set so what is it what is the status of the work there were talking about on these things called global circulation models which are models of a world wide temperature and precipitation and it predicts out had those data points over about a hundred years and to try to measure you know like what climate changes it is happening so there's about an organization called the the intergovernmental panel on climate change and they publish a assess report and about every 5 years that's a sort of a scientific curation of at the output to these models and the data that we have available and by collaborated on by more than a hunter authors were worldwide some of the best minds are working on climate change are working on this on this on this data and we can organize the model data into 3 key categories and there's the different models that are on that come from all over the world different scientific it's institutions and then there's the model ensemble is we take averages of the models I it outputs 3 different datasets there's a temperature max temps remain imprisoned precipitation and this areas that they run over I 1 set is like a historical run from 1950 until 2006 and then and there's these things called our CPs which a carbon emissions scenarios that basis say OK how how's carbon going to be it how is the human race going to deal with carbon emissions over the future of for example here's 3 of the 4 of our cities that are included in the next is that see the R C P 4 . 5 says I carbon emissions will peak in 2000 are 20 40 around 2040 2050 and then go down to a lower than current levels of which would be pretty awesome and P A . 5 says who were just I continue business as usual and I keep up and carbon in the year so the next i downsampled data is the monthly data over the United States I stand symbols is pretty high resolution confidence that the dataset and there's for RBC are included from 2006 to 2009 is available on S 3 publicly as a over 8 thousand that CIA files at about a giga so each of them and we as part of our preprocessing we exported due tiles and so I measure of how how big that was is 15 point 3 terrabytes I mean even for 1 combination of 1 model 1 dataset and in 1 or CP scenario I like a match for 1 of those like 9 . 9 2 gigabytes even if you working by 1 with 1 dataset across that time period of time it's pretty pretty large so what are the workflow for working with this this dataset so the tools that we use like is that on the maintainer of due Charles would you Charles is is it's a scholar library for doing I have a say anything geospatial we have library functionality for doing the vector processing are re projection and duties and stuff but our main focus has been on Rasta processing doing really really fast and optimize um single tile dislike was single-threaded chunking Theressa data doing complex you special operations on metadata and then i it's also framework for doing distributed reciprocity of tile dresses I we have k Our main framework that we've had historically is based on actor and works really well for if you can fit like a roster in a single machine and has a a Corsican paralyze over that I'll rested data and on a single machine but most our current development is focused on I'm integrating with sparked helps swimmers workers later I we have that the people who know what map algebra terms are we have all these local zonal local and global operations and implemented on roster and were currently an incubation allocation Tech so for those you don't spark is i it's fast and it's a really fast I Cluster Compute Engine and its could could be described as again next generation Hadoop research if you do sort of like a larger data processing you know a workers than it on the ball i it's it's really it's a really seen these days I train scholars nice for us because we integrate really well that are but also has binding for Python and Java and for back and we use Accumulo which is a um big table implementation on top of HDFS so we use that and use that to store tiles and you and some indexing like you gives us a really good precision about where we wanna store tiles and be able to do range queries on how do I get tiles out of course store I another great point about cumulo is that it is by Dumais another location that project and you mentioned i which is a skull project and we have to be collaborating with them in the future even more the as so is some strategies working these big rosters I 1 thing is to just Tyler answers given very giant Rasta you just I break it up into the individual tiles and work on the individual tiles that's 1 strategy so those a spatial tiles but you could also have special temporo piles where you would look at the tells as sort of a a a a a stack of tiles in 3 dimensions with mentioned being time and then you can index the style sort of project the two-dimensional or three-dimensional index schemes onto a single dimension by using is what's called a z curve Our hashing in the 2 the sense of on the left we have the two-dimensional z curve that allows us standard of suppose my way and then there the harder to look at a three-dimensional z curve and to Ernesta they allow us to the range queries of continuous and spaces in the three-dimensional space judicial space and not have to agree individual tile ideas I we can actually turn them into a set of arranges the sees this and so spark uh provides this arm idea called of resilient distributed dataset the sort of the core the type and SPARQL and it allows you to look at a time a large dataset across a cluster as sort of a single collection let you do functional transformations on that the so what we provide the spark is a roster RTD and that I arrested decay with key in the key type is is generic and based on the tiling so are the indexing so we had we support does spatial only tile Rastas at war wrasses with armor spatio-temporal key for the use of state space and key was the 1st thing we have to do to get to this said next dataset of into trellis is to these and data loading and so 1st of net CDF is sort of a a tough to work with because it's just all packed into from these large files as the 1st that did which is used similar rest stereo Python code and to tired to take the net CDF of format entirely each of the bands of that a roster into vital if I told you to tiles and that's a really embarrassingly parallel problem you can have it is far up some of the E C 2 instances that the reform as a symbol q service either that takes than its year file downloads a chunks it out in the present another 3 bucket so we don't really need them complex indexing for that sort of and that sort of steps so that occurs on my get hub if you wanna take a look at that of that code and the Mexicans the just the data into cumulo using judicial box so that process is um a little more complicated because we are indexing the tiles I wish to take the digits of S 3 re-project them into Weber characters in 1 of your money map but we mosaic that files into a RTMS tiling scheme so we determine what is your level would the level should fit to and then take the tiles and kind of make up the Web tiles out of the the tiles the incoming and we can cure made up the zoom levels and calculate the index splits to give to accumulate so that the it stores of very efficiently and what we end up with hybrid
little catalog you're and that looks
like this and you can see that I have I C C S M 4 is a the model tape and CP 45 is the carbon emissions scenario where there's we curb carbon emissions i in the future and we see we have a number of zoom levels and the dates of for each of the months for just I just 4 years we wouldn't go and take a look at them on a map I and then use or ICP 85 with which is car emissions that are I'm actually a lot higher and you could see that there's that blue words is colder so but you can see that the ICPD 5 is less cold always take a look at i in 2006 see in August's with that looks like YCP 45 there in this case that that more more hot it is some it looks like this and then the RCP 85 with a higher carbon emissions you see a lot more red 7 this should be enough to convince you to curb your carbon emissions I mean you will live in this world the this world so array so
now I do some life coatings you chose to Ochoa show what was her looks like 2 work with this data not on a map but actually in a in a consul I right so this is on scholar so I apologize if you are familiar scholar hopefully you will follow along and forcing reduces to some some imports of us in the functionality of the Rasta package for you shows that the package In spark by poor Accumulo and of new password token and and then make this spot context implicitly available so 1st oranges connects to Accumulo was season zookeeper output zookeeper kind of keeps track of of working those doing as an hour connected and there is a the catalog attached to the cumulants since cumulant that catalog and in Diageo trolls the idea the catalog is where you load data out of receded into so I can start loading data out of this by doing catalog that load I have to strongly typed the m the key type so that I get RTD actually typed on the space time t and then knows that modeling the CSM for Dutch or CPU 45 analysts work with sea level 5 and well you know I need to put in the later class and that's it the great so we can do the same thing with or CP 85 and now we have these rats oddities that represent all the tiles for that dataset and I by import some of operations here like local arrester operations I can do things like take a difference of those 2 brasses so I take or CP 85 minus RCP 45 great so that this gives me a rat's RTD that represents the work that be the difference between 2 Rastas I have actually done anything exam massive for any results but let's do something easy income and take the minimax of that rested you'll see that spot kicks off and I and I get some results I and we see that it's negative 14 and 18 so this would mean that some in some other cases the RCP 45 the lower cognition scenario but it has cells that are warmer than the corresponding and Rasta that's a B or C P 85 which should be surprising because the different model calculations but what what would be surprising if there were more warmer cells in our CPU 45 then or CP 85 right to work and expecting the more carbon there the hotter than the highest get so this was that I'm currently working on my local machine is like 30 30 gigs I think the without and the the and a 1 band is a it's like it's each band so 1 month state over and is a set of tiles each 2 thousand 512 by 512 but the total I areas is thing is like 7 thousand by 5 thousand roughly had great so what I can do here's just do some like functional calculation the difference to say I if I take a lawful global if operation which takes um a function that says for each value let i if the value is data we were ignored no data values and the it's above 0 which means that the 85 his lecture warmer then assign it 1 or else assign it 0 by making can store that off temporarily great a lot of coding create and and so now I have another otherwise rats RTD where all the tiles are either transformed to 1 or 0 better off of it is warmer or colder as so I can do this I can map into the tiles do it a little temple packing and then take the tylenol I wanna do tiles of 2 2 or a double and take the sum of it so now this gives me a RTD of doubles at just represent the sums of each the tiles and then I could just use some all of them that RTD to get the total sum they'll see that markets often action calculates all that stuff as so we get the hotter value in the hotter cell values is that number and analyst sheep and take the colour values can we get another number and if everything works out then Potamitis colder should be a large number so there are a lot more hotter cells in the RCP 85 scenario then the R 44 the it was yes a local the you always have to you yes so spark those like some spillover stuff so just loads different partitions of the tiles works on them and then throws the like is it does all the memory management for me which is really key because I can't that's a that's a tough problem the basal that's that allow the solver a binary another thing that we could do is actually unstable so instead of looking at just the difference with the negative values of a look at the absolute difference between the values so I can use a local ABS solutions are are actually value function which takes that value each cell and then I could do something like catalog that save I call it given the ularity color death it's at sea level 5 unmitigated the alarm the table to store the tiles Accumulo and then give it adds stiff whom I so this is actually shocking 1st and then the survived to this right and you should be able
see yes see if this is the spot you INEC it actually saving off the the value in so this this is the 36 . 1 gigabytes is where like how big that RTD is I had right now it's also calculating a histogram for the whole entire as because it has been a really useful for them visualizing so decide countless Instagram states that often to metadata and
so now if I restart my service going I We should be picking up that new that layer and be able to look at it which we can also
I saw eye color this a little differently just as a this is easier the lighter green is lower values in the higher green is a larger differences in its z didn't do any pure meaning to the stressor aware so is always in mobile 5 of us of a tragedy zoom out all above great I mentioned I left them out it with defined as a k greater stereo in this part of it it so I would do something a little more complicated say I wanna find the title in the difference and absolute difference that has the most variance so that the the greatest total difference um for that I'm a do on sort of as the sum calculation again but it is something a little different take and we a map of the tile in that Atelidae entirely again I take the sum but keep the tile Aidid and how some of data because I'm actually look at that later so we just space that and so the type great and then what I could do now like and that's called this sums and then I can do the spark already the max operation on any the given an ordering the does not order um that the that this setup all types do the double impact again so this is the idea I we have the tile in the sum and then we say disorder by the sum then and this yeah Suite they give up but and then so this is uh this should be the uh tile idea like was the tile and the the so the spot kicks off is there does the max operation it does the map and the max operation so I get the sum and I get the tiles tile and I get the key of so if we look at that the keys says it's uh 2058 0 3 so March of 2058 we can go ahead and take a look at that
we 58 of 3 whom and you can see that here's a pretty large difference I can come to guess where the tile would be but instead of that I'm actually put it on the map and this will show some vector and
capabilities of Geotrellis have for some important stuff to work with vectors a
and the to to say the great
so I'm importing the re-projection logic added to the judicial logic and then also proj four-second specifically named these are uh the the seriousness I have a utility function just reduce into a file that'll painted on the map of that's that that's there I yeah so the who so yes I wanna find the extent of that key and if we look at the key gives me the daytime but it also gives me this like tile quarter for the keys so we have some metadata that can allow you to turn that into a map extent I and that'll be Weber cater so so-called the McKenna extents and and and so would be death that metadata a that map transform this form the year key so she get aware Mercator stand but I actually want the i in lat long so do w extents I had that reproject from web cater to last long great so now I can just write so that Jason out just turn it to a polygon and then to judges and now that should actually show up on
my map offended all things correctly yet so now we get the extent of the tile in lat long that we can put through web service and is pain on a on a leaflet maps on I think that's all I could I wanted to show you this uh more stuff about calculating isochrones at time for time for that so
thanks be of life could something of the my colleague on is that talk about some the reader that we use to read this you test that we put into S 3 from the nets yeah files
Betriebsmittelverwaltung
Offene Menge
Extrempunkt
Extrempunkt
Raum-Zeit
Metadaten
Code
Gruppe <Mathematik>
Temporale Logik
Offene Abbildung
Zirkulation <Strömungsmechanik>
Analytische Fortsetzung
Caching
Automatische Indexierung
Schlüsselverwaltung
Kategorie <Mathematik>
Singularität <Mathematik>
Dienst <Informatik>
Generator <Informatik>
Menge
Digitalisierer
Festspeicher
Dimension 3
Programmbibliothek
Instantiierung
Tabelle <Informatik>
Algebraisches Modell
Subtraktion
Selbst organisierendes System
Nummerung
Datenhaltung
Virtuelle Maschine
CLI
Spannweite <Stochastik>
Informationsmodellierung
Bereichsschätzung
Datentyp
Programmbibliothek
Verband <Mathematik>
Rippen <Informatik>
Parallele Schnittstelle
Tabelle <Informatik>
Prozess <Physik>
Datenmodell
Indexberechnung
Symboltabelle
Elektronische Publikation
Menge
Zirkulation <Strömungsmechanik>
Parkettierung
Bitmap-Graphik
Punkt
Prozess <Physik>
Euler-Winkel
Minimierung
Applet
Übergang
Vektorrechner
Mehrrechnersystem
Gruppentheorie
Datenverarbeitung
Kurvenanpassung
Einflussgröße
Bildauflösung
Funktion <Mathematik>
Nichtlinearer Operator
Lineares Funktional
Präprozessor
Prozess <Informatik>
Applet
Abfrage
Übergang
Ähnlichkeitsgeometrie
Nummerung
Zoom
Frequenz
Softwarewartung
Diskrete-Elemente-Methode
Automatische Indexierung
ATM
Strategisches Spiel
Dateiformat
Kategorie <Mathematik>
Projektive Ebene
URL
Schlüsselverwaltung
Aggregatzustand
Explosion <Stochastik>
Gewicht <Mathematik>
Quader
Hausdorff-Dimension
Schaltnetz
Tesselation
Implementierung
ROM <Informatik>
Term
Code
Framework <Informatik>
Benutzerbeteiligung
Mittelwert
Operations Research
Speicher <Informatik>
Hybridrechner
Softwareentwickler
Implementierung
Autorisierung
Matching <Graphentheorie>
Einfache Genauigkeit
Mathematisierung
Fokalpunkt
Quick-Sort
Mapping <Computergraphik>
Flächeninhalt
Mereologie
Basisvektor
Speicherabzug
Modelltheorie
Mapping <Computergraphik>
Home location register
Server
Informationsmodellierung
Magnetbandlaufwerk
Zahlenbereich
Wort <Informatik>
Online-Katalog
Zoom
Übergang
Resultante
Gewichtete Summe
Kumulante
Benutzeroberfläche
Extrempunkt
Spezialrechner
Gruppe <Mathematik>
Speicherabzug
Funktion <Mathematik>
Lineares Funktional
Nichtlinearer Operator
Stellenring
Ruhmasse
Übergang
Kontextbezogenes System
Rechnen
Entscheidungstheorie
Token-Ring
Histogramm
Informationsverarbeitung
Menge
Rechter Winkel
Wurzel <Mathematik>
Tabelle <Informatik>
Aggregatzustand
Subtraktion
Multiplikation
Total <Mathematik>
Klasse <Mathematik>
Gruppenoperation
Tesselation
Zellularer Automat
Zahlenbereich
Online-Katalog
Identitätsverwaltung
Zentraleinheit
Virtuelle Maschine
Informationsmodellierung
Task
Fermatsche Vermutung
Zählen
Passwort
Passwort
Minkowski-Metrik
Videospiel
Pell-Gleichung
sinc-Funktion
Token-Ring
Partitionsfunktion
Mapping <Computergraphik>
Bellmansches Optimalitätsprinzip
Flächeninhalt
Last
Betafunktion
Analogieschluss
Hill-Differentialgleichung
Speicherverwaltung
Kantenfärbung
Bitmap-Graphik
Subtraktion
Gewichtete Summe
Total <Mathematik>
Extrempunkt
Tesselation
Benutzeroberfläche
Raum-Zeit
Datentyp
Nichtlinearer Operator
Suite <Programmpaket>
Schreiben <Datenverarbeitung>
Green-Funktion
Programm/Quellcode
Zoom
Rechnen
Mapping <Computergraphik>
Arithmetisches Mittel
Dienst <Informatik>
Betrag <Mathematik>
Parkettierung
Mereologie
Entropie
Kantenfärbung
Ordnung <Mathematik>
Schlüsselverwaltung
Mapping <Computergraphik>
Subtraktion
Parkettierung
Programm/Quellcode
Vektorraum
Lineares Funktional
Dualitätstheorie
Elektronische Publikation
Programm/Quellcode
Softwarewerkzeug
Applet
Elektronische Publikation
Kontextbezogenes System
Polygon
Mathematische Logik
Menge
Zeichenkette
Mapping <Computergraphik>
Metadaten
Benutzerbeteiligung
Bildschirmmaske
Parkettierung
Maßerweiterung
Schlüsselverwaltung
Videospiel
Gewicht <Mathematik>
Applet
Elektronische Publikation
Mapping <Computergraphik>
Histogramm
Diskrete-Elemente-Methode
Web Services
Ganze Zahl
Array <Informatik>
Polygon
Parkettierung
Räumliche Anordnung
Maßerweiterung

Metadaten

Formale Metadaten

Titel Distributed tile processing with GeoTrellis and Spark
Alternativer Titel Geospatial - Geotrellis and Spark
Serientitel FOSDEM 2015
Autor Emanuele, Rob
Lizenz CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/34389
Herausgeber FOSDEM VZW
Erscheinungsjahr 2016
Sprache Englisch
Produktionsjahr 2015

Inhaltliche Metadaten

Fachgebiet Informatik

Ähnliche Filme

Loading...
Feedback