Merken

Big (enough) data and strategies for distributed geoprocessing

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
everybody my name is Robin crests and I work at the day lab at the World Resources Institute and only give you caveats before start talking about big data and all that I'm I'm not an engineer so no suffer engineer and so on uh uh you know I'm not especially interested in like crazy tuning but like the uh look the last talk was produced it but I don't do that I don't uh you stuff like Dumais so which sounds totally amazing and but you have to think about uh data patterns on your hard drive in terms of like performance and what not and I just need stuff to work quickly enough to get my job and I think that's a pretty useful for and and that a w i is a is an environmental think tank based in Washington DC you do a lot of what we do a lot of policy work in the involving countries as well as in the United States and it's a it's a policy research shop that generates a lot of and geospatial data in countries where we operate and but it's not a country like Twitter where and you you have massive amounts of data that said on occasion we do have pre substantial amounts of data and that's what I wanna talk about and and sometimes as we get into this place where rules were between big data and small at our normal data something you could him on your laptop for example in use your standard tools are just whatever I but said there's this there's this point where it becomes all shipments excuse me I did the at the end I do in this whole thing over here
and in even know that you guys can see inside
by the letter I OK here we go here's is uh let's the I'm
sorry the how can it have and so
everything it says of theirs alright so big enough data is 1 it's big enough to be a pain in the ass so like I know when I see it but there's not like you know more than a gigabyte terror by more the petabyte where you can certainly draw a line it's when you tools that you step the typically you start breaking down so you another Ramón laptop you're service crashing you don't need a space of a year processes running for weeks or potentially years if you if you let it run to completion of stuff just doesn't quite work anymore and um so that's it that's this point where small there is you know the wrong small data anymore but the big data toolkit might be overkill you don't necessarily have time to learn a space and uh highs and so uh like uh everything in the Hadoop ecosystem and spark and Cassandra and you know all these all these really amazing tools that most of us don't really need and unless you do but in most cases I don't know and so this this sort of awkward middle ground where you need to find tools that will support the operations they need to do but that aren't necessarily the the standard tools that you would have higher on your laptop and so I wanna talk about this in the context of Global Forest Watch org which which is an initiative and the risk of the I'm a bunch of partners to have put the best and cited the data about forestry in deforestation and on the web and through you know nice nice 1 maps and like and then done before go wanna just show you what that looks like I'm gonna be I'm talking
about 2 datasets in particular this 1 is a CC conceal area and this 1 is of a lens that based on global dataset the tracks forest loss and gain over the last 12 years school look at all of the urban over here this is global there in your data and generated with cooler than just and so there's some deforestation pink
here over near Mount Hood I don't know what's going on there but if we switch to satellites make
see something going on and there's blue regrowth and various places so this so this is a pretty amazing dataset
generated by the University of Maryland and
Google I'm and its global so that's that's the 1st for around for anything like this and the other days of our own talk about the use of force monitoring fraction or former that's alive
and working on an and that is a motor Space system for tracking deforestation hot spots are the odds that deforestation to that's what we charge but force loss hot hot spots Harvey order defined uh deforestation is up to you but where there were trees and then there are no more that's what we were identified so I'm just gonna zoom in here to Indonesia 1 of the major hot spots and and what's
interesting about form is that it's updated every 16
days so you can see the sort of viral spread of deforestation across the landscape because you can hope we see here and and that the idea is that we want people on the ground to be able to react quickly to some force lost
as far as quickly as possible and you know this is 4 4 3 this is considered real time in the past of a country like Indonesia you get a new map of deforestation every couple of years so know with the with a 30 meter and annual lands have based on data from the University of Maryland In our 500 meters 16 day at resolution um I dataset you have some pretty cool new tools built into Global Forest Watch that you can use the and if you
don't stuff with uh with international forestry so back to the
presentation rises and falls check
him out and so a demo so and
then about the and the nuts and bolts of how formal works and but 1st and I don't know how many of you some of the the interesting talk is shared by the guy that does leaflet and he's talking about how simplicity was a guiding what is the goal of should be the goal 1 of the goals and and uh you know that the are I think humility pointers Hough and but the uh what words occur that I heard a expected another conference was that simplicity in some cases it is better than optimal and because basically an hour of human time I had this recently if even within our 1 of our salaries you can buy 400 hours on Amazon and to crunch whatever suffer crunching so there's a there's a real question about how much time you want put into optimizing the hell out of a process and when if your if your process can just scale and you can save a lot of money and time that way instead and I know that this is not a work in every case but is something to think about keep something to keep in mind and so former is basically an image processing algorithm and it takes in a lot of the satellite imagery from from NASA motors and vegetation index of dataset I'm and then we just the statistics and basically what you see here is 1 pixel on with the vegetation intensity shown over time as so this is this is the end EVI on it's basically a measure of vegetation intensity of greenness uh and you has seasonal fluctuations um as even even in the tropics those . seasons as we move and as we would recognize and here and but so the the 1 thing is that the even with seasonal fluctuations and you know cloud cover in 1 of those have a a preseason all preseason behave a pre predictable behavior the so if you have something that goes from Green and intense vegetation to brown another intense vegetation and any happen have fires around the same time which you see at the end of that time series that is something that might be considered deforestation and how exactly we classifiers deforestation depends on uh a model built around historical deforestation so we're looking for patterns in the and EVI signal that are indicative of deforestation but but the point is and for the purpose of this talk is that we have these pixel we need to build pixel time series so that we can run regressions on them when you do this we need to do spatial joint so that we can bring in other datasets like rainfall or a um uh fires which are not in the same format as the as the raw and EDI and that were using and we need spatial filters because we not actually you care about deforestation overland right and so error over the ocean right and so we need to filter that out and we need to be able to do statistics just the standard statistics of the wanna do animals in like econometrics that are not necessarily the designed for working with images so when we 1st started doing this we're using 1 2 desktop machines and they were just both hitting the same hard drive we really know what we're doing that point and reusing RJS and Python using some state and empire to to the actual mass and it works and some bytes that was for a very small number of pixels just to show that the algorithm had legs and it worked in Brazil and Indonesia to some weasel postage stamps so we then have stroke we struggled subsequently on how to scale that up from 10 thousand pixels 100 climbers squared 1 from the resolution 200 billion pixels at 500 meter resolution covering all topics and so the this sort of insight that we got was that if we treat everything as a as a roster of that helps us in certain ways and causes problems and others but I'm I'm actually on like this guy I'm a fan of arresters I think they're really amazing data type and and so I achieving is Rasta how you can treat everything as text because at the end of the day you can convert a restaurant attacks you know about this dispersible pick the roster here and change it into row columns of values and you have a you have something try to database and or just write as text files and since everything is also a are seeking a isn't everything is arrested and again today can convert and anything into 10 we care about entered duresses words points polygons lines and what not and you can convert the same and pieces of data into text and that's great because you do + text and that's where we get to uh the the biggest data of questions here so the problem here is that Hadoop is not simple but I don't know how many of you are familiar with writing MapReduce jobs but it's not a very intuitive way of thinking about how to process data especially geospatial data and so what we ended up doing is using a and the technology stack uses closure cascading Cascalog to basically take away a lot of the pain of of working with the do and we also run this on uh Elastic MapReduce on amazon so it's that's also convenient and but closures nice because it's a cool very uh nice language to to work with the 2 list of it's very elegant and if you're if you're in the list he appreciates that closure if you don't know closure list it's still it's that you'll feel that today and it's weird but instead and cascading is a really cool library that uh basically rights MapReduce jobs for you and so you give a you give you basically told what to do it will write the MapReduce jobs for you run them on your own and expect in Cascalog is just a uh a close a wrapper for that for that library and so yeah matter the benefits of MapReduce which is a lot of scalability if you can express your problem in terms of MapReduce but you have to think about readers and so here just would do little bit of a the basic codebook through looks like that's kind of tiny and African and my bill and so all this is doing is I'm taking in a data source which has row columns in values multiplying and the value by 5 and on the last line there and then and spitting out the results of the uh that row column in the new value and it's this can run just on your laptop it can also run on a massive cluster the and you see at the bottom of what you get and where the other interesting things about 1 interesting about interesting thing about Cascalog is that you can do these implicit joints so in this case we have a pixel source images the row column in value and then we have a dataset that represents the countries is that those pixels and fall into which we've uh generated previously and somehow and so basically by giving a by taking a pixel source and a country source and and naming feels the same so the row and column um are named the same in each source as you can see you you see my pointer gets like right here pixel source row column row column there and we can do an implicit join and by specifying row column than in the output vector here and and there we've done have what amounts to spatial join to them in 3 lines of code on your laptop or on servers the and this is getting more complicated but not really I'm because all trying to do is join a fires which happened in a certain latitude longitude with a country and tricolor probably fires happen in each country and so again you that's um we've got a fire source here that has a latitude longitude and the date and the brightness and without the country source which is in rows and columns again them so it is no more and we have a function that just converts from latitude longitude rows and columns and we filter on uh on the brightness is really what hot fires and then we count up from the fires happen and in a particular and particular country because the countries what's in in the output vectors of does it doesn't was joint again and to give us the result of the bottom where Indonesia is a hot fires and here
so we have a pixel time which I built pixel time series so that's essential for doing a regression analysis and end so you think you've got that you can imagine this being having originally been 44 rosters and couple 2 pixels and 44 Rastas over time I all we have to do to have this scale from 1 laptop to 100 uh servers is have a function call build series that takes in the date in the value and spits out uh at a time series uh which uh just the vector of values and so this is what we get at the end and just a nice nice clean vector values of uh for using this subsequently we can then we can use this these values to to like pass a regression line through and see if the supervisor statistically significantly changing over time I just
standard standard stuff and so that's that's how we that's how we do a lot of the data manipulations and we have in the in natural form algorithm and the nite so the code exodus moving data around doing joins doing the uh the bringing in different data sources making every making sure everything lines up correctly and but then and then and then you we had so and we have to do just standard statistics so there's a there's suffer libraries that take care of the us and but then once we have our dataset of all the deforestation that uh that we've detected and we need to put them onto the map that I showed you and so the guys at the duality and uh developer's site for sneaking up of this crazy datatypes and which is sort of like vector tiles and except it's just text not some binary and a binary format that the map boxes is working on and but basically you have these x and y and x y fields the tell that your browser where to paint a pixel on the on the screen and you get these really smooth the release the animation that you at the side on in the demo so that's just doing 1 is not swapping out tiles at all that's really inefficient is just redrawing redrawing pixels as time moves forward and so the sequel here is what we were using to generate those are the different type of zoom levels but uh in the villages well it's hard to be wrote this code uh I don't have to think about this which is really nice but the problem was that it got really slow and has sometimes Reasoning server timeouts from for large table adjust was not efficient and so I was looking for unfortunately we launched using these and so there was that I was up before the morning on large they try to make sure that these sequel queries finished and because we you know we just kept up in the data we had a large and the sequel carries kept failing because we're also uh you know we're going large traffic and there is turning into a nightmare of so we do now instead of because it's hard to test kept breaking and
is we use Cascalog to to generate the did the values the going to the table and so on we have this very simple calculation that takes uh takes an xyz corner calculate calculates the values of different so zoom levels and then the updates that x y and z values and this is the Cascalog query that actually does this over and that uh node this actually generates xyz when it's from lat-long or from row and column I'm going through our long insects lazy and and then you as a as a result you are transport transposing the time series and enter into long-format format uh which we can then can be used to count up how many deforestation events happened in a particular area and that's how we then paints and paint the change in the the uh in the forest cover in on the website and and then the query that actually generates all ozone levels is just relying on this three-line thing this is just telling us where where the data comes from and generated tiles using function above and count how many deforestation events happen in each of those tunnels and and what sizes that again that scales from your laptop to 100 servers and pre simply the and so the nice thing about that is instead of having to have babies the sequel queries hoping they finish moving through the servers not under too much load to handle the the update I so we get something that's basically infinitely scalable we can test every bit of the code before we deploy in Production it's a very reliable process and it's fast enough and so we're not giving up so that this process might you know instead of taking uh and so having something super optimize that would take a few minutes this might take an hour or a federal more machines out it'll take 15 minutes but the idea is that we don't have to always optimise everything down to the last uh the last MS and if you can horizontally Scalia process and you can just throw more machines it until you get quick enough time for your purposes and so just search to wrap up here and the the lessons for a big enough geospatial data is at the first one I'm echoing Mike Bostock yesterday is great talk and this find right tools and actually use them don't get stuck using that same all tools that you used in the past that are quite the right things for what trying to do now and there are a lot of great tools out there for a for doing some process in a distributed processing this is just a sampling of them use had dudes StarCluster sparkle uh as a as a result these suffer Geotrellis and but depending on where you have done you use case and your application each of these and you can have a role in processing datasets the wouldn't otherwise really be able to handle in years normal tools that we and it's useful to keep in mind that simple is or can be better than optimal and you're very very expensive if you can get computers to do your work the same yourself money and time and will if you're creative about data formats and if you know worried so much about the you know using strips and geospatial data types and in the indices and stuff and if you're creative about what geospatial means versus just pure text and you can you can explore some tools that otherwise would be and unavailable to were and the last thing is that had you can be really great and but it's uh it's very powerful but it can also be really painful so keeping keeping things simple with a library like cascading that will do the work of MapReduce for you and it's a really nice thing so that's that's I round-robin at any questions but if do you think the question what the but
the the other side and you of have it the bus really cool and how are you distributing and managing so like you
actually want you datasets really big so I need to scale up to 100 nodes and AWS or you actually using to manage that job and distributed so as so we actually work completely on AWS so we'll never have the data locally so just keep on S 3 and natively it's available to targeted system itself this is all there and then it's available to every node in the Chao and the so I will stop there of trying to wrap my head around the roster to text expression in terms of thinking about your spatial data as text the in terms of how do you assign a persistent IDE to roster to fit your what part of the world that little piece of text talks about just give me 30 seconds more fat sugar to be question and so notice data split up into tiles and so you I think there 10 degrees across so you can based on and the latter like a a given latitude longitude can be converted into a tile quarter in into tell you you can figure out which Tyler falls into and then which pixel within that tile and it falls into it so we we have this mapper from uh long to tile and image coordinates and that's so we can go back and forth and that was more yeah its library yeah the most is great for just how incredibly consistent and systematic errors got so that that's a probable for us about that so how was that that broken down when you processing at the due process at all or another too much about the Hadoop cluster with this process in a chunk of time yeah so and it's so you can so all that at the beginning we have to start out by processing each file each image file individually but he can't those natively in Hadoop so we just read in 1 file so basically split up into chunks and spit it back out and I know she's going to into text files which well sequence files which is Hadoop's binary format for storing data and and uh we just it just spits out rows and rows and rows of of uh of values and the 2 panels of the splitting for it for us so we never really think about that again that yeah the started playing with this quite a bit to and down I started rabbi ran out you know that that the geographic text conversion is will that always be the way it is what people write geographic wrappers around such that you can not not to do specifically about something on top of you can take geographic data more natively yeah this history question there was a emprical talk yesterday about a project called geo Mr. and the does just that and it sounds like a very high performance and the way to do it geospatial natively in the context of the more you get into trying to do like real geospatial the more complex it gets and so you have to ask whether the return on investment is worth it and for as it was and and but then again were also not computer scientists so in no learning that the learning curve like is bad enough just having to get up the Soyinka going that much further was just this too much but there there are people there working on that I'm hoping there will be a tool set a geospatial tools that this sort of like works out a box like posters is today and it will add the boxes and like processes today I am but I haven't come across the but and I'd love to know somebody else has the and that thank you thank you
Festplattenlaufwerk
Punkt
Twitter <Softwareplattform>
Prozess <Informatik>
Notebook-Computer
Mustersprache
Ruhmasse
Schlussregel
Normalvektor
Programmierumgebung
Term
Standardabweichung
Nichtlinearer Operator
Punkt
Vervollständigung <Mathematik>
Wald <Graphentheorie>
Raum-Zeit
Kontextbezogenes System
Systemzusammenbruch
Raum-Zeit
Quick-Sort
Computeranimation
Quantisierung <Physik>
Dienst <Informatik>
Standardabweichung
Notebook-Computer
Mini-Disc
Gerade
Standardabweichung
Satellitensystem
Einfügungsdämpfung
Weg <Topologie>
Wald <Graphentheorie>
Flächeninhalt
Bruchrechnung
Forcing
Netzwerktopologie
Bildschirmmaske
Einfügungsdämpfung
Physikalisches System
Störungstheorie
Zoom
Ordnung <Mathematik>
Brennen <Datenverarbeitung>
Raum-Zeit
Mapping <Computergraphik>
Echtzeitsystem
Wald <Graphentheorie>
Forcing
Meter
Quick-Sort
Computeranimation
Bildauflösung
Bitmap-Graphik
Resultante
Satellitensystem
Demo <Programm>
Bit
Punkt
Prozess <Physik>
Formale Sprache
Wald <Graphentheorie>
Computeranimation
Festplattenlaufwerk
Prognoseverfahren
Algorithmus
Skalierbarkeit
Maßstab
Prozess <Informatik>
Lineare Regression
Minimum
Mustersprache
Statistische Analyse
Meter
Gerade
Einflussgröße
Bildauflösung
Funktion <Mathematik>
Lineares Funktional
Filter <Stochastik>
Statistik
Gebäude <Mathematik>
Ruhmasse
Quellcode
Reihe
Automatische Indexierung
Rechter Winkel
Server
Dateiformat
Pixel
Fehlermeldung
Aggregatzustand
Codebuch
Algebraisch abgeschlossener Körper
Vektorraum
Green-Funktion
Zahlenbereich
Keller <Informatik>
Ökonometrie
Kombinatorische Gruppentheorie
Polygon
Term
Code
Virtuelle Maschine
Datensatz
Informationsmodellierung
Zeitreihenanalyse
Fächer <Mathematik>
Notebook-Computer
Datentyp
Wrapper <Programmierung>
Programmbibliothek
Elastische Deformation
Zeiger <Informatik>
Bildgebendes Verfahren
Pixel
Green-Funktion
Fluktuation <Physik>
sinc-Funktion
Bildanalyse
Mailing-Liste
Vektorraum
Elektronische Publikation
Quick-Sort
Bildschirmmaske
Hydrostatischer Antrieb
Wort <Informatik>
Subtraktion
Demo <Programm>
Web Site
Quader
Browser
Natürliche Zahl
Vektorraum
Fortsetzung <Mathematik>
Code
Computeranimation
Übergang
Bildschirmmaske
Algorithmus
Zeitreihenanalyse
Binärdaten
Notebook-Computer
Lineare Regression
Datentyp
Programmbibliothek
Softwareentwickler
Gerade
Touchscreen
Regressionsanalyse
Softwaretest
Lineares Funktional
Zentrische Streckung
Statistik
Pixel
Gebäude <Mathematik>
Reihe
Systemaufruf
Ausnahmebehandlung
Zoom
Quellcode
Vektorraum
Quick-Sort
Mapping <Computergraphik>
Reihe
Datenfeld
Parkettierung
Server
Dualitätstheorie
Pixel
Tabelle <Informatik>
Resultante
Bit
Prozess <Physik>
Minimierung
Mathematisierung
Fortsetzung <Mathematik>
Kartesische Koordinaten
Dienst <Informatik>
Computerunterstütztes Verfahren
Code
Computeranimation
Übergang
Überlagerung <Mathematik>
Virtuelle Maschine
Datensatz
Task
Zeitreihenanalyse
Notebook-Computer
Stichprobenumfang
Datentyp
Rechenschieber
Programmbibliothek
Vererbungshierarchie
Indexberechnung
Zentrische Streckung
Wald <Graphentheorie>
Globale Optimierung
Abfrage
Rechnen
Biprodukt
Dateiformat
Ereignishorizont
Warteschlange
Arithmetisches Mittel
Flächeninhalt
Last
Parkettierung
Server
Bus <Informatik>
Dateiformat
Normalvektor
Tabelle <Informatik>
Folge <Mathematik>
Bit
Umsetzung <Informatik>
Prozess <Physik>
Quader
Term
Räumliche Anordnung
Datensatz
Knotenmenge
Arithmetischer Ausdruck
Prozess <Informatik>
Binärdaten
Programmbibliothek
Vorlesung/Konferenz
Kurvenanpassung
Informatik
Bildgebendes Verfahren
Schreib-Lese-Kopf
Zwei
Physikalisches System
Kontextbezogenes System
Elektronische Publikation
Quick-Sort
Minimalgrad
Parkettierung
Mereologie
Heegaard-Zerlegung
Projektive Ebene
Programmierumgebung
Fehlermeldung

Metadaten

Formale Metadaten

Titel Big (enough) data and strategies for distributed geoprocessing
Serientitel FOSS4G 2014 Portland
Autor Kraft, Robin
Lizenz CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/31665
Herausgeber FOSS4G, Open Source Geospatial Foundation (OSGeo)
Erscheinungsjahr 2014
Sprache Englisch
Produzent Foss4G
Open Source Geospatial Foundation (OSGeo)
Produktionsjahr 2014
Produktionsort Portland, Oregon, United States of America

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Big data gets a lot of press these days, but even if you're not geocoding the Twitter firehose, "big enough" data can be a pain - whether you're crashing your database server or simply running out of RAM. Distributed geoprocessing can be even more painful, but for the right job it's a revelation!This session will explore strategies you can use to unlock the power of distributed geoprocessing for the "big enough" datasets that make your life difficult. Granted, geospatial data doesn't always fit cleanly into Hadoop's MapReduce framework. But with a bit of creativity - think in-memory joins, hyper-optimized data schemas, and offloading work to API services or PostGIS - you too can get Hadoop MapReduce working on your geospatial data!Real-world examples will be taken from work on GlobalForestWatch.org, a new platform for exploring and analyzing global data on deforestation. I'll be demoing key concepts using Cascalog, a Clojure wrapper for the Cascading Java library that makes Hadoop and Map/Reduce a lot more palatable. If you prefer Python or Scala, there are wrappers for you too.Hadoop is no silver bullet, but for the right geoprocessing job it's a powerful tool.
Schlagwörter big data
hadoop
deforestation
geoprocessing

Ähnliche Filme

Loading...