Bestand wählen
Merken

"Fast Big Data?" A High-Performance System for Creating Global Satellite Image Time Series

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
the just get undertakes wonderful and come from the remote sensing research units so at the Council for Scientific and Industrial is resurgence of Africa and that's mouthful 11 got into my father told me the good so the title of my talk is forced Big Data high-performance system for creating global satellite image time series and honest Cuba the outline the
interest of time to provide some background many research and operational geospatial applications analyze data through time rather than space some examples of these the climate change applications land cover change crop monitoring off firestreak analysis and then these applications need rapid access to extended type the 2 Borel time-series data but the primary focus of our research is on notice which is the most widely used satellite sensor for time series data but this really is applicable to any credit stock of datasets uh it could even be useful set if you get to something like wild and then just for those who don't know
time series is if we take a stack of images of the same area on the time series is simply a set of values for particular pixel and can represent like a graph like that in a concrete
example of the application that depends on the on time series data as a mobile app that we've developed I'm among other things it provides motors based vegetation and caring and Street for any point on Earth from and this data is instantly accessible to the to the user and kick some of points all taps on a point on a map and then rapid access to to this data alone is impossible if we use the the raw data in its original spatial form and and in this presentation on the show how we transform the of coming to town optimize form that's usable by applications like these so we're we're
really off to the challenge of global time series data and we talking 175 motors styles some of the 3 or 4 different products and comprising nearly 176 thousand images and more of more than 4 terabytes of raw data and the fundus quickly skip
this is a so lot of view of our and collection of motives stalls at different products indicated but different colors and so that's really a lot of data I think we can take the Big Data box here and
so if we want to convert of the spatial data to a time series that that is quite a resource intensive task and previous research band and in my group looked at how to do this in a resource constrained environment so we know and you have a sort of a desktop clausal small so the class top of hardware was about a geezer Ravel so and but I'm going to present how we did this in an unconstrained environment where we've got enough around enough processing power to do this without resorting to strange taxis to get all of the states and to time of tumors form and I'm describing a fully automated high-performance systems using commodity all address no fancy supercomputers of clusters required
in so just to give a little bit of a system overview um we've got a component that downloads data automatically every 8 days or whenever new motor stators available verifies integrity and then once of prices that looks at the motors tiles and finds out whether land bounds also weakened turmoil the ocean data which we don't care about um and from these collections of data we polled that achieves which I will describe just now and and then from these cues we can produce maps all we can serve up the but the data through and as time series to various different applications this whole system is constantly monitored and every step of the way in many of something goes wrong it was reported to us with the you know all similar and insular components in the
system and nearly all of our applications or bolt on top of school and shift 5 and the heart of our system is the datacube API which is a library written in C + + 11 and and q building infrastructure runs and on top of that our various cube applications that use the data in process of further also is built on top of it and then we also got a Python-based keeps server which um this is the state of vi http using adjacent API to you the web apps along mobile apps and all of this and it depends on on the storage area network and that schools all our data for us and not for those of you
not familiar with what its storage area and integrase sand is essentially a high-speed large-capacity storage infrastructure that's built on top of a dedicated network and it looks a little bit like this and so we've got storage servers connected to processing service using thinking of it is its and using ask is is a transfer protocol and then the this is actually quite cheap and debating on we all of course and we both this for about 50 thousand dollars but that's because we far away insignificant if you did this year you probably would spend about 40 thousand and and we make good use of of this infrastructure we'll to provide time series at this magnitude and now that
the key years of accessing a pixel values through an image stack is really really slow to read 1 pixel using the Python Google API from say 660 motors in flee the images takes all 2 minutes and and that's really because we're dealing with extremely good data locality and even though that amount of of data is just about over a kilobytes we're picking that out from 34 gigabytes of raw data and and we opening 660 different images using Google this just a really massive amount of overhead so what we need is a time optimize data structure to store the states of it but each pixels time series together and then we can new in order to do that a transpose the data from its original spatial form intuitive Borel form and and then achieve optimal data-locality fall for time purposes and and we use a
data structure that we call the hypertext poral data and that sounds very fancy but it's Indian quite simple really and we start with say a stack of images so we've got 2 teenagers hero of w with an age height and each of them out of the in bands and and while this is not the datacube yet and this is simply all the data that we going to use and we start with say the 1st pixel there the and we transpires it put all these pixels from from all these images together and we get the time series of pixel 0 so we move on to another 1 take pixel 1 there's it's time series and so he goes to the whole image until we've transposed all of the time series into little bits that close together it not as as described reading 1
pixel throughout all these images is so slow that it's the unfeasible to do it that way and what we actually do is we take 1 image and the spread its pixels into a large output buffer so say for the 1st image we take all of its pixels and put it into a position t 0 the next image of all its pixels that to T 1 and the next 1 good to t t 2 and so on until we process all the images the so this is a data structure where we can rapidly you read each time series and sequentially without having to skip over all of this massive data on and this is really what we call the data queue and we all of this data is reorganizing attentive for form
and not as to talk about the data volume for a but and if we look at motors 500 meter data every tall is about 5 . 8 million pixels per band for city 43 years 7 bands so that's about 40 million pixels the image and visible 667 of these images to date so that's almost 27 billion pixels but Q and then for 16 bit data we talking about 50 gigabytes of data for 1 q and now does is all have to be in RAM at once not exactly but if you do memory-constrained trans position it's extremely inefficient and slow so if you've got a server with limited amounts of RAM like for ladles in 16 leagues are you going to have to be really just go multiple times the data in order to do this transposition the and in doing so you're not getting any cash benefit from operating system because you're reading multiple times the data every time you get to the end the dataset used effect of all of the stuff you just read out of it so you really just thrashing those caches and summer trend is enough then to do this all in order to make anyone 1 boss the data you pretty much have to get the whole thing interacts that's 50 gigs of RAM and but transit cheap these days that it's really not the foreign to solve and 256 gigs of RAM like we have in our in our server is is not actually a big deal anymore and although this increases by about 3 but the year for monazite daily that and this ongoing be initiated some scene of the murders losses that long until it becomes an issue
and so this is basically pupil bowling process and we've got a block that's allocates all the memory we need at 1st and then we have 3 threads that start pricing various bits in parallel so we have a section that catches the falls makes it faster for for Google to read them in and out of fear that reads all of this these bad data into the into buffers and in the last 3 that transposes into the eventual Oktoberfest that it written to next year 5 q so an
interesting aspect of time series cubes as that's updating them requires rebuilding them from scratch and the a reasonable question to ask could be why don't you just leave space at the end of each time series and then little bits of data in their wallets can can be done but it turns out that every time canonized all will be modified if European data like this and if you use compression which we always do this entire fall changes anyway so you rewriting the whole thing and it turns out to be no faster than rebuilding it from scratch but with enough memory like we have that we can do but of optimization and we we boulders cube using an existing due to read the whole existing giving to memory but that ran to good use transpose only the new data which might be 1 or 2 or 3 images and and in that way we get about a 35 % of performance improvement and bold engulfing from scratch and so just to look at some
concrete examples of of this improvement and you look at the base line this is our old way of processing this data with limited amount of memory and in a constrained environment for this style edged thing the 11 discoveries about the northern part of Africa the and it's a all the 100 % coverage so mostly land cont proper way any no data here and that he's take about 42 minutes to to bold time-series optimize Cuba that but in an unconstrained environment that goes down quite a lot so almost 12 minutes of assumed corn crop also that stays the same in updating it We squeeze a little bit of extra performance out of it ultimately weakened bold italic this about 4 times faster and then looking at at a small it's all that can be cropped about 33 % this is 1 of the lot of motion the baseline constrained of environment it would take about 27 municipal the tall like this uh rapidly decreases so when once you've thrown a fragmented uh just about 8 minutes of cropping it always that again and and also talk like this has a really fast updates on because there's not a lot of trans position of no data that needs to be taken care of so ultimately for tall like this we a 4 to 10 times improvement and the important point here is really that the speed improvements allow us to process the entire this modest falls in about a day all day and a half and which allows us to to really provides time series at the scale without tying up lots of computing resources for a week or more than a week and
but we did face a couple of challenges in this whole process of we discovered that energies of decompression or deflectors it's otherwise known as a serious bottleneck of we can read from a senator thousand 40 megabyte per 2nd that's CPU can only decompress the stator and it'll turn that make the 2nd so really not getting the benefit of all those fancy storage and we fit decompressing data that right and we wanted to look at multi-sided fonts position of the data as well but it turned out to be an absolute nightmare for the quiz CPU and it's really the worst way in which he it can access the the CPU cache and if you've got multiple threads trying to write to the same bits of memory and you end up with a really a terrible problem where they actually only 1 thread can do this at at a time so a multithreaded trans position ends up being no forces and single-threaded transportation and then in a massive bomb is that H for which the raw data stored in is not the it's so there's no concurrency possible you concrete multiple bands of motor and with multiple threads in the same program and then we also discovered that all edge yeah 5 can be thread-safe so you confirm multiple threads that but it uses global locks so this once again no concurrency possible and this is really disappointing for supposedly high-performance scientific data format and not being able to use multiple multithreaded programming to access data stored in and and furthermore it only supports the CAP I'm not C + + so that is a bit of a setback as well the so we we thought about using polished year 5 and that comes with its own set of problems again no C + + support for that no 3rd safety there and so although it can it can do battle accesses data you can't write barrel compress data which is pretty much what we want to do and it comes although the other limitations which made it somewhat unsuitable for what we wanted to do it that lead into some future work that we like
to do I'm in terms of performance side we we want to investigate recent improvements that Intel might the z lived library and which pretty much all the compression is based on a and when we're talking about the flight and so Dunsany basis see the improvements that can really really and speed up the decompression of later and we could also precompressed the bands in memory before we write them from so this page the if 5 functionality for doing this and and then ultimately what we would like to do is implement a proper multithreaded and C + + API full-fledged year 5 and and hopefully from back next year then this actually worked and in terms of accessibility of our data and we'd like to investigate the up the idea of creating three-dimensional datasets of so time by with but instead of the current two-dimensional datasets that we have so we have t times in which is that with time side and but that doesn't make the data very accessible to you things like edge give you all the time you know you average age stiff software and would also like to use named John Usenet CDF of which would open up the states to things like by that we are currently we use IronPython implementation of and to get done Tuesday that's obviously not so efficient in terms of accessibility 1 begin
to just get back to some concrete examples of all this data is used and we've got this mobile app that we've developed and which provides fire data for the whole world also then provide some motors based vegetation data so on that crosses a whites line there is that is curing the history indication of hot dry the vegetation is in on on that line all the occurrences of file with the dates in and really the realize absolutely on on speedy access to this data because you can't expect the user to wait minutes just to draw a graph like this and and then we've got a
web euros as well um this is essentially just a girl maps of interface where you can click on a point and it gives you the various vegetation indices and others mode is based indices and users can into their own net lungs and hopefully in the in the near future bloodshed false and get get averages fully areas and then finally we have got you just
plug in a which is very much the same thing the any data that you can display and you just click on a point and gives you the time series values back those points thank you
thank you the questions so I end I just want to quote from all were word usage she if of on which the other options for fall for much that you can and should we experimented with a number of of file formats we try just flat files all of storing each time series as its own fall and using the false system itself as the sort of database but all of those options of quite slow and the nice thing about the sheaf is that you can include some geospatial make me today to have its and and and not just store and two-dimensional data but any number of dimensions so for some of the data we work with that's quite valuable especially the multi-band data the and this is really interesting i else to do have modest time-series generation assets it's quality how other folks about it and the questions I have um uh the first one is uh how much time you spend on actually doing the transposition into time series versus saving as a share 5 up to disk and I wonder if uh it using a Hadoop cluster for example to actually do that trans position and then using something else to do that you know processing and H 5 might uh places it's hot saves in time and I don't mean not just about and then the other question had was and whether you can do as sort of aggravate stagger its statistics over region and using uh using your set up or are you or is it really designed for doing specific pixel look ups and most officially so if I want to know how many fires happen in California have graphite that ideally I guess in terms of the composition of that's really depends on on the type of style modest all because of the no data values tend to transpose a lot faster and and also if we crop a tall and trim away the no dates the things go a lot faster than for the fully land-based cells and that sentence position takes about the same amount of time is reading and so that could be I would say about 3 minutes the to and off minutes for both reading entrance position and then in writing not to follow is is actually quite slow because of the single compression and we use both ESoP energies of compression depending on the type of data set data compresses better with Jesus and others bit of is of and that compression can take up to about 3 minutes and writing about emits a lot faster now so that we have the sand because we can write at about 8 and line and made about the 2nd onto the sand the and so that it does drop off some of the time and and in terms of spatial aggregation is possible and depending on on what used to read the data I think if you use and sort of good all C + + API you can read millions of pixels and in a very very short time so you can definitely spatially requests so time series for many many points and and then the statistics on top that but if you want to get a slice of the data to produce an image of the stellar structure obviously is again once again it's it's completely the opposite of what you need to build to produce an image so that undue images for this I those pretty interesting you mentioned trying to do certain problem with multiple threads on presumably it was transpose something on the same Cuba was pursuing and as I wondering could you possibly have possibly break down your data and to many more cubes assess the data more to coupled on i match on and not used to motor so it may be a stupid question but and it is possible and that makes handling the the source data the tricky because of noticed stalls 2004 that by 2048 pixels and so we would have to splits up the input data and read the sections that in order to old smaller attitudes and these other ways in which we can get around that problem of not being able to access the raw data with multiple threads and that's NPT actually have different processes that alright to a shared memory and so it's it's a quick win and although we prefer to try and address the root of the problem which is really just the spot cake and library she falls was written 20 years ago and when doing stuff like this was impossible so it wasn't really a concern to you have to be able to read data from multiple threads and and actually Google has a global lock over all HTA for operations and so good was made a cheerful thread-safe but obviously at the expense of not being able to use the data more the yeah with how where is all this could live I was waiting for that question have and is still in the research phase so I'm afraid there's no get up your purell like many of the presentations and but I do hope to build to make this available and within the near future unfortunately this also depends on sort of our research budgets and to get the stuff into the public key accessible form and but I have to be back at foster G with the with the you're always you can down at this any more questions thank you very much
Spezialrechner
Reihe
Satellitensystem
Geschlossenes System
Einheit <Mathematik>
Font
Zeitreihenanalyse
Geschlossenes System
Computeranimation
Satellitensystem
Nichtlinearer Operator
Satellitensystem
Pixel
Graph
Mathematik
Raum-Zeit
Kartesische Koordinaten
Fokalpunkt
Analysis
Raum-Zeit
Computeranimation
Satellitensystem
Überlagerung <Mathematik>
Reihe
Flächeninhalt
Zeitreihenanalyse
Menge
Fokalpunkt
Datentyp
Analysis
App <Programm>
Blase
Punkt
Prozess <Informatik>
Schaltwerk
Minimierung
Kartesische Koordinaten
Supercomputer
Kombinatorische Gruppentheorie
Biprodukt
Computeranimation
Mapping <Computergraphik>
Reihe
Spezialrechner
Bildschirmmaske
Rohdaten
Zeitreihenanalyse
Hydrostatischer Antrieb
Mobiles Internet
Hardware
Subtraktion
Prozess <Physik>
Quader
Adressraum
Klasse <Mathematik>
Gruppenkeim
Supercomputer
Computeranimation
Task
Spezialrechner
Zeitreihenanalyse
Geschlossenes System
Supercomputer
Gruppe <Mathematik>
Cluster <Rechnernetz>
Leistung <Physik>
Hardware
Hardware
Sichtenkonzept
Prozess <Informatik>
Biprodukt
Quick-Sort
Reihe
Kantenfärbung
Programmierumgebung
Aggregatzustand
App <Programm>
Bit
Prozess <Physik>
Gebäude <Mathematik>
Programmverifikation
Kartesische Koordinaten
Integral
Gebundener Zustand
Mapping <Computergraphik>
Benutzerbeteiligung
Geschlossenes System
Zeitreihenanalyse
Würfel
Parkettierung
Speicherbereichsnetzwerk
Hydrostatischer Antrieb
Würfel
Programmbibliothek
Server
Zusammenhängender Graph
Verschiebungsoperator
Aggregatzustand
Bit
Kanalkapazität
VHDSL
Minimierung
Wärmeübergang
Lokalität <Informatik>
Service provider
Computeranimation
Spezialrechner
Bildschirmmaske
Zeitreihenanalyse
Temporale Logik
Speicher <Informatik>
Datenstruktur
Datei-Server
Pixel
Prozess <Informatik>
Datennetz
Protokoll <Datenverarbeitungssystem>
Güte der Anpassung
Stellenring
Globale Optimierung
Speicher <Informatik>
Ruhmasse
Speicherbereichsnetzwerk
Reihe
Dienst <Informatik>
Rohdaten
Datenstruktur
Flächeninhalt
Hydrostatischer Antrieb
Overhead <Kommunikationstechnik>
Größenordnung
Overhead <Kommunikationstechnik>
Ordnung <Mathematik>
Pixel
Aggregatzustand
Puffer <Netzplantechnik>
Bildschirmmaske
Bit
Pixel
Ortsoperator
Zeitreihenanalyse
Hypertext
Warteschlange
Datenstruktur
Computeranimation
Funktion <Mathematik>
Lesen <Datenverarbeitung>
Betriebsmittelverwaltung
Formation <Mathematik>
Bit
Einfügungsdämpfung
Prozess <Physik>
Ortsoperator
Gruppenoperation
PASS <Programm>
Interaktives Fernsehen
Gebäude <Mathematik>
Computeranimation
Demoszene <Programmierung>
Spezialrechner
Multiplikation
Geschlossenes System
Gruppe <Mathematik>
Meter
Inverser Limes
Thread
Spezifisches Volumen
Caching
Soundverarbeitung
Prozess <Physik>
Pixel
Gruppe <Mathematik>
p-Block
Twitter <Softwareplattform>
Benutzerschnittstellenverwaltungssystem
Caching
Hydrostatischer Antrieb
Festspeicher
Würfel
Server
Garbentheorie
Mini-Disc
Ordnung <Mathematik>
Pixel
Lesen <Datenverarbeitung>
Bit
Prozess <Physik>
Punkt
Ortsoperator
Minimierung
Gebäude <Mathematik>
ROM <Informatik>
Raum-Zeit
Marketinginformationssystem
Computeranimation
Last
Dämpfung
Zeitreihenanalyse
Inverser Limes
Quellencodierung
Ganze Funktion
Gerade
Zentrische Streckung
Mathematik
Güte der Anpassung
Globale Optimierung
Quellencodierung
Würfel
Festspeicher
Mereologie
Ganze Funktion
Würfel
Eigentliche Abbildung
Programmierumgebung
Formation <Mathematik>
Bit
Multiplikation
Prozess <Physik>
Ortsoperator
Hyperbelverfahren
Datenparallelität
Implementierung
Schreiben <Datenverarbeitung>
Kugelkappe
Zentraleinheit
ROM <Informatik>
Term
Transportproblem
Computeranimation
Homepage
Intel
Multiplikation
Font
Software
Mittelwert
Gruppe <Mathematik>
Programmbibliothek
Inverser Limes
Steifes Anfangswertproblem
Thread
Speicher <Informatik>
Optimierung
Tonnelierter Raum
Quellencodierung
Parallele Schnittstelle
Implementierung
Lineares Funktional
Quellencodierung
Speicherbereichsnetzwerk
Kugelkappe
Energiedichte
Datenstruktur
Thread
Forcing
Menge
Festspeicher
Caching
Datenparallelität
Basisvektor
Dateiformat
Zentraleinheit
Aggregatzustand
Mapping <Computergraphik>
ATM
App <Programm>
Benutzerbeteiligung
Punkt
Flächeninhalt
Graph
Mittelwert
Hydrostatischer Antrieb
Mobiles Internet
Störungstheorie
Indexberechnung
Elektronische Publikation
Gerade
Schnittstelle
Public-Key-Kryptosystem
Bit
Punkt
Prozess <Physik>
Ortsoperator
Gemeinsamer Speicher
Hausdorff-Dimension
Program Slicing
Zahlenbereich
Zellularer Automat
Euler-Winkel
Term
Computeranimation
Bildschirmmaske
Multiplikation
Zeitreihenanalyse
Geschlossenes System
Mini-Disc
Datentyp
Programmbibliothek
Thread
Speicher <Informatik>
Quellencodierung
Phasenumwandlung
Gerade
Dimension 2
Umwandlungsenthalpie
Nichtlinearer Operator
Statistik
Pixel
Rootkit
Datenhaltung
Quellcode
Elektronische Publikation
Ein-Ausgabe
Quick-Sort
Konfiguration <Informatik>
Energiedichte
Generator <Informatik>
Rohdaten
Einheit <Mathematik>
Würfel
Dateiformat
Wort <Informatik>
Garbentheorie
Ultraviolett-Photoelektronenspektroskopie
Ordnung <Mathematik>
Lesen <Datenverarbeitung>

Metadaten

Formale Metadaten

Titel "Fast Big Data?" A High-Performance System for Creating Global Satellite Image Time Series
Serientitel FOSS4G 2014 Portland
Autor Swanepoel, Derick
Lizenz CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/31627
Herausgeber FOSS4G, Open Source Geospatial Foundation (OSGeo)
Erscheinungsjahr 2014
Sprache Englisch
Produzent FOSS4G
Open Source Geospatial Foundation (OSGeo)
Produktionsjahr 2014
Produktionsort Portland, Oregon, United States of America

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Description:We describe a system that transforms sequences of MODIS images covering the entire Earth into time-optimized data cubes to provide rapid access to time series data for various applications.Abstract:Satellite time series data are key to global change monitoring related to climate and land cover change. Various research and operational applications such as crop monitoring and fire history analysis rely on rapid access to extended, hyper-temporal time series data. However, converting large volumes of spatial data into time series and storing it efficiently is a challenging task. In order to solve this Big Data problem, CSIR has developed a system which is capable of automated downloading and processing of several terabytes of MODIS data into time-optimized "data cubes." This time series data is instantly accessible via a variety of applications, including a mobile app that analyzes and displays 14 years of vegetation activity and fire time series data for any location in the world. In this presentation we will describe the implementation of this system on a high-performance Storage Area Network (SAN) using open source software including GDAL and HDF5. We discuss how to optimally store time series data within HDF cubes, the hardware requirements of working with data at this scale as well as several challenges encountered. These include writing high-performance processing code, updating data cubes efficiently and working with HDF data in a multi-threaded environment. We conclude by showing visualizations of our vegetation and burned area time series data in QGIS, web apps, and mobile apps.
Schlagwörter Big data
time series
MODIS
GDAL
HDF
visualization
Storage Area Network
mobile app

Ähnliche Filme

Loading...
Feedback