Bestand wählen

Using Spark in Weather Applications

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Erkannte Entitäten
and some and present where the for the professional
services division of the company on your estimate of the company you probably already know no it's of we are the Weather Channel Weather Underground sigh and if you have a nice owner injury so when you have touched 1 of our servers in the past our dealing with forecast so we serve on average 17 billion requests per day for all the forecasts sometimes peaking at 30 billion and so as far as the professional services vision and we are engaging aliens aviation space so we help develop models on top airlines optimal optimizer few efficiency based on whether congestion modeling and airports and network propagation from which at time from runway configuration changes we of models to rid that energy we have products to model energy usage of renewable resources of wind power output and we even have a model that will forecast the forecast that comes out of nowhere or the European neurologically since since those are the market movers the her there's a forecast of the 4 test test coming out for this 1 so on insurance things like of insurance companies from understand the slaughter risk based impending weather events like Hurricane Katrina columns you better have switched on reserve and future potential impact we have models to detect hail from radar the until public was it incoming particle of commissars retail thing you inching problems and we space kupca music on people United States the big box stores and based on the weather forecasting near me at the centromere Commissioner of that decision turned on and off on this if I think it's the sum of all the models to and measure the impact of the weather on large enclosed spaces and so as far as Apache Spark whether company use it for feature extraction so taking this larger sets in slicing and dicing emerging to develop features for training a models and for but was models from measured data modeling itself and then our operational forecasting so we have support in use to a shipper from the forecasts on and then consumed by his the goals of presentation of was and how the overview of Apache Spark and and then following that a quick overview of the grid weather data formats for the suppression of it for now that ought to point it out of the vector geometries on examples of how you just as we did this work amend words and in certain some simple sport operations on data the thanks Tom but if so as a dumb introduced uh we will be using us but that she's hard for us in time and up so what have we we haven't it and that what is actually the best part so as pocketed Europe refers cluster computing framework so the goal of the of the framework is and to be able to distribute the computation over a network of computers on Germany's over a cloud of liver and that was part of the platform has started in 2009 uh at Berkeley on you know a lack of em the AMP lab and plan and it's been donated to the Apache Software Fundation in 2010 that was 1 of the 1st release and scenes are that that being constantly update of grading and and now we are getting to a in divergent 1005 of ended the army additions to it there of more than 400 developers that is open sourced and our uh they did every your brings a lot of new functionalities uh so what is the from exactly but it's a generation generalization over MapReduce side only few of her what is MapReduce MapReduce is that has been developed by Google and Yahoo for a distributed computing a for web search and a lot of other needs to had a and basically it was the date the variety was to separate the every condition into operation which was of map which is map and reduce and basic everything can be expressed with these 2 operations of and the park is a jurisdictional rate which means that it can do way more than just the mapping and reducing and it's also optimized for speed so here is a graph that a this part uh is is displaying aside comparing it to do and so hot over and the number of iterations and we see that the running time is very different think did it's about 10 times different of so what makes as part of facet around an the basically 1 of the few is is that the could hold and moved to the data and it's not that data are shuffled and up and basic and it's very fast the 2nd thing is they have the lazy evaluation mechanism which is operation asked or stacked and they're not going to be executed until it's really necessary so until we really actually need to results of these computations and their and our where a lot of the musician through that a stack of operation to to run them faster and they also want to make that 1 so the very clear when it started it was you also wanted to make that uh fast through but for right so that basically most of the platform in regions can which is uh languages can be set in more compact and and and a low to do a lot with a very limited set of instructions of so at the heart of the heart lies at the the main storage of unit which is called resilient distribute the set and throughout the presentation here at about that the the featured a dataset so the resisted asset is basically a way to partition the data I that you have a set of big data and you want to do computationally the 1st thing you do is put it they are together into the body resilient is with that assets which will be felt the spread out to the different a worker node so you can see here too general and uh at heart of the house up work so it there's a driver program that runs a code that equal that someone wants to write and distribute it with the apple will ultimately be spread out over a worker nodes in the number of nodes and that body is basically that they are automatically spreading out to the older worker nodes and the partition an what is interesting with the reason that that as I was saying before is that the days a lazy at the uh instruction Italy's impressions that when we create an actual RTT nothing really happens but it's just hiking all the instruction that leads to the the trading is they are so in in pargeter it makes it fraidy and fault tolerant in this sense that if we lose always somewhere there the because the RDB stole or all the operations of it's easy to reconstruct them basically from scratch starting from the date on older persons being it's the reason to redo it this is how they get that fault tolerant future which which is bright and the basic as we will see a layer there are 2 types of operations that we can do on an already what is traded the the first one hour a transformations so this is the 1 that lost 4 and not necessary executed unless necessary and they are actions and actually this is where this is what triggers the actual evaluation and we we and will whether the codes that are distributed computed at the nodes and back to the driver but the end uh finally that already is is also great it's easy to the tune the partitions we can always repartition RTD and Fifi animal nodes only if you want to optimize so we can play with a partition will also play with how we cash is idea is that orbit cast on the workers of so that if you have multiple operations and data set on this already it's not the real the data it's just gonna story it and the Paris is there and there are different at the levels of stories that we can use of going from memory only disk memory and is that we can choose labor depending on on on the need for itself it's really coming so they are
compared to not produce is hard on operations that can be done on on the united and here is a very a small sample of itself of 1 that is very basic is filtering which means of basically removing some of the database doesn't condition but so the map is the same as that of MapReduce in the sense that it's a 1 mapping where we take our original data are and we all transforming them one-to-one 1 every single data element is transformed into a new element it's a 1 1 mapping and we have a person does more than that which which we death scattered data which is something that wasn't this is the this is so easy to do with MapReduce and that is how does very well on its of so for every visible that is flat flat map fat that would basically do for every element of the data you can create an ensemble of a similar element but in a sense you basically increase the size of that state if you work with key value data you would write a set of keys of and a and and you create basically that the one-to-many mapping for every single sample in your there and then finally which is the synapse reviews but uh the again with way more pressure needs to gather transformations with and which basically will reduce the put together different but they have items you have annually in your RTD of an and aggregate them through a person that you may all sex some average ease and and and basically the out many which control whole how that the data are aggregated inside each partition on the 1 on the worker nodes and also how they are aggregated on then back to the driver uh in in the made me think you here
but as you can imagine if you have multiple RTD you can also do not the giver operations so the union intersection joined up as a that's great classical 1 thing it's pretty poor full is that quote group which a low utility bill and you want to put together 3 oddities and into a single 1 so the it itself thank graceful son but switched on thanks so so this point we have scription RT and the question for us now is how we get the data into the on looking at whether data you're dealing with primary 2 types of data from observational and forecast on their varying dimensionality even within a forecaster with the observations of an dimensionality of the structure on throws light use of something is a very often and it reads because there is much debate their common if you've done enough to know that if you're a customizes code trespass dataset in New 1st we exchange messing it's really not a fun beautiful to play with the times and so should never binary formats on that's together grave HTF those used you know but the good news is that there is a tool called that's your job I mean when something called the common data model year and this is becoming a prize abstraction the eventuality of and for the different formats and it's very advantageous to use in this or that this 1 is a canonical by reading tandem the problem with data in this format is that there are many large files and the the weather company on just a grid data stands at about static load of but to distributed bytes per minute on many begin terabyte 35 hours or so and kind of light the statements and so there's a lot of data and users it a train models or for operational forecasting you Hamlet and it helps scale so hurry data related and RTD is and if using spot HDFS is typically what underpins a sport insulation our problem we have these differences are due to the file system on too fast application ture across nodes on very large block sizes Signed stream to make best use of hand only you read the new writing but it's to the flux as the heart of 128 megabytes you don't want to be reading a block change of white on reacted to standard for start from typical formats that are supported in that annual basis Tricia news text on the on the of extending the size of and Barry from Israel but you know how convert that Cervantes 1 of others read stories written for message diversity that rich diverse school reality is Mr. John has some from middle assumptions of the kind of is use in outside of fast and profit of requires a fight into the file and random access to the problem which optimize retreat so are options and we want to maintain their lead users to surprise really rich healthy attraction for us all it assumes faster access and so 1 option is used as you should file system brought cluster the but often look at my not extorts used to store as a key value pair in a tool like a 3 or sex with water erectus similarly stored on so we have chosen to use given them money we have In the restriction the Minister of objects so utility of these Amazon be heavily receive a flowery essentially stored as 3 of using a compound here is that has all the information that mentioned the date product variables of answer is necessary so what we wanna read the data in into spot and then produce a list of key so want to read them we then generate an idea to solicit users want from us and then we distribution of work and it's it's very important to sometimes it's list of cheese is not big enough to be on the protection across all of the clusters so you need a memory this check demand partition this list of key 200 and and then all these sentence sparknotes worker nodes and then flat that were 1 of the necessary key is red the files locally it is open up this year job local file system or a memory fountain every file system minutes flat to this compound cues already with all the variables in the various mentions in the company value I that's a student story crossed over the nodes ready work on operations now as far as using S 3 or something is given mind it really does influences spot cluster are designing typically with HDFS from diffusion go with the higher CPE density per node and you want you just reach local tyranny are no on when you're doing this pessary strategy you need to this sleep decrease your if you to never quite reach you want have as many not quite as possible In with S 3 we've seen in the upper node reads of around 300 megabytes per 2nd I'm certainly pull data off Wesley a very quickly we have a large number of the job so quickly that and think is cost on refusing deftly need to keep your PC to instances in the s regions in the region the distortion otherwise you heard on transfer cost you learn with terabytes of public to delete that 1 part of the interesting history the plays really well in US sport March Elastic MapReduce is like the on demand cluster engine so the have a data store S 3 you can spin up if you a sport clusters on-demand when job and shut down the results of a survey of was very well on the on the model of and also once you have data pulled from 3 and red in your arteries and then after the underlying in HDFS inspired from the format that you want to be so each tree is defined as a passionate the question of what you do then you are in the the
so we did not necessary out introduce exactly what we're doing but um most who mostly what what I'm doing with this of Schardt platform and already is is actually a machinery in algorithms for predictive modeling the statistics of the soul and here we are trying to make the the conditional more concrete on what what what can be done actually with it so it's mostly their preparation taps and the first one is what the clementian is that the volume of the the we begin and for itself if you get that he that we present here of we basically that he basically contains every piece of information that the forecast forecast content which ease of which viable we're forecasting for is the runtime of wonderful has been issued of what is there the actual ensemble member when the ensemble forecasts so and some will focus on multiple forecast of for this and times that we we can have a probability information but we also have developed time the use of i times a which is when that book has been developed for and then we have the wine and expertise that 3 three-dimensional haste of diffusion and we put all these numbers together we get our true use that up point that we generating heard day and that that's only 1 of possible and we have many of them that we use so you see this example is the CMW effort in Europe European Centre for a million me weather forecast and we also use GFS also use non-use we use many moles so we need like pass extraction and the 1st step is obviously filtering of filtering out in horror and mean of of getting a reducing the amount of the data key step to the the the key that or interestingness suppressible here and with a small example code you see that we are uh for example extracting the two meter temperature only we all the only interest by a viable the run the run this this 6 the hours and time and we only going to get the 1st 24 hours of the forecast and and then we can also leave it dissipates buys this fighting along longitude and latitude that the conditions so it was great is that it's a single can of single line instruction that will run and and extract the data out by using what comes from before of another example here is so if we want to do for example translations of the data and it's it's actually very simple to do by just simply of shifting the key so by shifting the the the the key we are able to actually generate new they are necessary for a commission learning so what I displayed here is an example of a Amal that would take but that a certain number of X viable that all of from xt minus 1 to xt minus i which is the the the the viable for example temperature and for the past followers that we want to include the animal so the fact that is a good way for us to to achieve that I have another of nice future which makes it very easy to use work it would be to resample the data of the it's really easy for us to go around the key all to have a key not heard that basically we have aggregate all the points that belong to the sense where up to the same key so that by using aggregation functions of we we all the uh by using reduced by he we we are going to get on and and the gregate of of this they also resembling is made pretty easy and can be done in a very specific way of of forces a very useful and I'm here like a more complex version which is moving average which necessitates like a sliding windows and so hot that that's where part is pretty the mission oriented it has a sliding function and basically creates a over all these must have left it creates like a sliding window over which I that we can use any types of aggregation averages all of median all the and and it's very common in the 2nd of in a very compact way we can up prepare and run mission in moles of from from the data here is from the the N
which is the machine learning library at that spot has what they have a lot of different model for clustering the data for reducing the dimension then but all that the levels that we need from here to a decision trees and random forests for a for a part of it the multilayer perceptron so they also have the Emily library has known network library where we are actually the experience so right now but and and and and is all that works with a single party that every of this function will take the the already as input so that that's basically it follows presentation which 1
over about for and thanks for attention and I have a question to the questions at the end of the and the end of the day and the time the yeah we did it come far as reasoning storage as 3 minutes to sensitive way of metadata can also instances of restoring glitches so something like so as far as I mean when the process of decommissioning of data centers in the of so the decision was made so corporation that some of the some savings in the sum of the parts this but as far as the cost to computer processing and they we don't have this power stands and given the volume will lead to the I'm assuming the places you you the but we have a kind of a tiered storage systems so we capture data on since we In the product renown working on we we're vending all the readers didn't sell it for all forms of and so that it goes into the lattice he values so in memory of the river more her but prince cluster and operate and then after that it is archived to pass through and then we're a decision given sort of a strategic partnership IBM data or the their decides purposes so we have available and forecast data is very important for model training you capture the forecast is that you have when you actually run the model to train a forecast of the classes in the work so you have sort of and this is a bit of an entity can answer every want a aside from network and just as CPU and 1 of the stress points in small of which sparked a mean you want to keep your data local to each node is much can you have a of which such a sort a shuffle operation would incur network traffic across the nodes that the stress points you design your the processing such that a link to lot and and you actually tried to do this question because when you know in in the Oskar example then takes as a word in glaciers you you story and then if you think you need it for the requester in in which 5 hours to do so glaciers for archival as your appointment of is that by using acid you have glitches option you write would want to and real-time or or near-real-time with this week work FIL
Gewichtete Summe
Schreiben <Datenverarbeitung>
Gruppe <Mathematik>
Translation <Mathematik>
Dichte <Physik>
Generator <Informatik>
Einheit <Mathematik>
Objekt <Kategorie>
Demoszene <Programmierung>
Virtuelle Maschine
Gewicht <Mathematik>
Theoretische Physik
Spezifisches Volumen
Cluster <Rechnernetz>
Amenable Gruppe
Binder <Informatik>
Wort <Informatik>
Prozess <Physik>
Atomarität <Informatik>
Element <Mathematik>
Computerunterstütztes Verfahren
Maschinelles Sehen
Lineares Funktional
Konfiguration <Informatik>
Spannweite <Stochastik>
Arithmetisches Mittel
Varietät <Mathematik>
Transformation <Mathematik>
Physikalisches System
Speicher <Informatik>
Operations Research
Leistung <Physik>
Wald <Graphentheorie>
Orbit <Mathematik>
Physikalisches System
Objekt <Kategorie>
Streaming <Kommunikationstechnik>
TUNIS <Programm>
Temporale Logik
Befehl <Informatik>
Dienst <Informatik>
Dimension 3
Elektronischer Fingerabdruck
Klasse <Mathematik>
Automatische Handlungsplanung
Fluss <Mathematik>
Räumliche Anordnung
Open Source
Elastische Deformation
Inhalt <Mathematik>
Elektronische Publikation
Formale Sprache
Kartesische Koordinaten
Arbeit <Physik>
Einheit <Mathematik>
Prozess <Informatik>
Funktion <Mathematik>
Nichtlinearer Operator
Zentrische Streckung
Gesetz <Physik>
Wahlfreier Zugriff
Funktion <Mathematik>
Strategisches Spiel
Kombinatorische Gruppentheorie
Framework <Informatik>
Elektronische Publikation
Transformation <Mathematik>
Einfache Genauigkeit
Mapping <Computergraphik>


Formale Metadaten

Titel Using Spark in Weather Applications
Serientitel FOSS4G Seoul 2015
Autor Kunicki, Tom
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
DOI 10.5446/32074
Herausgeber FOSS4G, Open Source Geospatial Foundation (OSGeo)
Erscheinungsjahr 2015
Sprache Englisch
Produzent FOSS4G KOREA
Produktionsjahr 2015
Produktionsort Seoul, South Korea

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract "Many important weather related questions require looking at weather models (NWP) and the distribution of model parameters derived by ensembles of models. The size of these datasets have restricted their use to event analysis. The ECMWF ensemble has 51 members. Using all these members for statistical analysis over a long period of time requires very expensive computational resources and a large amount of processing time. While event analysis using these ensembles is invaluable, detailed quantitative analysis is essential for assessing the physical uncertainty in weather models. Even more important is to potentially detect different weather regimes and other interesting phenomena buried in the distribution of NWP parameters that could not be discovered using a deterministic (control) model. Existing tools, even distributed computing tools, scale very poorly to handle this type of statistical analysis and exploration - making it impossible to analyze all members of the ensemble over large periods of time. The goal of this research project is to develop a scaleable framework based on the Apache Spark project and its resilient dataset structures proposed for parallel, distributed, real time weather ensemble analysis. This distributed framework performs parsing and reading GRIB files from disk, cleaning and pre-processing model data, and training statistical models on each ensemble enabling exploration and uncertainty assessment of current weather conditions for many different applications. Depending on the success of this project, I will also try to tie in Spark's streaming functionality to stream data as they become ready from source, eliminating a lot of code that manages live streams of (near) real-time data."

Ähnliche Filme