Using Spark in Weather Applications

Cite

FOSS4G

Open Source Geospatial Foundation (OSGeo)

Kunicki, Tom

Formal Metadata

Title

Using Spark in Weather Applications

Title of Series

FOSS4G Seoul 2015

Number of Parts

183

Author

Kunicki, Tom

License

CC Attribution - NonCommercial - ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/32074 (DOI)

Publisher

FOSS4G

Open Source Geospatial Foundation (OSGeo)

Release Date

2015

Language

English

Producer

FOSS4G KOREA

Production Year

2015

Production Place

Seoul, South Korea

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

"Many important weather related questions require looking at weather models (NWP) and the distribution of model parameters derived by ensembles of models. The size of these datasets have restricted their use to event analysis. The ECMWF ensemble has 51 members. Using all these members for statistical analysis over a long period of time requires very expensive computational resources and a large amount of processing time. While event analysis using these ensembles is invaluable, detailed quantitative analysis is essential for assessing the physical uncertainty in weather models. Even more important is to potentially detect different weather regimes and other interesting phenomena buried in the distribution of NWP parameters that could not be discovered using a deterministic (control) model. Existing tools, even distributed computing tools, scale very poorly to handle this type of statistical analysis and exploration - making it impossible to analyze all members of the ensemble over large periods of time. The goal of this research project is to develop a scaleable framework based on the Apache Spark project and its resilient dataset structures proposed for parallel, distributed, real time weather ensemble analysis. This distributed framework performs parsing and reading GRIB files from disk, cleaning and pre-processing model data, and training statistical models on each ensemble enabling exploration and uncertainty assessment of current weather conditions for many different applications. Depending on the success of this project, I will also try to tie in Spark's streaming functionality to stream data as they become ready from source, eliminating a lot of code that manages live streams of (near) real-time data."