"Many important weather related questions require looking at weather models (NWP) and the distribution of model parameters derived by ensembles of models. The size of these datasets have restricted their use to event analysis. The ECMWF ensemble has 51 members. Using all these members for statistical analysis over a long period of time requires very expensive computational resources and a large amount of processing time. While event analysis using these ensembles is invaluable, detailed quantitative analysis is essential for assessing the physical uncertainty in weather models. Even more important is to potentially detect different weather regimes and other interesting phenomena buried in the distribution of NWP parameters that could not be discovered using a deterministic (control) model. Existing tools, even distributed computing tools, scale very poorly to handle this type of statistical analysis and exploration - making it impossible to analyze all members of the ensemble over large periods of time. The goal of this research project is to develop a scaleable framework based on the Apache Spark project and its resilient dataset structures proposed for parallel, distributed, real time weather ensemble analysis. This distributed framework performs parsing and reading GRIB files from disk, cleaning and pre-processing model data, and training statistical models on each ensemble enabling exploration and uncertainty assessment of current weather conditions for many different applications. Depending on the success of this project, I will also try to tie in Spark's streaming functionality to stream data as they become ready from source, eliminating a lot of code that manages live streams of (near) real-time data." |