Serving high-resolution sptatiotemporal climate data is hard, let's go shopping

2 views

Formal Metadata

Title
Serving high-resolution sptatiotemporal climate data is hard, let's go shopping
Title of Series
Author
Hiebert, James
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
FOSS4G, Open Source Geospatial Foundation (OSGeo)
Release Date
2014
Language
English
Producer
FOSS4G
Open Source Geospatial Foundation (OSGeo)
Production Year
2014
Production Place
Portland, Oregon, United States of America

Content Metadata

Subject Area
Abstract
The world is a big place and time is infinite. Scientists who study any aspect of the Earth's climate are immediately faced with the exponentially growing amount of data that are required to represent properties of the climate in both time and space. The bulk of these data is a substantial barrier to extracting meaningful information from their contents. This barrier can be prohibitive to smaller-scale researchers and communities that want to study and understand the impact of the climate on their localities. Fortunately, a substantial amount of free and open source software (FOSS) exists upon which one can build a great geospatial data application.The Pacific Climate Impacts Consortium (PCIC), a regional climate services provider in British Columbia, Canada, has been making a concerted effort to use geospatial FOSS in order to expand the availability, comprehensibility and transparency of big climate data sets from the Coupled Model Intercomparison Project (CMIP5) experiment. With a full stack of geospatial FOSS and open protocols we have built and deployed a web platform capable of visualizing and distributing high-resolution spatiotemporal raster climate data.Our web application consists of:+ back-end storage with raw NetCDF4/HDF5 files+ a PostgreSQL/PostGIS database for indexed metadata+ ncWMS for maps and visualization+ the PyDAP OPeNDAP server for data requests+ a web user interface to tie it all togetherThis presentation will provide a case study for enabling scientific collaboration using FOSS and open standards. We will describe our application architecture, present praise for and critique of the components we used, and provide a detailed discussion of the components that we had to improve or write ourselves. Finally, though our use case is specific to climate model output, we will provide some commentary as to how this use case relates to other applications of spatiotemporal data.
Keywords
spatiotemporal
opendap
big data
netcdf
hdf
postgis
climate
Loading...
Image resolution Medical imaging Spacetime
Metre Dialect Mapping Key (cryptography) Observational study Information Ferry Corsten Multiplication sign Zoom lens Sound effect Line (geometry) Vector potential Flow separation Data mining Digital photography Plane (geometry) Universe (mathematics) Order (biology) Window Physical system Spacetime Vulnerability (computing)
Dialect Decision theory Forcing (mathematics) Projective plane Sound effect Streaming media Mass Degree (graph theory) Digital photography Computer animation Bridging (networking) Touch typing Business model Office suite Local ring Resultant
Domain name Dialect Scaling (geometry) Coalition Process (computing) Information Variety (linguistics) Physical law Expert system Sound effect Physicalism Semantics (computer science) Image resolution Term (mathematics) Personal digital assistant Forest Business model Computer science Cuboid Local ring Fundamental theorem of algebra Physical system Row (database)
Satellite Domain name State observer Server (computing) Computer file Multiplication sign Mass Mereology Weight Open set Information technology consulting Number Web 2.0 Fluid statics Latent heat Measured quantity Natural number Business model Software framework Subtraction Area Airfoil Dialect Process (computing) Product (category theory) Spacetime Projective plane Expert system Bit Uniform resource locator Numeral (linguistics) Computer animation Integrated development environment Data storage device Hausdorff dimension Self-organization Right angle Communications protocol Data type
Enterprise architecture Server (computing) Demo (music) Observational study Multiplication sign Mass Streaming media Weight Metadata Flow separation Web 2.0 Computer animation Visualization (computer graphics) Dependent and independent variables Cuboid MiniDisc Diagram User interface Physical system
Polar coordinate system Server (computing) Computer file Ferry Corsten Multiplication sign Function (mathematics) Vibration Subset Fluid statics Video game Component-based software engineering Bridging (networking) Single-precision floating-point format Representation (politics) Selectivity (electronic) Subtraction Vulnerability (computing) Area Overlay-Netz Mapping Information File format Computer Sound effect Web application Computer animation Network topology Dependent and independent variables Right angle Remote procedure call
Web 2.0 Web application Computer animation Code File format Telecommunication Database Translation (relic) Software testing Line (geometry) Metadata
Greatest element Code Multiplication sign Range (statistics) Function (mathematics) Data analysis Mereology Weight Usability Web 2.0 Fraction (mathematics) Medical imaging Web service Mathematics Single-precision floating-point format Electronic visual display Extension (kinesiology) Enterprise architecture Process (computing) Electric generator Mapping File format Electronic mailing list Bit Staff (military) Degree (graph theory) Arithmetic mean Graph coloring Repository (publishing) Hausdorff dimension Order (biology) Freeware Server (computing) Implementation Computer file Open source Connectivity (graph theory) Mass Streaming media Open set Number Attribute grammar Revision control Goodness of fit Latent heat Database Touch typing Ideal (ethics) Data structure Subtraction Home page Demo (music) Projective plane Line (geometry) Limit (category theory) Local Group Word Computer animation Software Visualization (computer graphics) Interpreter (computing) Dependent and independent variables Communications protocol
Computer animation Function (mathematics) Units of measurement Open set
Computer animation Operator (mathematics) Perspective (visual) Metadata Number
Domain name Trail Server (computing) Computer file Multiplication sign Range (statistics) Point cloud Function (mathematics) Streaming media Weight Disk read-and-write head Perspective (visual) Attribute grammar Derivation (linguistics) Data compression Database Core dump Business model File system Energy level Extension (kinesiology) Dialect Mapping File format Electronic mailing list Sound effect Planning Analytic set Tessellation Computer animation Angle Personal digital assistant Hausdorff dimension Universe (mathematics) Quicksort Fiber bundle Near-ring
Scaling (geometry) Computer animation Lecture/Conference Projective plane Business model Energy level Number
this looks like a 0 here so that when it started comes my name's james here 10 can be talking today about serving high-resolution space so space know temporal data on the web and multiple just the subtitle there's not actually can be any shopping involved in this talk of images that I guess I wonder perpetuates stereotypes that canadians come down to the US to go shopping but
so I mean I'm here from the Pacific climate impacts consortia and and named key so a nonprofit based out of the University of Victoria in Victoria British Columbia and our mission is to bridge the gap between academic climate science and applied policy regarding the impacts of climate change so that the study of global climate change our requires large spatio-temporal climate simulations and is climate simulations that provide coverages over both time and space so well discussed today is how we built a system to serve these large datasets tersely stakeholders of but will it's terrible and start the about myself and it's not much on so we got a little map here once zoom menu meal actually see something hopefully the projectors through but the the it the to do that now and so I he taken our stakeholders work to address vulnerability to climate change but also what does that
really mean so I a I live on an island on an archipelago called hiding y and some small coastal community you see at the time of something or of and so are grocery trucks come in once a week by ferry and they're they're easily disrupted by poor weather much of my communities population of 1st nations and everyone relies very heavily on the ocean for subsistence so comitological disruptions very directly cause socio economic disruption on my house along with many others of about 100 meters from high tide line and chancellor map again but others a coastal marsh next to us so and there's about a 1 to 2 meter tied seperates low tide from high tide a couple kilometers of the March of so you can imagine what the effects of another meters sea level rise might be good in the community so in communities like mine about the potential effects of climate change a very 1st order in very directly experience more least it's easy to imagine the but but in of larger communities in 1 2nd here but in and in larger communities also vulnerabilities so Victoria where he cake is based in Vancouver both susceptible to the effects of sea level rise of so this photo here is my plane on the way down here landing in Vancouver and you can see both the edge of the runway and Tidewater in the same picture frame likewise that you'll see this as well another larger seemingly invincible cities like San Francisco New York or like the house in this photo in the Chesapeake Bay not far from Washington DC and that house isn't there anymore so in 2005 the province of BC recognizes vulnerability to climate change by creating the PKK via an endowment to the university of Victoria and charged us with the mission of bridging the gap between the academic climate science and the research community but academic climate science research community and regional users of climate information of us
we conduct credible peer-reviewed research on the effects of climate change in BC and work with stakeholders touches provincial regional and local governments to make use of that research so our stakeholders use climate
projections answer a wide array of pertinent questions and to make policy decisions in engineering decisions for incredibly expensive infrastructure based on results of our impacts models so examples include a whether future river flows can support hydropower whether future storm intensity will necessitate larger culverts storm drains are bridges and to what degree sea level rise might inundate our homes and farmland In this photos few minutes walk from my office and how loss of at the glacier melt might possibly shear mass might affect our streams rivers both Tidewater Alpine glaciers or whether force might become more susceptible to fire for outbreaks of disease yeah so to answer these kinds of
questions requires us to have information a very small scale for high resolution and as applied to climate data high resolution is kind of difficult to define for impacts models so users typically require landscape-level information a better so high is a relative term that usually means high seeking get from but allow me a brief interlude had to 1st explain were climate it comes from and how 1 gets from global physics to landscape-level information in local effects it and so it's so that's a
long and data-intensive processes are process and requires the expertise of planners statisticians were climate scientists statisticians domain experts such as forests Forster's and hydrologists and and computer scientist such as myself so just for the record I personally don't have expertise in climate science of this talk is about that particular button discussing it to give to motivate our use case and to help explain the semantics of our data the so a snapshot of the of the process looks like this a large coalition of international climate scientists from global climate models were reduce and that represent climate at a large scale and these models are drive from fundamental physical laws and include a variety of features that affect the climate system sometimes the at a lower scale information and this is the same models around but with smaller grid boxes on a particular region and that's referred to as a regional climate model for our system and then climate statisticians like those of the cake and downscale of the GC ends were lost against regional or local scale and then domain experts may run local impacts models of a local scale
the so so if you think about that whole process is a data pipeline the cake has expertise in the areas on the right to downscaling some impacts models modeling and but we rely on open data and open protocols from the organization's upstream Witcher mostly federal governments and we want to provide a downscaling products the open data and open protocols downstream so that's 1 reason why I want to release the data and assessing climate impacts doesn't end downscaling and there are numerable other domain experts who can use this downscaling data to assess climate impacts on their areas of expertise for example hydro engineers the highway engineers in hydropower engineers and in fact recently the province of BC began requiring proponents of natural resource projects to begin including the impacts of climate change as part of the environmental impact assessment process and so it's still in its infancy that also there are a lot of consultants and there that real he need data on climate change the the on and finally we're confident there were calls up to peer-reviewed scientific rigor so we much pride as much transparency as possible and having a data public accompanying have any journal articles is a great way to do that not nearly all of our climate data as spatio-temporal Rasta data so it's 3 or 4 dimensional coverage of x y and time dimensions sometimes including is the dimension and so we want to do is create a general-purpose framework for serving spatio-temporal Rasta data and then put a bit of a web UI on top so it turns out that's hard to
do and if you consider that that climate scientists or modelling some number of future scenarios for human emissions that multiply that by all the different global climate models by all the different regional climate models by all the different types of downscaling by some number measure quantities by time and by space all a sudden you've got a lot of data but unlike observations were not limited by the sensors on satellites that we can afford deploy became just can create data out of thin air people and were least a model that's this turning away at simulating the Earth in the ocean and were not even limited by what's happened in the past we were projecting future not just 1 future but many many realizations of the future as modeled by many people and then many modelling centers so that we could have just dropped dollar all over data all the mass of net CDF files on on FT piece on static web server in college but most of the researches with whom we collaborate don't have the ability to download and store hundreds of terabytes of data like we do plus a turns out that most impacts modeling is very location specific so people should have to download data for all of Canada Justice select out Vancouver's data and throw the rest away so we want give people a data they want and only the data that they want
and so this is the the architecture we designed for the system is represented by this diagram and you'll notice of the data itself is down the bottom end of the foundation of everything and then there several streams of services that are built on top of that so the net CDF box your lower left is the only thing that's just data sitting on disk and we've go of post post breast and just on interview mass pied AT and T which the package that we wrote and they're all different services that we have running which respond to incoming web requests all the metadata regarding are available data is organized in post ncw must provides the climate visualization layers and pied après bonds to all the requests for the actual data itself and then PDP response to all the requests that build up the user interface the see also in and go through a brief demo that you may or may not be with study but I'll do ahead of time
so that there is a hypothetical user of our
climate services and this is Alice she's an engineer with PCs Ministry of Transportation and she's working 1 a B remote coastal communities Bella cooler His attach the outside RO outside world by ferry had a single road and she's assessing its vulnerability the extreme precipitation and its effects on roads culverts bridges and other critical infrastructure and to Alice once plausible future climate scenarios for the watersheds rounder highways Weaver map to select area of interest and and so she can get the this overlay the clamor Rastas see where the information is available the right hand side there's a tree of all the different scenarios of greenhouse gas
concentration pathways differences in GC and in different downscaling methods and there's a time selector only interested in the subset of future crap showing use analyze the past plus the future 40 years to correspond to the projected life some bridge and she assesses select output output format and so we often at CDF and then a couple others for convenience so in a park asking grid format and the plain text representation I mentioned just selects download and data start streaming right away the this is area the download of but now that the data that the datasets that a user could download are potentially very large so each scenarios the full spatiotemporal domain and around 150 gig which is a ridiculous to serve up as a static file or http but as soon as you want to write a webapp around it allow dynamic responses subset requests etc. all a sudden you're talking about a lot of I O bound computations that you have to make before the HTTP response was that which ideally should happen in less than a 2nd so while reading for that to download I'll go through all of the software components that we use off the shelf moons we had to modify for this purpose and so we've
written a full web application back in Python which does all the file format translation all the database communication and passes all the metadata onto the web UI to be interpreted the user on so that consists of about 20 800 lines of Python code plus a 15 hundred lines of that just testing code and then there's another 3 thousand lines of which make up I dont which of the modified a lot of Chile to what extent and that here the and
so high that this is a component of the data portal that actually provides the data download services as an implementation of the Open Data Protocol which is designed to be a discipline neutral means of transferring data across the Web and so it's the protocols open source and it's designed to be 0 as an application independent such that you can get data into whatever softer you want to use to do your data analysis boat supported mostly by US scientific agencies such as known and arson and 1 there are another of of different there are a number of other different opened up servers out there but pineapples 1 that we use this architecture is is quite a bit more flexible than some the other opened up servers so that this is a rough layout and you can see it that at the bottom and has a number of different handlers which are written to interpret different data formats and then translate them into the DAP structure in the middle and then on the top there are numerous responders which translate the DAB structure into output formats user once so it's quite a bit of flexibility and I which important for us especially with a wide range of abilities the users have and in be be a bit of an idea of to what degree pineapples off the shelf and I ran Hg turn on all of our pilot repositories i which measures the changes in the repository by lines of code and then the fraction shown there are I the turn of the staff from my group divided by the total Chern of all that matters and you can see that we wrote 1 handler by ourselves From the HDF and net CDF work that we're using is mostly hours that's primarily work that needed to be done to make the server able to stream the response of and for the rest we only had to make minimal changes were using a modified version of NCW mass to provide visualization the climate Rastas on of really like it gives us a lot of stuff for free and so a full-featured web mapping service server that converts net CDF files into tiled images usable on the web and it has support for time dimension which is very important to us given the most arresters are also have thousands of times that's available fortunately it has a few limitations that make it non ideal for use of Big Data so to configure layer you have to go through the files 1 by 1 and add them to a list and then configure 5 10 different attributes and additionally whatever you want to start rescore or restart the server that goes through every single file in order and scans the the whole ball file to determine the ranges that it can assign a color bar on so I can take many minutes possibly hours and it only gets lower the more layers you add so we've done some modifications sensitive you maths to run it off of post press database on word gets its whole list of layers of variable ranges and everything so that that D coupled image generation from I O intensive part of configuring layers and it's made possible to scale up a deployment I and finally the last piece of softer stack is the JavaScript front and the ties everything together for the user and ultimately it doesn't really provide any functionality in and of itself but it's Kiefer providing a good user experience and and it's aware of all of the various services that are provided asynchronously makes requests processes and then displays things to the user so all the mapping is provided by layers and we use Shakery for bit of convenience the so this is the free and open-source softer conference so of just mentioned that to what extent are components of free and open source software and so for the most part of this project unfortunately were were mostly just open-source users not huge contributors and but all of our code for pied have PDP in the way due i are on a date homepage which are linked to the and and are released under various free software licenses that's technically free software but but not so much in the sense that it has a vibrant community effort driving at some were the only contributors and users for all the parts that we wrote out right of and to our knowledge so much the code is still pretty specific turn needs of mostly because we haven't just have the need to generalize it so that if anyone's interested in using any of the components I guess just get in touch but anyway go back to the demo see her data is done downloading
and what you know it looks like it
has 1 engineer has not quickly downloaded gigabytes of custom selected climate scenario output with just a few clicks of the data is fully attributed with metadata units are references and citations to the methods used to perform the downscaling so and
spirit of open science and having all the
metadata directly attached the data is a pretty big deal because it inches of the data provenance is tractable even if are further operations are performed on the data later later on which is the ultimate goal and from our perspective so from here it's relatively easy for Alice to plug the numbers into whatever impacts motions run at the
so with that the and here
and I that's all I'll leave you with my simple can simple conclusions i governments user downscaled climate model output to plan for the effects of climate change near infrastructure there so much model output that data delivery is a non trivial problem tried to make it as easy as possible for users to narrow the data down to what they actually need and we stream into them right away and our work is available and were happy to work with others to make it more what do you do if b I think we've got 5 minutes for questions but the I'm just curious what you do for hosting for your large amounts of data here on your own server should use cloud as things but we do run answers were fortunate enough to be a university where were subsidized some extent so we build a lot of our own stuff that was of local the and the and mentioned the downloaded datacubes a fully attributed its at all attributes in that city other side commented upon has a year after year and that's see the should probably add that's not the case for some of the other formats but yeah what Messier the yeah so on and then some work with as some kind I model data as well and he and as far as like tiling out for as serving out on like say with map tiles and the dad explodes really really quickly so how large is a cluster that you're working with an to process this data um we don't actually use any we don't use any cluster it's a single server that's single-server list pushes Rasta running on a yeah so 1 of the tricks is organizing the the datacubes to be able to read it quickly and if you yeah if you lay out the data in time agency of major and minor mixed up but you laid out in time major than you can make a map really quickly but you can't drill down in 1 dimension because you've got a lot of read the reads the whole way through and so yeah give the layout 1 way you can easily make a map that if you have lay the other way that the can easily read and go and you have any plans for a I'm adding to analytic capabilities sesterces statistical i derivations of the data as as services here we want to but were working on for on angle this the kind of almost answer my question Iran's cares about uh yes restoring large files you wanna serve them up quickly and here's uh what sort of clustering within about the use of what we actually I think I don't think we use any clustering August impression yet so we use file system level compression on of us and then the decompression gets put up across all of our core tha 1st as far as that others that are the files the stored compressed at and the file system here the uncompressed at the net CDF level but a compressed at the the file system level of a and so that way you get because it's spatiotemporal there's lot of no data values notions and of the thing so you get really good compression with all the new data values essentially goes away but thanks the start more the story and yeah on the top of my head when track of the important things go from now from a performance perspective are the ranges that you can generate 1 of the good and the color of but yeah we store like all the GCM Sonora C and so that it came from and on all the time look full-time dimensions so if you wanna see in the file for a specific time step you can look that up the database 1st of all I have to and the Rangers are probably the most important thing so that I can think of off the top 100 that's a good question so I no I think I think kind goes straight to the files themselves 1 the what's what's the spatial domain it depends we mostly work in like mandate story can be seen civic Yukon region but uh we
have global climate models that evolve we have global climate models so it's global and but yeah we do some some projects for Environment Canada and that's at a national level and so
they just depends of about the other data but typically that's so that's 1 of the big things that we do that helps connect a global climate model to local scale as having to do that downscaling and we have a number of people that have the education downscaling so far over the years of expertise the great thank you very much
Loading...
Feedback

Timings

 7458 ms - page object

Version

AV-Portal 3.10.1 (444c3c2f7be8b8a4b766f225e37189cd309f0d7f)
hidden