We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

A high performing data retrieval system for large and frequently updated geospatial datasets

00:00

Formal Metadata

Title
A high performing data retrieval system for large and frequently updated geospatial datasets
Title of Series
Number of Parts
351
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
ECMWF is a research institute and a 24/7 operational service, producing global numerical weather predictions and other data for a broad community of users. To achieve this, the centre operates one of the largest supercomputer facilities and data archives within the meteorological community. ECMWF also operates several services for the EU Copernicus programme to provide data for Climate Change, Atmospheric monitoring and Emergency services. As part of ECMWF's Open data initiative, more and more meteorological data and web services are freely available to a wider community. ECMWF's web services include an interactive web application to explore and visualize its forecast data, a Web Map Service (WMS) server and many graphical products including geospatial weather diagrams so called Ensemble (ENS) meteograms and vertical profiles. ENS meteograms and vertical profile diagrams are among the ECMWF's most popular web products and presents ECMWF's multi-dimensional real-time ensemble forecast data for a given position globally. They are freely available through various ECMWF web services, and integrated on ECMWF's GIS based interactive web application. Datasets powering the dynamically generated diagrams are formed from a rolling archive of 10 days data, updated twice a day and each update consists of data around half a Terabyte. An upcoming update on ECMWF's forecasting system will increase the data size by a factor of 3-4 times in the near future. In addition to ECMWF's forecast data, similar services are requested as part of various Copernicus projects producing different datasets. This talk presents migrating legacy data structure used for ENS meteogram datasets to a more flexible, extensible, and high performing one fit to be used by GIS systems by using Free Open Source Software (FOSS). The new data structure uses Python ecosystem. The data preparation workflow as well as the challenges and the solutions that are taken when dealing large and frequently updated geospatial datasets are presented. Talk will also include early experiments and experiences to offer these datasets as part of OGC's Environmental Data Retrieval (EDR) API.
Keywords
Information retrievalPhysical systemWeb serviceContent (media)Data structureProduct (business)Point (geometry)Physical systemContent (media)BitInformation retrievalPoint (geometry)Reading (process)Data storage deviceData structureSet (mathematics)Mechanism designProduct (business)Computer animation
Data managementWeb serviceDigital signalTwin primePrototypeSelf-organizationMathematicsStaff (military)State of matterPredictionNumerical analysisProduct (business)Range (statistics)Group theoryDenial-of-service attackOffice suiteOperator (mathematics)Core dumpProduct (business)Initial value problemPresentation of a groupSlide ruleSupercomputerDigitizingPhysical systemRange (statistics)MereologyWeb serviceProjective planeStaff (military)Functional (mathematics)NeuroinformatikMultiplication signState of matterSoftwareSelf-organizationData managementSystem programmingReading (process)Arithmetic meanAnalytic continuationComputer animation
Data structureParameter (computer programming)Product (business)Vertical directionDiagramPoint (geometry)Focus (optics)MereologyDiagramParameter (computer programming)Product (business)Multiplication signForm (programming)Point (geometry)Kernel (computing)Time evolutionData structureWeb serviceInformation retrievalComputer animation
Product (business)Connectivity (graph theory)Series (mathematics)Point (geometry)Open setSpacetimeLevel (video gaming)Musical ensembleSimulationCondition numberPredictionDiscrete groupVariable (mathematics)Image resolutionNumberParameter (computer programming)Vertical directionData structureThermal expansionTime evolutionFrequencyDatabaseCovering spaceDistribution (mathematics)DiagramExpected valueInitial value problemParameter (computer programming)Multiplication signSheaf (mathematics)Point (geometry)Web applicationSpacetimeRange (statistics)Computer simulationSingle-precision floating-point formatState observerSoftware developerField (computer science)MetrologieCartesian coordinate systemNumberGreatest elementDifferent (Kate Ryan album)Product (business)Structural loadDimensional analysisAreaSurfaceVertex (graph theory)Time evolutionExpert systemMusical ensembleOverlay-NetzPhysical systemMetreInformation retrievalDiagramRight angleBinary codeLine (geometry)Graph (mathematics)Set (mathematics)MereologyOpen setDistribution (mathematics)Data structureProfil (magazine)Level (video gaming)Image resolutionGenerating set of a groupType theoryInteractive televisionWeb 2.0NeuroinformatikComputer animation
Point (geometry)DatabaseDistribution (mathematics)Bit rateRevision controlProduct (business)Civil engineeringDiagramExpected valuePhysical systemData structureStandard deviationComputer fileSupercomputerComputer configurationData compressionImage resolutionDirected graphOperations researchOpen setCodeFile formatFocus (optics)Library (computing)Free groupSimulationField (computer science)Information retrievalTable (information)Coordinate systemoutputVariable (mathematics)Transformation (genetics)Multiplication signFigurate numberMiniDiscData storage devicePoint (geometry)Standard deviationProduct (business)CodeNeuroinformatikTotal S.A.Maxima and minimaCuboidSupercomputerCASE <Informatik>Distribution (mathematics)Graph (mathematics)Musical ensembleDiagramPhysical systemExpected valueLimit (category theory)Table (information)Operator (mathematics)Field (computer science)Group theoryFile archiverParameter (computer programming)Computer fileFile formatDatabaseTask (computing)MereologyWeb 2.0Fitness functionSystem programmingData structure.NET FrameworkProcess (computing)Data compressionPie chartAreaOpen sourceConnectivity (graph theory)Greatest elementWeb serviceComputer configurationElectronic mailing listVirtual machineHierarchyReading (process)Point cloudWritingRow (database)Bit rateFront and back endsComputer animation
Coordinate systemDatabaseSingle-precision floating-point formatTransformation (genetics)Image resolutionSimulationField (computer science)Point (geometry)Shape (magazine)Sheaf (mathematics)MiniDiscRead-only memoryPrice indexInformation retrievalPoint (geometry)CASE <Informatik>Computer file2 (number)Table (information)BitSemiconductor memorySubject indexingInformation retrievalParameter (computer programming)NumberShape (magazine)Multiplication signMusical ensembleImage resolutionProcess (computing)DatabaseQuery languageField (computer science)Sheaf (mathematics)SimulationSingle-precision floating-point formatRow (database)Computer animation
Coordinate systemSingle-precision floating-point formatPoint (geometry)Subject indexingDistanceUniform resource locatorTable (information)Random matrixComputer animation
Image resolutionoutputFile formatStandard deviationInterface (computing)Open setPoint (geometry)Integrated development environmentCommunications protocolInformation retrievalInformation retrievalImage resolutionIntegrated development environmentBuffer overflowDatabasePoint (geometry)Communications protocolProduct (business)Multiplication signMoment (mathematics)Computer animation
Transcript: English(auto-generated)
Thank you very much, and thanks for coming. My name is Gian. I work at European Ready Center in Reading. So I will be talking about the high-performing data retrieval storage, actually, for large and frequently updated data sets. There's mainly better data, but I will more focus on the better data on points.
So if you would like to get the better data in a given point of globally or many points, then that's exactly what we've designed it for. So the content of my talk will be, I will talk a bit about ECMWF, the European Ready
Center, and what we do. And then I will talk a bit about our data and also our graphical products, which is where the system is feeding the data to. And then I will talk a bit about the point-based products, so it's the better data on a given point or many points, and the challenges and the issues we get how to do that.
And I will talk about the data structure, how we are holding the better data to be able to deliver this, and the retrieval mechanism, as well as the future challenges. So ECMWF, we had already four presentations during the FOSFOGIE this year, three of them
yesterday and one of them was in the first day. And we all share more or less the same slide. So if you've seen it already, that will be a repeat, otherwise, ECMWF is the European Center for Medium Range Weather Forecast. It's established in 1975.
It's an intergovernmental organization. So we have 23 member states and 12 cooperating states and nearly 400 staff distributed over three sides of Europe. So our headquarters is in Reading. This is most of the staff are working at as well. Then we have the Bologna office, which is hosting our supercomputer.
It's one of the largest supercomputers in the world, because you need big computers to run weather models or climate models. And then we have an office in Bonn, which mainly hosts our staff working in the European projects. So we are a 24-7 operational service.
We provide a 24-7 operational service, so we run operational numerical weather prediction model. We produce data and we deliver this data to our member states, cooperating states and customers. And it's a lot of data. And, of course, our main customers are national weather services from our member states, but
also we have private customers as well. And our primary core functionality is to deliver the binary data every day when we run the model. We are also a research institute at the same time, because the weather models, they need
to be developed continuously. And we have a big research group working on the model itself. We have a big research group working on the initial conditions, because you need an initial condition to run your model. So we have a very strong research community at ECMWF as well.
So we also operate and provide to EU Copernicus services. This is climate change service, C3S, and atmosphere monitoring service, which is known as CAMS. So they also operationally provide data as well as graphical products.
We also support Copernicus emergency management service, which is SAMS, and this is done through the EFAS project, the European Flood Alert System. There was a talk yesterday about it as well. And as a new initiative as part of the destination earth, ECMWF is also developing a digital
means of the earth together with ESA. So we have a lot in our hands to shuffle that. So as I said earlier, we run our model four times a day, and then we provide this data to our customers. It's a big amount of data. But we also, as a complementary service, provide lots of graphical products for our customers,
but also for general public as well. And when it comes to weather and also to graphical products, there are usually two forms to show that it's either 2D maps, so you have a coverage over part of the world or globally, which shows the weather parameter, how it is distributed and how
it changes over forecast time steps, or is provided point-based. So you have a point somewhere globally, and the users click, and then they receive a diagram showing either the time evolution of the parameter they would like to see or the
vertical structure of the weather parameter for this point. So in this talk, for the rest of the talk, I will more look at the data structures for the point-based data, because now I will talk also about the size of the data to give an idea and why it is a challenge for us.
So also, those graphical products, we provide them through various web applications, and also those point-based products, they are part of those web applications. Just quickly mention about those applications, we have an application called ECCharts, it's
an interactive one, and it's designed for the expert users. It runs on open layers, and then users can zoom and pan, or they can add different layers together, they can overlay and change the overlay. They can even do computations on demand based on our data, and that's usually designed
for the forecasters. Then of course, the same infrastructure, we provide the WMS and WFS features, so we have nearly 300 different layers, all from ECM, WFS data of course. And then most of them, well almost all of them are available for our registered users,
and a portion of them are available for general public use as well. Then we have a dashboard, which is simply a portal to collect different products, because users can generate many, many products by combining different parameters together, and then the dashboard is an easy way to see them all together.
And recently ECMWF has started a new initiative, Open Data, so up until two years ago, our data was very restricted, but recently they are quite open, you can access many of ECMWF's parameters as binary data, but also graphically as well.
So we have another application which is called OpenCharts, and this is accessible by anyone, so you can go to ECMWF's website, go to Charts section, and you will see hundreds and hundreds of different products. But of course this one doesn't let you to interact because of the load we are worried
about, so it instead has predefined products and predefined areas for those products. So plenty of ways to see the graphical products from ECMWF, and those point-based data, those diagrams, they are all integrated in those applications, so users can click to
the maps and then they can generate those diagrams. Some data concepts, to understand the data sizes we are dealing with, is that ECMWF is one of the pioneers of running ensemble data.
So what is ensemble data? Instead of running one model from a single initial condition, we actually run 50 models every day, four times a day, and we generate 50 different outcomes. It's because the initial conditions are generated from observations from all over
the world, but they can go very wrong. And if you start with a wrong initial condition, it doesn't matter how good your model is, you will end up with wrong data. So to prevent this or to accommodate that, ECMWF ensemble model changes the initial conditions slightly and try to generate a probabilistic approach as an outcome so that
different scenarios can be seen from slightly different initial conditions. But that means for the developers like us is that the data size is 50 times more to deal with.
Then of course the meteorological metadata, they have a time step, so we have two dimensions of the time steps. It's because we generate four forecasts a day, we have a base time, so that's what we call the initialization time for the forecast. Then we have what we call the valid time, because every forecast goes up to day 10
or 15 or six weeks, we have different ranges, and we provide weather for every six hourly time steps. So that's an extra 8 to 90 times the single data field.
And also when we produce the weather data, we produce them on grid points, so it's just discrete points on the X, Y, Z space, so the whole world is covered with those grid points and any user can access to the data at a given grid point.
So what we call the model resolution is actually how far apart those grid points are, and in our ensemble system is currently around 18 kilometres. And we have of course, we need to have a look at the atmosphere vertically as well,
and we have 137 vertical levels, so the same grid points are repeated 137 times. So this is more or less the size of the data that we deal every day, four times a day. So just to show it at the bottom, the time steps nearly 80 of them, number of grid points
18 kilometres apart, so that adds up 1.6 million grid points globally. Ensemble simulations, 50 of them, and we offer two of them as a product out of four runs. So this is more or less the size we are talking about.
Of course, we need a system to offer our users those diagrams for any grid point over the globe, and a system that we can extract any given grid point or grid points out of this jungle of data in a very quick way.
So basically, point-based data, we have two types of them. As I mentioned earlier, we have the surface parameters, for example, which has time steps going up to day 10 or day 15. And this is what you see on the left, and this is what we call the metrograms.
They just show the time evolution of those 50 different forecasts, usually put as percentile some statistics so that the users can read it in a digestible way. Then we have the ensemble vertical profiles, the one on the right, which is showing the
structure of the atmosphere, and we have some shaded areas, as you see, around the lines, which is showing the distribution of those 50 profiles, okay? So how do we do that? How do we offer those point data out of this big dataset?
It's probably as anyone else do. We get the data after every run, and we post-process the data because it's not as ready. Although we post-process, most of the time it's not even ready to show either, so we still do some processing on the fly as well. But the standard post-processing, like using two components of the wind to compute the
wind speed or wind direction, we can do it in advance. So that's what we do in the post-processing. Then we generate some point databases. That's what I will go into detail. And then we sync and activate those database files on the web machines.
And when the activation time comes, they become available, and our backend at the web services can access to those files and extract the data, okay? Just to give some figures about those point databases.
So per run, we generate about 400 gigabyte of point database. So per day, because we have two runs, one in the morning, one in the evening, we generate about 800 gigabyte of data. And we only keep seven days or ten days, depending on the product archive of this data.
So that adds up to seven terabyte of data, which is 800 of gigabyte is written every day, and 800 gigabyte is removed, because otherwise we will fill up the disks very quickly. And the user's expectations, of course, to access to those diagrams in a second or two.
So one example is this diagram here. It shows the 15-day data distribution out of this ensemble data. For various parameters, we had a cloud cover, total precipitation, wind speed, wind direction,
and the min-max temperatures, for example. And you see those boxes and whiskers as those 50 ensemble members. It just shows the distribution, how the values are distributed over. And we have those shaded areas for each parameter, which is showing the climate. So when you look at this one, the one at the bottom, for example,
the rates are the maximum temperature. And we have the boxes showing the forecast from today's forecast. It's actually from yesterday's forecast. And then we have the shaded area, which is showing the climate at that time of the year, looking to the last 20 years' data.
So we also generate the climate and store every week. So to be able to generate this diagram, we need to access 10 of those point databases. That's about 50 gigabytes, just for this specific product. And it requires around 1,500 data values to be able to draw this diagram.
So up until very recently, we were using a very kind of a legacy system to be able to store this data for the point data access. But then we have more and more requirements building up that we needed to change.
And some of those requirements are that our production now needs to be migrated to ECMWF's new HPC system. So we have a new computer, supercomputer. And every legacy code needs to be decompiled. So it's not easy to maintain, because not many people know anymore
what exactly the code is doing. And it does ancient stuff, it has ancient dependencies, and then it's time to change it. So that's for clear. We need something more flexible, more standard data structure, because again, what we were doing was very ad hoc, very specific to ECMWF in this case.
We needed better performance, because it's getting more and more difficult when the data sizes grow to be able to extract the data fast enough. We also would like to retrieve not the nearest point to the requested point,
but many points. So like the nearest 500 points with the same efficiency and with the same performance. We also would like the compression options, because the data sizes are growing. So it's better if we can compress and get the same efficiency.
And then extensible as well, because with the growing data sizes, we have limits on the disk and other things. More importantly, maybe, especially an operational center like ECMWF, we like simple, reproducible workflows and the data containers, like files basically.
Because we have a group of people maintaining tasks, and then when this task fails, which is generating this file, then they just rerun it. So everything, every step of the workflow has to be reproducible.
No room for getting the data corrupted or anything has to be recreated. And the simple file structure seems to be the best, in our case, fitting for this purpose. So we had a look at some different systems, how we can store this data. Some databases, which were extremely good in performance.
Then the NetCDF, SAR, and PyTables, like HDF5, were part of this list as well. And then we start using the PyTables, which is a package for managing hierarchical datasets,
and designed to efficiently and easily cope with extremely large amounts of data. So it's built on top of HDF5 format, it's completely free and open source. It has a very high performance read and write files. It can do SQL-like operations. Data chunking is there, which is very important.
We have compression capabilities and extra tools already coming out of the box to inspect the files that we can generate. And it simplifies a lot our workflows. But the data needs to be prepared, and for this purpose, we had a look how we can put our data into a table.
Because our data comes out of the grid files, so we need to extract those data out of the grid files. And we have many data fields. So we have those ensemble members, we have the time steps, so the data fields can grow a lot. And then we have the grid points, which is the same for all those data fields.
And then we decided to put the fields in the columns of the tables, and then the grid points as the rows in the table. And we keep the latitude-longitude pairs in a different table so that we can do some very fast processing.
So looking at some ensemble temperature data, this is a simple example. It's a single file of 15 gigabyte. We have three tables. The first table has all the data, so it's 50 ensemble simulations, 8 forecast time steps, which adds up to 4,000 fields.
So we put them as columns in the table, so 4,000 columns. And we have the ensemble resolution of 1.6 million grid points, which adds up to 1.6 million rows. The second data table is just for supportive data, which we would like to have.
And the third one is the coordinates table. So at the end, we have a data size, which is 15 gigabyte for a single parameter, where we can move around and we can even, on the command line, query. And we spend about 5 to 10 minutes to generate this data from our original data set.
The chunking is maybe the most important thing when it comes to the performance. So it basically defines the size and the shape of data section, which is all together on the disk together. So it's very, very important to choose your size and shape
for your chunk, because that directly defines the performance of your data retrieval. So in our case, we get all the columns, because we want all this data together. And we want to store and put the number of the rows to 500 or 1,000, which will, at the end,
add up a very small memory footprint. Can I ask you to shorten it a little bit, so we have time for questions as well? OK. Sorry about that. So the retrieval process, on the other hand, is we get the latitude, longitude, parameter, date, and the number of the nearest points. We locate the correct database files.
We find the nearest point and extract the data with a single extraction, basically, because we know the index to this row. And this is our nearest grid point. Again, the nearest grid point is where most of the work goes. So we get the coordinates from the table.
Then we compute the distances for the neighboring points. And then this is to the requested location. And this is how we get the index that we can use in the other tables. Future work, our resolution is going up by four times. That will be 6.4 million data points.
So that will be challenging. We also would like to offer not only the graphical products, but also some kind of data out of those databases. And we are, at the moment, investigating the data delivery using environmental data retrieval protocol, EDR.
Yeah, that's it. Thank you very much. Sorry about the overflow.