We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Creating Geo-Harmonized PM2.5 maps over Europe using machine learning

00:00

Formal Metadata

Title
Creating Geo-Harmonized PM2.5 maps over Europe using machine learning
Title of Series
Number of Parts
57
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production PlaceWageningen

Content Metadata

Subject Area
Genre
Abstract
Saleem Ibrahim, Researcher at the CTU in Prague, presented high resolution (1 km), full coverage of inhalable particulate matter (PM2.5) maps of whole Europe for the years 2018–2020 using open-source data. This was a major finding of his study, which aims at gaining a better understanding of these small particles, one of the most harmful air pollutants to all living things. The biggest challenge is the ground measurement tools, by highly-expensive ground stations limiting the coverage of estimations. To accelerate the process and reduce costs, the work of Saleem explored the application of machine learning and deep learning algorithms to estimate PM2.5 using multiple sources like satellite retrievals of Aerosol Optical Depth (AOD) and other auxiliary data, such as meteorological data, land cover, land use, among others.
Keywords
Open setPoint (geometry)Student's t-testUniverse (mathematics)Remote procedure callParticle systemDiameterMapping
Gamma functionMixed realityMedical imagingMereology
Maxima and minimaSource codeLevel (video gaming)Price indexVertex (graph theory)MIDIError messageSubject indexingState observerCountingDiagram
MeasurementCountingState observerGame theoryPoint (geometry)Interpreter (computing)Line (geometry)Group action
MeasurementEstimationSpline (mathematics)KrigingInterpolationInterpreter (computing)DistanceInverse elementOpticsComputer animation
OpticsSatelliteKrigingSpline (mathematics)MeasurementEstimationInterpolationCross-correlationFood energyPoint (geometry)Subject indexingDifferent (Kate Ryan album)Flow separationCorrelation and dependenceOpticsMeasurement
Medical imagingImage resolutionLengthOpen setSuite (music)Source codeMeasurementAlgorithmSatelliteOpen sourceEndliche ModelltheorieShared memorySelf-organizationTheory of relativitySurfaceVirtual machineCovering spaceDifferent (Kate Ryan album)State observerImage resolutionDemosceneZoom lensDirection (geometry)State of matterComputer animation
Uniform resource locatorObservational studyPoint (geometry)AreaComputer animation
PreprocessorAverageCalculationCellular automatonSample (statistics)LengthTotal S.A.Boundary value problemVariable (mathematics)Independence (probability theory)Modul <Datentyp>Musical ensembleAlgorithmObservational studyLinear regressionPredictionEndliche ModelltheorieoutputInformationData modelParameter (computer programming)Performance appraisalSoftware testingMathematical optimizationService (economics)Hydraulic motorInterpreter (computing)DatabaseSoftware testingCore dumpState observerMetreEndliche ModelltheorieNumberGroup actionProjective planeObservational studyWebsiteTime zoneContext awarenessDecision theoryImpulse responseProduct (business)Power (physics)Maxima and minimaState of matterMeasurementAuditory maskingNetwork topologyComputer hardwareIndependence (probability theory)Temporal logicInformationBuildingWave packetOpen sourceCovering spaceAlgorithmMusical ensembleEstimatorAverageParameter (computer programming)ForestInterpolationTunisoutputDependent and independent variablesServer (computing)Bilinear mapComputer animation
Faculty (division)NumberSoftware testingCross-validation (statistics)Range (statistics)Validity (statistics)Point (geometry)Game theoryDiagram
Data modelGroup actionLocal GroupIterationSoftware testingWave packetSource codeParameter (computer programming)HypercubePredictionObservational studyAreaEndliche ModelltheorieCase moddingEstimationCellular automatonNichtlineares GleichungssystemService (economics)SurfaceEndliche ModelltheoriePoint cloudParameter (computer programming)Information retrievalImage resolutionTerm (mathematics)Point (geometry)Angular resolutionSoftware testingSampling (statistics)Cross-validation (statistics)Validity (statistics)MappingBitPixelRevision controlSatelliteAverageOrder (biology)Domain nameInternet service providerSpacetimeUniform resource locatorSelf-organizationState of matterHydraulic motorNumberService (economics)InformationMetreDisk read-and-write headComputer animation
Error messageArithmetic meanRootSquare numberAbsolute valueCoefficient of determinationCoefficientImage resolutionMappingHydraulic motorEmailError messageFehlerschranke
OpticsObservational studyFunction (mathematics)Validity (statistics)State observerMappingSet (mathematics)OpticsInheritance (object-oriented programming)Open setOperational amplifier
OpticsObservational studySet (mathematics)Sound effectMultiplicationSinc functionInclusion mapComputer virusResultant
Computer animation
Transcript: English(auto-generated)
Hi, my name is Salim Brahim, I'm a PhD student, as Tom mentioned, in the Czech Technical University in Prague, under the supervision of Professor Lina Halinova.
I'm specialized in GIS and remote sensing, and today I'll be talking about creating PM2.5 maps over Europe. I will share the methodology, the ideas, the obstacles that we have. So, what is PM2.5? It's particulate matter with diameter of 2 or less than 2.5 microns, and it's mixed between solid and liquid particles.
As you can see, it's compared with human hair, it's so small, and we can see the air quality index for it. So, under 12 it's fine, from 12 to 34, 235 is moderate, and it's become more dangerous after that.
How is it measured? Traditionally, it's measured by ground stations. So, it provides good and accurate observation for these particles, but the problem with this ground station is the spatial coverage.
So, we can't count only on the ground stations, it's like highly cost to establish these stations and to maintain these stations, and it's not.
So, many approaches have been developed to estimate PM2.5. Some basic interpretation methods like inverse distance weightening, some more sophisticated methods like Grigin, Spline, the satellite-based PM2.5 exposure assessment like aerosol optical depths.
AOD is used to predict PM2.5. Aerosol optical depths is defined as the measure of the columnar atmospheric aerosol content, but it's not enough to count only on AOD to predict PM2.5, since the relationship is not like simple linear relationships.
That's why we tend to include more data and to make the correlation more powerful between AOD and PM2.5.
So, I'll share with you the data. The open source data we use, we get the ground-based PM2.5 measurements from OpenAQ. It's an organization that collects this data from a governmental and research institution, and it's provided to the users.
MODIS is providing daily observation for AOD since 1999 by the Aqua satellite and 2002 by the Terra satellite. So, there are many algorithms used by MODIS to retrieve AOD, starting with the dark target, deep blue, and the mixed one.
And in the last years, the MAIAC algorithm is widely used. Other kinds of data like NDVI, meteorological data like wind speed, relative humidity, surface temperature, a lot of meteorological data, and land cover data from Copernicus, elevations from JAXA.
So, we have all this data with different spatial resolutions, with different projections, and we want to have all this data in our machine learning models. So, how to do it? First of all, this is the study area, and you can see the location of PM2.5 ground stations.
And we created one by one kilometer grid, so we have almost more than 13 million pixels, and almost 5.5 million are located over land.
So, first of all, we want to do pre-processing for the data and to include it in our grid. We will start with the AOD. MODIS provides some quality assurance layer in its recent products.
And we utilized this sub-dataset to create some masks, and we applied the masks on both Terra and Aqua. After that, after applying this mask, we did simple averaging for both of them, so we can have the maximum coverage.
Okay, some of them are there, but for the rest of the data, like for the meteorological data, we did the reprojection, resampling to our grid, and we used the bilinear interpolation. For the PM2.5, we had some issues. Since we are counting on open-source data, we can't guarantee that we have hourly observations.
So, since the Terra and Aqua are passing, the Terra is passing locally around 10.30 a.m., and the Aqua is passing locally around 1.30 p.m., we took the observation between 10 a.m. and 2 p.m. And later, if we had hourly observations, we can include it in our study. We removed the
unrealistic values. It was less than 1%, and we did the averaging for each station per day. We put all this data, and we project it to EPSG3035, and we have this grid, one by
one kilometer. Let's imagine this is the grid. Here we have some stations. Let's go more in-depth. We have the observation from the ground station. We have all this input data, or we
can say all these independent variables, and the value of the PM2.5 as the dependent variable. First of all, it's common that the relationship between AOD and PM2.5 is not simple, so we started testing some algorithms, and we were satisfied with the extra trees algorithm.
From its name, it's a tree-based algorithm, and it's kind of like ensemble learning since each tree will provide one estimation, and it depends on the problem.
For the classification problems, the final decision will be based on the majority of classifications of the trees in the forest. For our case, since you are predicting continuous values, it would be the average of all the estimators, all the trees in the forest.
Here we can imagine that we have the inputs, our grid, and here you can see for each day, we have these inputs, and we included both the spatial and temporal information in our model.
Here we can see the methodology. We have data, we want to split the data for training, for testing, and usually many techniques are used to have the best models that can describe the relationship and do the best estimations.
First of all, we can do the tuning for our hyperparameters. For each model, we have these parameters inside of it, like number of estimators, number of trees, max depths, it depends on the algorithm. The Scikit-learn provides the grid search, where you put some inputs and it tests each group
of the inputs. It depends on how much powerful hardware you have, it depends on the server. Sometimes it depends if the people who are sharing the servers are going to kill you or not. And also we can do some tests to see approximately what we can start with.
I'm showing here the number of trees, number of estimators, the accuracy of five cross-validations, and we see 50 to 100, it starts raising, and until it reaches between 400 to 500, we can see the number is fine, we can start from this range.
Other tests we did, like to explore the max feature, so we see at maybe 13, it's starting to be horizontal, so we can know it would be from 12 to maybe 15.
And some tests like this, we can put in our grid search or randomized cross-validation search, and we can continue with our validation. So there are many techniques for the validation, the well-known K-fold cross-validation, when you split the data for K
-folds, you keep one with you, you train on K minus one, and you validate it on the one that you kept. Leave one station out cross-validation is widely used also because you want to see how the model is going to,
when the model wants to predict some values in a new location, how it's going to be is good or it's not. So you train, you keep one station, and you train the model on all available data from the rest of the stations, you did the validation on that station, and you do it for the whole station. As I said, it could be sometimes a little bit like highly cost tests, but
like when you know where to start, you can maximize this cost, minimize this cost. At the end, after we know that these parameters are fine, we want to test the model on
unseen data, and this is the data we kept at the beginning when we split the original data. So we have did our test, and we tested the model on the test set, and we can say that our model was able to predict with this accuracy.
At the end, after we train the model, we know the accuracy that we got, we want to train on all the available data with the hyper parameters that we got, and we want to estimate new values for each pixel of our grid.
And that's what we got in the first pre-beta version of PM2.5. The thing is, MODIS satellites have showed that 65 or 67% of the Arctic surface is covered by cloud every day.
So we had a lot of gaps in our data, and it didn't make sense for us to produce this daily map. So we aggregated the maps, and we calculated the monthly maps, monthly average, but it's not our main goal, because when we started, we said we want to produce daily maps with one kilometer resolution for Europe.
And here we have the idea why we don't fill the MODIS data, it's like the main predictor or the most important feature of the input data.
So we looked for some other resources for AOD, and we found that Copernicus Atmosphere Monitoring Service provided a modelled AOD with low spatial resolution around 80 kilometers, but it provided with high temporal resolution, like every three hours.
And since we are using MODIS and it's passing between, I said, 1030 to 130, we use terms modelled AOD at three times, at nine, at 12, and at three, and we included the spatial and temporal information also, and the elevation of the points.
So, here we had some issue also because we are training a model for each year, and we found that for each year we have like between 400 millions to 500 millions points, and it doesn't make sense to train on all this data.
So we have to find some ways to train our models and to make sure that these models can predict new values in our grid in the best way and take advantage of our data.
So, the thing we've done is we took 10% only of the data that we have, and we use the KS test to have a representative samples of the whole population that we have. It's a little bit statistics, but it
worked for us, and we applied the five cross validation, and after that we tested our models on the rest of the data. So, approximately, we trained on 50 millions points and we tested the model on 450 million points, and the good thing is our models reached very high accuracy.
They were able to estimate AOD with an accuracy ranging between 92% to 95% with very low small relative errors for RMSE and MAE. So, we can see here the errors of the models, and later we wanted to make sure,
because we created a daily AOD maps for Europe with one kilometer spatial resolution, and
we wanted to make sure that this AOD maps are acting well with other validation methods. So, we use the aeronautics stations, which provide the observations of aerosol optical depths, and we found that the output or
the generated data set is, it has really good accuracy and be counted on to do other air quality studies later. So, the output was a harmonized atmospheric depth set for aerosol optical depths for all of Europe. And here I will show you.
Yeah, so we utilized our reconstructed data set in. Okay, we utilize this data set in studying the effect of Coronavirus on aerosol optical depths, and we found when you have like full coverage, you can provide more accurate results, more accurate
conclusions, because you don't need to deal with the gaps problems anymore. So, this data should be publicly available, as soon as possible, and it can be the base for multi studies for aerosol for air quality studies and it would be for us.
So we will continue with our main goal to find PM2.5 maps, and since we have now full coverage aerosol optical depths it would be much easier for us. So, yep, that's it.