We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Exploring Copernicus products and machine learning for health applications

00:00

Formal Metadata

Title
Exploring Copernicus products and machine learning for health applications
Title of Series
Number of Parts
57
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production PlaceWageningen

Content Metadata

Subject Area
Genre
Abstract
Rochelle Schneider is a Research Fellow in Artificial Intelligence for Earth Observation at European Space Agency (ESA), and she expounded the development of a multi-stage satellite-based machine learning (ML) model to estimate daily PM2.5 levels across Great Britain during 2008-2018. Stage-1 estimated PM2.5 concentrations in monitors with only PM10 records. Stage-2 imputed missing satellite aerosol-optical-depth due to cloudiness and bad retrievals. Stage-3 applied the Random Forest algorithm to estimate PM2.5 concentrations using a combined dataset from Stage-1, Stage-2, and a list of spatiotemporally synchronised predictors. Stage-4 estimated daily PM2.5 using Stage-3 model. The relatively high precision allowed these estimates (approximately 950 million points) to be used in epidemiological analyses to assess health risks associated with both short- and long-term exposure to PM2.5.
Keywords
Cartesian coordinate systemVirtual machinePresentation of a groupVariety (linguistics)Extension (kinesiology)UsabilityComputer programmingSet (mathematics)SpacetimeCombinational logicTraffic reportingVideo gameComputer animation
Slide ruleBuildingSource codeCovering spacePopulation densityImage resolutionSocial classWebsiteCross-site scriptingProduct (business)Time zoneNumberSatelliteDirection (geometry)AverageMaxima and minimaPressureSurfaceTotal S.A.Particle systemOpticsGrass (card game)Elementary arithmeticPartial derivativeEmailIntegrated development environmentComputer networkMeasurementWorkstation <Musikinstrument>Normed vector spaceImage resolutionAngular resolutionInformationProduct (business)BitCartesian coordinate systemService (economics)State observerMedical imagingSatelliteCASE <Informatik>Latent heatLevel (video gaming)Set (mathematics)BuildingCovering spaceAreaSurfaceType theoryCombinational logicFormal languageWebsiteMeasurementDirection (geometry)Total S.A.Maxima and minimaOpticsVirtual machineSource codeArtificial neural networkVariety (linguistics)Arithmetic meanOrder (biology)Volume (thermodynamics)Error messageBoss Corporation40 (number)Link (knot theory)Moment (mathematics)Multiplication signCausalityWorkstation <Musikinstrument>Limit (category theory)MetreCategory of beingAverageResultantEmailSelectivity (electronic)Extension (kinesiology)Electric generatorNatural numberReliefRight angleTerm (mathematics)Intelligent NetworkComputer animationDiagram
Source codeProduct (business)Temporal logicSatelliteMachine learningNormed vector spaceSeries (mathematics)Angular resolutionPixelCellular automatonImage resolutionSession Initiation ProtocolEstimationMultiplicationLevel (video gaming)MeasurementVariable (mathematics)Random numberModal logicPredictionAreaMatter waveSequenceTotal S.A.ForestDot productImage resolutionLevel (video gaming)Uniform resource locatorObservational studyAreaGroup actionNumberCASE <Informatik>Multiplication signVirtual machineSatelliteProduct (business)WaveletPixelCartesian coordinate systemInformationLatent heatProcess (computing)Series (mathematics)Limit (category theory)Conditional-access moduleProjective planeNetwork topologyState of matterMetrePoint cloudLocal ringFocus (optics)Insertion lossResultantGraph coloringSocial classCore dumpCollaborationismTrailVideo gameIntelWebsiteData managementPoint (geometry)Bit rate
Temporal logicVariable (mathematics)Level (video gaming)Product (business)Level (video gaming)SummierbarkeitVariety (linguistics)LengthMoment (mathematics)Multiplication signPopulation densityProjective planeMatrix (mathematics)Forcing (mathematics)Shared memoryThomas BayesValidity (statistics)CASE <Informatik>Representation (politics)ResultantComputer animation
IcosahedronWeb pageVariable (mathematics)Temporal logicLevel (video gaming)Product (business)Cellular automatonForestRandom numberSlide ruleLemma (mathematics)Gamma functionEstimationImage resolutionOffice suiteSelf-organizationCollaborationismWater vaporEntire functionTemporal logicRepresentation (politics)Validity (statistics)NumberProduct (business)Level (video gaming)CASE <Informatik>Normal (geometry)Cross-validation (statistics)Set (mathematics)Reduction of orderCartesian coordinate systemInformationCoefficient of determinationTime seriesAreaImplementationLink (knot theory)Limit (category theory)Cubic graphResultantMathematicsGoodness of fitComputer configurationWhiteboardOcean currentPresentation of a groupBit rateSelf-organizationEstimatorCausalityGame theoryShift operatorMetreRow (database)Computer animation
Set (mathematics)Product (business)Residual (numerical analysis)
NeuroinformatikMoment (mathematics)Particle systemWave packetLevel (video gaming)CASE <Informatik>SupercomputerDifferent (Kate Ryan album)Multiplication signCoefficient of determinationProduct (business)Time seriesSpacetimeRandomizationVirtual machinePoint (geometry)SurfaceForestVariable (mathematics)MeasurementAlgorithmImage resolutionGreen's functionMeeting/Interview
Transcript: English(auto-generated)
OK, so hello, everyone. Thanks for having me here today, virtually, unfortunately. But a great presentation before my presentation,
because it's exactly what I'm going to talk about it. So I want to show an example of an application of using Copernicus data, which is the European Union's Earth Observation Program, and combined with machine learning for public health applications. So as Tom said, I work at ESA, European Space Agency,
more specifically at the Philab, ESA Innovation Center from ESA. And in this specific center, we worked in the combination of AI4EO, artificial intelligence for Earth observation solutions, in a variety of topics. And my topic is specifically to link this information
with the public health applications. So at the moment, we are working with a lot of collaborators, such as the UNICEF, and also using this data to link with health applications in Europe, and also in other countries, such as Brazil. So before I actually talk about the application
of this data, I'm going to talk a little bit about the data per se, because I think it's a great opportunity to show you how amazing this data is to work in a variety of fields, but also to show you a little bit some pictures and some spatial resolutions of this data for you to better understand what is possible with this data.
So basically, Copernicus, they have the land monitoring service that provides you land cover in provenance, density, and elevation. And this is a kind of a high resolution that you can use. So as an example on the right side is the map of London,
and the gray features are the buildings in a specific area of London. Next to the Hyde Park is a famous park in London. So there is this overlap of a blue grid over London that represents one kilometer by one kilometer. So when I say high resolution, it's because you can see that this data is available at 100
or 20 meters, which is quite small compared with what we usually deal with these modeling systems of one kilometer. So it's also available data, meteorological data, and air pollution data from the climate change services and also the atmosphere services.
And here is an example of two important data sets from that. That is the ERO5 and ERO5 land. So they provide, as you can see, they provide temperature data, and one is for the whole globe, including the oceans, and the other one is just for land. That's the reason why they have the ERO land name. But also the CAMPS data, which is the air pollution.
So these products are usually named as a satellite-based product because they have a lot of satellite observations inside of these models. As an example here, you see some examples of ESA and NASA satellites inside of these models.
So here is an example of what is the resolution of this data if you were interested in using it. So the ERO5 is around 25, 30. That, as you can see from the image on the left, on the right, is London, overlapped
with a blue grid of one kilometer. And then you see on the top in red is this 25. But you also have the nine kilometers from ERO5 land. That is with a higher resolution. And the importance of this data, and mostly in my case because of health applications, is
because this data is available at air temperature, for example, is available at different heights of the atmosphere. So my interest here is to extract the surface level, is where we live, and where we wanted to explore the human exposure to that specific stressors, in this case, of temperature.
So the continuous data that they provide, that is by hour, you can measure these. You can make the mean, the monthly average, or minimum maximum. So that's the importance of this data. And also the atmospheric products in this case
is, as the previous colleague talked about it, the total column of aerosol optical depth that is used as well is provided by CAMS. And here is the surface level of air pollution. Because as we know, some satellites, they measure directly NO2.
They don't measure PM25 directly. That's the reason why we use AOD to include in the machine learning models. But these type of models, they provide a surface level. And there is not only NO2 and PM25. As you can see, I listed here some examples of which else you can explore with that.
And the resolution for this data is 10 kilometers across Europe. But globally, this data is around 40. And some cases, as someone is exploring Europe and not just a specific country, this is an interesting source
to get monitored data. Because this is a combination of all the air quality monitors available across Europe. If you are in a country that you need to manage data from another country or for the whole area, and you don't have actually to go in every website for the country in a different language to get this data.
So this is a source that you can explore. But OK, so this is a very quick overview about what is available. Of course, there is plenty more than that. But you can also see now an application for this that is just beyond the product.
So here is an example that I'm going to show you. It's a work that was developed before I joined ESA. It was developed at the London School of hygiene and tropical medicine. This project was exactly focused in giving to people the exposure of PM particulate matter to areas
that we just don't have data. And why this is important? Because as you see in the picture on the left, the red dots are the monitors across Great Britain. And the blue crosses are a specific cohort study or a specific residence that you have health data
and you wanted to explore. But the problem here is if you get the observed data from the ground monitors, there are a lot of people that just don't have the exposure assigned to them. And they usually, what they do is they assign the closest monitor to that person. But if there are some areas that you have a monitor
that is the closest one is just 100 kilometers away from you. So that actually is not your exposure anymore. So that's the reason why we developed this project. We aim to reconstruct the exposure of PM2.5 at one kilometer to be able to assign it
to every single postcode level, the appropriate exposure. So now I enter more in the project itself. So we're gonna talk about how we developed that. So basically we explored Great Britain,
which is Scotland, England, and Wales. And we explored from 2008 to 2018 in a resolution of one kilometer. And the pollutant was particulate matter 2.5. And as you see in the picture on the left is the number of one kilometer breeds
that are just inside London and the Great Britain, the whole area is composed of 234,429. So to be able to achieve this, we had some limitations. And that's the reason why we had to create a multi-stage machine learning approach.
In this case, because as you see in the stage one, there are a lot of places that they only had PM10, monitors measuring PM10, and they did not measure PM25. So we had to create this process to estimate what would be the PM25 in these monitors
that we only have PM10. And regarding the AOD, so we use data from satellites. And as you might know, the UK is quite cloudy. And this is a limitation for AOD because it's a product that is sensitive to cloud. So we don't have information when is cloud,
or if you have some clouds, the pixels next to this area, this starts not to be that as reliable as a very clean area. So in this case, we had to do another step for that. And the stage three is the stage that we actually had a huge data collection processes because we gather a lot of information from satellites,
satellite-based data, and also geospatial features that were available from the government. To then reconstruct the information for every single day during 11 years. So here is just to show you how we have done it. In the stage one, as I said, you see on the blue dots on the left
that are the locations just for 2010 of monitors that measuring PM25. If you check the other map in the orange color, you see that the huge amount is a considerable larger amount of PM10 measuring across Great Britain.
So basically here we just estimate for that monitor measuring PM10, what would be the expected PM.5 in that case. So this was made to increase our ground truth. Let's say is the number of monitors that we have to train our machine learning model.
So here is the action to fill the gaps of AOD from satellites. Important thing is that the satellites, they provide a resolution of one kilometer while the data, the AOD from CAMS is a much coarser resolution, but there is the benefit of a continuous time series
for that specific product. And also they provide this in multiple wavelets and this is basically every three hours data. So it's much more robust in this case. So what we aimed in this stage was bring it back
that we scaling the AOD from CAMS to the resolution of the satellites. Since we aim to have this resolution at one and not in the very course resolution from CAMS. So here is just the combination of everything. As you see that we also use the, we combine the stage one, the stage two
and we use all the data such as the sum of the length of the roads by touch, the land cover, the population density in a variety of data. This is usually the time of the moment of the project that took a lot of time to synchronize all this data
in the matrix of one by one kilometer. So in here, you see the results we use, of course we did a 10 fold cross validation in this case, but we removed the entire time series of the monitors to use as a validation to avoid having somehow
of representation of that monitor if we just randomly select the monitors for the validation but we did something else in this case, which is the spatial and temporal R square. So we basically we validate our models using the normal cross validation approach
but we as a public health approach in this case, we had to do something else because in epidemiological studies, you can use daily data but you also can use annual data. So annual number of deaths in that specific place or the annual number of hospital admissions. So we need to make sure that if we gather the annual
from the daily data from our model, the model is still as accurate as the daily. So this is how we express this accuracy by the spatial and temporal cross validation in this case. So as you can see in the beginning, since we didn't have much data available
for that specific years, we didn't have so strong performance in that specific early stages. And just to finalize and to provide you the idea how we combine the AI for health applications and also policy implementation,
here's an example of our next step. Based on this data set of 11 years of daily information for PM2.5, we indeed detected a reduction of PM2.5 across the years to respect the policy implementations
set up by European Commission. So what we are interested here is to see this reduction if it's actually detected by our models and also link this information with the guidelines from the European Commission and from the World Health Organization that set up a limit for the European area
in that case for the European Commission. And what we wanted to do here is also estimated the excess deaths caused by this overstep in the limits. So in this case, for example, we wanted to analyze across the whole Great Britain, the areas where the PM2.5 was above 25 micrograms
per cubic meter, where it was above 10, which is the World Health Organization that has a restricted recommendation in this case. And based on this surplus, we estimate what would be the number of deaths that could be avoided caused by PM2.5.
So, thank you. Thank you. Thank you, Rochelle. This is actually, yeah,
this looks like you did a lot of modeling, highly appropriate for our conference. We're very happy to have you with us. We have some questions for you. I will start. This is stage one you did in this example. You had a stage one and stage two
and stage one is just to basically do daytime computing. Yes. So there are techniques, you could do it without the stage one, you could just go directly to modeling. So like one way will be to consider that the value is just the PM,
just the PM are the values. And then we have the points of 2.5 or 10 that you are the one extra covariate, which is the size of particles. And then you model at once. And if you use a three-based machine learning, then you don't have to worry about computing,
but you basically throw all the data together. Have you looked at something like that maybe? Have you considered doing something like that? No, not at the moment. What we consider was including another interesting variables available, for example, PM2.5 from CAMS at surface level
as a predictor in this case, but not including, let's say this, you're saying to include in this case, a model that provides two outcomes. Yes, that you say I want to predict 2.5 or 10, but it's a single model. So you have all the values at once.
And the other thing I want to ask you, you showed the R square per year, does it mean that you fit a separate model every year or? Yes, we have a random forest algorithm trained for every year in this case, not we didn't consider all of them together.
So while like in our training sessions we did, we would do like a climate data, but for whole space time cube, you fit the one model, but then you take the whole time series data. So maybe something could be also interesting to you, but these are the data becomes much bigger,
it's more computational, but- Yeah, definitely is one of the limitations that if you run the entire set, you need to have a supercomputer in this case. So in that time we set up by year rather than the whole data set. But it's a super interesting work and it's a reality of data science, the gaps, the differences in resolution,
the gaps in measurements. So it's really something we experienced, so we recognize each other.