We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Air Quality & Python: Developing Online Analysis Tools

00:00

Formal Metadata

Title
Air Quality & Python: Developing Online Analysis Tools
Title of Series
Number of Parts
132
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Poor surface air quality has a range of implications for human health and the economy. Without concerted mitigation efforts, trends in urbanisation and aspirations for progressive economic growth will result in poorer levels of air quality. Analysing and interpreting the incoming data streams from heterogeneous air quality measurement stations is critical for tackling the problem and for developing early warning systems. I am using Python to develop a set of online analysis tools (ukatmos.org) to enable the public to quickly and easily plot air quality data in many ways, effectively freeing up information that is already publicly available but in awkward formats and often involves development of code. We anticipate these tools will also support data science classes at school, and can speed up scientific research by minimizing effort in repeating analyses. This talk will cover how the tools integrate numerous Python libraries (e.g. Pandas and NumPy), the Django web framework, the Plot.ly tools for creating interactive graphs, and SQL to address the large data volumes. Developing these Python tools in an adaptive and scalable way allows it to grow as more data become available, e.g. satellite observations. Adaptability also includes evolving user requirements. This project will also be developed into a Python library allowing the user to easily use the online analysis tools from an offline Python environment.
35
74
Thumbnail
11:59
SoftwareMathematical analysisIntegrated development environmentLink (knot theory)Observational studyCategory of beingQuicksortTerm (mathematics)CASE <Informatik>WebsiteLoop (music)Row (database)Suite (music)MeasurementBuildingDirection (geometry)InformationCodeBitObservational studySound effectUniform resource locatorMathematical analysisWeb 2.0Centralizer and normalizerNumbering schemeUniverse (mathematics)Endliche ModelltheorieData analysisGoogol1 (number)Web-DesignerCuboidProjective planeMixed realityProcess (computing)Group codePerfect groupModule (mathematics)Information engineeringNumberForm (programming)Web pageRange (statistics)Group actionSpreadsheetWorkstation <Musikinstrument>International Date LineFood energyPlotterMultiplication signOffice suiteFunction (mathematics)Set (mathematics)Letterpress printingRight angleCodeBoss CorporationRaw image formatTable (information)Instance (computer science)Computer fileElectronic mailing listGoodness of fit
Virtual memoryMobile appProcess (computing)World Wide Web ConsortiumDigital rights managementModule (mathematics)Line (geometry)BitMultilaterationMereologyMultiplication signInstance (computer science)Graph (mathematics)WordWebsiteQuicksortWeb pageFrame problemMathematicsSoftware frameworkGoogolMathematical analysisBuffer overflowGoodness of fitSelectivity (electronic)Menu (computing)Stack (abstract data type)Analytic setData structureElectronic data interchangeUniform resource locatorSummierbarkeitCodeEndliche ModelltheorieMobile appTerm (mathematics)DatabaseCASE <Informatik>PlotterGame theoryType theoryCodeInformationElectronic mailing listHypermediaVolumenvisualisierungTask (computing)FunktionalanalysisData analysisGraph coloringView (database)Time seriesBlogGraph (mathematics)Electronic data processingComputer fileWeb applicationDescriptive statisticsTemplate (C++)Online helpInternet forumProcess (computing)Interactive televisionNP-hardWeb 2.0Moment (mathematics)ThumbnailZoom lensIdeal (ethics)God
Server (computing)Musical ensemblePoint (geometry)ConcentricEvent horizonWebsiteTouchscreenGoodness of fitLine (geometry)Variable (mathematics)Moment (mathematics)NumberPlotterTime seriesGraph (mathematics)Arithmetic progressionHistogramSinc functionSelectivity (electronic)Binary fileCentralizer and normalizerSurjective functionDifferent (Kate Ryan album)Computer animation
Mobile appWorld Wide Web ConsortiumDifferent (Kate Ryan album)Block (periodic table)Point (geometry)Pairwise comparisonMultiplication signMoment (mathematics)Instance (computer science)Graph coloringSoftware bugBitData managementResampling (statistics)Crash (computing)InternetworkingSequelCASE <Informatik>Arithmetic progressionDatabaseAnnihilator (ring theory)Axiom of choiceWebsiteResultantHill differential equationEntire functionEndliche ModelltheorieThumbnailSatelliteModule (mathematics)Workstation <Musikinstrument>Goodness of fitPlotter1 (number)FeedbackVideo gameCombinational logicWeb pageSoftware developerBlack boxCurveCodeQuicksortPerfect groupLogical constantControl flowInclusion mapComputer animation
Block (periodic table)QuicksortLocal ringNumberTime zoneWebsiteType theory1 (number)ResultantCalculus of variationsSpring (hydrology)Touch typingLimit (category theory)Identity managementMultiplication signDifferent (Kate Ryan album)CASE <Informatik>Workstation <Musikinstrument>Observational studyProjective planeSource codeInheritance (object-oriented programming)Term (mathematics)AuthorizationGroup actionOpen sourceIntegrated development environmentSoftware development kitError messageMeasurementRight angleUniverse (mathematics)Computer animation
Transcript: English(auto-generated)
Thank you very much for the introduction. I work just down south of Edinburgh in King's building, so this is a brilliant opportunity for me to come and talk, so thanks for sticking around on a sunny Friday afternoon. For this talk, basically, I will introduce myself, say what I'm doing here, basically, and then I'm going to talk through a case study virtually of how I've been using Python in my work to make science data
analysis and web development to try and create some online tools for people to access analysis about air quality and make it easy and quick for people. I'll finish with a few lessons I've learned during this process and where I hope this project is going to go in the
future. So my official title is a postdoctoral research at the university, but I don't actually do that much research anymore. I have a background in atmospheric chemistry and everything I've learned coding-wise has been self-taught because I needed it, and I started off in Fortran doing atmospheric models and developing them and working them. They're basically
weather forecasting models like you'd get off a Met Office or anything, but you stick some atmospheric chemistry in and you get a pollution forecast. Fortran is great for that, but Fortran isn't great for data analysis, so I needed something nice to process the output from it, and that's where Python came in. I did use IDL for a while and then
I ran away from it like wildfire. Python's been much nicer to me. So now my work is mainly working in my research group as the group code or data wrangler person. I've heard of the term the other day, research data engineer, and maybe I fall
into that category, but I don't know. It's just a title, isn't it? So I'm somewhere in the middle of all these things. Hopefully not jack-of-all-trades, master of none, but could argue that. Right, so a brief introduction about air quality. Now if you saw the keynote
this morning, this was touched on basically. All it is is a measure of how polluted the air we breathe is, so air quality and air pollution are used synonymously and interchangeably between people. And in this case, I'm specifically talking about pollution with direct health effects. So nitrogen dioxide, ozone, particulate matter, which is basically just soot. This is not, in this case, greenhouse gases like carbon dioxide and methane, because these
affect climate and not health directly. And all these are generally emitted from traffic, but you also get natural sources such as fires. And interestingly, there was actually a fire on Blackfoot Hill, which is just near the observatory in Edinburgh yesterday, luckily put out by the fire brigade before it got too far. Maybe it was a stray cigarette
or something, but you could smell that smoke in central Edinburgh. So that was having an impact on the air quality. Just to bring the tone down a bit, there's been in the news an awful lot recently. A quick Google brings up lots of stories about how this affects your health and impacts it's having, and it's pretty horrible. And it's
becoming more and more in people's forefront of their mind, which is a good thing generally. However, we need to monitor this because just saying it's there or smelling it, I mean, it doesn't give us much information. So what we need is some way to get this monitored air pollution into an accessible form, which is where Python comes in. And this
is a little quote up in the top corner that I heard a talk yesterday by someone called Alex Jacob, but I thought it was perfect. Data only has value when it's relevant. And that's true. You get a number from a monitoring station and it is meaningless to most people.
Who cares if ozone down the road is 300? What does that mean? But to make that measurement into something useful, you need to spend time and energy gathering the data, knowing where it is, processing it and putting it in a form you know. And to most people, this is daunting because they don't have the right skill set. Even people with the skill set, it's time wasting. So the reason I kind of started this job is because my boss wanted to free up some
of his time when he's writing grants. He doesn't want to spend a day trying to plot this air quality somewhere. He wants a quick, easy thing so he can do it in half a minute and get on with the rest of his day, basically. And the thing with this hurdle in the way is, for most people, it's too much. If it's out of sight, out of mind, I don't want to bother
with that. So what we need is something to combine this data collection, the gathering of it, the analysis, and then visualize it in a way people can understand. So ideally, a set of tools that anybody can use and that are accessible and understandable by anybody. And the idea of ideally, you want a tool that can be used from anyone. School
children to academics, that is a broad range of people, but yeah, it's an ambition. So the
is getting the data. Now, this has been not too bad for this. So I'm getting, for this case study, I'm using air quality data from Defra, which is a UK government department currently owned by Michael Gove. I don't know if that's good or bad.
And there are over 150 of these sites currently working in the UK. There's maybe another 200 that have been previously working, have been shut down for various reasons. And these each take hourly measurements of all sorts of various pollutants. So there's an awful lot to deal with, especially since some have been going back since 1975. So there's a lot of measurements, but on the grand scheme of things, actually all the measurements I've ever taken only adds
up to a few gigabytes. So it's not big data, but it's messy and annoying and hidden away. And it's there, but yeah, not many people use it. So these is a little plot of where all the stations are spread across the UK. The nearest one to us here is just by Arthur's seats. This is a picture of the one in Edinburgh. It's a green box there. And you'd be glad to know
that Edinburgh's generally pretty good according to this for air pollution. However, you've got to consider things that was alluded to in the keynote this morning, that this air quality station is right next to a park. It's set away from the road and it's not a busy area at all. You start putting one of these outside the road outside and you get a completely
different picture. So there are all these stations scattered around the UK and annoyingly, Defra doesn't have a nice, neat spreadsheet where they all are, like a list CSV file, long sheets. So I need to find every bit of information about these sites so I can start using them properly and usefully. So talking about coordinates, how long they've been going for, what
pollutants they measure, codes, you know, like the European site codes they have and all these sort of thing. And this is where Python finally comes in, is data scraping, which has been an great module for passing out HTML code. So it's basically putting your request there,
a web page from using Python request, put it into beautiful soup and it passes it all out for you and you can search for bits you want. So you say, all right, for this site, Aberdeen prints in there, give me all the bits of information you've got about it and you get a nice table out. And although Defra have been great and you could email them and ask for this sort of
information, this is a very quick method of getting a lot of information you need. And although on the website, each one of these sites has its own web page, so you have to go on, look, go on, look, you can do that in a loop, no problem, like just click through them all. So now I've got all these sites,
I need to get the pollution data from them. And again, this is another thing that is not made easy by the government, but every site has its data available in a CV that you can, CSV, so you can just go to a particular URL and it's there. However, you need to know that URL and that's not available. I managed to get that by finding someone else's code who does
some work with them a couple of years ago, going through their R code, finding the URL that they use. And so it's a simple task if you know the URL, the problem is you need to know the site code and the year. So each web page, so for instance, Edinburgh, the site code is ED3. If you want 2018, you'd have to use those. This data is not
any useful structure. You want data from 2018 in Edinburgh, great, it's all there. You want specifically carbon monoxide from the past five years from Edinburgh, Aberdeen and Glasgow, say. You're talking about 15 web pages there, which have their own information
in, a lot of which is useless to you because you're only after carbon monoxide, say. But it's there and it's available. And that's good, we have some data to play with. So the next step is analysis, which is the fun bit that I enjoy. And of course, I use pandas. However, I'm ashamed to say I came to pandas quite late
in the game. I was quite stubborn in the terms of everything I'd used, just used NumPy and that worked, so why bother changing anything else? However, just a quick Google of, oh, I want to read this CSV of a web page. What's a quick way of doing that? Pandas is easy. Oh, I'll try that, one line. Oh, that was easy. Oh, that's a nice data frame in a time series. Oh, this is really
nice. And I wish I'd spent a couple of hours maybe, a couple of years ago, spending how to learn, teach myself pandas. And I would say to myself, I don't even want to think about how much time, but it's a bit of a lesson in don't be so stubborn in your code you use. Things like filtering or resampling are such powerful tools in pandas. It makes things so
quick. It is fantastic. And there are also great tutorials and documentation out there. And I'd say Stack Overflow, which I basically owe my PhD to, is full of pandas. It's just, you know, want to do anything? Pandas, pandas, pandas. So we've got this data in pandas.
We can do all sorts to it. I guess the next step is we want to visualize it. And I use Plotly. Now, for a long time, I used MATLAB, which has been great, but then I discovered Plotly. And this provides a very simple way. So this is a very small snippet of code that'll make that graph on the side. That's all you need. And it'll make you an interactive plot that
has features like hover and zoom. And you can change colors really easily. And if you're thinking about interacting with people, having something that they can change, they can manipulate, or not manipulate the data, but make how they want to see it. It becomes a lot more interactive and a lot more personal, instead of just having a static
graph there that's showing you whatever you want to show. And it's great because it makes things incredibly simple. I mean, what they are isn't important, but you can make very simple plots. So you subplot things, any bar plots, wind roses. So that's more. To do a wind rose, there are a few modules out there now that do it,
but Plotly just seems miles above the rest. So, so far, I'm in comfortable territory for me. This is what I've done for the best part of six years, I'd say, is some sort of data analysis or something. However, the next step for me is putting it online. This is going
very much into the unknown. And doing something like this really highlights how you can think you know about Python, and then you really don't. After a little Google and search around, I went with Django. It seemed like a good framework.
It's a huge framework. It has lots of documentation and lots of tutorials, which is great, but also a little daunting. Somebody's never used this before. It's like, oh, God. But I'd say there's lots of tutorials. And I know it's not aimed at me, but I'd find the Django's Girls tutorials on how to set up a website using Django is great for anybody starting off with this. I know there are other frameworks out there.
And a lot of this is very much of an uncertain, like, I'll try it. I'll see. I don't know if this is the right thing for me, but I'm going to go with it and see how far it goes. And especially with Django, it's very popular. It's lots of documentation and tutorials, but it's not really designed, as far as I could tell, for the sort of websites I want to make.
It's mainly focused on blogs and that sort of thing, but give it a go. And I might be preaching to the choir here, but basically Django will easily create a lot of files for you in a template. And these files include things like URLs.py, which is a list of
the website URLs you want to be called. So your actual website name goes in there, and then it calls this views.py, which processes things and renders web pages. It just sets out this thing for you, which as a beginner in website, Pythonic websites,
it's ideal. Click and play, basically, and then you can start spending the next couple of weeks breaking it day after day. So again, you can type in your website. By the way, this is the views.py, hey, this person wants to visit this website, do something.
Which then says to this other module, models.py, they want something from this website, process some data, get something from a database, do something, fine, here you go. It's back to views.py, and it makes yourself a pretty website. And hey, the website is
born. This is a very simple, static website at the moment with some buttons on, you can click, but it was fairly simple to do that, and Python, you know, the amount of tutorials out there in Python really helps that. But as I say, Django is a great framework, but for what I wanted to do, which was lots of people interact with this website,
change graphs, play with them, it's not, I found, the easiest thing to do to create multiple instances or interactive pages, especially without reading scary words like JavaScript. So I discovered, along with Plotly, who do nice graphs, they introduced me called Dash, which is another framework, and this is taken straight from their website,
says they build analytical web applications with no JavaScript required, so that's two thumbs up from me, and then this is built on their JavaScript, so React, Flask, and it ties it in, so you can have interactive things like drop-down sliders, graphs, whack that with your analytical code, and you can make something that looks good pretty easily, and I thought, this is ideal. So Dash creates these apps,
which could be standalone websites by themselves, in my case, it's not, I'll explain a little bit later. Every time a website's loaded, a new app instance is created, so you get one per user, they do what they want, it doesn't affect anybody else. Each app layout, so each app has like a layout, it's basically in Python, you say, I want this, then this, I want a drop-down menu, then I want some descriptive statistics,
I want a plot, I want a selection menu, you can click all that, and then you click them, and it calls these callbacks, which are Python decorators for functions, so you click this and this decorator goes, oh, someone's said they wanted this color bar to be yellow, change this, and it sends it back and updates the page, and it's brilliant.
So with a bit of wrangling, I managed to put my Dash app in my original Django framework, with a lot of help of people in forums, so Django framework sort of holds everything together, and Dash app is within inside that, and that was all the hard work, basically, gets the data processing it, displaying quickly on the website, except,
if it wants to, great, I'll just type it in. So you end up with something that's not showing
on this screen, that's great, there we go. So this is a simple website, it's not that pretty to look at, it's a work in progress, but Dash provides things, so this is a selection tool
to get some data you might want, so we can go, define it by region, let's say central Scotland, since we're here, and I want all urban sites, it's thinking it's, this is using a quite cheap server at the moment, so you have to,
and eventually we're not finished spinning around, it selects all the sites that are in central Scotland that counted as urban, so you can click Edinburgh St Leonard's, which is the nearest one, we'll stick with the time series, select any variables they use, let's look at
nitrogen dioxide, click submit, now this is calling the data, and plotting it up, and this is all that, and so this is what's good about Dash, is you can hover, hover data, get different points, you can zoom in to look at more points, you can download it if you want,
let's reset the axis, and with Dash as well, you can make these interactive things, you click weekly, and this is saying all right, someone wants to resample this data every week, and it goes there, and it says there's a pandas module that easily does that, just resample week, or you want to stick it in a line graph instead, and you just add more plots onto
these, so we have say histogram, which you can change the number of bins to 50 if you'd like, this is the example of the average concentration over one day, so it's taking all the time series and saying well look there's a whole day there basically, we can split that into weekdays,
Monday, Tuesday, etc, and you can see peaks at rush hour, so it has all these really useful tools, and makes a nice play website, or website you can play with, sorry, problem is there's
too much data really, it's time to use the database, previously that website was calling the Defra website, every time someone requests it out on this much, and it's going off, as I said before with the way they've structured their data, this is just not feasible,
it's calling it every time, it's fine for a small amount of data, it won't really take more than a second maybe, but as soon as you start getting decent amounts, it's taking a very long time, and eventually it's going to crash, so better data management is needed, and that's where Django comes back into its own again,
so using Django it's really simple to integrate a SQL database into it, and I basically just copied all the data Defra had, and whacked it on this database, which now Django calls, it leaves Defra alone, sort of, because it needs constant updates, Defra updates every day, I just have a worker post in the background saying oh it's morning,
go collect some new data, write that down, and now any combination of millions of data points is available, you want every three o'clock on a Wednesday outside Aberdeen, brilliant, it'll do it for you, no problem, so that's where it's at at the moment, and it is still early days, but it's been a good learning curve, and there are developments that like to do this, there are many many bug fixes that need to be done, it's quite easy
to go on that web page and break it, it doesn't work on a mobile for instance, it doesn't really work on Internet Explorer, but you know it's a work in progress, I'd like to integrate more data with it, so more stations, a lot of European stations, a lot of council stations, this picture here is a new sensor for CO2, even though I said I wasn't talking about it, but we could include it, on top of Blackford Hill, the observatory there,
so that data could be available soon, talking about satellite data and models, although then you're going from gigabytes for entire decades to terabytes per day, so data management starts getting a lot more intricate I suppose, and also to get more feedback from
many users, you know the people who are using this, it's actually useful, I've made some plots, but I've showed you simple ones, what would be really useful, comparisons against different things, so that's where we're at at the moment, and I start to finish with lessons I've learned from doing this, and sort of going into the unknown, the first one is just jump in, I spent a long time being like that doesn't quite fit what
I'm doing, but you'll never find the perfect tutorial, and it's best to start with something that's very imperfect and build it up, than trying to find, you're wasting your time trying to find something that's better, in that sense be adaptable, I started with Django, it didn't quite work for what I wanted, I went with Dash, I looked at some other things, I went back to Django, you know there's no point belligerently sticking with things,
and so don't be scared to make the wrong choices, I started a lot of website where this is just not right, so it's not what I wanted, but I'm very prone to just sitting there tooling my thumb singing over what I'm going to do, but take your time to learn new things, in my case pandas is what I would have learned, but you know I suppose that's
case in any walk of life, not just python, don't get bogged down by the little things, I found with writing this website that I found it a lot better to quickly do something that makes you feel like you've achieved a lot, and then you can be like all right, I'll play with the colors of the bar plot later, or the spacing, that doesn't really matter right now, what I want is to get something going and get excited about it, but in that
mind keep an eye on what you're trying to do, because you end up sort of getting, again looking at these small things and be like well what is that I'm trying to do, is spending a week going over this bit of code that's going to resample actually
useful for anyone, or is it just something I want to do, also don't reinvent the wheel, and this might be a case for academics especially, because I know people, myself included, are always hesitant to use other people's code, because it's always a bit scary putting your faith in results that is almost a black box, you're like here's put this data into a Python
module or a website and it comes out the other side, what's it going to show, if you know what it does step by step that's good, but you're going to waste a lot of time doing that, and at some point you've got to trust people, you can't redo everything, and lastly go for a walk, take time I found, if I get stuck, that's the best way to
do it really, just go out, and especially with air quality, if I go outside and it smells, better do something about it, so that's me, thanks for listening, more tech problems,
I've not planned on being, but I live in Edinburgh, so I could be, I don't know what, I've not looked at what sprints are,
I could just repeat the question back if, so he asked if I was at the at the sprints,
yeah so that question if you didn't hear was about using data from
a Scottish government in Edinburgh, and friends of the earth have lots, and yeah there's so much data to be used out there, Defra is just a starting block, but I've a few
friends who work at the Scottish Environmental Protection Agency, specifically looking at air quality, and they have a lot more stations available and stuff, but now I'd love to use it, yeah, and more the better in terms of educating, as in say with Edinburgh, you go on the line, just not anything, yeah, past few years, sorry, the local monitoring stations in Scotland
are lots of them are placed along roads, and Scotland's been breaking legal limits
along these, and so this is the motivation behind trying to better monitor, better check these results and visualize them, so I'll speak to you later, but yeah, I mean, more data the better basically, I'll put you in touch, thanks, thanks for a great talk,
are they, do you know, like, so if you're using data from various different sources, like you were just saying, do you think that the, what you collect from different types of stations will be comparable with each other, will there be sort of
technical variations in those? There is a problem with that, so things to be directly comparable, people argue, so different types of instruments might have different calibration things, and selected by different environments, so Edinburgh is considered urban background, but you might get
urban traffic, which is on a road, or, and these things, I mean, you can say one's more so than the other, they're not as directly comparable, you can't do it easily, but it's doable, there are ways around it, but yeah, it's not just this number versus this number, basically,
there was another, yeah, this is the last question, it is open source, and yes, I would expect a pull request, it is a mess right now,
yeah, yeah, last quick question, and this is a question from friends of friends, so this is a real case that the parents of the primary school, they, they are convinced that the, the error around the primary school, and the air quality is bad for the children,
but then however, they don't have a way of convincing that the local, local council to say the air quality there is really bad for that, so do you have any suggestions, or any tool kit that for, as a citizen of the, or the parents of the school can use it, and to collect it,
and then convince the authorities to say this is a problem? It's difficult, because there are a lot of people that think that, and I would argue rightly so, there are lots of groups, I know from university side, and I imagine there are from commercial sides as well, that are actually looking for ways to test their instruments,
and gather data, so there's one recently from University of Birmingham here, that they did a study around schools in, within Birmingham, so they brought some monitoring stations, and it wasn't, didn't cost the school anything, didn't cost parents anything, it was, it was a research project done by Birmingham, but then they fed it into the community,
and got the community involved, and showing, and their results actually put a no traffic zone around their local school, so there are, there are out, they are out there, unfortunately I don't do anything directly measurement wise, I could write down a few places you might be able to look afterwards. Okay, that's all the time we have allotted,
so let's thank the speaker.