Citizen Science with Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44944 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201861 / 132
2
3
7
8
10
14
15
19
22
27
29
30
31
34
35
41
44
54
55
56
58
59
61
66
74
77
78
80
81
85
87
91
93
96
98
103
104
105
109
110
111
113
115
116
118
120
121
122
123
125
127
128
129
130
131
132
00:00
SoftwareInformation technology consultingDemo (music)Link (knot theory)Core dumpField (computer science)Multiplication signMathematicsLink (knot theory)NumberIntegrated development environmentRoutingDigital photographyGroup actionForm (programming)Point cloudDemo (music)Web pageLoginReading (process)Domain nameMereologySheaf (mathematics)Set (mathematics)Slide ruleMobile appOpen setGraph (mathematics)WebsiteCASE <Informatik>Poisson-KlammerMeasurementProjective planeResultantTheoretical computer scienceType theoryEndliche ModelltheorieVirtual machineDiagramGodComplex (psychology)Self-organizationLimit (category theory)Row (database)Connected spaceBitBasis <Mathematik>Computer programmingEvent horizonPoint (geometry)Lie groupArmFocus (optics)Moment (mathematics)Shared memoryInheritance (object-oriented programming)Right anglePiComputer animation
08:01
CollaborationismSatelliteGraph (mathematics)Source codeUniform resource locatorWeb pageWeightInformationProcess (computing)FaktorenanalyseCorrelation and dependenceCausalityProcess modelingEvent horizonOpen setSingle-precision floating-point formatInformation1 (number)Multiplication signExecution unitGame controllerArchaeological field surveyView (database)Right angleGreatest elementMathematicsTracing (software)WeightRow (database)FrequencyType theoryEvent horizonSet (mathematics)Mathematical analysisAxiom of choicePresentation of a groupSingle-precision floating-point formatLink (knot theory)Slide ruleMoment (mathematics)Video gamePoint (geometry)BitSoftware development kitMobile appNumberConnected spaceHypothesisCore dumpLaptopProcess (computing)EvoluteSign (mathematics)Coefficient of determinationMereologyReal-time operating systemSoftware developerRoundingOpen sourceResultantOpen setElement (mathematics)Revision controlSound effectFilter <Stochastik>Projective planeSensitivity analysisDivisorCollaborationismSatelliteLimit (category theory)Interface (computing)Graph (mathematics)Noise (electronics)HistogramCausalityCondition numberComa BerenicesComputer fileFamilyStandard deviationNegative numberData storage deviceComputer animation
15:38
Single-precision floating-point formatProcess capability indexMachine learningUniform resource locatorWeightResultantWeightData analysisDivisorPoint (geometry)Line (geometry)Type theoryCountingTotal S.A.StatisticsDrop (liquid)Different (Kate Ryan album)Closed setMembrane keyboardBitTracing (software)Number2 (number)Dot productSound effectMachine learningConnected spaceVirtual machineFrequencyHypothesisKey (cryptography)Multiplication signGraph (mathematics)Integrated development environmentProper mapSimulationProjective planeRight anglePotenz <Mathematik>Game controllerQuicksortLie groupVulnerability (computing)CurveComputer animation
19:30
Uniform resource locatorWeightCurveLabour Party (Malta)ZeitdilatationDecision theoryDifferent (Kate Ryan album)Multiplication signCuboidPoint (geometry)TrailBitResultantWeightZeitdilatationReading (process)Row (database)Digital photography2 (number)Operator (mathematics)CurveCoefficient of determinationLevel (video gaming)Different (Kate Ryan album)Line (geometry)Modal logicDot productMechanism designLink (knot theory)Slide ruleNumberMereologyRight angleTerm (mathematics)FrequencyPlotterExecution unitComputer animation
23:25
NumberWeightResultantDifferent (Kate Ryan album)Strategy gameDot productLine (geometry)Point (geometry)Interpreter (computing)Execution unitStaff (military)Graph (mathematics)Virtual machineComputer animationDiagram
24:03
NumberWeightGraph (mathematics)Decision theorySelf-organizationFlowchartOnline helpSingle-precision floating-point formatMultiplication signVirtual machineNetwork topologyStaff (military)Carry (arithmetic)BitPerspective (visual)ResultantWeightTrailComputer animation
25:20
Graph (mathematics)WeightNumberProcess (computing)TrailPattern languageVideo trackingWebsiteDemo (music)Software testingVideoconferencingData typeComputer-generated imageryComputerCurve fittingOnline helpSample (statistics)CodecVideoconferencingSpreadsheetSpectrum (functional analysis)DatabaseMultiplication signDynamical systemDot productFreewareRight angleAreaDigital photographyFunction (mathematics)TransmitterFlowchartProjective planeVirtual machineIntegrated development environmentOperator (mathematics)MetreProcess (computing)Point (geometry)ResultantDecision theorySoftware testingDiagramSelf-organizationDemo (music)Network topologyRaw image formatReal numberTrailGoodness of fitPhysical systemMachine learningBuildingBitSignal processingField (computer science)Arithmetic meanStrategy gamePropagatorRange (statistics)Matter waveData storage deviceSoftware-defined radioRoboticsLimit (category theory)Tracing (software)Computer animation
32:09
Video trackingRight angleAutonomic computingArea
32:48
Bit rateCodecComputerBookmark (World Wide Web)User profileComputer fileSample (statistics)VideoconferencingDrop (liquid)View (database)Data typeBoom (sailing)Video trackingWebsitePeg solitaireSpacetimeFreewareOnline helpEstimationFrame problemAcoustic shadowAreaWebsiteSoftware testingRouting10 (number)VideoconferencingExecution unitLink (knot theory)BitComputer animation
33:45
VideoconferencingData typeComputerBookmark (World Wide Web)Video trackingTrailDrop (liquid)WebsiteSpacetimeFreewareDemo (music)Software testingComputer hardwareWeightAreaFrequencyMultiplication signComputer hardwareMappingProjective planeExecution unitBitSoftware testingSoftware development kitResultantLevel (video gaming)Network topologyControl flowMereologyLink (knot theory)Demo (music)Decision theoryPattern language2 (number)VideoconferencingComputer animation
36:34
Demo (music)Dependent and independent variablesGraph (mathematics)CodeArchaeological field surveyGeometrySample (statistics)Kernel (computing)WeightCellular automatonInstant MessagingHistogramRange (statistics)Parameter (computer programming)View (database)Computer fileData storage deviceVideo game consoleDebuggerComputer networkComputerFamilyNormal (geometry)InformationMaxima and minimaContent (media)Markov chainServer (computing)Network socketFrequencyMedianChi-squared distributionMetropolitan area networkPlot (narrative)Alpha (investment)Slide ruleWeightMultiplication signStructural loadMedianDependent and independent variablesRaw image formatBitOutlierDemo (music)Goodness of fitDistribution (mathematics)Row (database)Coefficient of determinationComputer fileArithmetic meanInformationQuicksortLaptopNumberParsingRange (statistics)SequenceInfinityResultantDifferent (Kate Ryan album)Query languageArchaeological field surveyRoundingArtificial neural networkSkewnessPhysical systemVariancePoint (geometry)CircleNumeral (linguistics)CodeProcess (computing)SpreadsheetIntegrated development environmentGraph (mathematics)Pairwise comparisonOverlay-NetzPerspective (visual)Volume (thermodynamics)Interface (computing)Interactive televisionMultiplicationPlotterTransport Layer SecurityParameter (computer programming)8 (number)Optical disc driveRight angle40 (number)Frame problemCASE <Informatik>Standard deviationComputer animation
43:31
Domain nameClient (computing)Multiplication signSet (mathematics)Slide ruleGraph (mathematics)ResultantComputer animation
44:25
WritingBlogSelf-organizationMultiplication signMessage passingEvent horizonGroup actionEmailMathematicsAddress spaceIntegrated development environmentType theoryComputer animation
Transcript: English(auto-generated)
00:00
So welcome, and I'm really happy to be here. This is a huge honor to be given a chance to provide a keynote. I'm always pushing data science on everyone, so my goal this morning is to try to educate you and convert you into the field of data science and then bring you into my meetup and the pi data,
00:21
but the general data science world. So I'm an engineering data scientist. If you're outside of the data science world, you would have heard a lot about data science. It's one of the sexiest things at the moment, and it gets all of the mind share. Many data scientists come into this world via PhDs
00:40
and postdocs. That's not me. I came in via a theoretical computer science background 15 years ago, so I've taken the other route into data science. And I'm talking to you today more as engineers coming into this world, rather than academics coming into the data science world.
01:01
So I work, I've been running my own company for nearly 15 years. I coach, I train, I act as an interim senior data scientist in teams that are lacking a senior. And one of the reasons that I do that is that I love to learn new things, so I keep challenging myself to go off and work in new domains. A couple of the talks today will be medical focus, but I've worked in a number of domains
01:22
just because I really enjoy learning new things. And then I love sharing those new things that I learn, and where possible, I run experiments, including upon my wife, as you've already heard, and then see what I can learn. So along the way, I've written a book on high-performance Python with my co-author, Misha,
01:43
who was at Bitly and has moved on to Cloudera, I think. I co-founded the PyData London meetup, which Alexander has mentioned. So I'm really proud of PyData. We've got over 100 PyData events around the world
02:00
and a set of conferences. Of the 100, my London one, built with Marco, who's here, and a number of other colleagues, is the largest in the world. I'm super happy about that. We've got 7,500 members. I would love to invite some of you to come and join us at PyData London. But also we've got a PyData Edinburgh, we've got a PyData Cardiff in the UK. There's a number of PyDatas throughout Europe.
02:22
There's a PyData Frankfurt, I'm reminded. You really, if you're interested in this at all, it's a very friendly and welcoming community. It's all the nice things about the Python community with people who like talking about data. So if you're at all into that, go and join. I consult through my model insight, and I work with companies like Mitsubishi Finance Bank,
02:40
Channel 4. I've been working with QB, it's a very large insurer, helping them figure out how to apply machine learning into their insurance. So I work with large companies where there are large projects, they can be very slow, they can have a big impact, but they're big corporate things. That's not what I'm talking about today. So I'm gonna give you some stories on citizen science.
03:01
These are small, either individual or lightweight projects inside various organizations that are public projects. I'll be giving a crowd-led demo. I mean, I'm sacrificing chickens to the demo gods here. I'm giving a live demo with Jupiter Lab, so that those of you who haven't seen
03:22
the Jupiter Lab environment before, you can see how a data scientist might work, and you might be inspired to go and try this as well. I need you to participate with me. I'm gonna be sharing a link for a Google form twice. No login required, you just visit the link, and then there's a single question,
03:40
you type in an answer, you hit submit. So you can do that on your phone. I know that the Wi-Fi in here might be a bit tricky, but the form is very lightweight. It works over your 4G connection. So for anyone who's got a mobile phone, I would very much like you to take part in this little experiment. You just go to this form twice, type in a number, hit submit, and then having done that, you can submit another answer.
04:00
So if your neighbor doesn't have a device, you could pass your device over, and they could submit their answer as well. It's all in the months. No logins, no complexity there at all. There are two appendix slides as well that I'll show, they've got all the links for the talks that I'm using, these stories in here, so you can follow up and learn more about them. So first of all, I'm gonna talk
04:21
about Macedonian air quality. Has anyone been to a city with really bad air quality before? Okay, so a bunch of us have. So when I was at Piedata Amsterdam about six weeks ago, I met a chap called Gorjan,
04:42
I think Jovanovski, I think that's his name, and he told the story of the Macedonian smelly fog. And as he tells it, every year the smelly fog descends. So this is not an unusual photograph of a strange cloud layer in the city. This is the smog in the city
05:00
taken from above, looking down. And this is what the populace lives in. You can see some of the skyscrapers just peeking through at the top there. So this bad weather descends for many months of the year, every year. It's a known thing. Everyone just says, it's the smelly fog. There'll be rains later, it'll clear, it'll be okay. In between the government issues warnings. Anyone with breathing difficulties
05:21
or anyone who's a baby maybe shouldn't leave the house today because the air is particularly lethal. And then they changed the limits. My wife and I, when we lived in Chile, we had a similar thing where the government would change the red levels and increase the limits for what it meant for a day to be a red day when you shouldn't leave the house. But it's kind of terrifying when you can't see down the street because the pollution is so thick.
05:41
Gorgen took, he was learning programming at the time. So we took some government open data about pollution measurements and he was learning programming. So he wasn't very confident in what he was doing. This was about five years ago, I think. He takes the data and he draws some graphs and he knows he's made mistakes because when he draws the graphs, the numbers make no sense.
06:00
He takes it, it's an undocumented data set. So he hasn't got any guidance there. But every time he's drawing these numbers, it's crazy. These numbers are significantly higher than anything he sees around him. He does some reading online. These numbers are consistently four times higher than the bad pollution that he's read about in Beijing and 20 times the numbers expected in the EU
06:20
at the worst possible case. And these are the daily readings that he's experiencing. And after a while, he realizes, oh, they're true. These numbers are actually correct. And he's the first person that he's found who's playing with this data set. So this is really awful, right? There's killer air, 20 times EU pollution limits. No one in a country is talking about it. So he writes a webpage graphing these results.
06:44
And then he finds some other people who care about this topic as well. There's a lot of people, it turns out, in Macedonia who care about the fact that they're being poisoned on a daily basis. So we find some people and they popularize these results. And within a month, they've got a million people consuming these results, first off a webpage and then off of a mobile app.
07:01
And it gets to the point, that diagram on the right there, that's a member of parliament holding up printouts from the website in parliament, discussing the fact that there's this issue that transcends any nationality, sex, education, wealth bracket. This is the air that everybody, every politician, every child is breathing, and it's killing them all.
07:21
And maybe this needs to be discussed. And the incumbent government doesn't want to discuss this because this is a bad topic. And then the government that wants change is talking about it and using this to generate some action. There was an interesting part in Gordon's story where he talks about how the Minister for Ecology
07:40
goes online, I think goes onto the radio, and says, this is all lies, the data is wrong, it's all a lie, it's a conspiracy. So Gordon comes back and says, look, I'll come into the government with my data, compare it to your official data, we'll compare it to your paper records. If I'm wrong, I will apologize, delete the app and remove everything. And if you're wrong, you resign. And then that was the last they heard from the Minister of Ecology.
08:01
And I asked Gordon, how did you get the data? If the data's this bad, why would the government release it? He said, ah, well, Macedonia wants to be in the EU. The EU provided these sensors. A requirement of having the sensors is the data has to be published. The data was published, not documented, but published. So as a result, the data was made available, but then just not pushed, not documented,
08:20
not investigated in any way. And it took someone like Gordon to go and do something with it. So they're improving upon this now. They've gone from this single dump of data to frequent updates. So it drove government policy change during the election, and the new, the change in government were promising rapid evolution in air quality standards.
08:41
They won the election, nothing changed after that. And so this is clearly gonna be an ongoing slow process. But some things did change. Using the mobilized population, they tracked down a highly polluting incinerator. Turns out, British supplied highly polluting incinerator, something that we in the UK got rid of over a decade ago
09:01
because we wouldn't meet the EU limits with this. And the Guardian wrote about this in 2001. So this was gifted out. And as I read it, in fairness, it was better than what was available at the time out there. So it was an improvement, it just should have been better. But as a result of highlighting the problems with this unit, the fact that it was running 24 seven rather than within strict timelines,
09:21
and we didn't have certain filters on it. They got it fixed. So it was far less polluting. The big step up they're doing now was a collaboration with the European Space Agency and the Copernicus Project, looking at real time satellite data, which doesn't depend upon where a government places sensors,
09:40
which may or may not be in a sensitive location, but take satellite data, which sees everything. And they're beginning to analyze that to drive further change. So what can we learn out of this? Well, the simple lesson here is, draw a graph of unseen data. So if you can find the data set that no one's looked at, and there's a lot of open data, I've got links to that in the appendix here,
10:00
go and find some data that no one's drawn before and draw it. And you can draw it in Excel if you want. A lot of this stuff, it's really easy. It's CSV files typically. Draw the data and tell a story. Find some people if you want to try and get some change around this, and then see where that can take you. This is a really easy entry point into data science, just getting some data and drawing it.
10:20
And there's a additional slide here that I've just put in. A month ago, Pi Londinium, my colleagues Robert and Olivia talked about this personal air quality project they're working on, a Raspberry Pi device with a low-cost sensor. That kit up there costs about 60 pounds. They can use it for monitoring in the house. There's the infamous dirty sausage store,
10:41
which you can see told in that presentation if you go and watch it. And what they're doing is they're mounting these sensors on the backs of pedestrians and cyclists. So as they go around, they're monitoring their own personal air quality as they travel around the street so they can make better choices about the streets they're taking and the pollution they may or may not consume. And there's a talk this afternoon by Douglas Finch on air quality in Python
11:00
that if you're interested, I'd suggest you go and attend. So here's the first audience participation moment. So I would like you to guess the weight of my dog and you know nothing about my dog. So this is a wide open, simple survey. There's, if you go to that, bit.ly.com keynoteada1,
11:21
and you can guess that the second one will have a two on the end, but don't go there yet. So bit.ly.com keynoteada1. There's no sign and you can go there on your mobile phone. That link will appear on the next set of slides. So you can go in there in the next couple of minutes. Please guess the weight of my dog in kilograms. Only put in a number. So if you put in any text, it gets stripped out.
11:42
And I'm gonna give you no information right now. Later on, I'll give you some more information. You make a second guess and then we'll compare those two sets of results in a Jupyter notebook. So just a number, no negative numbers, kilograms, so nice round numbers or low numbers. Nothing too crazy. Please don't be the clever person
12:01
when I gave this last time who types in N-A-N to see if my parting routines work. They do, but there's no need to test this. Kilograms, kilograms, please. Yes, when I ran this last time, I left that deliberately blank and immediately an engineer that I know dived in with my requirements for units
12:21
and I love that. But yeah, kilograms only. So we've mentioned my wife sneezing. She's here in the audience. And I love the fact that she supports me in running these experiments upon members of my family, including her. So my wife sneezes a lot. And when I say a lot, there's a histogram on the bottom right hand corner.
12:42
We wrote an app that records where she can record when she sneezed. So he just records every time she sneezed. So the left hand bar are the days when she sneezes zero times. This is over the course of about a year. So there are about 35 days when Emily didn't sneeze at all during a day. The next bar is when she sneezed one time in a day.
13:01
Well, there's about 20 times. And then two times a day, three, four times, she sneezed about 40 days, about four times a day over the course of a year. The far right side is 28 sneezes in a day. That was a particularly bad day. And then the question was, Emily was a mobile developer at the time, could she write an app, an open source app that had benefits to other people
13:22
suffering from different conditions, but a generalized app for medical personal healthcare? And could I analyze the data to see if we could find possibly correlated, possibly causal connections between events to see what might drive the sneezing? So we had a hypothesis. There are environmental factors that drive sneezing if we record all of these factors.
13:41
Can we do something about it? So this is the app that Emily built. It's an iOS-based app, open source. It has event logs, so simple button interface. You can just tap when something has happened. You've got a runny nose, your eyes itchy. Particularly, I've sneezed, I've sneezed. We talked about could you use the device to automatically record sneezes?
14:02
So you get a physical jerk, you get a loud noise. That would be quite a lot of work. I didn't want to go quite that far. Tapping a button was easy enough for the first version of the experiment, but I can see lots of ways you could automate elements of your personal reactions collection over time if you suffer in that way, which we might imagine seeing in future devices.
14:22
Oh, is that a hand up? Oh yes, for the survey please, just one answer per person. And then there'll be the second survey where you put in a second answer later on. Thank you. And so yes, open source app, editable history, records GPS traces.
14:40
I will say one thing there with the GPS traces, I take periodic updates from Emily and then I do the analysis. It was really weird realizing that I had the same kind of view that Google and Apple have watching a person's movements over time. It feels incredibly intrusive. And of course it is incredibly intrusive. I got it lagged, but nonetheless, you get this view. And it's a view that Apple and Google
15:00
and any other controllers of our data, any mobile phone company has all the time. And if we aren't looking at that, we never think about it. We kind of just take it for granted, but when I actually had it in my hands, it's really weird to have that. So one of the reasons I encourage people to run these kinds of experiments is it makes you think a little bit outside of what's normal to you in your everyday life and how you're interacting with the world and the data that's available.
15:21
So we're gathering all this data and there's a number of things we got out of it. I've given a couple of talks on this. This is just gonna show one little result here. So here we're looking at a single patient antihistamine effect. So Emily sneezes a lot. She takes antihistamines roughly every other day. On a day when an antihistamine is taken, what is the effect? Well clearly Emily thinks she needs the antihistamine.
15:41
She's sneezing. She already feels like she's sneezing. It's a day with high propensity to sneeze. So what effect does the drug have? So on the left hand side, we've got all of the traces for when individual sneezes have occurred. So this is a period of 12 hours after the first anti or the one antihistamine of the day has been taken. So when the antihistamine has been taken,
16:01
whenever Emily sneezes, she's tapping away, but she's already recorded an antihistamine has been taken. So if I take those days and then say at the zeroth hour when an antihistamine was taken, count all of the sneezes and then we just get a single count. That's the blue line on the right hand side. So hours zero and one after an antihistamine was taken, the sneezes are high.
16:20
They're close to 50. Two hours after the antihistamine was taken, we see a marked drop. Number of sneezes, the total number of sneezes over all of these antihistamine days is markedly lower and it stays low for about eight hours and then it increases again. And we might ask, well, what's driving that behind it? So the dotted line behind, that's just an extrapolated line that I put together.
16:43
I know that the antihistamine that Emily was taking at the time takes about two hours to have an effect to enter the bloodstream. And then it has an exponential delay curve, a decay curve, so that it drops off with a certain half-life. And so I can plot that extrapolated line based on the simulation. And we see that that two-hour point
17:00
is when the sneezes drop down and then as it decays to a certain point, the sneezes pick up again. And of course, this is a general result but this applies to everyone in different ways based on personal biology. So based on the kind of medication you're taking, you might have a different reaction, a different effect. It might last for days. It might last for only hours. This particular drug, other drugs,
17:20
might work in different ways, better and worse ways. So here's a nice simple way to record the data and see how it works for you to improve your own personal health care. Now I had the strong hypothesis that there were causal factors in the environment that drove the sneezing. And I worked awfully hard, I went really hard, with a couple of colleagues, really, really hard trying to find any evidence of this causal connection.
17:41
We couldn't find it. We found one result. There was a weak relationship with humidity. As the air got drier, the propensity to sneeze increased. As the air got damper, propensity to sneeze decreased. And it turns out your nasal lining, the mucous membrane in the nose, when it's drier, it's more irritable. And so it's more likely to be sneezed, all things being, you're more likely to sneeze,
18:01
all things being equal just because the nasal lining is drier. So we can't control humidity, but it is interesting at least to find a proper result in there. Now we escalated this, took it to a King's College professor, one of the top professors in the world, connected via our Pi Day to London community. And he said, this is an amazing result. Clearly this is a non-allergic reaction going on. It's a chronic and persistent rhinitis.
18:22
So Emily is primed to sneeze just because that's the way her body is working. And there are no environmental factors. We had data for different countries, different seasons, different allergen types in the air, what kind of travel we were doing at the time, London Underground buses, all sorts, no connection at all with any of that.
18:41
He did suggest the new treatments, which we tried, but didn't get any improved results out of it. So it had some benefit, it ruled out another treatment method. I mean, the anti-histamine works just fine, but we were looking to see if there was a better solution here. But the important takeaway here is graphing was enough to get a diagnosis and the machine learning, it did give us something new, but you don't need to go all the way through
19:01
to machine learning in a data analysis project. Typically getting good data, good enough data and drawing graphs and having someone who can interpret it is what you need. And that's the key takeaway here. And I'm gonna repeat that lesson a little bit more. And you might want to see Marco Bonsagnini's Lies, Damn Lies and Statistics Talk in a couple of hours time,
19:20
where he talks into some of the issues around data analysis. If you're interested in this, that might move you on a little step forwards. Okay, second guess the weight exercise. So I've got an English Springer Spaniel. You get some sizing evidence there from some of those photographs. The photographs appear in some of the subsequent slides and then there'll be the second link,
19:41
bit.ly.com, KeynoteAda2. Just go there and give a second guess for her weight in kilograms, so a number only. I'll let you look at those photos, those lovely photos for just a minute. I'm gonna move on, but you'll see a few more photos.
20:02
So you can make a guess in a minute or two if you want when you've seen a few more. So, oh, and that's my dog who clearly I upgrade with sensors as well. That's a video camera on her back as one of the experiments that will run on her. So updating outdated medical results. This was a talk given at Pied Ater Warsaw
20:23
last year by Anna. So here she's looking at updating outdated medical results. It was a really nice lightning talk. I didn't realize this. It turns out that in birthing centers, maternity units, when a woman is coming up to giving birth, there is a critical curve developed about 60 years ago
20:43
by Friedman, which is used to judge whether the woman is on track to give birth based on time and cervical dilation. At 10 centimeters, the baby is ready to come out, and so you want the track that the cervix is dilating appropriately over a period of time. And if the woman is progressing too slow,
21:00
it's a failure to progress, I think is the technical term, then you need to intervene to make sure that the baby comes out successfully. All hospitals around the world typically use the Friedman result from 60 years ago. The Friedman result from 60 years ago was developed when we had different technology. Women gave birth to different ages. Women had different levels of health.
21:22
The drug intervention and mechanical intervention were very different, and our understanding of bodies were very different. And yet 60 years later, we still use the same guidelines. And it turns out increasingly around the world, there is discussion about whether this is actually wrong. And so Anna was part of a team looking into how this might be wrong and how it might be fixed.
21:42
And the important point is when a doctor chooses to intervene because of a failure to progress, that nice phrase covers either drug intervention or perhaps a cesarean operation, which could have significant negative impacts on the patient and on the baby. And then the question is, well, do you need to worry about this at this stage? Or actually, are we intervening too soon?
22:03
So she and colleagues conducted experiments or conducted recordings on a couple of hundred, I think they were first time and second time mothers. The link is in the appendix. You can go and watch for the details. They recorded the results of cervical dilation of all of these mothers over a number of hours,
22:22
about 12 hours, I think. And then what you see there with those box plots, those boxes represent the majority of mother's readings at each of those hourly bars. So at the one hour point, cervical dilation was between zero and three centimeters. And then by the four hour point,
22:41
it's between what, three and eight centimeters. And then typically by say six hours, at least some of the mothers have reached the 10 centimeter dilation. The baby has popped out and they're finished. And then there are other mothers still progressing in their birthing. This center doesn't practice cesarean operations and drug intervention. So they typically see all of their mothers
23:00
through to successful delivery without intervening. But there are medical facilities if that was required as an intervention. Lots of other hospitals follow the Friedman curve and intervene early if they believe it's necessary. So the red line is the Friedman curve. So if any mother is above this curve, she's progressing either on track or faster than expected, then that's fine.
23:21
If she's below that point, and on the right hand side, that's those black dots below the red line. And those black dots might be one or more mothers at that point. Then they're not progressing fast enough and that's when a doctor has to intervene according to the classical result. But all of these mothers had no intervention and gave birth successfully. And so this is one of a growing body of evidence
23:40
being gathered around the world in different birthing centers showing that this intervention strategy is inappropriate or could be inappropriate and that some refinement is required to improve the quality of healthcare for these mothers. So what do they do just having graphed this and shown it? Well, they then took an extra step. Can they give an interpretable result that staff in the healthcare unit can use?
24:03
And they used some machine learning to develop a decision tree. So from a machine learning perspective, this is, that's an incredibly trivial result. It's a really, really simple, old fashioned, single decision tree. It's not deep learning, it's not big data, it's none of the buzzy things. But this is an incredibly useful result. This is interpretable by the staff
24:20
in the birthing center. It's a flow chart effectively saying, help me make a better decision than what's available in the textbook. This is incredibly useful. And so you can see if you're a first time mother, go to the left side. And then based on your weight, go left or right. And then based on your height, go left or right. And then we predict how long it should take you to have the baby. And then if you're not progressing within that time,
24:42
appropriately, that's a secondary bit of evidence to provide, suggesting that maybe an intervention is required or actually you're on track, you're under the time and everything is looking sensible still. So this has been introduced to the staff there. They like the idea of this and they want to do something with it and they're doing further experiments. So what are our lessons here? Well, check for outdated assumptions.
25:01
Many of you work in organizations that are old. They're large, old organizations. They will carry lots of historic baggage. Maybe some of that baggage is outdated. Lots of it probably is. Some of it, if you fixed it, just by reviewing the data that you've got available, maybe you could make better decisions. So maybe that saves time or money or improves people's interventions or whatever the metric is you want to use.
25:22
People forget to go and check on these outdated assumptions. They just become a matter of fact. But if you've got access to the data, because you've got access to a database or an Excel spreadsheet, or whatever it is that you've got, maybe you can go and draw some graphs and think about interpreting that evidence in a way that helps make better decisions. And one of the important outputs there is to make interpretable advice.
25:42
Don't make a really complicated system just because you could. Instead, go and make something that is interpretable by your colleagues. One of the big challenges I've been talking about in the last couple of years in my public talks is around interpreting machine learning output so that you can go to a non-machine learning colleague and explain why this system is saying a certain thing.
26:00
And that flow chart there, that decision tree, is exactly that kind of output that you want. So if you wanted to make a guess for Ada's wait, having seen some more photos, now is about the time you want to do it. Keynote Ada 2, I think we've run out of pictures as I go on to the last little story. So this story, this is the last of the stories
26:23
before we do the little demo. Where are the orangutans? So my colleague Dirk Gorison, he runs the London Machine Learning Meetup, which is a rival to my Piedata Meetup, but it focuses much more on the, rather than the data science stories, far more specifically around machine learning
26:40
and advances in machine learning. It's a similarly large meetup, very, very popular, hosted in the same hedge fund, HL, who hosts my meetup. We're both super grateful for that company for hosting us there and providing, I mean, the meetup that we have and that Dirk has, there are about 200 attendees every month, free beer, free pizza, fully hosted,
27:00
which is the size of a small conference for free every month, which is lovely. That's a lovely example of community contribution to help us progress our own goals. So Dirk runs this machine learning meetup and he's got this personal project. So some years ago, he was involved in a company, a commercial organization looking to track animals
27:20
in the wild to see if you could intervene and monitor to provide better care for animals that are in the wild. That company didn't work out. He managed to acquire some rights to carry on working with the underlying technology and he found a charity who wants to work with this, specifically around orangutans. So it turns out orangutans, they are very bright primates, they can be a pet,
27:43
and then they get bigger and they get less cute and then people just get rid of the pet. They live in areas that suffer deforestation and farming and they can be mistreated. And so you have aid agencies, that's the picture in the middle there, going to rehome the animals that have been found.
28:02
And one of the problems with rehoming is you've rehomed, how do you know you've done that successfully? How do you know that the animal is happy and the new environment has integrated and that your strategy for rehoming is a good one? And if you can demonstrate success, you're likely to raise more funding and if you can't demonstrate success, you've got a problem. And of course, you want to be successful
28:21
with rehoming these animals into a nice environment. So the way you do this is you take a little radio transmitter to the device on the right and you embed it in the body under the skin. You can't put a big tracker. These are very bright creatures that don't want a big bracelet trapped onto them. They don't have necks. They've got these big, thick, stubby bits.
28:41
And whatever you try to adorn them with would have to survive years in a rugged environment with an animal who's not afraid to be a bit heavy-handed. So they put these subcutaneous trackers in. One of the problems there is there's limited range then. You've got a radio tracker that gives out a weak signal. And the way you track it is a human turns up to the point where they saw the animal yesterday, that's their best guess
29:01
as to where the animal is today, and walks around with a radio tracker and if it starts beeping, brilliant! That means there's a signal within 200 meters of a dense jungle and they walk around back and forth trying to make the signal stronger. And if they get the signal stronger, hey, they found the ape, brilliant! And if they don't, well, they try again tomorrow. And at the beginning, when they release an animal,
29:21
there are teams of two tracking 24-7 for several weeks and then it becomes more intermittent. And then coincidentally they discover other animals that were released and they can start tracking them. But it's kind of bitty and it's really time intensive. So can we automate this? So Dirk's project, can we use drones to automate this? Really sensible idea. Can you send a drone back and forth across the sky with a radio receiver,
29:41
picking up the radio signal, processing it and then providing some kind of GPS locations? Really sensible idea. Turns out doing this on your own, when you don't have a background, for example, in radio signal processing and drone dynamics and automated flight systems and the like,
30:02
means that you take some time to build this up. Now, Dirk's a very smart guy. He also works on autonomous self-driving vehicles at a large funded company. So he does have a good strong background in engineered robotics. But nonetheless, building a drone to fly in a jungle autonomously is a non-trivial operation.
30:20
So if you were to watch his keynote talk, he talks about the Python-powered software-defined radio behind this because they have to pick up the raw radio signals over quite a wide spectrum and then do post-processing, things like the humidity in the jungle affects the signal propagation and the wavelength being used. So they have to process to find these pings. There is no simple API that just finds the pinging device.
30:42
They have to go and do the raw processing themselves when the drone comes back. And that means then that you send this drone off, it flies a flight path, it comes back, hopefully it comes back, and then you can process it to find out what it has recalled, that it's not a real-time system, which can lead to some problems.
31:01
So here are the results from one of their test runs. They were releasing an orangutan called Susie. They knew where she was being led away by a keeper to be released. So they took the drone, and I'll show you some videos of this drone in just a minute. They took the drone, set it off, and it starts flying. In the middle top diagram, you can see, the green diagram, you can see those black dots.
31:21
You can see it's basically traces flying up and down, like flying up and down a field, but it's flying over an area of jungle. And then it gets to the bottom piece and then it flies straight back up and it's returning home. And then when it gets home, you can process the data. Now, Dirk developed this in the UK. So Heathland in the UK, very different to a Bornean jungle. When they were there, they discovered
31:41
they had to fly the device lower because the signal quality was worse. But nonetheless, you can see areas of poor signal and then bright signal, strong signal, and that's where this orangutan was. They had a successful test flight. So then the question was, well, how do we take this further out and do some more work?
32:01
So I'll show you two videos. Now, there's no audio. There should be audio, but it didn't want to work. So you're going to pretend that there's a buzzing sound going, bzzz, because that's all the audio really is here. Bzzz, oh, there we go, right, bzzz, right.
32:21
So here we've got, that's the drone that Dirk is using. And what's recording it is a professional drone with a camera rig on it, which is incredibly stable. So you can see this other drone in the background. It's now going off on an autonomous flight run, just on a calibration run. And so it flies off across. This is out in the jungle, but on the edge of the jungle,
32:41
in a very safe area where they're developing. And so this thing flies out, and it's all very sensible. And then I think we see it somewhere in here. You'll see it. So you can just see the shadow coming down in the middle, and then that's the drone going down to land. So it flies autonomously, brilliant,
33:00
in a nice, wide, open area. Then you get to the release site. And so this is on the drone itself. This is Dirk's unit. It's flying up. This is one of the test runs, because it's a test run because it came back, and they got the video. So this one flies up. You notice the hole in the canopy, and then you notice the canopy around no other holes.
33:20
So when this thing flies off, it has to fly back to exactly the right spot to come down and land. And so it's going to fly off on quite a large route. So it's got tens of kilometers that it'll be flying off, but there's no radio link. And so this thing flies up, and I think it flies around a little bit. You get the idea, dense jungle. You can't see the orangutan.
33:41
From the ground, you can't see the sky, so you can't see the drone. But then this one comes all the way back, and then it comes in, and it lands again. And so this is nice. It comes in. It successfully lands. Everybody's happy. They're ready for this. They know that there's an orangutan out there. This is a year ago, and Dirk was out a couple of weeks ago for a second run,
34:00
but they're out a year ago. They knew roughly where they wanted to send this device. So they say, go. The device takes off. It's got a little signal tracker. He knows that the device is in the nearby area. It flies off. They hear it go, and then it's going to go off for some period of time. Fair enough. And then they wait.
34:21
And they wait. And they wait. And then the signal tracker shows a bit of signal. So this device is coming back, and then there's nothing. And then they wait. And then they wait. And then there's nothing. And so they decide, right, we've lost the drone, and it's quite an expensive big bit of kit. It's disappeared in the jungle somewhere. And it turns out, by looking back at the maps,
34:41
they thought they had a flat elevation as they were flying a crisscross pattern. And then when this unit flew back, it turns out there was some kind of knoll somewhere in there. So the trees were higher, and then the device flies through. And it's not a smart device, so it flew through, hit a tree, probably crashed, and then that was that. And then actually some months later, it turns out the aid agency found the remains
35:01
of the drone crashed into the tree, and that was exactly what happened, and they sent it home. So that was a disappointing first result, but it did prove that this thing works. And if you follow the keynotes, you'll see that Dirk had lots of problems even getting lithium ion batteries out of the Eurozone. He lost some of them along the way. They were captured by customs.
35:20
And once he was out in the middle of nowhere, then you've only got so much kit you can carry, and then something else breaks, and then you have to start gerry-rigging parts that might just about keep it going. So it's quite difficult to keep this kind of thing working. But the aid agency funded another device, so they went out and did a test run again. I was hoping to get some video of that second run.
35:41
Dirk tells me that it worked better this time when he got the device back, but he doesn't have successful results yet, but they're going to continue with this project. And there'll be links in the appendix if you want to read about that and follow where this project might go next. So hardware is hard. I mean, hardware really is hard. If you've never done hardware, hardware's hard. But freeing up human time is valuable.
36:01
If you could free up those tracking humans who wander around with a radio device just listening to it, and let them go and intervene more successfully and track more animals more consistently, they can only make better decisions with that kind of result. So if you tackle any kind of hardware kind of problem, always expect to iterate a lot. So always break it down into a project you can achieve in stages, even like that handheld air quality monitor.
36:22
Always break these things down into tiny stages that are achievable. So now we're going to do the live demo. So we'll see if this works. I'm a little bit nervous because now I've got to fetch the data from your surveys. So if you remember, I asked you the question, how heavy was Ada,
36:40
without showing you any evidence of what kind of dog she was, and then after showing you evidence of what dog she was. So we should see two different distributions of data, and then maybe we can learn something from that. As a data scientist, I use Jupyter Notebooks. This is in the new Jupyter Lab interface. So this is a web-enabled interactive Python environment
37:01
where you can do charting and graphing and 3D plots and JavaScript, and you can query SQL and big data systems and CSV files and anything that you need, and you can develop it in a way that provides for easy demonstrations. And if you've never used it before, I recommend you give it a go. So you're going to recognize some of the code, but I'm not really going to go into the code that's here.
37:21
So I'm just going to load in, let's see if over 4G, I can, okay, okay, okay, all right. So we loaded the data. We've got the data files down, fine. So these are some examples. These are the last rows of the last time that I ran this, but the rest of the data will be
37:43
for the answers that you've put in. Crit, oh, good grief. A mean of infinite and a standard deviation of nan. So this is having put in my most robust parsing process possible in the hope,
38:01
and last time it ran just fine. Oh, well, that might be annoying. Oh, well, let's, if not, I've got the pre-rendered demo on the other thing, and I'll have to improve the slides. Let's, no, no, it's all, oh, good grief. Who put in range parameter? Okay, now skip that one.
38:21
Can, does the next, there we go, all right. So what I would have shown you, and I will show you, no, no, don't debug it. Don't debug it. What I would have shown you is, and this is a pre-rendered one, that one of the first things you always do is load in the raw data and look at it,
38:41
and then you process the data to get rid of your outliers and the weirdnesses, so you can look at the one that's hopefully a bit more sane, getting rid of any mistakes that might have crept through. I'm very curious to see what mistakes actually crept through. Come on, stop doing that. But I'll debug that offline. So having taken out some of the unusual guesses
39:01
that have generated infinite results, thank you for whoever did that. We've got, what, we've got 448 responses in the clipped region, which is pretty sensible. So this clipped region, I take any numbers that's kilograms one or more and 60 or less. So 60 is the weight of a large Rottweiler,
39:22
which is a pretty hefty dog. Dogs do go up to over 100 kilograms, but they're pretty rare. They're bigger than humans. They're fairly terrifying beasts. My springer spaniel was much smaller, as you saw. So we've got nearly 500 responses, so I'm really happy with this. We get an interesting distribution, so we get a lopsided distribution, skewed distribution.
39:43
So lots of guesses on the left-hand side. So many of you are guessing around, what's that, between five and 15 kilograms. There's a spike at around 15 kilograms, and there's a spike at around 20 kilograms. Now I expected this. If you don't have any evidence to work off, you're gonna probably pick a round number
40:02
that's pretty sensible, just not obviously wrong. So 15, 20, 25, 30, every point, so we're gonna see spikes and this kind of artificial results. And then some of you are taking some punts much further out onto the larger weights. And if we could look at the raw data, we would have seen guesses, I'm guessing, going up to 100 plus, because it's not unreasonable to have a dog that heavy.
40:21
It's just unlikely. So we've got this skewed distribution with a median guess of 12.8 kilograms. So if we sort all of the guesses in numeric sequence and go halfway along that sequence, the median is at the 50th percentile, and that'll be 12.8 kilograms,
40:41
which is a reasonable guess, not knowing anything in advance. What happens when we introduce some evidence? How do your guesses change? So let's load in the second one. So 412 guesses in the second case, now with a median of 12. So it turns out you're not all dog fanciest because you're all wrong,
41:00
or a lot of you are wrong. So here we see this distribution. This is what I wanted to see. This distribution has closed down. By providing more evidence, those of you who would have guessed higher probably have come lower. Those of you who guessed very low might have guessed higher. So the distribution has closed down a bit. So it's still a skewed distribution. There's still a lot of weight on the left-hand side and a longer tail to the right-hand side.
41:22
The median hasn't changed very much, which is kind of interesting, but the spikes that we saw at 15 have disappeared. There's still one at 20, but there are spikes here, just under 10 and just over 10 kilograms, which means that you're guessing in a sort of nine kilograms, 13 kilograms, which isn't crazy at all.
41:41
She is a smaller dog, and if you're not a dog owner, it'd be maybe hard to guess her weight. It turns out she's actually 17 kilograms, but you're not too far out there. Now, one thing we could do if you want to start comparing your results is to take these two individual sequences of numbers and put them together into a data frame, which allows multiple sequences of numbers. So it'd be like an Excel spreadsheet
42:01
with multiple columns. So I combine these two, and then we can look at these. So because there are less results in one than the other, one of them has got these missing numbers, and that's fine for the graphing perspective. And then if we just draw these and overlay them, here we can see just a simple visual comparison of your before and after guesses. So the blues are the before,
42:21
and then the kind of the orangey-ready one is after. And so what we see is that the blues are higher as we go to the right-hand side. So more of you are making larger guesses, in particular those round number points. We see 15, 20, 30, 40, and kind of 55 jumping out. And then once you've got some more information,
42:41
your guesses have come towards the left-hand side, towards the lower numbers, and we see a greater volume of those guesses around the 10 kilogram points, and it's all kind of bunched up in there. So the wisdom of the crowd is kind of working here. You've made sensible guesses, but you're not dog fanciers. This is not some kind of dog competition. So you don't have great information about what the weight of a dog might be.
43:01
So you're not spot on. The actual, the correct answer is somewhere in here, which is a low point in the result, which is interesting. But the purpose of this was to see that the variance of the result shrunk rather than exactly where the median or the mean guess might be. So I'm really happy with that demo, and hopefully what you've seen from that is you can take some raw data, draw some graphs,
43:20
make some comparisons and ask some questions, and inevitably raise some new questions which drives you back to the beginning to get more data and draw more graphs so you can go around in a circle. Okay, so it's time to wrap up. So closing thoughts. It's all about collecting data and visualizing it and then sharing your results.
43:41
There's an awful lot of hype about big data, deep learning and the cleverest, smartest, next thing coming. But almost all of my work with clients involves finding their data, realizing they haven't got the data they thought they had, fixing it up into a way that it's useful, drawing graphs and then interrogating people about what does this actually mean, and then providing some results.
44:00
And then they're iterating and making things slightly more complex and then iterating and iterating. It's all about getting the data and visualizing it. And you've all got access to that data. There are data sets in the appendix. You're very welcome to go and follow those when these slides go online and then find some data sets if you don't have access to your own data. But working off the data you understand is the right way to go. That domain knowledge is incredibly important
44:21
and only you have the domain knowledge about the data that you've got. I have a request of you. If I've made you think about something new and if you're interested in this topic and if you want to go and make some change around your own environment, I'd love to get a postcard. I've been collecting postcards for the last year. They remind me that these talks actually work. They make people think about what they're doing.
44:41
So I've got a lovely collection of postcards at home. If you would like to send me a postcard, just send me an email. I'll send you my address. I don't care when you send it or where you send it from. I just like getting postcards with nice messages saying, hey, you made me think. So if that's a thing you would like to do, please get in contact. And more importantly, please,
45:01
if you haven't yet thanked an organizer and a speaker here, please go and thank an organizer and a speaker. Many people forget that these are volunteer-run events. The speakers put a lot of time in. The organizers put a lot of time in. And they forget to go and say thank you. So we consume from the ecosystem without contributing back even to say thank you. We're a lovely group here.
45:20
Pythonists are a very lovely bunch. Please go and thank the people around you for the work that they've put into this. The write-up will be on my blog. Thank you very much.