Lies, damned lies, and statistics
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44923 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201882 / 132
2
3
7
8
10
14
15
19
22
27
29
30
31
34
35
41
44
54
55
56
58
59
61
66
74
77
78
80
81
85
87
91
93
96
98
103
104
105
109
110
111
113
115
116
118
120
121
122
123
125
127
128
129
130
131
132
00:00
SoftwareStatisticsEndliche ModelltheorieCross-correlationStatisticsLie groupType theoryPoint (geometry)Cross-correlationTerm (mathematics)MalwareLevel (video gaming)Machine learningStatement (computer science)NumberSquare numberGoodness of fitMultiplication signDifferent (Kate Ryan album)Degree (graph theory)Video gameMathematicsQuicksortEndliche ModelltheorieComputer animation
02:17
MeasurementAssociative propertyVariable (mathematics)Cross-correlationLinear mapNumberLimit (category theory)Musical ensembleInternet ExplorerBit rateBit rateNumberLine (geometry)Correlation and dependenceGroup actionConnected spaceCausalityFacebookObject (grammar)BitSet (mathematics)Variable (mathematics)QuicksortInformationDemosceneEvent horizonComputer configurationProgram slicingFlow separationView (database)Cross-correlationOutlierAssociative propertyPoint (geometry)ExpressionInternet ExplorerLie groupCASE <Informatik>Sound effectLinearizationComputer animation
07:30
ParadoxGenderGenderProgram slicingDistribution (mathematics)Bit rateDifferent (Kate Ryan album)NumberGroup actionQuicksortGradientLine (geometry)ParadoxLie groupSet (mathematics)Social classRight angleType theoryComputer animation
09:35
Lie groupSampling (statistics)Type theoryInformationComputer animation
09:59
SubsetEstimationContext awarenessError messageArchaeological field surveyText editorSample (statistics)Zuckerberg, MarkVotingSampling (statistics)BitVisualization (computer graphics)QuicksortArchaeological field surveyFormal languageDecision theoryData analysisEndliche ModelltheorieEstimatorError messageMultiplication signCASE <Informatik>Process (computing)Regulärer Ausdruck <Textverarbeitung>Right angleSubsetWordNegative numberComputer animation
12:42
Internet forumWage labourVisualization (computer graphics)ResultantMessage passingVarianceQuicksortProgram slicingPerturbation theoryLocal ringMedical imagingSet (mathematics)Shared memoryCore dumpDiscrete element methodComplex (psychology)Right angleDifferent (Kate Ryan album)Service (economics)Point (geometry)View (database)Library (computing)BitPlotter1 (number)LageparameterCartesian coordinate systemArchaeological field surveyVotingNumberGraph (mathematics)Context awarenessRevision controlPearson product-moment correlation coefficientDecision theoryDrop (liquid)Data analysisPhysical lawPhysical systemOnline helpMultiplication signConservation lawLabour Party (Malta)
18:20
Decision theoryBayesian networkSlide ruleHypothesisExtreme programmingSoftware testingStatisticsStudent's t-testRight angleP-valueFormal languageRandomizationSet (mathematics)ResultantHypothesisPresentation of a groupExpert systemCodeSlide ruleStandard deviationPattern languageField (computer science)Exploratory data analysisThresholding (image processing)Extreme programmingQuicksortVideo gameCASE <Informatik>Multiplication signWeb pageReal numberMathematical analysisComputer animation
22:36
Context awarenessDifferent (Kate Ryan album)Direction (geometry)Electric generatorTheoryStatisticsGoodness of fitMultiplication signPoint (geometry)HypermediaAnnihilator (ring theory)Observational studyConnected spaceNumberComputer animation
23:57
Open setCodeData analysisNumberSlide ruleTwitterLink (knot theory)Presentation of a groupTable (information)Virtual machinePattern languageProper mapHypothesisExpressionDirection (geometry)Multiplication signSoftware testingWave packetMessage passingCheat <Computerspiel>Similarity (geometry)BitSelf-organizationSet (mathematics)Data conversionCASE <Informatik>Mathematical analysisRight angleExploratory data analysisLie groupComputer animation
Transcript: English(auto-generated)
00:06
Thank you. So there are three types of lies. There are lies, there are big lies, and there are statistics. So as a starting point, please consider the following statement. In the Vatican City, there are 5.88 popes per square mile.
00:25
This is not a lie. This is correct. The number is correct. And I've never been in the Vatican City, but I kind of expect one pope to be there at a time. So something fun. Two popes?
00:42
Well, sometimes it happens, yeah. But in principle, you know, in the kind of in the long term, you expect one pope at a time, even if you don't know much about the Vatican like me. So something fun is going on, and that's the idea for the talk. So we are exposed to
01:00
statistics in everyday life, and not all of us have advanced degrees in maths and whatnot, so this talk will be about the use and misuse and abuse of statistics in everyday life, and essentially how not to lie with statistics. So the idea is we're not talking about Python or any advanced
01:21
statistical modeling or machine learning. We just want to be sort of good citizens and be prepared for when we are exposed to statistics, and we want to understand what's going on. We're not talking about Python, but just out of curiosity, how many of you are Python users at different levels, beginners, experts, being exposed to Python, more or less?
01:45
Almost everybody. A few people too tired to raise their hands, but you know, almost everybody. Everybody feeling okay? Anybody feeling sick? Nobody sick? So there you go. Statistics are telling us that knowing some Python is
02:03
positive for your well-being. So with these statements, there are two problems, one that we'll discuss later, and the other one is the kind of the starting point, which is correlation. So correlation is an informal
02:21
definition. It's already in the name. My correlation is some sort of relationship, connection between two things, two events, two variables. A bit more formally, we also want to measure the strength of the relationship, of the association between two variables. When we talk about correlation, the kind of simplest thing that comes to mind is linear correlation.
02:46
It's just easier to visualize, right? Linear correlation, when one variable is increasing, the other variable is either increasing or decreasing, following some sort of line. So you see the line here, therefore linear correlation.
03:01
We talk about positive or negative linear correlation, but the idea is one variable moves and the other variable follows the line. To give you a more concrete example, let's say the temperature goes up, and if you have an ice cream shop, also your revenue will go up, right?
03:21
Nice weather, you sell more ice cream. And the way we look at this, there's kind of a, you know, a cause and effect. Nice weather, therefore we eat more ice cream. But in the general case, that's not always true, right? Maybe you heard the expression, correlation does not imply causation.
03:42
So again on the ice cream example, we can see how there is a correlation between revenue for ice cream sales and the number of people who die drowning. So what's going on here? Is ice cream really the killer? To understand what's going on here, we need to introduce the notion of lurking variable. A lurking variable is
04:05
a variable that we don't really see, but it's there. It's kind of looking at us, so it's lurking. Back to our ice cream example, that would be temperature, of course. So nice weather, more people eat ice cream, so revenue goes up.
04:21
But also nice weather, more people go swimming, and therefore more people die drowning, unfortunately. So there is a third variable here explaining the connection between ice cream and drowning. One more example, often people observe that whenever there is some sort of fire accident, if you deploy more firefighters on the scene, you will also have
04:46
bigger damage. So from a decision-making point of view, it makes sense to say, okay, let's deploy less people, less firefighters, so the damage will be smaller. Of course, you know, big fire means you have to deploy more firefighters, so big fire also
05:03
has a higher chance of causing a bigger damage. So that's the idea. There is the fire severity behind the scenes to describe the relationship. So long story short, if we try to explain correlation and causation, it can be complicated, right? So here I'm kind of summarizing all the options. Either
05:21
there is actually a cause, so A causes B, or the other way around, or maybe the two variables A and B together explain something else, or something else is the cause of A and B, or maybe there is a transitive relationship, A causes something, and something causes B, or maybe there is just no connection between the two variables. A few examples of things that correlate.
05:46
So the number of movies with Nicolas Cage and number of people who drown into a pool. So Nicolas Cage, please don't do other movies. The consumption of margarine and the number of murders by a blunt object, so margarine makes you kind of more nervous, more aggressive.
06:09
Facebook, it's quite easy nowadays to blame Facebook for everything, but Facebook, number of users of Facebook and the national debt of Greece, they can go together, so more users of Facebook, bigger problems for Greece.
06:26
Number of users of Microsoft Internet Explorer and murder rate, yeah.
06:40
Again, numbers are true. There is no lie here. Finally, my favorite one is the consumption of chocolate and the number of Nobel Prizes. So you see how, you know, every country is kind of following the line, the more chocolate you consume, the more Nobel Prizes you win, and
07:01
well, there are a couple of outliers, Sweden having more Nobel Prizes than expected, who knows why, and Germany, Germany not very efficient at converting chocolate into Nobel Prizes. So that was correlation.
07:21
Now, moving on to the next topic, so what's going on when you analyze data, and you sort of slice and dice your data set? It's also called the Simpson's paradox that was first observed and described by somebody not called the Simpson, but still we call it Simpson's paradox, and
07:41
I'm going to use a textbook example here to describe the Simpson's paradox. That's from Wikipedia, essentially. If you look at the number of admissions in grad school in the 70s for the University of California and then you group by men and women, you see that there is a difference in the proportion between men being admitted and women.
08:02
So the difference is kind of big enough to ask the question, is there some sort of gender bias going on? Now, the numbers here are correct. If you start digging into the details and you break down the numbers per department, so each line is a different department, A, B, C, D, and so on,
08:22
what you observe is something funny. So you see how for many departments the proportion of women being accepted is actually higher compared to the proportion of men. So these numbers are also correct and they're kind of telling the opposite story. If you look at the absolute numbers, you see how
08:43
men tend to apply for departments with a higher admission rate, and on the other side, women tend to apply for departments with a lower admission rate. So essentially, well, one could say maybe women are applying to more competitive departments.
09:04
Long story short, you will observe this kind of paradox whenever you have a dataset and you kind of slice and dice the dataset and your classes, your groups will not be equally distributed. So the distribution across departments is highly skewed and
09:23
that's why you observe this kind of phenomenon, Simpson's paradox. So all the numbers are correct. If you have some sort of agenda to push, you can choose one or the other, right? The next type of lies here is related to
09:41
sampling bias. So sampling bias, you know, when I asked do you know Python? Well, we are at the Python conference, so we kind of expect a lot of people to know Python and I shouldn't use this information to draw conclusions on a bigger population, right? So back to the terminology sampling. The idea of sampling is
10:04
selecting a subset of individuals with the purpose of doing some sort of estimate on a bigger population. Sometimes you cannot do estimates on the full population, right? So you need to build some sort of model and you do sampling. In the age of big data, that's what you have to do.
10:23
On the other side, bias. In everyday language, we have maybe a bit of a negative connotation to the word bias. We associate bias with prejudice. In science, maybe there is not explicitly this kind of negative connotation. So a bias is just a systematic error.
10:41
We don't know if the error was on purpose or by accident. So sampling bias is simply an error done during your sampling process. And again, a bit of a textbook example. Dewey defeats Truman. That's Truman, president of the U.S. I'm gonna say 1940 something, 48 maybe. That's in the morning
11:05
when he became president. So he was elected and he's waving a newspaper that says Dewey defeats Truman. So the newspaper says the opposite and you see the guy smiling. So what happened here is that the newspaper put the wrong headline because they ran some sort of survey, a
11:25
phone survey precisely. They phoned people and they asked who are you gonna vote. And remember this is 1948, so not everybody has a phone. So the kind of people with a phone at the time who are actually readers of the Chicago Tribune were all
11:42
Republicans essentially and they were voting for Dewey. So the survey was clearly biased. Therefore, the wrong headline. There's a special case of sampling bias. There's also survivorship bias that was mentioned yesterday in the keynote.
12:00
So survivorship bias is when you focus only on the lottery winners and you forget about all the people who bought a ticket but didn't win the lottery. And also when you hear all the stories of success, you know, all the billionaires, Bill Gates, jobs and so on, they are all college dropouts. So should you quit studying and become a billionaire? Well, you're old enough to make your own decisions. I
12:27
didn't quit studying and I'm not a billionaire. The next segment is on data visualization. So data visualization in data analytics in general is a very powerful tool.
12:47
You can essentially use one image to describe a complex, a very complex kind of concept. And also as a data analyst, when you're doing data analytics, you still need to use visualization to understand what's going on with your data.
13:04
So here you have, for example, four different data sets and they all share some summary statistics. So the average X, the average Y, they're all the same. The variance will be the same. Some sort of correlation coefficient will be the same. So if you only look at the summary statistics of a data set, maybe you don't fully understand what the data set is about.
13:27
Once you plot it, you will see how these data sets are really very different. And again, this is a bit of a textbook example. But the idea is data visualization gives you better insights to your data set.
13:41
But also data visualization is used to communicate right to the broader public. If there is a complex kind of topic, you can use just an image to communicate. So here there was some sort of a core decision and a newspaper just wanted to showcase how different parties support
14:02
this particular core decision. And you see how the bar for Democrats is much, much higher, almost three times bigger than the others. So it looks like Democrats are very much in favor of this particular decision. But then something funny is going on. The vertical axis is starting not from zero, but from 50. So once you normalize
14:24
everything, so the one on the right is the correct version of the plot, you see how, yes, the bars are different, but the difference is not so huge. So maybe the story on the right seems less interesting from a newspaper point of view.
14:42
More visualization, so guns in the U.S., very hot topic. So in 2005 in Florida, they introduced what is called the Stand Your Ground law. And here you see how when the law is introduced, there is a kind of a drop in this graph that is representing the number of
15:03
murders committed using firearms. Again, something funny going on, right? For some reason the vertical axis starts from 1,000 and goes down. So once you fix the plot, reality is literally upside down.
15:21
Okay, so this was published in the Business Insider. The original visualization was by Reuters. One more example, this is from the Italian public service. And
15:41
essentially, this is a talk show, a political talk show. They did a survey, and they asked whether the government is friends with the lobbies. And of course being friends with the lobbies is bad. So if you don't like the results, you take 44%, which is a big slice of the cake, and you squeeze it into a tiny slice.
16:07
And I always thought, you know, in Italy we are kind of world champions at this. But then I moved to the UK about 10 years ago, and I realized things are not any better anyway. So some political leaflets.
16:22
In the UK, just to give some context, the system is called the first-past-the-post. So essentially, the narrative from the main parties is always don't vote for the small guys because the vote is going to be wasted. You should vote for us. So it's always kind of a race between two main parties. Here, lifted by the conservative parties in blue,
16:44
they say don't waste your vote on the UKP. You should vote for us because we're going to be ahead anyway. And it's funny how the bar for the Labour Party, which is 42% rather than 32%, is smaller than the one for the conservatives. So kind of giving a message that they are ahead.
17:04
But they're not the only ones doing this kind of little tricks. So this one is from the Lib Dems for some sort of local election, I think. And you see how the yellow bar for the Lib Dems is kind of catching up with the Labour, you know, almost there. We need your help. We need just a couple more votes.
17:26
But then when you normalize it, you see there is a huge difference. So it's kind of like, why bother? Now to complete the picture with all the main parties. So again,
17:40
the story is going to be a race between two horses. These are from the Labours and they say don't waste your vote with these small yellow guys, vote for us. And it is indeed a race between two horses, but they completely forgot about the Green Party, which is the one competing for that particular
18:02
constituency. So yeah, kind of just to be a little bit politically correct, you see how all the parties are kind of doing the same, the same little tricks. Okay, so that was the idea on data visualization. You can use visualization to kind of convey any kind of message.
18:21
Now for a slightly more advanced kind of topic, statistical significance. Statistical significance is one of the most unfortunate names in science, probably, because in everyday language when we say something is significant, we kind of assume it's also important, right?
18:41
So often statistical significance is used as a synonym for importance, but it's not really the case. So when we talk about statistically significant results, we simply mean that we are kind of sure about the results. So the results are more reliable. They're not by chance. But statistically significant results are not about how big the results are,
19:03
it's not about how important the results are, and it's not about how useful they are. So they're simply statistically significant. So we're just more sure about the results, and that's it. A notion connected to statistical significance is
19:21
the p-values, and so p-values, when I was a student, it was one of the most confusing topic for me, and I wish I could tell you that now I fully understand the p-values, but it's not really the case, and at the time we didn't have Wikipedia.
19:41
Anyway, nowadays, you know, the topic is so confusing that it has its own Wikipedia page on what the p-value is not, so a lot of misunderstandings around the p-values. So I was chatting about the topic earlier with Vincent who gave a presentation, so we were kind of preparing the slides before the presentation, and I know he knows a lot about statistics, so I asked him,
20:04
do you know about p-values? And I could tell he's an expert because he didn't answer the question. He says, hmm, I'm a bi-agent. Okay, so let's put it on the slide. So even people who know about statistics like Vincent don't really have an answer on what the p-value is about.
20:21
So let me try to upset the statisticians in the room. Let's see if I can. So the p-value is a kind of basic definition, is a probability of observing the results that we get, or more extreme, when the null hypothesis is true. So that's the basic definition of p-values.
20:41
Remember, it's about probability, not certainty, and what you see in a scientific publication usually is some sort of threshold, which is arbitrary and usually set to 0.05, so p smaller than 0.05. Other fields might have different standards,
21:01
but what you see more often is 0.05. It means one out of 20, right? That's the idea. So essentially, can we afford to be fooled by randomness every one time out of 20? That's the idea behind the p-values. Connected to the notion of p-values, there's an idea called data dredging.
21:22
So dredging is dredging in kind of real life, kind of fishing. And in fact, data dredging is also called data fishing or p-hacking, to say we're trying to hack the p-values. So what's going on with data dredging? Essentially, the
21:40
conventional way of going about it, you know, you formulate some hypothesis, you collect data, and then you either prove or disprove your hypothesis. In data dredging, you kind of go the other way around. So you have your data, and you look for patterns until something interesting and statistically significant comes up. So you kind of build your hypothesis in retrospective.
22:05
So looking for patterns in your data, you know, it's fine. It's pluratory analysis. So you can sort of understand more about your data set. That's totally fine. But testing your hypothesis on the same data set, that's typically wrong. That's what data dredging is about.
22:20
Often, it's quite easy to spot. Sometimes, it gets through and you see publications where you might feel like they were going for some fishing, but you're not really sure. wrapping up, we have seen a lot of examples in different directions where
22:43
essentially you can use statistics to push your kind of agenda, and it feels like we can't really trust anybody. Well, the purpose of the talk was not to create or prepare the next generation of conspiracy theorists. The purpose was simply to, you know, remind you
23:02
there's a big difference between bigger headlines from the media and, you know, proper science. Anyway, this is something that can affect everybody, so nobody is really immune. So even if you are in good faith, from time to time you stumble upon these problems, and you might
23:21
kind of introduce your own bias. So the point is always try to ask questions, in particular, you know, what is the bigger context? If you're observing something about the study, you know, who's paying for the study? Is there anything that is missing at all? What is the bigger picture? And
23:40
long story short, you know, the best question would be, so what? You observe some data, you observe some numbers, so what are the connections? What are we trying to describe here? Is there anything that we don't see? And so on. So that's the summary of it. That's essentially closing the presentation. The slides are on the speaker deck. They will be around on the
24:04
conference app, on Twitter, usual things, and more links if you want to know what I do, and just to plug quickly Paedeta London. I'm one of the organizers of Paedeta London, so that was mentioned this morning. Paedeta London, so yeah, you will find me there and you can ask me about Paedeta London or other Paedeta chapters around the world.
24:27
Thank you very much. Thank you, Marco. We've got time for a couple of questions if anyone has any questions.
24:47
So hi, thank you for the talk. It was very nice. I didn't quite get why data dredging is bad. I mean, I understand that if I have a hypothesis and I try one data set, it doesn't work, so I can try the next data set. Okay, that is bad. But if I have a very big data set and I just look for interesting patterns and I find one,
25:04
why is it bad looking further into that? So looking for patterns is fine. In fact, that's what we do with a new data set. So during exploratory analysis, you kind of look for patterns. The problem is when you
25:22
kind of assess your hypothesis on the same data, so you do it in retrospect that it's kind of cheating. Imagine if you, you know, coming from a machine learning background and in my case, imagine I do training and testing on the same data set. It's kind of similar. It's like cheating, basically.
25:41
But looking for patterns is totally fine. You just shouldn't validate your hypothesis in that way. Okay, thank you for the presentation. And you said what are lies, but then what is true and how to spot the proper analysis?
26:03
Yeah, I think we need a lot of time to discuss what is true and what is not and it's a bit of a philosophical question, I guess. The title of the talk is taken from, you know, famous expression. Yeah, that's the problem. There are facts and there are lies, but sometimes there are facts that are presented in a way that is clearly pushing some kind of agenda.
26:30
And I guess the message is you need to be prepared for it, right? I'm not saying everything is fake, right? Sometimes things are kind of
26:41
representing reality just in a way that is packaged for you to kind of go in some particular direction. It's difficult to break down when you get something from the news, it's difficult to break down reality into smaller chunks to fully understand whether
27:01
things are really lies or facts. Still, you know, if you want to be a good citizen, you should make an effort. That's just the message. But I totally agree, you know, it's a deep philosophical question and I guess we need a couple of beers to approach the conversation.
27:23
Alright, any other questions? Got time for maybe one more if anyone's interested. Alright, in that case, thank you very much Marco. Thank you.