I like Big Data and I can not lie!
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 56 | |
Number of Parts | 188 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/20837 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
re:publica 201656 / 188
1
2
3
5
6
13
14
15
17
18
20
21
22
24
26
27
28
29
31
32
33
34
35
37
38
39
40
41
42
44
46
47
49
51
52
54
55
58
59
63
64
65
66
67
68
70
71
72
75
77
79
80
82
85
86
90
91
93
94
96
97
98
99
102
103
105
106
109
111
112
113
115
116
118
119
120
121
123
124
126
128
129
132
133
136
137
138
139
140
141
144
146
147
148
149
151
155
157
160
161
162
163
165
167
169
171
172
173
174
176
178
179
180
181
183
185
186
188
00:00
HypermediaWeightPort scannerMetropolitan area networkTerm (mathematics)XMLComputer animation
00:22
Operations researchPlanningDegree (graph theory)MereologyStatisticsStrategy gameLecture/ConferenceMeeting/Interview
00:46
Value-added networkData analysisStatisticsDependent and independent variablesSampling (statistics)DigitizingBitComputer animation
01:16
Dependent and independent variablesForm (programming)Pattern languageCASE <Informatik>Computer animationLecture/Conference
01:55
Metropolitan area networkPresentation of a groupData analysisBitDependent and independent variablesProcess (computing)Flow separationPrice indexLecture/Conference
02:46
Order (biology)Dependent and independent variablesProcess (computing)Video gameLecture/Conference
03:09
Arithmetic meanWeightHome pageBit rateService (economics)Basis <Mathematik>Sampling (statistics)Lecture/Conference
04:02
QuicksortSampling (statistics)Group actionLecture/Conference
04:41
Sampling (statistics)Group actionVarianceLecture/Conference
05:09
Gamma functionInformationLecture/Conference
05:37
QuicksortFraction (mathematics)Formal languageComputer fontMeasurementPoint (geometry)Lecture/Conference
06:18
Point (geometry)InformationQuicksortOrder (biology)Backdoor (computing)Lecture/Conference
06:56
Proper mapLecture/Conference
07:17
Metropolitan area networkMaxima and minimaNormal (geometry)GenderMultiplication signArmIdentity managementOrder (biology)Representation (politics)Lecture/Conference
08:18
Fraction (mathematics)NumberData analysisLecture/Conference
08:43
CausalityStatisticsArithmetic meanCross-correlationQuicksortObservational studyTheorySound effectLecture/Conference
09:39
Metropolitan area networkSampling (statistics)Sound effectCore dumpNumberDifferent (Kate Ryan album)Cross-correlationQuicksortPrice indexOrder (biology)StatisticsArithmetic meanLecture/Conference
10:49
WeightVariable (mathematics)Price indexTheory of relativityLecture/Conference
11:15
Variable (mathematics)Group actionSampling (statistics)Observational studyOrder (biology)Data analysisCross-correlationInformationSet (mathematics)Right angle19 (number)Task (computing)Lecture/Conference
12:41
Metropolitan area networkContext awarenessInterpreter (computing)Cross-correlationLie groupVideo gameLecture/Conference
13:16
QuicksortEstimationVariable (mathematics)ResultantGroup actionGraph (mathematics)StatisticsContext awarenessCalculationData analysisLecture/Conference
14:24
Context awarenessLaptopData analysisResultantLecture/Conference
15:11
Metropolitan area networkStatisticsData analysisLecture/Conference
15:34
AreaQuicksortContext awarenessDatabaseResultantThomas BayesOrder (biology)StatisticsInternetworkingSoftware developerVacuumMultiplicationAxiom of choiceNumberLecture/ConferenceMeeting/Interview
17:05
Metropolitan area networkResultantGenderContext awarenessFunctional (mathematics)Electric generatorComputer fontAddress spaceType theoryMultiplication signCross-correlationRepresentation (politics)Level (video gaming)SummierbarkeitSlide ruleVariable (mathematics)outputLecture/Conference
18:54
Personal area networkSlide ruleExpected valueCross-correlationLecture/Conference
19:22
StatisticsNumberBitLecture/ConferenceMeeting/Interview
19:56
Software developerRight angleAxiom of choiceCalculusStatisticsMathematicsLecture/Conference
20:14
StatisticsSoftware developerNetwork topologyMeeting/Interview
20:34
Regulärer Ausdruck <Textverarbeitung>StatisticsBuildingDatabaseLecture/Conference
21:23
Metropolitan area networkValidity (statistics)Lecture/Conference
21:54
Metropolitan area networkMeeting/InterviewLecture/Conference
22:23
Object (grammar)Meeting/InterviewLecture/Conference
22:43
Online helpProcess (computing)Basis <Mathematik>Different (Kate Ryan album)Goodness of fitMeeting/InterviewLecture/Conference
23:44
Metropolitan area networkMultiplication signRow (database)CASE <Informatik>Exception handlingGraph (mathematics)Arithmetic meanLecture/Conference
24:10
Multiplication signMathematical analysisLecture/Conference
24:29
Metropolitan area networkLecture/ConferenceJSONXML
Transcript: English(auto-generated)
00:23
Hi, and thanks for having me. As you've heard before, I just recently finished my doctorate degree in economics and social sciences. And part of that was teaching statistics and empirical methods. I've been working before as a business analyst
00:42
in strategy planning and operations analysis. And I love all things digital. So I recently talked to some people and I found that many people, when they were dealing with data analysis or big data samples,
01:02
that they lack the education of a common statistic or social sciences introduction. So catchy title, probably have a vague idea what this could be. The subtitle was about bias and responsibility. And the first thing I'd like to mention is bias.
01:24
What is a bias? Bias is a form of prejudice, something that happens unconsciously, sometimes consciously, and it's something that we all have. We can't really escape it. It happens to us every day. We think we're unbiased, but we are biased.
01:41
It's that we see patterns where there aren't any patterns or we give more significance to things that, in reality, are just minor occurrences. And that's what bias is, in case you didn't know. Today, I'm going to talk a little bit about responsibility and why I think it's very important that we all think
02:03
about bias and what it means for us in the process of data collection, data analysis, and data interpretation and presentation. Responsibility, because there were several talks about big and small data amounts over the last few days.
02:21
And they can have quite grave implications if we think about how people or how governments try to identify potential terrorists. They do that because they have a lot of data and try to find some variables, some indicators about who
02:41
could be a potential terrorist or where could a terrorist attack happen. And unfortunately, this also means that the wrong people might be targeted and labeled as a terrorist. So in order to prevent that and not make any wrong conclusions, we have responsibility to think about how we handle data in the day-to-day life.
03:05
First thing is data collection. What is data collection? Well, the whole process starts somewhere. So data is not something you find flying around or lying around on the street. It's something that you somehow
03:21
collect from your users, either by a questionnaire or by other means that create data. For example, user data on your home page, how's the user engagement rate, who's clicking what, et cetera, but also health services.
03:43
They collect a ton of data from your height to your weight to whether you smoke or whether you go to the gym, and if so, how often, et cetera, et cetera. So a lot of data is collected on a daily basis. And there are some things that can go wrong when you collect data.
04:01
For example, you pick the wrong sample. What is a sample? A sample is a group or an amount of data that's sought. Well, you can't collect every data from everyone yet. Maybe someday in the not too close future, you can.
04:21
But usually, you have some sort of sample of data available. So for example, you want to have data about women who go to the gym. So you can't ask all women out there about what are their gym habits.
04:41
Instead, you use a sample. And this sample can be representative for that group that you want to learn about, or it can be completely wrong. The sample varies a lot in size. A sample varies a lot in significance.
05:04
And all these things start with access. What do I mean with access? If you collect data, especially if you do a questionnaire and you want to ask your users things and you want information from them and they are supposed to give that to you for free,
05:22
then they need to have to access that questionnaire somehow. They have to leave their answers. They need to understand the question. So that's the first thing. You need to have questions that are simple. And people need to be able to understand them. They need to know what you mean with that.
05:41
They need to speak the language. They need to be able to read it, read the font. They need to be available to give truthful data to you. For other things, for automatically generated data, you need to have access to that as well. So if you want to measure user engagement,
06:01
you need to have to have some sort of tracker that measures that. And the tracker needs to be accurate and measure all sort of possibilities that people can engage with your site, and not only a small fraction of that. So access is the first point usually
06:21
where people can go wrong. The second point is trust. If people don't trust you, they will likely give you false information. Why is that? If you collect data and you are very untrustworthy, or you force people to give you
06:40
data or some sort of information, people will just say whatever to make it go away. So you need to be trustworthy, or at least seem trustworthy, in order for people to give you actually truthful information. But even if you manage to do all that, you have proper access and you're trustworthy,
07:04
it doesn't really tell you anything about the quality of the data. Because people, especially if you ask them outright, usually tend to be biased as well. As I said, it affects everyone. And people have the tendency to, they want to fit into the norm.
07:22
So for example, if the first thing in a questionnaire is asking the gender of a person, they are much more likely to answer in the stereotype that this gender represents. Same goes for age, race, religion, country of birth,
07:42
et cetera. Because they are, if you remind them at the very first time of the questionnaire of their general identity, they are subconsciously reminded that there is a certain norm attached to that. And they are going to be more likely to reply to everything you're going to ask them in a manner
08:02
that they think is socially accepted from them in order to fit the general assumption. So that's something many, many people forget. And it highly affects the quality of your data. So be careful to watch out for that. For big data things that are generated automatically,
08:23
usually you have these problems in a smaller fraction. What happens here is the data analysis usually tends to go wrong or has far more problems. The first thing people often do is they do some number crunching.
08:41
They have data available. And they think, wow, I found this interesting fact. And they confuse causality with correlation. So correlation, for example, the first thing we had back in a PhD workshop about statistics, introduction to statistics, was
09:01
that if you put the size of a penis for a certain country, the average size of a penis, next to the average income, it correlates for some countries. But this doesn't actually mean anything about causality.
09:21
Because if you would say this is a cause, then you would say that a bigger penis leads to more economic income. And you probably need some sort of other studies or theories to back that up. So that's something that correlates. But it's not anything that is a cause or causality effect.
09:44
But even if you found something that's correlating, you don't always know if it's significant. What does significance mean? In statistics, it's something where you say, usually it's a 5%, 95% number, that something is 95%
10:05
to be truthfully correlating. So it's a number that gives you some sort of indication whether or not the data correlation that you found while number crunching or maybe even looking for it
10:21
is actually something that's significant. Or if you say, for example, 55% of the women at Republica went to see Laurie Penny's talk this morning and only 38% of the men did, then you would have to calculate the statistical significance
10:43
of that difference in order to tell if it's really relevant or if it's just a coincidence. Another thing are variables. For example, if you happen to know or if you happen to ever have read any women's magazine,
11:02
then you certainly know of the body mass index. And people use that as an indicator or variable for body health, height and weight relation. And it was something that was, I think, invented in the 19th century. Studies showed again and again
11:21
that it has absolutely nothing to say about the health of a person. It's still being used by health insurance companies because it's easier for them to use that. But for example, if you look at Dwayne Johnson or otherwise known as The Rock or maybe Arnold Schwarzenegger, they have huge bodies with lots of muscles.
11:42
Obviously, they're very heavy. And if you would calculate their body mass index, it would probably tell you that they are highly overweight and have adipocytosis, et cetera. But obviously, it's not true. So the matter how you construct your variables
12:00
and the way you use your data in order to get truthful information is very important. And this can be misleading. And this starts at the very beginning when you make a concept about your data analysis and your data collection. And if you're biased, if you think that a body mass index is very truthful,
12:23
then probably what you're gonna calculate from your data sets will be misleading as well. But even if you did everything right, you picked a perfect sample that's actually representative for your group and you found a correlation that's actually significant,
12:42
even if you did all that right, still, when it comes to interpretation, it can still go wrong. Because what happens often is that people look at the data without the context and they generalize things. They say, well, in this specific issue, this told me something, so I'm extrapolating that
13:02
and saying everything will be related to this. And this correlation means that this will stay this way forever and it always has been this way because the data doesn't lie. But in fact, data lying is the most obvious thing.
13:20
If you, well, there's a saying that goes, you shouldn't trust a statistic that you didn't make yourself. And I think that's very true because the way you, for example, show a graph or the way you select which variables go into your calculation, that everything that you put into it
13:43
means that you can generate very misleading results. And especially if it's not relevant for your business or the problem that you're trying to solve with data, because usually if you do some data analysis, there's something, some sort of question
14:01
you want to have answered or some sort of problem that you want to solve, whether it's give me a financial forecast for the next two years or tell me which health groups are most likely to suffer a heart attack or whatever. So don't overestimate your relevance of data
14:23
without the proper context because data being used without context, well, I didn't take my notebook to me up front, but in preparation I thought about an example on how I could compare what it means to use data without context.
14:42
And I thought using data analysis without context is like unprotected sex. It yields very unexpected results and it's always dangerous. So don't be the very dangerous person
15:01
who's going around without protection. If you have data and you've analysed everything, use some proper context and try to find out what it really means with your data analysis you did. So, I mean, I've told you a lot of things about statistic or social research methods or approach
15:23
and probably to some of you it's not new. Why did I do this talk? And this is where I'm returning to the bias because if you look at who's nowadays handling data and also who's handling the data collection, it's often people who build databases,
15:42
developers, engineers, these sort of people, and usually they don't tend to have some sort of knowledge about statistics or social research and I think that's very dangerous
16:00
because that means that people just operate in a vacuum area where they can't really, well, put all the data they collect into a proper context and they don't really know what could go wrong. I mean, for example, if you have a multiple choice questionnaire,
16:21
the number of choices you put there influences the results. The order of the questions you ask influences the results. The way you phrase a question influences the results, et cetera, et cetera, and there are decades of research on this. So all the people who are now facing the internet
16:41
and build these huge databases and you collect all these incredible huge amounts of data or maybe small amounts of data, they usually lack this introduction to social research and statistics and the dangers. So I think it's very important to address bias
17:02
and how do you do that? Through diversity, of course, but not only. Diversity, that means that you, first of all, get into your team, not only engineers, but if you build something like that, you maybe get other people from other functions, cross-functional support in there,
17:21
but also if you want to address racial bias, gender bias, age bias, then you should make sure that you have a vast representation of these types. So an all-male team, they will probably overestimate some things that are related to men
17:41
in relevance or significance or whatever. And a very young team might maybe completely forget that there's a whole generation of users out there or of people out there who are very old who take much more time answering questions, who need a bigger font size, et cetera, et cetera.
18:02
So diversity is very important and then also, of course, when it comes to discussing the results, when you think about the context again, get people on board who are from diverse functions, who are diverse in the team setup you have and discuss the correlations or variables you found
18:24
and discuss with them what it means to them and get their input so that you can learn different viewpoints and don't make the mistake of generalising things that you maybe expected to find. I'm sorry, I was a little nervous because the stage is so big so I talked very fast,
18:44
which leads me already to the last slide. I'm sorry for the terrible puns, by the way. Baby got backbone, of course, another pun relating to the song that I quoted. This is my last slide and it means be authentic,
19:03
be truthful and be responsible in what you do. Have a strong backbone if someone wants you to find a correlation, if someone wants you to find certain data that undermines a very certain expectation, find your inner backbone and look for it but be open
19:24
and don't try to force statistical significance onto numbers that don't really deliver it. And since I've finished so early, I think it would be okay to take some questions if there are any. Otherwise, if you don't have any questions,
19:42
I'll be running around a little bit the rest of the day. I'm here, come find me, chat with me. But yes, if there are any questions, I'm happy to talk to you now. Thank you.
20:02
Hello, my name is Michael. I am one of those developers that you're talking about. Hi. I did mathematics at school but we had the choice between calculus and statistics and my teacher chose us calculus. Yeah, that's... I wish he'd chosen statistics. So the question is, is there any hope? Like, all these developers that you know
20:22
who make all these mistakes, do you have any success stories of people who've maybe taken some course online on statistics and now they're really good? Is there help? Yes, well, I am naturally a very optimistic person. So I'm biased, ha ha ha ha, another great pun.
20:42
Sorry about that. No, I do think there's hope actually. I mean, of course, it's a structural thing. So first of all, you need to be aware that there might be a problem, which is why I held this talk today. So I'm hoping that maybe people who are leading teams
21:02
who are either building databases or building tools to collect data, that they may think about this, talk about this, and then start thinking about how to change this. And I think a simple way would be to get introductory courses to statistics, for example,
21:21
or social research. These things, I mean, these problems, they've been discussed for several decades now, so it should be rather easy to give people an introduction to concepts like validity or how to reproduce data, et cetera, and reliability of data.
21:42
So I do think there's hope, but of course, you have to go out there and talk about it and chat with your colleagues and maybe do a little work yourself to get into the topic again. Are there any other questions?
22:10
Hey, Mina, I'm Laura. I was wondering, confronting people with their biases makes them very often uncomfortable. What's your approach to actually have people
22:22
think about their biases instead of just pushing back and saying, no, I'm objective, I know it? Yeah, I think the first step is acknowledging that everyone is biased, so me, including, obviously. It seems maybe a little pastoral,
22:41
standing up here and preaching to you about diversity and being responsible and not biased, but it's a process that you usually can't help being biased, and I think it's something that people need to learn that it's natural to occur, and if you have a diverse team,
23:01
you're confronted with your own bias on an everyday basis because your 30 years older colleague or your female colleague or your trans colleague, they might have very, very different experiences to you and a very different viewpoint on things, and they will confront you, usually, about it,
23:22
or maybe you don't really know that it's coming from those differences, but you learn that not everybody thinks the same, so you're challenged in a way, and I think that's a good way for people to learn that you're biased, but that it's okay to be biased as long as you acknowledge it and work on it.
23:44
Are there any other questions? We have still a couple of minutes, so. Well, I can set a record of being the first to cut the time short. Are you, were you trying to, no. Grab the chance.
24:02
It's not often the case that we have this much time for discussion, or, yeah, I'm sure that Jasmina will be here. I'm here, and I'm really happy to talk to any engineers, any people who are leading teams or analysis teams. Come find me today. Have a beer with me or something else, a coffee,
24:23
and let's chat, and thank you so much for your patience and your time, and it was a pleasure talking to you. Thank you. Thank you, Jasmina. Thank you.
Recommendations
Series of 2 media