We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

I like Big Data and I can not lie!

00:00

Formal Metadata

Title
I like Big Data and I can not lie!
Title of Series
Part Number
56
Number of Parts
188
Author
License
CC Attribution - ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Data analysis can be fun – and horrible all at the same time. So here's a perspective from a network researcher, sociologist and former business analyst on how to improve our daily approach to data. What traps can be avoided? How do we know when we're biased? Is there such a thing as "good"/"bad" data? Let's talk, discuss and maybe change our approach. The talk will cover some foundations: what's a bias – and how do our biases get reflected in our data collection, analysis and interpretation? The way we tackle our own biases with regards – but not limited – to gender, race, social origin, abilities, nationality and other factors shapes not only the quality of data collected, but also directly the outcome of data analysis and interpretation!
2
Thumbnail
1:01:39
5
Thumbnail
57:37
14
52
Thumbnail
1:00:21
55
Thumbnail
1:02:36
96
102
Thumbnail
59:03
115
Thumbnail
1:01:49
128
148
162
176
185
HypermediaWeightPort scannerMetropolitan area networkTerm (mathematics)XMLComputer animation
Operations researchPlanningDegree (graph theory)MereologyStatisticsStrategy gameLecture/ConferenceMeeting/Interview
Value-added networkData analysisStatisticsDependent and independent variablesSampling (statistics)DigitizingBitComputer animation
Dependent and independent variablesForm (programming)Pattern languageCASE <Informatik>Computer animationLecture/Conference
Metropolitan area networkPresentation of a groupData analysisBitDependent and independent variablesProcess (computing)Flow separationPrice indexLecture/Conference
Order (biology)Dependent and independent variablesProcess (computing)Video gameLecture/Conference
Arithmetic meanWeightHome pageBit rateService (economics)Basis <Mathematik>Sampling (statistics)Lecture/Conference
QuicksortSampling (statistics)Group actionLecture/Conference
Sampling (statistics)Group actionVarianceLecture/Conference
Gamma functionInformationLecture/Conference
QuicksortFraction (mathematics)Formal languageComputer fontMeasurementPoint (geometry)Lecture/Conference
Point (geometry)InformationQuicksortOrder (biology)Backdoor (computing)Lecture/Conference
Proper mapLecture/Conference
Metropolitan area networkMaxima and minimaNormal (geometry)GenderMultiplication signArmIdentity managementOrder (biology)Representation (politics)Lecture/Conference
Fraction (mathematics)NumberData analysisLecture/Conference
CausalityStatisticsArithmetic meanCross-correlationQuicksortObservational studyTheorySound effectLecture/Conference
Metropolitan area networkSampling (statistics)Sound effectCore dumpNumberDifferent (Kate Ryan album)Cross-correlationQuicksortPrice indexOrder (biology)StatisticsArithmetic meanLecture/Conference
WeightVariable (mathematics)Price indexTheory of relativityLecture/Conference
Variable (mathematics)Group actionSampling (statistics)Observational studyOrder (biology)Data analysisCross-correlationInformationSet (mathematics)Right angle19 (number)Task (computing)Lecture/Conference
Metropolitan area networkContext awarenessInterpreter (computing)Cross-correlationLie groupVideo gameLecture/Conference
QuicksortEstimationVariable (mathematics)ResultantGroup actionGraph (mathematics)StatisticsContext awarenessCalculationData analysisLecture/Conference
Context awarenessLaptopData analysisResultantLecture/Conference
Metropolitan area networkStatisticsData analysisLecture/Conference
AreaQuicksortContext awarenessDatabaseResultantThomas BayesOrder (biology)StatisticsInternetworkingSoftware developerVacuumMultiplicationAxiom of choiceNumberLecture/ConferenceMeeting/Interview
Metropolitan area networkResultantGenderContext awarenessFunctional (mathematics)Electric generatorComputer fontAddress spaceType theoryMultiplication signCross-correlationRepresentation (politics)Level (video gaming)SummierbarkeitSlide ruleVariable (mathematics)outputLecture/Conference
Personal area networkSlide ruleExpected valueCross-correlationLecture/Conference
StatisticsNumberBitLecture/ConferenceMeeting/Interview
Software developerRight angleAxiom of choiceCalculusStatisticsMathematicsLecture/Conference
StatisticsSoftware developerNetwork topologyMeeting/Interview
Regulärer Ausdruck <Textverarbeitung>StatisticsBuildingDatabaseLecture/Conference
Metropolitan area networkValidity (statistics)Lecture/Conference
Metropolitan area networkMeeting/InterviewLecture/Conference
Object (grammar)Meeting/InterviewLecture/Conference
Online helpProcess (computing)Basis <Mathematik>Different (Kate Ryan album)Goodness of fitMeeting/InterviewLecture/Conference
Metropolitan area networkMultiplication signRow (database)CASE <Informatik>Exception handlingGraph (mathematics)Arithmetic meanLecture/Conference
Multiplication signMathematical analysisLecture/Conference
Metropolitan area networkLecture/ConferenceJSONXML
Transcript: English(auto-generated)
Hi, and thanks for having me. As you've heard before, I just recently finished my doctorate degree in economics and social sciences. And part of that was teaching statistics and empirical methods. I've been working before as a business analyst
in strategy planning and operations analysis. And I love all things digital. So I recently talked to some people and I found that many people, when they were dealing with data analysis or big data samples,
that they lack the education of a common statistic or social sciences introduction. So catchy title, probably have a vague idea what this could be. The subtitle was about bias and responsibility. And the first thing I'd like to mention is bias.
What is a bias? Bias is a form of prejudice, something that happens unconsciously, sometimes consciously, and it's something that we all have. We can't really escape it. It happens to us every day. We think we're unbiased, but we are biased.
It's that we see patterns where there aren't any patterns or we give more significance to things that, in reality, are just minor occurrences. And that's what bias is, in case you didn't know. Today, I'm going to talk a little bit about responsibility and why I think it's very important that we all think
about bias and what it means for us in the process of data collection, data analysis, and data interpretation and presentation. Responsibility, because there were several talks about big and small data amounts over the last few days.
And they can have quite grave implications if we think about how people or how governments try to identify potential terrorists. They do that because they have a lot of data and try to find some variables, some indicators about who
could be a potential terrorist or where could a terrorist attack happen. And unfortunately, this also means that the wrong people might be targeted and labeled as a terrorist. So in order to prevent that and not make any wrong conclusions, we have responsibility to think about how we handle data in the day-to-day life.
First thing is data collection. What is data collection? Well, the whole process starts somewhere. So data is not something you find flying around or lying around on the street. It's something that you somehow
collect from your users, either by a questionnaire or by other means that create data. For example, user data on your home page, how's the user engagement rate, who's clicking what, et cetera, but also health services.
They collect a ton of data from your height to your weight to whether you smoke or whether you go to the gym, and if so, how often, et cetera, et cetera. So a lot of data is collected on a daily basis. And there are some things that can go wrong when you collect data.
For example, you pick the wrong sample. What is a sample? A sample is a group or an amount of data that's sought. Well, you can't collect every data from everyone yet. Maybe someday in the not too close future, you can.
But usually, you have some sort of sample of data available. So for example, you want to have data about women who go to the gym. So you can't ask all women out there about what are their gym habits.
Instead, you use a sample. And this sample can be representative for that group that you want to learn about, or it can be completely wrong. The sample varies a lot in size. A sample varies a lot in significance.
And all these things start with access. What do I mean with access? If you collect data, especially if you do a questionnaire and you want to ask your users things and you want information from them and they are supposed to give that to you for free,
then they need to have to access that questionnaire somehow. They have to leave their answers. They need to understand the question. So that's the first thing. You need to have questions that are simple. And people need to be able to understand them. They need to know what you mean with that.
They need to speak the language. They need to be able to read it, read the font. They need to be available to give truthful data to you. For other things, for automatically generated data, you need to have access to that as well. So if you want to measure user engagement,
you need to have to have some sort of tracker that measures that. And the tracker needs to be accurate and measure all sort of possibilities that people can engage with your site, and not only a small fraction of that. So access is the first point usually
where people can go wrong. The second point is trust. If people don't trust you, they will likely give you false information. Why is that? If you collect data and you are very untrustworthy, or you force people to give you
data or some sort of information, people will just say whatever to make it go away. So you need to be trustworthy, or at least seem trustworthy, in order for people to give you actually truthful information. But even if you manage to do all that, you have proper access and you're trustworthy,
it doesn't really tell you anything about the quality of the data. Because people, especially if you ask them outright, usually tend to be biased as well. As I said, it affects everyone. And people have the tendency to, they want to fit into the norm.
So for example, if the first thing in a questionnaire is asking the gender of a person, they are much more likely to answer in the stereotype that this gender represents. Same goes for age, race, religion, country of birth,
et cetera. Because they are, if you remind them at the very first time of the questionnaire of their general identity, they are subconsciously reminded that there is a certain norm attached to that. And they are going to be more likely to reply to everything you're going to ask them in a manner
that they think is socially accepted from them in order to fit the general assumption. So that's something many, many people forget. And it highly affects the quality of your data. So be careful to watch out for that. For big data things that are generated automatically,
usually you have these problems in a smaller fraction. What happens here is the data analysis usually tends to go wrong or has far more problems. The first thing people often do is they do some number crunching.
They have data available. And they think, wow, I found this interesting fact. And they confuse causality with correlation. So correlation, for example, the first thing we had back in a PhD workshop about statistics, introduction to statistics, was
that if you put the size of a penis for a certain country, the average size of a penis, next to the average income, it correlates for some countries. But this doesn't actually mean anything about causality.
Because if you would say this is a cause, then you would say that a bigger penis leads to more economic income. And you probably need some sort of other studies or theories to back that up. So that's something that correlates. But it's not anything that is a cause or causality effect.
But even if you found something that's correlating, you don't always know if it's significant. What does significance mean? In statistics, it's something where you say, usually it's a 5%, 95% number, that something is 95%
to be truthfully correlating. So it's a number that gives you some sort of indication whether or not the data correlation that you found while number crunching or maybe even looking for it
is actually something that's significant. Or if you say, for example, 55% of the women at Republica went to see Laurie Penny's talk this morning and only 38% of the men did, then you would have to calculate the statistical significance
of that difference in order to tell if it's really relevant or if it's just a coincidence. Another thing are variables. For example, if you happen to know or if you happen to ever have read any women's magazine,
then you certainly know of the body mass index. And people use that as an indicator or variable for body health, height and weight relation. And it was something that was, I think, invented in the 19th century. Studies showed again and again
that it has absolutely nothing to say about the health of a person. It's still being used by health insurance companies because it's easier for them to use that. But for example, if you look at Dwayne Johnson or otherwise known as The Rock or maybe Arnold Schwarzenegger, they have huge bodies with lots of muscles.
Obviously, they're very heavy. And if you would calculate their body mass index, it would probably tell you that they are highly overweight and have adipocytosis, et cetera. But obviously, it's not true. So the matter how you construct your variables
and the way you use your data in order to get truthful information is very important. And this can be misleading. And this starts at the very beginning when you make a concept about your data analysis and your data collection. And if you're biased, if you think that a body mass index is very truthful,
then probably what you're gonna calculate from your data sets will be misleading as well. But even if you did everything right, you picked a perfect sample that's actually representative for your group and you found a correlation that's actually significant,
even if you did all that right, still, when it comes to interpretation, it can still go wrong. Because what happens often is that people look at the data without the context and they generalize things. They say, well, in this specific issue, this told me something, so I'm extrapolating that
and saying everything will be related to this. And this correlation means that this will stay this way forever and it always has been this way because the data doesn't lie. But in fact, data lying is the most obvious thing.
If you, well, there's a saying that goes, you shouldn't trust a statistic that you didn't make yourself. And I think that's very true because the way you, for example, show a graph or the way you select which variables go into your calculation, that everything that you put into it
means that you can generate very misleading results. And especially if it's not relevant for your business or the problem that you're trying to solve with data, because usually if you do some data analysis, there's something, some sort of question
you want to have answered or some sort of problem that you want to solve, whether it's give me a financial forecast for the next two years or tell me which health groups are most likely to suffer a heart attack or whatever. So don't overestimate your relevance of data
without the proper context because data being used without context, well, I didn't take my notebook to me up front, but in preparation I thought about an example on how I could compare what it means to use data without context.
And I thought using data analysis without context is like unprotected sex. It yields very unexpected results and it's always dangerous. So don't be the very dangerous person
who's going around without protection. If you have data and you've analysed everything, use some proper context and try to find out what it really means with your data analysis you did. So, I mean, I've told you a lot of things about statistic or social research methods or approach
and probably to some of you it's not new. Why did I do this talk? And this is where I'm returning to the bias because if you look at who's nowadays handling data and also who's handling the data collection, it's often people who build databases,
developers, engineers, these sort of people, and usually they don't tend to have some sort of knowledge about statistics or social research and I think that's very dangerous
because that means that people just operate in a vacuum area where they can't really, well, put all the data they collect into a proper context and they don't really know what could go wrong. I mean, for example, if you have a multiple choice questionnaire,
the number of choices you put there influences the results. The order of the questions you ask influences the results. The way you phrase a question influences the results, et cetera, et cetera, and there are decades of research on this. So all the people who are now facing the internet
and build these huge databases and you collect all these incredible huge amounts of data or maybe small amounts of data, they usually lack this introduction to social research and statistics and the dangers. So I think it's very important to address bias
and how do you do that? Through diversity, of course, but not only. Diversity, that means that you, first of all, get into your team, not only engineers, but if you build something like that, you maybe get other people from other functions, cross-functional support in there,
but also if you want to address racial bias, gender bias, age bias, then you should make sure that you have a vast representation of these types. So an all-male team, they will probably overestimate some things that are related to men
in relevance or significance or whatever. And a very young team might maybe completely forget that there's a whole generation of users out there or of people out there who are very old who take much more time answering questions, who need a bigger font size, et cetera, et cetera.
So diversity is very important and then also, of course, when it comes to discussing the results, when you think about the context again, get people on board who are from diverse functions, who are diverse in the team setup you have and discuss the correlations or variables you found
and discuss with them what it means to them and get their input so that you can learn different viewpoints and don't make the mistake of generalising things that you maybe expected to find. I'm sorry, I was a little nervous because the stage is so big so I talked very fast,
which leads me already to the last slide. I'm sorry for the terrible puns, by the way. Baby got backbone, of course, another pun relating to the song that I quoted. This is my last slide and it means be authentic,
be truthful and be responsible in what you do. Have a strong backbone if someone wants you to find a correlation, if someone wants you to find certain data that undermines a very certain expectation, find your inner backbone and look for it but be open
and don't try to force statistical significance onto numbers that don't really deliver it. And since I've finished so early, I think it would be okay to take some questions if there are any. Otherwise, if you don't have any questions,
I'll be running around a little bit the rest of the day. I'm here, come find me, chat with me. But yes, if there are any questions, I'm happy to talk to you now. Thank you.
Hello, my name is Michael. I am one of those developers that you're talking about. Hi. I did mathematics at school but we had the choice between calculus and statistics and my teacher chose us calculus. Yeah, that's... I wish he'd chosen statistics. So the question is, is there any hope? Like, all these developers that you know
who make all these mistakes, do you have any success stories of people who've maybe taken some course online on statistics and now they're really good? Is there help? Yes, well, I am naturally a very optimistic person. So I'm biased, ha ha ha ha, another great pun.
Sorry about that. No, I do think there's hope actually. I mean, of course, it's a structural thing. So first of all, you need to be aware that there might be a problem, which is why I held this talk today. So I'm hoping that maybe people who are leading teams
who are either building databases or building tools to collect data, that they may think about this, talk about this, and then start thinking about how to change this. And I think a simple way would be to get introductory courses to statistics, for example,
or social research. These things, I mean, these problems, they've been discussed for several decades now, so it should be rather easy to give people an introduction to concepts like validity or how to reproduce data, et cetera, and reliability of data.
So I do think there's hope, but of course, you have to go out there and talk about it and chat with your colleagues and maybe do a little work yourself to get into the topic again. Are there any other questions?
Hey, Mina, I'm Laura. I was wondering, confronting people with their biases makes them very often uncomfortable. What's your approach to actually have people
think about their biases instead of just pushing back and saying, no, I'm objective, I know it? Yeah, I think the first step is acknowledging that everyone is biased, so me, including, obviously. It seems maybe a little pastoral,
standing up here and preaching to you about diversity and being responsible and not biased, but it's a process that you usually can't help being biased, and I think it's something that people need to learn that it's natural to occur, and if you have a diverse team,
you're confronted with your own bias on an everyday basis because your 30 years older colleague or your female colleague or your trans colleague, they might have very, very different experiences to you and a very different viewpoint on things, and they will confront you, usually, about it,
or maybe you don't really know that it's coming from those differences, but you learn that not everybody thinks the same, so you're challenged in a way, and I think that's a good way for people to learn that you're biased, but that it's okay to be biased as long as you acknowledge it and work on it.
Are there any other questions? We have still a couple of minutes, so. Well, I can set a record of being the first to cut the time short. Are you, were you trying to, no. Grab the chance.
It's not often the case that we have this much time for discussion, or, yeah, I'm sure that Jasmina will be here. I'm here, and I'm really happy to talk to any engineers, any people who are leading teams or analysis teams. Come find me today. Have a beer with me or something else, a coffee,
and let's chat, and thank you so much for your patience and your time, and it was a pleasure talking to you. Thank you. Thank you, Jasmina. Thank you.