Merken

I like Big Data and I can not lie!

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
we the term community and the person's be
as
I high and thanks for having me on a different before I don't well I just recently finished my doctorate degree in economics and social sciences and so part of that was teaching statistic and empirical methods have been working for a as a business analyst and strategy planning and operations analysis on and I love all things
digital soul um I recently talked to some people and I found that many people when they're dealing with data analysis or big data on and a bit of samples that they lack the education of some of the common statistical or social sciences introduction so catchy title probably have a vague idea what this could be of the subtitle was about bias and responsibility and the 1st thing i'd like to mention is bias but what is a biased by as a form of something that happens
unconsciously sometimes consciously and it's something that we all have we can't really escape it and it happens so is every day on we think we're unbiased but we are biased it's that we see patterns where there aren't any patterns are we give more significance to things that are in reality you are just minor occurrences and and that's what biases and kids do know and
this today I'm going to talk a little bit about responsibility and why I think it's very important that we all think about bias and what it means for us in the process of data collection data analysis and data interpretation and presentation
responsibility because and there were several talks about big and small data amounts of of the last few days and some they can have quite grave implications if we think about how people are how governments try to identify potential terrorists but they do that because they have a lot of data and try to and find some variable some indicators about who could be a potential terrorist or where could the terrorist attack happened and so on but unfortunately
this also means that the wrong people might be targeted and labeled as a terrorist so in order to prevent that and not make any wrong
conclusions of we have a responsibility to think about how we handle data in the database like at 1st thing in data collection what this data collection well it's the whole process
starts somewhere so data does not something you find so flying around so or lying around on the street and it's something that you somehow collect from your users either by a questionnaire or by other means that's and create data for example of the the user data on your homepage house the user engagement rate was clicking what's it cetera but also health services are they collect a ton of data from your heights to your weight to whether you smoke or whether you go to the gym and if so how often etc cetera so a lot of data collected on the day that on a daily basis on and there are some things that can go wrong when you collect data for example you pick the wrong
sample of what is a sample of samples the group or and at the amount of data and that's sort of what you can't collect every data from everyone gets maybe some day in the not too near future enough to close future you can but so I usually you have some sort of a sample of data and data available so and for example you want to have data about women to go to the gym so you can't act ask all the women out there about of what i their didn't habits is set you use a sample and
so this sample and can be representative for that the group that you want to learn about or it can be completely the wrong and the sample variance a lot in science and some of them to the sample varies a lot insignificant and all these things and start with excess but mean with excess
on if you collect data especially if you do a questionnaire and you want to ask your users of things and you want information from them and they are supposed to give that to you for free on then they need to have 2 Xs that questionnaire somehow they have to leave their answers they need to understand the question so that's the 1st thing you need to have the questions that are a simple and people need to be able to understand
them they need to know what you mean with that and they need to speak the language they need to be able to read it and the re deformed and they need to be available to give truthful data to you on for other things for automatically generated data you need to have access to that as well and so if you want to measure user engagement you need to have to sort of have to have some sort of tracker that measures that and the checker needs to be accurate and measure also the sort of possibilities that people can engage with your site and and not only a small fraction of that so the 1st point usually where
people can go wrong and the 2nd point is trust if people don't trust you and they will likely give you false information why is that if you are moving collect data and you are very untrustworthy or you force people to do that give you data of some sort of information people do it will just say whatever to make it go away so you need to be a trustworthy or at least seem trustworthy and so in order for people to give you actually truthful information on but even
if you manage to do all that to have proper access and you're trustworthy but it doesn't really tell you anything about the quality of the data because some people especially if you ask them out right I usually tend to be biased as well the as I said it's affect everyone and people have the
tendency to they want to fit into the norm so for example if the 1st thing in the questionnaire is so I'm asking the gender of the person they're much more
likely to answer up in the stereotype that this gender represents the same goes for age race and religion arm and country of birth etc. because they are if you remind them at the very 1st time of the questionnaire and of their general identity they are subconsciously reminded that there is a certain norm attached to that and there got are going to be more likely to apply to everything and ask them in a manner that they think it is still socially accepted from them in order to fit the general assumptions so that some something many many people forget and it's highly affects the quality of your data so be careful to watch out for that for Big Data
things that are generated automatically usually you have these problems and a smaller fraction of what happens here is of the data analysis using the constable wrong or of has far more problems on the 1st thing that people often do as they do some number crunching they have data available and
they think wow I found this interesting facts and day confused codelet he was correlation so a correlation for example the 1st thing we have to go back and PhD Workshop about statistics Introduction to Statistics was found that if you use uh put the size of the penis for a certain country the average size of the penis next to the average income it correlates and for some countries but this doesn't actually mean anything about causality because if you would say this with the cost than you would say that the bigger penis leads to more like an income of this and you probably need to have some sort of other studies of theories to back that up so that's something that correlates but is not a
anything anything that is the core of our samples ality effect on that even if you've found something that's correlating you don't
always know if its effect the significant what does significance mean it's in statistics that something where you say on usually it's a 5 % 95 per cent of the number of that something is 95 % for a to B and truthfully correlating so it's on the number that gives you some sort of indication whether or not and the data correlation that you found 1 number crunching or maybe even looking for it and it's is actually something that significant or if you say for example um 55 per cent of the women at Republican went to see lorry pennies talked this morning and some only 38 per cent of the mandates and then you would have to calculate the statistical significance of that difference in order to tell if it's really relevant or if it's just a coincidence on another
thing of variables it and for example if you have and if you happen to know all or if you happen to ever have red any women's magazine the magazine then you certainly know of the body mass index and people use that as an indicator of a variable from 4 of body health hate the height and weight relation and it was something that was i think invented
in the 19th century it's studies show it again and again that it has absolutely nothing to say about the health of a person is still being used by health insurance companies because it's easier for them to use that but for example if you look at Dwayne Johnson or otherwise known as the Rock or maybe Arnold Schwarzenegger they have huge bodies with lots of muscles obviously they're very heavy and if you would calculate their body mass index it would probably tell you that there are highly obovate and have a deposit task I etc. but obviously it's not true so the matter how you construct your variables and the way you use your data and in order to the gets truthful information is very important and this can be misleading and the starts at the very beginning when you make a concept about on your data analysis and your data collection and if your biased if you think that's a body mass index of very odd truthful than probably what you're gonna call them calculate from your data sets will be misleading as well yeah but even if you did everything right you pick the perfect sample that's actually representative for your group and you are found a correlation that actually significant
even if you that all that arrives still when the goal comes to interpretation it can still go wrong because what happens often is that people look at the data without context and they generalize things they say well in this specific issue this told me
something soul and exible polating that and saying everything and will be related to this and this correlation are means that this will stay this way for ever it always has been this way because the date of life but in
fact a data lying is that the most obvious thing if you will and there's a saying that goes on you shouldn't trust the statistic that you did make yourself and I think that's very true because the way you for example I'll show a graph or at the way you select which variables go into Europe and calculation of that everything that you put into it but it it means that you can vary it can it can generate very misleading results and I especially if it's not relevant for your business or a the problem that you're trying to solve with data because usually if you do from data analysis there's something some sort of question you want to be and to have answered or some sort of problem that you want to solve whether it's I give me a financial forecast for the next 2 years or and tell me where it which which health groups are most likely to suffer a heart attack or whatever so I don't overestimate your rebel relevance of data but without a proper
context because the data being used without context from what I didn't sigma notebook to me up from but have in preparation I thought about an
example on how I could compare what it means to use data without context and all thought of using data analysis without context this like I'm unprotected sex it use very unexpected results and it's all it's always dangerous and so don't be at the very dangerous person who's going around with a protection if you have data and see if analyzed everything you some proper context and try to find out what it really means of with
your data analysis he did so I mean actually a lot of things about statistical on social research methods Our approach and probably to some of you it's not new and why did I do this talk and and this is where I am returning to the bias because if you look at who's nowadays
handling data and also was handling the data collection of it's often the people who build databases of developers engineers use of sort of people out and usually they don't tend to have a about some sort of Bay of knowledge about large statistics are a social research and I think that's a very dangerous because that means that people just operate in a vacuum the area where they are and they can't really well put all of the all the data they collect into a proper context and they don't really know what could go wrong I mean for example if you have a the multiple choice questionnaire the number of choices he put their influences the results and the order of the questions you ask influences the results the way you phrase the question influences the results etc. etc. and there's there are decades of research on this on so all the people who are now facing the Internet and built these huge databases and you collect all these incredible huge amount of data or maybe small amounts of data on which they usually lack the this introduction to social research of statistics and the dangers so I think it's very important To address some by us and how do you do that through diversity of course but not only
on the diversity that means that you 1st of all gets into your team not only engineers but if you build something like that you maybe get other people are from other functions cross-functional support and there on but also if you want to address racial bias gender bias age-bias then you should make sure that you have a best representation of the sum of all of these types so and all mandated team they will probably overestimated of some things that are related to men in our relevance or significant the whatever and ends at a very young team might may be completely forget about that there's a whole generation of users out there are of people out there who are very old and who take much more time answering questions who need a bigger font size et cetera et cetera and so diversity is very important and then also of course when you mn comes to discussing the results when you think about the context again I get people on board who are from diverse functions who are diverse in the team set up you have and discuss the correlations are of variable to found and discuss with them what it means to them and get there in so that you can not learn different viewpoints and don't make the mistake of generalizing things that you make may be expected to find and sorry I was a little nervous because of the stages is so big so I talk very fast and which leads me already to like the last slide I'm sorry for the terrible puns that and this is a of bigger backbone and
of course the another upon relating to the songs that I quoted our this is my last slide and it means of the authentic be truthful and be responsible in what you do not have a strong backbone of
someone wants you to find a correlation if someone wants you to find certain data that undermines very certain expectation find your inner backbone and look for it but
the open and don't try to forest statistical
significance on 2 numbers that don't really deliberate on and since I've finished so early I think it would be OK to take some questions if there are any otherwise if you don't have any questions on all the running around a little bit of the rest of the day and here are confined the chat with me but yes if there are any questions and in what your time if
the hello my name's might rely on 1 of those developers have by that right it mathematics at
school but we have to choose Intelligence and Statistics and teacher chooses right so they actually had
some so the question is is there any hope like the 1 in so yes I know who made all these mistakes do you have any success stories of the tree and taken some course online statistics another reason to yeah is it yes well you know I am
naturally a very optimistic person to so I'm biased another great fun sorry about this because I don't know I do think there's so by actually um I mean of course that the structural thing so the 1st of all you need to be aware that there might be a problem which is why I hope of talk today and the final thing that maybe people who are a leading teams throughout you the building databases or building tools to collect data so that they may think about this a talk about this and then I'll start thinking about how to change this and I think the simple way would be to get Introduction courses to 2 statistics for example or social research on these things I mean these
problems they've been discussed for several decades now so it should be rather easy to give people an
introduction to to concepts like validity are hard to come reproduced data etc. and reliability of data out so I do think there's hope but of course you have to go out there and so talk about it and so some check with your colleagues and maybe it does the little work yourself and you yet
to get into the topic again but and are there any
other questions in the Nina and Lawrence
and I was wondering um confronting people with their biases of makes them very often comfortable what's your approach to actually have people think about the devices instead of
just pushing back and saying no I'm objective I know it and the I think the
1st step of acknowledging that so everyone aspires to fill me including obviously or it's me seems maybe a little postaural standing up here and preaching to you about diversity and
being responsible and not by but it's it's a process that you usually can't help being
biased and I think of some it's something that people need to learn that it's natural natural to occur and if you have a diverse team you're confronted with the bias on an everyday basis because you're at 30 years older colleague or you're a female colleague are your chance quality but they might have very very different experiences to you in a very different viewpoint on things so and they will confront you usually with and about it or maybe you don't really know that it's are coming from deals differences but you learn that not everybody thinks the same so you're a challenge the way and I think that's a good way for people to learn that you're biased and about that it's OK to be bias as long as you acknowledge it and work on it are there any other questions we
have still a couple of minutes so we have well I can set a record of being the 1st to come to I'm sure except that I you on the what where you are trying to know the graph the tons but not often the case that we have this much time for discussion or at yeah I'm sure that this mean it will be here I'm here and I'm really happy to
talk to any other engineers any people were leading to user analysis teams thought come find me today I'll have a beer with me or something else a copy and of that's chat and thank you so much for your patience and your time and it was a pleasure talking to you if you think this is
a half on
Portscanner
Metropolitan area network
Hypermedia
Gewicht <Mathematik>
Vorlesung/Konferenz
Term
Computeranimation
Bildschirmmaske
Bit
Statistik
Minimalgrad
Datenanalyse
Digitalisierer
Endogene Variable
Stichprobenumfang
Mereologie
Strategisches Spiel
Automatische Handlungsplanung
Operations Research
Vorlesung/Konferenz
Computeranimation
Metropolitan area network
Bit
Prozess <Physik>
Datenanalyse
Endogene Variable
Mustersprache
Vorlesung/Konferenz
Kombinatorische Gruppentheorie
Computeranimation
Besprechung/Interview
Ablöseblase
Indexberechnung
Ordnung <Mathematik>
Arithmetisches Mittel
Dienst <Informatik>
Gewicht <Mathematik>
Prozess <Physik>
Endogene Variable
Bitrate
Computeranimation
Homepage
Stichprobenumfang
Gruppenkeim
Varianz
Quick-Sort
Bruchrechnung
Punkt
Formale Sprache
Vorlesung/Konferenz
Information
Quick-Sort
Einflussgröße
Gammafunktion
Punkt
Hintertür <Informatik>
Information
Eigentliche Abbildung
Ordnung <Mathematik>
Quick-Sort
Computeranimation
Metropolitan area network
Geschlecht <Mathematik>
Nichtunterscheidbarkeit
Extrempunkt
Ordnung <Mathematik>
Normalvektor
Arithmetisches Mittel
Beobachtungsstudie
Bruchrechnung
Statistik
Physikalischer Effekt
Datenanalyse
Zahlenbereich
Quick-Sort
Physikalische Theorie
Korrelationsfunktion
Computeranimation
Soundverarbeitung
Statistik
Subtraktion
Zahlenbereich
Quick-Sort
Arithmetisches Mittel
Metropolitan area network
Stichprobenumfang
Vorlesung/Konferenz
Speicherabzug
Indexberechnung
Ordnung <Mathematik>
Korrelationsfunktion
Beobachtungsstudie
Gewicht <Mathematik>
Datenanalyse
Relativitätstheorie
Gruppenkeim
Neunzehn
Task
Variable
Menge
Rechter Winkel
Stichprobenumfang
Indexberechnung
Information
Ordnung <Mathematik>
Korrelationsfunktion
Metropolitan area network
Videospiel
Interpretierer
Kontextbezogenes System
Korrelationsfunktion
Computeranimation
Resultante
Variable
Statistik
Notebook-Computer
Datenanalyse
Gruppenkeim
Vorlesung/Konferenz
Kontextbezogenes System
Rechnen
Quick-Sort
Computeranimation
Resultante
Metropolitan area network
Statistik
Datenanalyse
Vorlesung/Konferenz
Kontextbezogenes System
Resultante
Lineares Funktional
Statistik
Gewichtete Summe
Datenhaltung
Selbstrepräsentation
Adressraum
Zahlenbereich
Kontextbezogenes System
BAYES
Quick-Sort
Internetworking
Rechenschieber
Metropolitan area network
Generator <Informatik>
Multiplikation
Font
Flächeninhalt
Geschlecht <Mathematik>
Hochvakuum
Datentyp
Softwareentwickler
Ordnung <Mathematik>
Korrelationsfunktion
Auswahlaxiom
Rechenschieber
Erwartungswert
Personal Area Network
Korrelationsfunktion
Bit
Statistik
Zahlenbereich
Vorlesung/Konferenz
Netzwerktopologie
Statistik
Rechter Winkel
Vorlesung/Konferenz
Softwareentwickler
Metropolitan area network
Statistik
Gebäude <Mathematik>
Regulärer Ausdruck
Vorlesung/Konferenz
Validität
Metropolitan area network
Vorlesung/Konferenz
Objekt <Kategorie>
Besprechung/Interview
Subtraktion
Prozess <Physik>
Güte der Anpassung
Basisvektor
Besprechung/Interview
Hilfesystem
Arithmetisches Mittel
Metropolitan area network
Datensatz
Graph
Vorlesung/Konferenz
Ausnahmebehandlung
Analysis
Metropolitan area network
Vorlesung/Konferenz

Metadaten

Formale Metadaten

Titel I like Big Data and I can not lie!
Serientitel re:publica 2016
Teil 56
Anzahl der Teile 188
Autor Banaszczuk, Yasmina
Lizenz CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
DOI 10.5446/20837
Herausgeber re:publica
Erscheinungsjahr 2016
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Data analysis can be fun – and horrible all at the same time. So here's a perspective from a network researcher, sociologist and former business analyst on how to improve our daily approach to data. What traps can be avoided? How do we know when we're biased? Is there such a thing as "good"/"bad" data? Let's talk, discuss and maybe change our approach. The talk will cover some foundations: what's a bias – and how do our biases get reflected in our data collection, analysis and interpretation? The way we tackle our own biases with regards – but not limited – to gender, race, social origin, abilities, nationality and other factors shapes not only the quality of data collected, but also directly the outcome of data analysis and interpretation!

Ähnliche Filme

Loading...