The toxicity of personal information in the "Big Data" age

Video thumbnail (Frame 0) Video thumbnail (Frame 1000) Video thumbnail (Frame 2068) Video thumbnail (Frame 2802) Video thumbnail (Frame 3947) Video thumbnail (Frame 4647) Video thumbnail (Frame 5932) Video thumbnail (Frame 7611) Video thumbnail (Frame 8150) Video thumbnail (Frame 11530) Video thumbnail (Frame 12188) Video thumbnail (Frame 12932) Video thumbnail (Frame 13992) Video thumbnail (Frame 15050) Video thumbnail (Frame 17603) Video thumbnail (Frame 19404) Video thumbnail (Frame 20544) Video thumbnail (Frame 23004) Video thumbnail (Frame 24172) Video thumbnail (Frame 25701) Video thumbnail (Frame 26799) Video thumbnail (Frame 28045) Video thumbnail (Frame 28836) Video thumbnail (Frame 29712) Video thumbnail (Frame 30277) Video thumbnail (Frame 31184) Video thumbnail (Frame 32455) Video thumbnail (Frame 33662) Video thumbnail (Frame 34291) Video thumbnail (Frame 36024) Video thumbnail (Frame 36604) Video thumbnail (Frame 37333) Video thumbnail (Frame 38449) Video thumbnail (Frame 39822) Video thumbnail (Frame 40658) Video thumbnail (Frame 41483)
Video in TIB AV-Portal: The toxicity of personal information in the "Big Data" age

Formal Metadata

The toxicity of personal information in the "Big Data" age
Title of Series
Part Number
Number of Parts
CC Attribution - ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
When asked about their online privacy, most people think they got nothing to hide. With my talk, I want to show that this is probably not true. To do this, I'll show a series of experiments that demonstrate how easy it is to learn interesting and sometimes very private things about people by analyzing the data trails they leave behind. I will discuss the risk of permanent de-anonymization of user data and propose technical as well as societal strategies that we can employ to protect our privacy.
Medical imaging Dataflow Game controller Algorithm Hypermedia Computer animation Information Meeting/Interview Internetworking Googol Information
Axiom of choice Slide rule Algorithm Computer animation Information Decision theory Graph (mathematics) Energy level Maxima and minima Perspective (visual)
Algorithm System call Theory of relativity Information Globale Beleuchtung Graph (mathematics) Sheaf (mathematics) Volume (thermodynamics) Food energy Mathematics Position operator Computer animation Different (Kate Ryan album) Information Videoconferencing
Slide rule Greatest element Electric generator Computer animation Divisor MIDI Smartphone Information
Server (computing) Open source Multiplication sign Source code Virtual machine Bookmark (World Wide Web) Moore's law Medical imaging Sign (mathematics) Befehlsprozessor MiniDisc Position operator Electronic data processing Addition Programming paradigm Weight Projective plane Data storage device Core dump Cloud computing Data stream Arithmetic mean Process (computing) Computer animation Internet service provider Freeware Row (database)
Goodness of fit Machine learning Computer animation Computer cluster Multiplication sign Knot Video game
Enterprise architecture Simulation Building Information Block (periodic table) Real number Plotter Interactive television Line (geometry) Cartesian coordinate system Likelihood function Distance Number Measurement Web 2.0 Prime ideal Facebook Computer animation Internet forum Software Website Right angle Fundamental theorem of algebra Form (programming)
Metropolitan area network Noise (electronics) Algorithm Information Plotter Virtual machine Measurement Distance Measurement Computer animation Bit rate Different (Kate Ryan album) Physical system
Cluster sampling State observer Functional (mathematics) Group action Information Maxima and minima Distance Neuroinformatik Number Type theory Causality Arithmetic mean Machine learning Computer animation Internet forum Different (Kate Ryan album) Network topology Codec Process (computing) Physical system Resultant Physical system
Context awareness Algorithm Touchscreen Service (economics) Information Link (knot theory) Decision theory Orientation (vector space) Decision theory Cartesian coordinate system Facebook Arithmetic mean Machine learning Facebook Computer animation Hypermedia Website Information Identity management Social class
Point (geometry) Pixel Pattern recognition Programming paradigm Multiplication sign Virtual machine Graph coloring Neuroinformatik Medical imaging Computer animation Order (biology) Representation (politics) Energy level Right angle Data structure output Resultant Task (computing)
Pixel Server (computing) Multiplication sign Decision theory Connectivity (graph theory) Virtual machine Numbering scheme Parameter (computer programming) Function (mathematics) Shape (magazine) Medical imaging Different (Kate Ryan album) Single-precision floating-point format Energy level Representation (politics) output Pixel Vulnerability (computing) Computer architecture Physical system Algorithm Graph (mathematics) Dataflow Information Sound effect Parameter (computer programming) 10 (number) Process (computing) Computer animation Personal digital assistant Function (mathematics) output Right angle Parametrische Erregung Row (database) Spacetime
Group action Game controller Information Artificial neural network Projective plane Content (media) Virtual machine Content (media) Control flow Wave packet Medical imaging Process (computing) Computer animation Order (biology) output Right angle Information output Reverse engineering Position operator Physical system Reverse engineering
Context awareness Computer animation Database Information privacy Fingerprint Information privacy Library (computing) Attribute grammar Fingerprint
Mathematics Computer animation Vector space Order (biology) Software framework Series (mathematics) Game theory Fingerprint Number Attribute grammar
Trajectory Computer animation Algorithm Code Personal digital assistant Different (Kate Ryan album) Multiplication sign Universe (mathematics) Video game Frame problem
Trajectory Torus Information Mountain pass Graph coloring Fingerprint
Computer animation Insertion loss Personal digital assistant Diagram Pattern language Graph coloring Resultant
Price index Wave packet Computer animation Vector space Testdaten Video game Software testing Summierbarkeit Software testing System identification Fingerprint Resultant Fingerprint Form (programming)
Algorithm Execution unit Password Bit rate Web browser Dimensional analysis Number Computer animation Bit rate Personal digital assistant Different (Kate Ryan album) System identification Ranking Fingerprint Address space
Satellite Electric generator Information State of matter Multiplication sign Bit Information privacy Orbit Usability Degree (graph theory) Internetworking Computer animation Personal digital assistant Internetworking Physicist Chain Self-organization Software testing Selectivity (electronic) Collision Object (grammar) Spacetime Row (database)
Computer animation Information privacy
Causality Goodness of fit Algorithm Bit rate Different (Kate Ryan album) Order (biology) Knot Video game Sound effect Data analysis Data analysis
Presentation of a group Data analysis Data analysis Vector potential
home the and only a few images what
flow had over 1 some today would talk about big data algorithms and personal information so the motivation for this talk is quite simple and today I wasn't already on control many aspects of our lives especially on the decide many things that affect as Internet users and consumers and so on I will also
make decisions that I can all affect our lives on a much deeper level yeah most of us don't really know how algorithms work and how they like analyze of personal information to make decisions about us and I will not talk about a changed so as technical background I would approach the subject from like quite a technical perspective by we tried to like and always highlight and it shall hold choice of technologies affects the outcome we would get and there will be some graphs and technical details in the talk so please don't leave the room I would try to keep that up to an absolute minimum and I would also try to explain everything that you see on this slide
so the talk is roughly divided into 3
sections 1st I will briefly talk about and what Big Data really means to us in personal of the network and try to explain you how algorithms actually done things about us and what it probably know more about us than we would think they know and finally I will I to discuss a few on new ways of thinking about personal data that might actually be useful in the each so usually when I talk
about big data ratio like graphs that explain how like the volume of data that we analyze so on so many see divides today and it will be like so on so many exabytes in 10 years and uh and this is like nice but I realize this is variable and intuitive because no 1 can put that kind of information in relation to 1 says so on today I want you to think big but different about big data because if you think like what is the difference between or modern industrialized society in late medieval societies 500 years ago and 1 key difference is that today the amount of energy needed electricity or all fossil fuels or resources that each
1 of us can consume is by a factor of a few million higher than what we could consume earlier and with the data it's the same thing we are today the 1st generation of people that actually Data trace behind that would be measured in gigabytes of terrabytes instead of maybe like a
few kilobytes as for our ancestors which had only limited for the documents in a few texts from the bottom of the things we really like in a new age of of of data and I put a smartphone on this slide
because the thing that's the device that embodies this new on paradigm to most of the best of to say because is a small device that everyone has hasn't pocket like a few hundred grams in weight yet that measures data across more than 20 different challenges the channels so for example we have the obvious things like the GPS position the device idea the phone calls that you make the duration the time the quality and and so that's in a machine but you also like things like the temperature humidity acceleration data and you contact and a whole host of behavioral data so when you use the phone when you switch it on the 1st time in the morning when you switch it off in the evening how long of rows what kind of things you doing so a lot of interesting data streams that we can use that we can make you can we can analyze so and at the same
time and what also is amazing is that today we not only have these data sources available but we also have the means to analyze them if I gave it to the low 50 and you could several things with that today you could either go out and buy an iced coffee for that or you could go to your favorite cloud provider sign up there and the 1 hour of private time with the server that is like 40 course 160 gigabyte of RAM or alternatively you could also like a 100 gigabyte of storage for 1 month or so that's really impressive because it means that we can today use computing power that we use of and also even want to do some data processing I can just go I can turn on the faucet I came to my processing and when I'm done I close together that as something so it's really it changes the way how we can analyze and process data what comes in addition is that many of the tools that we need to do a data-processing also freely available leaders open source or as free images that we can use to stop providers so this means that a lot of projects which would have been really really difficult 15 years
ago like analyzing human gene on for example are now well within the reach of a single individuals and in 10 years on 15 years I think that many of the things that we discussed that the corporations like Facebook Google which we desiderata doing are also able will also be doable by singer persons so the so you
still asking yourself when is a good time to learn about I and I would say now's a good time because on reasons will affect your life did you like it or not so the 1st thing to understand and I wasn't workers to like learn the basics of machine learning and this is what I try and next
a few minutes so I was looking for an example of that would then be easy enough to understand but at the same time realistic enough and so that it could really come from a real world scenario and what it is that almost until you bit about what we can learn from clicks because a click is probably the most fundamental building block of the World Wide Web and its solid user navigates from 1 website to another is following you around like something on Facebook is how you interact with many of your applications on your phone so t prime form of interaction of the user with on the weapon with any software today and here we have Roy user on is a bit unhappy that we're doing experiments with them but no worries it's just simulation so there won't be enterprise and I'll use that he is like an opinion about a certain subject on which we will a classifier between 0 and 1 where 0 would be like some left wing opinion about 1 would be the most likely hoping that we could think of and now this user interacts with a number of articles on line which also express a certain opinion so they can be left wing right wing moderate and singers of course if you like interact with an article on the likelihood that you will have like it or on finding good is higher if the article expresses the same opinion as you have so that means that the opinion which is you have to use that affects his behavior as if as he interacts with the articles of represented and now in our simulation will just but have 40 users in which interactive 60 articles and we will have only recorded what the user likes the article or not and then we would see what kind of information we can learn from them so the the if we
plot that we 1st need to think about like how we measure and how like to users are similar or not and to use this method that we can think of this just to compare the amount of articles that and I need to a given uses both like or both don't like and this is what a plot on the on the y axis and this is
also what the system like a machine learning system could and could measure in that sense it the more interesting information for
us is the difference in the opinion or the opinion of the users themselves which is you plot on the X. axis but which we can't measure and so if you look at the data we can see that it looks a
pretty chaotic so there's a lot of noise but you can also may be seated upward slope between the difference in your opinion and the difference that we have measured true or click rates and now we can use a machine learning algorithm to extract this information and learn something about you about the users get residents of that we use here is
called k-means clustering and it's basically and and I was and we ask computer to divide the user-centered tree groups where each group is and as similar as possible within the group but it's different from the other 2 groups as possible and the result is quite interesting as you can see that the I wasn't returns 3 groups and here but again the number of users in each book on as a function of the opinion of the given user and as you can see the I was is correctly classified or users and left-wing moderate and lot right wing opinion users and this is basically the essence of what machine learning is about because I
am we have not so there wasn't anything about some of the means of I'd like some the different types of articles you just giving giving it and some click data and we ask to like and group and the number of observations in a way that the seems a reasonable to it and the emerging information that we have about the opinion of the user was generated by the system itself so it's not something that we have explicitly programmed into it and this is what we call machine learning in this sense it the so what this means come to any of us is that the information of the data that we
can give to the absolute or to the websites of the visits contains a lot a lot more information than what we think and on to be less is a bit better and made a proposal on for redesigned permission screen because you all know on the screen here from your mobile phone where some that would ask for like 2 missions to redo context and to see your identity 1 or 2 CD and so that's a new device and that data is basically only a means to gain more information about you and and I would really like to see if he could have like a 2nd permission screen which would then shows what kind of information the could in fear from these things and it's quite a lot that you can actually learn from simple data like and so that's in the clicks that you have on your and a service for example and your religious belief so you political-ideological ideas we live we work we study maybe a sexual orientation on your income your social class in the city maybe but new relationship status and if you cheating and so I could show a lot of examples where people actually researchers who researchers when actually went and extracted this kind of information from the social media data for example here have a link to an interesting paper I'm doing this with the Facebook on like stages and so so I just want to like appointed as a user and you should really try to think about it on the data that's applications asked of you notice something that that is just an isolated thing but that's something that can convey additional information about you so
another interesting aspect of machine learning is that we can't always understand our algorithms to make decisions and to discuss that I want to
go and talk about the and new machine learning techniques which has been found like gaining wide adoption in the last 10 years it's called on deep learning and basically works by making on the structure of our own brain in order to like to be able to like you can learn more efficiently higher and more complicated the relationships between data points because the I wasn't as I showed you before is quite powerful and what is what it's doing but unfortunately it's quite limited as well because we have to give it an explicit representation of the data so we basically have to pre digest everything to the to the I wasn't so that we can get all the useful results and did not tries to overcome this on by like letting the I wasn't it itself that these level concepts that we feed to it so it can like while it's knowing ingesting data on construct the new representations new concepts from the data that it can use them to accomplish its goal
and the technique has been quite successful for an image recognition which is a very simple tasks for people but very difficult for computers probably no 1 here in the room would have any difficulty in distinguishing the paradigm the left the from the and ball of or kimono on the right on because it's like we can see that this thing like the feathers as an sky behind and whereas this like a bowl with some like fruits and vegetables inside I'm a computer on the other hand would have a much more difficult time distinguishing 2 because it only sees the pixel values and from the color composition and the shapes that in the picture and those 2 images are actually quite similar so let's see how on deep learning techniques
this problem so the deep learning architecture is quite simple i showed here so you can and imagine that the information flows from the left of this graph to the right so we have 1 input layer will feed for example the image data and then we have a server our so called hidden layers that take this image data on crunch together and process it and like can generate new representations from it at the end of our our processing pipeline we have 1 output layer which in many cases contains only a single output value which for example would tellers yes this is apparent on all this is not a of so and set and the way the algorithm constructs the representation of the data is kind of similar to what brain does when we're seeing things so you can imagine that on the left we have the individual pixel values of the image rows in the middle which have like higher level representations of different shapes for example rectangular-shaped on the right we would have done level concepts just just feathers or the concept of the and for this technique to work and it needs in in many the large amount of parameters so typically if you have 500 by 500 image would have 250 thousand input parameters and then in each of the intermediate layers we can again have several 100 thousand parametres so all in all that means that we have tens of million different parameters in the scheme and this means that this kind of machine and can only work if we have a very large input data set that we can use to train our algorithm and the strength of the system is at the same time it's weakness because and this parameter
space is much too large to understand for us so and this is the 1st time a thing and or 1 of 2 and a few times that we have an engineered system where you know all the components but we still can't understand how the system exactly makes its decision in effect you might have seen
these are quite beautiful images from the deep learning project at will and those are actually in attempts to understand how such a deep learning neural network is working because they are the result of like feeding information backwards through the system so like giving some signal into the right of the system and then like and amplifying whatever makes it signal strong on the left side so today we're really in a position where we need to reverse engineer our own systems to understand them and this can be
problematic because if we have on the system that we don't understand and and possibly some data input data that country on the content and control information content we cannot know how the system would processes and K. Crawford said yesterday from machine learning system we always need and some human input so we always need some training data that we feed to the system in order to like like optimize the outcome of the machine learning process and if this data is contaminated as they called for example if it contains like discrimination against certain groups of people on and the I wasn't as information available to the data that we give to that about these groups of people that it can use that to perform the same I'm kind of discrimination that a human would do so in that sense the I wasn't can to take up or bad habits if we don't control really well what kind of information the process with the system that we have today this kind of control and is not guaranteed
as a final aspect and I want to talk about why it's more and more difficult to come ashore the privacy of users the more data we actually have about them and to do that we
have to understand what a thing up from this going on you probably all know like library with an apprentice 1 years from the CCC and what what I want to talk about the so called database fingerprint database is something that we can put together from the various data attributes that you have which would be suitable to identify you can uniquely or this is sufficiently high probability in a given context and what sector does that work
mean and the math behind it is actually delightfully simple you probably all know most of you already did this game where you have think of somebody famous and then another person tries to guess what is by asking you series of yes no questions and each 1 of these questions if you answered incorrectly then narrows down the possible number of people that you can actually and that can actually correspond to the person is thinking of and that they this thing up frameworks in the same way but just defining a number of attributes that can be binary or categorical and that we can set for each individual and if we have enough of these attributes we can generating you need a vector set for each 1 of the individuals that we want to track and then and in order to like compare somebody on his same person we can just take those attribute vectors we can compare them we can
see equal and if they're not we consider nodes is not the same person and if they are we can say with high probability this the same person so let's again
have a look at some really nice data so this is dataset from Microsoft Research code your life and it contains a GPS data of about 200 people measured about some of what I'm doing a time frame of about 1 2 years and the question is on how easy is it to re-identify singer uses for their data here so let's have a look at the
data as you can see here the individual trajectories are plotted over the time we consider different modes of transportation some of them are by walking some of them by car softening by airplane and in case you're wondering if you're running what city this is this is Beijing and in case you wondered who would be surprised to give up his or her data for that I think this is the university district there so and I can see that the citizens quite rich and if you like
flooded it as the individual trajectories and you can see that it's a contains a lot of diverse information about each subject so here each of each color would encode a different person and now the question is how easy is it to construct a fingerprint from this kind of data needed to a very
naive approach but just putting in grids on the data here has an 8 by 8 grid
and measuring how often a given individual and has been found has been like measured in each of 2 squadrons and if we do it we can just plot the
result is the color coded diagram you see the results for the about 60 on 1 almost 100 of all or are individuals and you can see that the data footprints are actually quite different I mean there's some some cases where there gray similar but they're not at the other cases where we can like see very different patterns given the individual that we have yeah and now we could like
if you want to compare these to this fingerprints we can just go on and multiply them together and as a result we would get like Fred fingerprint that you could then just sum up and the sum of the would have obtained from that with vectors so to say an indication of is the same person or not so now what we're going to do is to
test how easy it is to re-identify user true historic data and you can imagine that we would and basically somebody would use a smaller form and then showed it away and get a new phone but apart from that not change any of its life of his or her life habits and the question is how easy is it to re identified a person even if she or he has a new mobile phone so we gonna use 75 % of lording it over data as training and 25 % of the data to test or assumption and the result is shown here
so we have on the the rank of the correct use identified and plotted versus 2 per cent of the cases for on different grid sizes between 16 and 1024 grid units and as you can see the identification rate is quite good already with this very simple method because it's between 20 and 40 per cent so that means what is like very very small and recruitment that we are able to already like uniquely identify 20 % of the users and if we count from all the corrected invocations where the user that we're looking for is actually within the 1st 10 and proposals of the algorithm on success rate would be even higher at about like 60 % so and remember this is only 1
data dimension that we have used here on infected in realized in many more dimensions that we could use like a GPS data you e-mail your phone number you browsing behavior on a social networking so that's on your on your browser so it would be much easier to construct a more rich and probably more unique fingerprint and they can think about is this a problem or not so nothing
it can be a problem and and to illustrate that I can compare it to something from my earlier like physicist because that like and my colleagues in astronomy there will always be worried about the risk on of the satellites descending into space is being destroyed by some giant flying around the because today as you can see behind that quite a few objects and in orbit around Earth and its can happen that like 2 objects collide and select objects get it destroyed and leave behind more degree and the most catastrophic failure more that 1 could actually imagine in such a case is the so-called Kessler syndrome or test like a state where on the reaction elect a collision between 2 pieces of John would create enough degree to set up the chain reaction basically destroyed everything that's in the orbit and like space unusable for future generations of the state and you can be ask a survivalist up all the leaders of the data needs that we have today but where we have several hundred millions or even billions of records of users being like published in the wider like centralized by the governments or other organizations if that could be a time
where it would no longer be possible to be private to be anonymous on the internet because every bit of information about you that you constantly change like the 1 you behavior and you character traits maybe Europe facial features etc. can be used to piece together a full picture of you regardless if you try to be anonymous or not so does this to
like the the question is how we should think of private data of personal data more like a toxic assets actually instead of a precious resource that we need to exploit this an interesting notion which read the 1st time about like maybe a year ago in US blocking which is becoming more popular now with the recent leads and I think In this way of thinking about it actually some merit because it tends to be more cautious of what we do actually because
when I was learning about data analysis on the it was mostly about on having fun exploring different datasets and like like and squeezing out a few more % of our success rate from the given data we have with some new I wasn't but nobody absolutely wanted to talk about was like the safety or like if we could like actually do harm of 2 things that we we do with with the data it is and I think this is something that we have to change because now data and it is becoming so pervasive that it actually is a really big effect on life and we estate and this should be careful in how we handle the data on the other hand is uses a
thing on what we today need is a better quantitative an intuitive understanding of our data I think today everyone already knows and that giving away data that can be dangerous but what we what we lack still is a think good understanding of how exactly or data can be used in the thing this is something we all of us we have to put in the effort and like just tried to understand better what we can do with the data how algorithms work and how on the life and our data is and how it can affect us in the future and I mean at the last conference spoke to this about a subject some personal and ask me if it would have if you just I'm ected randomly in order to confuse the i was in some that make them think was something different somebody different and uh I follow about this and I have to say that don't
have a quantitative answer if that works or not but as a thing it's a fun thing to do maybe you wanna do it and so and I mean that this presentation on a good knowledge and I really think that
data analysis is an amazing tool and you can have really really help us improve our lives in many ways that we can anticipate and many other ways that we can't even grasp today and so I would like us to try and to use the technology to its fullest potential and to be careful not to destroy the trust and by using some thank
you very much thank
you and it the the yeah thank you so much and S. Davis air 1 question is John fast here already the next speaker just come up on stage and them if there are any
questions I think and as will be on the side of the stage chose answering every question you still have if you do not have