How Our Browsing History Is Leaking into the Cloud

Video thumbnail (Frame 0) Video thumbnail (Frame 442) Video thumbnail (Frame 850) Video thumbnail (Frame 2062) Video thumbnail (Frame 2515) Video thumbnail (Frame 3112) Video thumbnail (Frame 3472) Video thumbnail (Frame 3772) Video thumbnail (Frame 4825) Video thumbnail (Frame 5502) Video thumbnail (Frame 6395) Video thumbnail (Frame 7868) Video thumbnail (Frame 8754) Video thumbnail (Frame 9542) Video thumbnail (Frame 10364) Video thumbnail (Frame 11417) Video thumbnail (Frame 11924) Video thumbnail (Frame 12326) Video thumbnail (Frame 15866) Video thumbnail (Frame 16176) Video thumbnail (Frame 17468) Video thumbnail (Frame 17810) Video thumbnail (Frame 18264) Video thumbnail (Frame 18994) Video thumbnail (Frame 19288) Video thumbnail (Frame 19807) Video thumbnail (Frame 20139) Video thumbnail (Frame 20481) Video thumbnail (Frame 20808) Video thumbnail (Frame 21267) Video thumbnail (Frame 22112)
Video in TIB AV-Portal: How Our Browsing History Is Leaking into the Cloud

Formal Metadata

Title
How Our Browsing History Is Leaking into the Cloud
Alternative Title
Tracking the Trackers
Title of Series
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2013
Language
English

Content Metadata

Subject Area
Abstract
Brian Kennish - Tracking the Trackers: How Our Browsing History Is Leaking into the Cloud https://www.defcon.org/images/defcon-19/dc-19-presentations/Kennish/DEFCON-19-Kennish-Tracking-the-Trackers.pdf What companies and organizations are collecting our web-browsing activity? How complete is their data? Do they have personally-identifiable information? What do they do with the data? The speaker, an ex-Google and DoubleClick engineer, will answer these questions by detailing the research he did for The Wall Street Journal (http://j.mp/tttwsj) and CNN (http://j.mp/tttcnn), talking about the crawler he built to collect reverse-tracking data, and launching a tool you can use to do your own research. Brian Kennish is the developer of Disconnect (http://j.mp/dchrome) and Facebook Disconnect (http://j.mp/fbdisconnect), browser extensions that stop tracking by third parties and search engines, and founder of Disconnect, Inc. (http://disconnect.me/), a startup that makes tools to help people understand and control the data they share online. Brian was an early DoubleClick and Google engineer, writing web and mobile ad servers for DoubleClick then working on AdWords, Wave, and Chrome for Google. He has spoken at SXSW Interactive, CTIA Wireless, Google I/O, Launch, and pii. Twitter: @byoogle

Related Material

Video is accompanying material for the following resource
Web 2.0 Connected space Point cloud Bit Point cloud
Mobile Web Frequency Uniform resource locator Process (computing) Doubling the cube Ad serving Multiplication sign Mixed reality Latin square Convex hull Proxy server Spacetime
Graphical user interface Facebook Wave Mobile app Information Googol Multiplication sign Software developer Quicksort Extension (kinesiology) Information privacy
Point (geometry) Computer icon Wechselseitige Information Execution unit Code Interior (topology) Maxima and minima Web browser Line (geometry) 8 (number) Entire function Facebook Graphical user interface Hill differential equation Landau theory Extension (kinesiology)
Facebook Extension (kinesiology) Information privacy Entire function Window
Web 2.0 Web page Googol GUI widget Web page Bit Information privacy Traffic reporting Discounts and allowances Web browser Spacetime
Web page Email Wechselseitige Information Execution unit Server (computing) Randomization GUI widget Information Execution unit Content (media) Maxima and minima Set (mathematics) Web browser Number Uniform resource locator Ad serving Oval Personal digital assistant Normed vector space Order (biology) HTTP cookie HTTP cookie
Web page Execution unit Intel Personal identification number GUI widget Twin prime Maxima and minima Cue sports Twitter Number Facebook Hooking Facebook Personal digital assistant Different (Kate Ryan album) Profil (magazine) String (computer science) Uniqueness quantification HTTP cookie Identity management
Web page Facebook Facebook Information Multiplication sign Set (mathematics) 8 (number) Number Twitter
Trail Goodness of fit Web crawler Different (Kate Ryan album) Web page Video tracking Hidden Markov model Figurate number Reverse engineering Traffic reporting
Web page Frame problem Home page Domain name Web crawler Scripting language Link (knot theory) Web page Computer-generated imagery Electronic mailing list Web crawler Neuroinformatik Time domain Web 2.0 Band matrix Uniform resource locator Search engine (computing) Website
Revision control Subject indexing Web crawler Spreadsheet Web page Website Function (mathematics) Curvature Resultant
Point (geometry) Web page Service (economics) Statistics Service (economics) Information Electronic mailing list Content (media) Mathematical analysis Set (mathematics) Analytic set Correlation and dependence Web browser Mereology Hypothesis Time domain Cube Personal digital assistant Googol Website
Service (economics) Statistics Service (economics) GUI widget Multiplication sign Content (media) Mathematical analysis Analytic set Set (mathematics) Twitter Number Facebook Network topology Googol Website
Service (economics) Video tracking
Mathematics System call Trail State of matter Electronic meeting system Basis <Mathematik>
Computer icon Type theory Slide rule Statistics Electronic mailing list Website Set (mathematics) Database Address space Information privacy
Website
Web page Computer icon Facebook Process (computing) Average Web page Projective plane Home page Website Computer icon Information privacy
Web page Computer icon Beta function Frequency Website Set (mathematics) Gamma function Information privacy Computer icon Information privacy
Wiki Computer icon Execution unit Projective plane Authorization Computing platform
Uniform resource locator Website Information privacy Computer icon
Connected space CNN Database Hydraulic jump Point cloud
so my name is Brian Kennish I'm gonna be talking about how our web browsing history is leaking into the cloud so I
never actually talked about myself much at these things but I think given the topic I have to do a little bit here and it's really gonna be more of a confession than an autobiography so about 10 years ago I showed up at
double-click and my job was to figure out mobile advertising and at the time you know no one knew anything about mobile advertising back I especially had absolutely no clue so I used to double clicks money to get a hold of every mobile device in the world that I could
I got a big pile of ugly phones that looked something like this well except I had way more of them and there were a couple of cool-looking Japanese phones thrown in the mix - and I plugged these things into a proxy server to see what data they were sending and what data we could target ads against and I still really clearly remembering being kind of shocked when I saw that these things were transmitting location and I thought to myself why the hell would anyone want double-click to know almost exactly where they are to see an advertisement but there was and of course I figured advertisers would be into this stuff so I put it in our in our mobile advertised server so it turned out we were like seven years too early on mobile advertising and not being satisfied just working at the biggest data collection company in the advertising space I went to the biggest data collector in the world period Google and I was an engineer at Google
for a long time worked on a lot of stuff but mostly ad stuff as well so Adwords and Adsense later on I worked out on wave and the last thing I was working on was Chrome so about ten months ago while I was happily working on the chrome team I started this article in The Wall
Street Journal which was about Facebook leaking personally identifiable information to third-party app developers and sort of got me thinking about the huge amount of data that Facebook was collecting about us specifically all the data that they were collecting sort of in an invisible way when we weren't on facebook.com so I went home that night whipped up this quick Chrome extension called
Facebook disconnect I spent about four hours doing this thing it was really like a throwaway thing people seem to be impressed when I tell them it took me four hours but to be honest I spent too half of those hours making the logo and the the entire code base of this thing
is like 20 lines of code it's pretty embarrassing so I had done a couple of like personal browser extensions up to that point I think one of them had something like 36 users I just released this thing I figured there might be a worldwide audience for like 50 people
the size of the football team but within two weeks there was an entire stadium full of people using this thing more
than 50,000 people had installed and were running it and that got me thinking about hey maybe people actually care about this privacy stuff I know I do so I wanted to do a follow up extension that did more than just stop your data from going to Facebook but the problem was again I was working at the biggest data collector in the world so I asked a lawyer what would happen if I did a broaden extension extension that included you know deep personalizing stuff on Google and he said I probably would get sued so I didn't like that idea so I quit Google and I spent three weeks making this follow-up extension which I just called plain disconnect and
that stopped your browsing history from going to all the major social networks and it also depersonalized your searches so which you did a search on Google or Yahoo it wouldn't be tied to your name anymore I'll talk about all this stuff in more detail in a second so anyway this stuff got a little bit of pressing in particular a reporter from The Wall Street Journal asked me if I thought there were any big privacy stories that hadn't been told yet and I said yes social widgets so I explained to him
what was going on in the social widget space and it went a little something like this I'm gonna do this very quickly because it's a little bit of web 101 so as not to bore you but let's say we go to a web page and the web page might
contain some sensitive information this in this case this page is about depression treatment so besides the first party content the actual article on this page there's a bunch of third party widgets and content on this page and one in particular here is an advertisement so in order for your browser to render this ad it sends a request to the ad server and the request is just a bunch of plain text that looks
like this obviously it tells your browser where to send this requesting in this case this is an ad from double-click the request also recall contains caller URL which tells the server where the request came from and in this case it tells a server that we were looking at this page about depression treatment it has the URL for that page and finally there's there can be a bunch of cookies in the request in this case one of the cookies has an ID in it so this unit this ID uniquely identifies me now most people are probably okay with this set of data that is being sent to an ad server because presumably this number while it's immediately attached to me it's not my name it's just this random set of numbers now I'll talk about later why that's maybe not such a good assumption anymore but for the last 15 years or so we could have assumed that this was anonymous information that was being sent to the ad server now if we go
back to this page and look at what else is on this page we also have this bunch of social widgets so we have stuff from Facebook and Twitter and a new +1 button and if we look at the request that gets
sent out in this case it's gonna look really similar so here we're looking at the request for the Facebook widget it's going to facebook.com we get that identical overfor URL and finally again we have a cookie with a unique ID in it now this looks almost identical to the request that we just looked at but there's a huge difference here the difference is that this ID is no longer just a string of numbers it actually points to my facebook profile so it's
not just my browsing history with a set of numbers that Facebook is getting it's like Facebook is actually getting my name they know that on brought me Brian Kennish is actually looking at that page and not only are they getting that information they're getting all the other informations that have explicitly given them like my age and where I live and who my friends are so you'd think with all this browsing history attached to our name that these companies would at least say what they're doing with the data and at the time I looked up looked up what they were doing and all I found was these 404 pages there was nothing
Facebook didn't say what they were doing with the data nor Google nor Twitter so
I explained this whole scenario to this Wall Street Journal reporter and he said well that's kind of interesting but how big of a problem is this really I mean can you quantify how much of our browsing history they're really getting and I said hmm that's a really good question good luck finding out the answer but he was a good reporter so he kept asking me over and over again and finally I relented and I figured I would answer the question in a way that a Googler would answer the question by writing a web crawler to figure out
where the prevalence of all these different tracking companies so our goals with this crawler were to get a
list of the most popular sites on the web and then to go to each of those and crawl them to a link depth of one so the way a search engine crawlers normally works is they'll crawl at least to a link theft of three which means that they go to the homepage get all the links on that page one then go to all those pages get all the links on that those pages too and then go to all those pages and get all the links on those pages three but since I no longer worked at Google and didn't have access to a million computers anymore or unlimited bandwidth we figured for the sake of this experiment it would be good enough to get a small sample and do a link depth of one and then so for all those pages that we got oopsie
for all those pages that we got we were going to extract the third-party domain names from all the resources on those pages that sent these HTTP requests with the referrer URLs so I ran this thing over the course of the week I decided I was going to run it out of Starbucks just for fun and after a week we ended
up indexing the thousand most popular sites we analyzed just over two hundred thousand pages and on these thousand sites we identified nearly 7,000 different third parties and the output of this crawler was this big ugly
spreadsheet that looks something like this this happens to be an annotated version of it but I've broken the result out into more viewable chunks here so
the first set of stats that we're going to look at here are the non social services so things like advertising analytics and content services these are the services that we could presume are anonymous the first thing I want to point out is how prevalent they are so the top service here appeared on 23% of the top thousand websites essentially they're seeing 23% of our browsing history so if you think about opening your web browser or going to your browsing history randomly picking 23% of the pages in there and then sending them to in this case Google API is calm that's basically what we're already doing the next thing I want to point out is how much Google stuff there is so the top five services are all from Google and the way we did this analysis is that we broke out each service separately but other research has researchers have look at this as in aggregate so for example they found that Google services some Google services appear on out of the top hundred sites ninety seven different sites it's pretty amazing just how prevalent Google stuff is and the last thing I want to point out gets that anonymous issue which is that most of the the services on this list are part of big data companies that also have personal information so Google obviously has personal information we log into things like Gmail and Docs and so forth adobe has personal information they have Photoshop online Amazon obviously you go and buy books they're just under the top ten on this list is Atlas which is was purchased by Microsoft who obviously has personal information so any point these big data companies could decide to link up their anonymous data sets with their personal data sets and what that would mean is that not only is your browsing history going forward being tracked but all the past fifteen plus years of your browsing history could instantly be attached to your name and this isn't some hypothetical scenario it's actually something that happened at Google a couple years ago the Wall Street Journal published some leaked documents where Google was debating linking up their personal and anonymous data so it's something that could actually be happening already or certainly could happen in the future so this is
advertising analytics content everything that's not social networking everything that doesn't have your name so this next shot is this next set of stat is is the
social services everything that does have your name and you can see that Facebook is hugely prevalent there on a third of the top thousand web sites the really amazing thing about this number is that at the time we did this analysis the facebook like button had just turned one year old so in one year they went from 0% to 33% likewise when we did this analysis Google which it was on a quarter of all the top thousand websites didn't have the +1 button yet so these stats are really just directional they're probably going to increase hugely over the next year or so and the stat I was probably the most surprised about was Twitter's which Twitter they're social widgets were younger than Facebook and they were already on a fifth of the top thousand websites so these guys are getting a huge chunk of our browsing history with our names so in summary we identified
350 different services that get at least 1% of our browsing history we identified 33 that get at least 5% and 16 they get at least 10% now this data ended up
getting published in this Wall Street Journal article I'll provide a link at
the end some longer tale data got published in the CNN article but we
really like I said the state data was directional we wanted people to be able to see what was happening in an ongoing basis so we've created this tool that we're putting out today and I bet you
haven't seen anyone type into their slides yet Khan which I'm going to do here this is a little address bar so
sorry that address was disconnect dot me slash DB as in database and we're trying to accomplish two things with this tool so all those set of stats that I just went over quickly work up we're throwing into this tool so we have I have a set of anonymity stats about all the top web sites and we have a list of them in here
you can drill down and look at specific
data on any site I'm gonna look up Yahoo here and they'll just zoom in here so
you can see Yahoo is 75 different unique third parties on their site when you go to an average Yahoo page they're almost five different third parties on the page which means that not only are you sending your browsing history obviously to yahoo.com work which is where you are but your browsing history is going to five other places so the second thing
that we wanted to address with this tool is the problem that I mentioned earlier which is that while we can see where our data is going can see that it's going to Facebook or Google they don't really do a good job of telling us what they're doing with their data so we've teamed up with Mozilla to work on this icon project
where we can turn every website into a set of privacy icons that make it easy to identify what they're doing once they actually get our data and if we go back
and look at this Yahoo page here you can see we have these five icons that
represent whether Yahoo is selling our data whether they give it to advertisers
how readily they turn it over to authorities and how long they keep that
data and this is actually a crowdsource
project so we have a wiki based platform
here you can go read the privacy policy of any of the sites that we have in here and then set their icons according to what what they're doing we already have a JSON API so we're hoping to make this a you know widely available to other tools beyond beyond our own so if you're
interested in learning more about this stuff I'll give you a quick few URLs
here so that Wall Street Journal article is at jump that's J dot MP / TT t as in tracking the trackers wsj the CNN article that I pointed to is a jump / TTT CN N and that database tool that I just quickly demoed is that disconnect dot me slash DB now I understand there's a QA room where I get to answer questions so I have a bunch of VC money
to rec the advertising industry the best question you can ask me is if we're hiring and I'm happy to answer that one so thank you very much
Feedback