The Dark Arts of OSINT
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 112 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/38959 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
DEF CON 2145 / 112
3
6
8
9
13
14
15
16
17
22
23
24
25
29
32
33
36
37
39
42
45
47
49
53
60
61
64
65
66
71
76
79
80
82
89
103
106
108
00:00
ComputersicherheitComputersicherheitMereologieBrennen <Datenverarbeitung>Social Engineering <Sicherheit>ARM <Computerarchitektur>HackerKlasse <Mathematik>Überlagerung <Mathematik>Pi <Zahl>ZweiComputeranimation
02:37
Demo <Programm>Oktave <Mathematik>Quick-SortUnrundheitDifferentePunktProzess <Informatik>Lesezeichen <Internet>Stützpunkt <Mathematik>ZeichenvorratHochdruckComputeranimation
04:14
Open SourceVideokonferenzGeräuschHypermediaHochdruckArithmetischer AusdruckInternetworkingDokumentenserverTelekommunikationZeitabhängigkeitWeb SiteDatenbankBildschirmmaskeMAPInformationsspeicherungZahlenbereichEinsQuick-SortGruppenoperationMultiplikationSchnittmengeStrömungsrichtungNatürliche ZahlPunktMultiplikationsoperatorE-MailNetzadresseGraphInformationRauschenOpen SourceSinusfunktionBitMehrwertnetzBroadcastingverfahrenRobotikMathematikDatensatzYouTubeDirekte numerische SimulationWeb SiteAutomatische IndexierungOnline-KatalogCoxeter-GruppeVideokonferenzBildgebendes VerfahrenTelekommunikationPhysikalisches SystemBildschirmmaskeFacebookDatenbankVersionsverwaltungSelbst organisierendes SystemDigitale PhotographieAbstimmung <Frequenz>DokumentenserverFortsetzung <Mathematik>InternetworkingHochdruckÄhnlichkeitsgeometrieElektronisches ForumMereologieVerkehrsinformationFormation <Mathematik>MetadatenVerband <Mathematik>ComputersicherheitEvoluteHypermediaBetrag <Mathematik>AdressraumGesetz <Physik>Hyperbelverfahren
11:37
Quick-SortInformationMetadatenSoftwareOffice <Programm>Lokales NetzOffice-PaketOffene Menge
12:20
DifferenteInterface <Schaltung>Netzwerk-gebundene SpeicherungOpen SourceInformationBildschirmmaskeSoftwareQuick-SortTemporale LogikEreignishorizontAnalysisBitMusterspracheFormation <Mathematik>FacebookComputeranimation
13:21
FacebookGraphW3C-StandardPunktFacebookGraphRechter WinkelFormation <Mathematik>DatensatzComputeranimation
14:12
Web SiteHackerGoogolSystemtechnikEinfacher RingDatenmissbrauchLoginWeb SiteInformationDatenmissbrauchTypentheorieSondierungDatensatzPolygonNatürliche ZahlBefehl <Informatik>DatenbankAbfrageGoogolPunktDifferentePlastikkarteBitMIDI <Musikelektronik>AdressraumZahlenbereichMultiplikationsoperatorFacebook
16:42
AnalysisStatistikInformationSystemverwaltungRechter WinkelYouTubePunktBitGüte der AnpassungBenutzerfreundlichkeitCASE <Informatik>Open SourceSchießverfahrenData MiningE-MailRelationale DatenbankMultiplikationsoperatorMusterspracheSpezielle unitäre GruppeKonstruktor <Informatik>VorhersagbarkeitStichprobenumfangTwitter <Softwareplattform>BaumechanikStatistikKommandospracheSystemaufrufHackerOrdnung <Mathematik>Ausreißer <Statistik>SoftwareVollständiger VerbandBildschirmmaskeMathematikFamilie <Mathematik>NeuroinformatikProdukt <Mathematik>EntscheidungstheorieShape <Informatik>Web SiteZweiSchnittmengeSuite <Programmpaket>Schreib-Lese-KopfWort <Informatik>Physikalisches System
24:40
StatistikDatenmodellInformationMathematikSelbst organisierendes SystemRechenschieberStatistikTypentheorieRechter WinkelSchnittmengeBitData MiningMultiplikationsoperatorReelle ZahlHidden-Markov-ModellExpertensystemDatenmodellDifferenz <Mathematik>Quick-SortDatenfeldAnalytische MengePlastikkarteOpen SourceEndliche ModelltheorieAdditionComputeranimation
27:40
Offene MengeOpen SourceOpen SourceDatenmissbrauchSchnittmengeDatenbankInformationPotenz <Mathematik>Lesezeichen <Internet>Computeranimation
28:21
Coxeter-GruppeQuick-SortTypentheorieSoftwareentwicklerHilfesystemFunktion <Mathematik>PerspektiveSchnittmengeGrundsätze ordnungsmäßiger DatenverarbeitungVariableVisualisierungPunktwolkeReelle ZahlPunktSprachsyntheseZahlenbereichSelbstrepräsentationSystemplattformFunktionalComputeranimation
30:20
GruppenoperationQuick-SortGradientArithmetisches MittelGoogolWort <Informatik>BeobachtungsstudieSystemidentifikationProzess <Informatik>Computeranimation
31:21
MultiplikationsoperatorIdentifizierbarkeitFigurierte ZahlDifferenteVariableAdressraumVererbungshierarchieMetrisches SystemQuick-SortTranscodierungProzess <Informatik>PersonenkennzeichenInformationDatenmissbrauchReelle ZahlStellenringBitComputeranimation
32:48
Metrisches SystemEinflussgrößeInformationStatistikSystemprogrammLokales MinimumZahlenbereichFlüssiger ZustandSummengleichungInformationFigurierte ZahlSchnittmengeMetrisches SystemMultiplikationsoperatorSystemprogrammLokales MinimumRechter WinkelLikelihood-FunktionEntropie <Informationstheorie>Computeranimation
33:34
Entropie <Informationstheorie>InformationRechenschieberAerothermodynamikZahlenbereichAggregatzustandInformationInhalt <Mathematik>Total <Mathematik>Web SiteBitEntropie <Informationstheorie>InformationstheorieGüte der AnpassungRechter WinkelComputeranimation
34:37
MultiplikationsoperatorKartesische KoordinatenQuick-SortUnrundheitBitAggregatzustandSchaltnetzZählenSchnittmengeRechter WinkelPunktRechenschieberEntropie <Informationstheorie>RechenwerkInformationMetropolitan area networkComputeranimation
39:04
VariableEinfache GenauigkeitAusreißer <Statistik>DistributionenraumDatenverknüpfungAnalysisAusreißer <Statistik>ComputerspielTypentheorieEINKAUF <Programm>SchaltnetzDifferenteAbfrageSchnittmengeKundendatenbankMathematikAutomatische DifferentiationFamilie <Mathematik>AnalysisBitKlasse <Mathematik>PunktVererbungshierarchieEuler-DiagrammNormalverteilungEinfache GenauigkeitKlassische PhysikDatenbankComputeranimationBesprechung/Interview
41:29
EindeutigkeitGeschlecht <Mathematik>CodeRPCEntropie <Informationstheorie>Hidden-Markov-ModellMultiplikationsoperatorGrundsätze ordnungsmäßiger DatenverarbeitungGeschlecht <Mathematik>Cracker <Computerkriminalität>InformationMetropolitan area networkCodeDatensatzRechter WinkelComputeranimation
42:20
CodeGeschlecht <Mathematik>Entropie <Informationstheorie>DatenbankGammafunktionGeschlecht <Mathematik>CodeMetropolitan area networkProgrammierungProjektive EbeneMessage-PassingEinsCodierungSpannweite <Stochastik>DifferenteDatenverknüpfungBitComputeranimation
43:20
DatenverknüpfungMAPSchnittmengeRechenschieberDatenverknüpfungDatenmissbrauchMetadatenQuick-SortInformationZweiVektorraumDatensatzFlussdiagramm
44:13
PunktAttributierte GrammatikDatensatzMatrizenrechnungÄhnlichkeitsgeometrieTrigonometrische FunktionGruppenoperationVariableEuler-DiagrammDiagrammVisualisierungDatenanalyseDatensatzMathematikGewicht <Ausgleichsrechnung>FunktionalAusdruck <Logik>Social Engineering <Sicherheit>SchnittmengeÄhnlichkeitsgeometrieVisualisierungEuler-DiagrammFlächeninhaltOrthogonalitätUnterraumVektorraumGruppenoperationVariableUmkehrung <Mathematik>AnalysisMatrizenrechnungStochastische AbhängigkeitQuick-SortSkriptspracheBiegungGarbentheorieBaum <Mathematik>DatenbankProdukt <Mathematik>Trigonometrische FunktionSummierbarkeitSoftware Development KitGeradePunktCASE <Informatik>InformationDifferenteAffiner RaumRechter WinkelComputeranimation
47:23
Eigentliche AbbildungWidgetStandardabweichungComputersicherheitMetrisches SystemImplementierungEigentliche AbbildungSystemprogrammStandardabweichungGamecontrollerCoxeter-GruppeUnrundheitSpieltheorieGrundsätze ordnungsmäßiger DatenverarbeitungBestimmtheitsmaßBetrag <Mathematik>Programm/QuellcodeComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:00
So welcome to the dark arts of OSINT. This would be Dr. Noah Shiffman, a.k.a. security freak. He is the academic of our team, the one who actually finished college. He is
00:21
way more intelligent than I am, a snappy dresser and an absolutely wonderful guy. Can I have a red shirt kick the shit out of this guy over here? Sorry. There we go. Not like they didn't tell us that earlier. And I'm Skydog, of
00:44
course, by the picture there. We are part of the dead bunny club, whether you've heard of that or not, is the pseudo philanthropic arm of everything Skydog does. So we got together, I met you a couple of years ago, and we found that we're fast friends and we have a lot
01:04
of fun getting together and getting into major trouble. Sometimes a little more than friends. Do I? Sometimes a little more than friends. You weren't going to talk about that. I took that out of the presentation. Sorry. Sorry. He is a great cuddler, though. So as they announced earlier, this is my 11th year coming
01:29
to DEF CON. I actually was back in the AP days. Who has been to the AP days? Everyone is a newbie. That's wonderful. I just heard about DEF CON two weeks ago.
01:45
I get to celebrate, ironically, at my 11th year here. I've been a goon for nine years. For my 11th year here, I get to celebrate three firsts, which is kind of odd. It's not losing my virginity. I'm hoping, I'm really holding out. I understand
02:08
I have to talk to a girl, though, and I'm not ready for that. The first one, it was really wonderful. My son got to participate in DEF CON Kids.
02:20
I'm old enough now that I have offspring. He got to celebrate in DEF CON Kids. He placed fourth in social engineering and second in hacker jeopardy. Definitely a first for me. My second would be my first Mohawk ever. I got to participate in Mohawk Con this year.
02:43
Round of applause for those guys. They did a wonderful job. I had to leave Vanderbilt to make that one happen. And my third is being accepted to speak at DEF CON, which is a great honor. I did find out that they do require you to submit
03:00
a paper. That's why it took me so long. I didn't read the fine print, but here we are. We're talking about our live demo. There's this live demo thing that maybe kind of discussed in the CFP. Well, so I don't
03:21
know how many people here are familiar with something called MATLAB or R or other letters of the alphabet. Yes. What's your favorite letter? No. So I didn't have a license copy of MATLAB and went with Octave and got into a battle with Octave and Octave
03:41
won and I lost. So we're doing a different kind of live demo that's sort of audience participation based. So it's going to be really fun and everyone's going to get to meet people sitting next to you. It's going to be a fun ice breaker opportunity. No, it's not. It's actually ‑‑ but it's going to be a
04:00
fun one. I'll participate in and make a point. So I hate Octave. Hate it. Okay. So that's all ‑‑ yeah. So our talk today is about the dark arts of OSINT. So the path we're going to take, we're going to talk about what is OSINT.
04:24
We're going to move on to ‑‑ Evan, if you call me again, I'll fucking kill you. I swear. Fucking kill you. Anyway, I digress. So we're going to speak about what is OSINT. We're going to talk about some acquisition tools and techniques. I am then
04:42
going to sit down and the guy with the math background is going to speak about anonymizing data. You don't remember? I'm going to leave the stage and Noah's going to speak about anonymizing and de‑anonymizing data. Open source intelligence. Open source
05:05
intelligence. Thank you for putting the pause in there. Did you get the transitions in there? The cool one that wipes? The dissolve. You pay for the dissolve. That's good. So what is open source intelligence? Essentially open source intelligence is anything out there
05:24
that you can reach without having to be a Leo or something similar or belong to a large organization or require paperwork to get to it. It's anything you can get to online or readily available. Why do you care? Who had a picture taken of him this weekend by some jackass with a camera? Not one of our photographers but someone with a phone
05:43
or whatever. Okay. Guess what? You're now hooked up with open source. The information is out there. You appear in a picture. Now it's something I can catalog and index. Congratulations. We weren't going to talk about it. So how can it be optimized? We're looking
06:04
at big data sets. One of the things that Noah's going to get to is taking the big data sets and crunching the numbers and actually extracting some information, some interesting information out of what's available readily available.
06:21
So OSINT comprises many things. One of them would be text, whether it is emails that you sent back in 73 where you were talking about something bizarre. Was it ‑‑ did you send anything back? Never mind. I've gone back and found some of the things I've done on forums way, way, way back in the day using a different name that I was able to actually
06:45
find online, things that probably would have shown how ignorant I was at the time. But anyway, you have text that's out there that can be searched for. You also have imagery. We have Facebook. We have appearing at DEF CON if you don't realize it or not.
07:01
You probably had a picture taken of you at some point in time. That appears there. Video. I think last night Evan played the little VR system where you had to move around the map and began to do the robot, which was an absolute hoot, which will appear on YouTube with a little bit of captioning later on.
07:21
Yeah, the black hat robot, absolutely. So we also have audio. The video that we have here of this presentation is currently available on DVD later. But they also put the audio up of that so you can ‑‑ if you're not into driving, looking at your iPhone, you can listen to the audio. And then you have geospatial, which would
07:44
be the images you take from a device that's GPS enabled and records your longitude and latitude and altitude and fun things like that. Other information that doesn't always get removed from imagery when it's put online. There is a certain signal to noise ratio. If you've been online and looked for data, a lot of times the
08:04
aggregators of that data may have some really bizarre things that show up. No, I never lived in Henderson, Nevada, but for some reason my name and phone number are associated with that. So there's a certain amount of it that's out there that doesn't really fall into place correctly. You have to go through and decrease the noise
08:25
to get the true signal. So out of that, once you clean up enough data, you're able to go through and put enough things together, layer them together, find where the high points in the graph appear, you will find actionable data. Anyone that's actually in the law enforcement community, which I'm not, anyone who is in
08:44
that community realizes that when enough data is collected, it becomes actionable and then it becomes intelligence, something that can be used to actually do something. So ‑‑ sorry, I won't cough there. The history and origins of ‑‑ how is it ‑‑
09:04
fur ball. No, I don't want to drink anymore. Not yet. Wait until you get on stage. So print media, originally you had newspaper clippings from other parts of the United States, someone would catalog those things and actually write up a report on it. We moved
09:23
into the radio age, things were actually transcribed and then cataloged and indexed. The search time on information like that was a little long, if you want to complain about Oracle or my sequel or something like that. The paper version of it really sucked. We moved to television, things that got compressed down to videotape and things of
09:44
that nature. Like I said, I recently worked for Vanderbilt. They have the largest compendium of news broadcasts. They go back as far as anyone else. That information can also be searched by metadata. And then of course we're down to the Internet age where every jackass can get out there and dance and then put online their robot
10:05
at a large security conference. That's coming back to haunt you, ass hat. So the evolution began in news sources, of course, with radio and print. Then we moved to government repositories. For some reason they decided it would be a good idea
10:22
to collect information and store it. Who knew? Then you went to academic publications where they began to collect data and sort everything and put it together. Theoretically they anonymized it. Now we've moved into the age of electronic databases where we know everything about you. Those are sexy. Those will get you laid, definitely.
10:46
The current forms and uses of those are definitely tool sets, websites you can go to and of course databases you can get your hands on, too, depending on what your flavor is. So let's see. Who's ever used Maltigo? Show of hands. Cool.
11:05
So Maltigo is basically used primarily to dig down on an organization. You can look at their WHOIS records and DNS and IPs and e‑mails and things of that nature. I'm going
11:24
to have someone else come up here and stomp your ass, too. But anyway, Maltigo is really good for drilling down on a company by looking at e‑mail addresses and things to compile a large amount of data. Who's used FOCA? If you haven't played with FOCA, FOCA is
11:43
a lot of fun. It basically looks at the metadata in Microsoft Office documents, PDFs, Open Office and the EXIF metadata in pictures. You can compile information just in the hidden information in all the documents you can get ahold of. Randy from accounting puts
12:04
out some sort of a document and inside that it contains information about where it's stored on the local network and it actually makes it to the outside world and gives me some information about how the interior network is built. That one is a really nice fun one to play with. Search diggity. Anyone use that one?
12:26
Not in my backyard. Apparently search diggity isn't used as much as everyone would like, but it basically is another form of being able to sift through data. It takes information
12:41
from Bing and Google and other sources and compiles it together and gives you a nice little interface to be able to get to it. So a lot of different pieces of software out there. Has anyone heard of Recorded Future? This is one of those that makes you kind of cringe a little bit. It's a temporal analysis engine. It forecasts and
13:05
does analysis to predict future events based on information from social networks and patterns that they can find. So they're able to go in and put some information in and actually determine what could possibly happen based on the information that's flowing right now. And of course there's Facebook. Who's put their music preferences on Facebook?
13:29
It's all right. We're among friends. You can raise your hand. Do we get a picture of that? So if you've put on to Facebook, hey, I like REO Speedwagon and for all
13:41
the young guys in the crowd, that's really a rocking band. Hey, I went to REO Speedwagon. Well, I can go back in with graph search now and say, hey, I want to know anyone who lives in Tennessee who likes REO Speedwagon and blah, blah, blah, and I can then mine some data out and I guess give you a jingle and say, hey, why don't you listen to
14:01
records, at which point you would probably run. So there are a lot of ways, things are actually being put out there now for you to be able to look at the data and try to grind through it. There are other websites, social mentions, Spokeo, Meltwater. I have my own personal preferences on what to use. Johnny Long isn't here, but who's ever seen
14:25
the Google hacking database? Okay. So a bunch of things that people have put together. If you're looking for certain types of information, they've put query structures together for you to use. This is what it's like to hang out with Noah and I at any point in
14:45
time. So basically you have three different types of public data. You have cooperatively provided data, which would be this is my name and this is what I like, which is
15:01
social networking. It's what I put on Facebook. I like REO Speedwagon and Smurfs. It's things that you willingly ‑‑ it's things that you put out there that your personal preferences and things of that nature or posts that you've made that can actually be mined to
15:20
look at, but you've willingly given it up. Did I say that right? Okay. Just checking. Things that are confidentially provided, something that has a session ID attached to it. I give that information. I filled out a questionnaire or survey. I said, yes, I'm more than happy to allow you to look at this information. I put something in there enough that it's
15:44
very identifiable, be it my address, my phone number, my credit card, things of that nature. So you have to actually ‑‑ it's a site with a privacy policy where you see I agree to it. So you've given that information up and you've agreed to their legal statement there. And then you have the unknowingly provided or the ‑‑ wait,
16:07
where did they get this from? It's the DMV records. It's other information. Maybe it's your medical records or how the fuck did they get my APGAR scores? He was slow at
16:21
birth and it never got better. Things that are third‑party generated, government and academia, things ‑‑ who has ever participated in something in college where you got paid 20 bucks to get an ass probing or something like that for research? Yeah. So they take that data and put it into a database and they put it online and theoretically
16:41
your name is not associated with it. So who publishes these data sets? A lot of academia. Now there's a commercial market for data that's been pieced together and for a certain E you can go in and cruise through that data. The more you pay the more granular
17:06
your data becomes and the more revealing it is. Why are these data sets published? For statistical analysis ‑‑ it's coming up. Try not to laugh. For statistical analysis we want to go back and look at the information and do some predictions. Looking for trends
17:25
and patterns that are out there. Retrospective outcomes. We struggled trying to find the proper example for this. We decided on which is better, Viagra or Cialis. We go back
17:43
and look at the information and see the satisfaction ‑‑ I guess that's not the right word. I heard someone say Cialis. A buddy of mine, I swear. It was a friend of mine, too. Where did Evan go? He's hiding. That's good.
18:08
And of course this information is used for decision‑making for future things. Maybe it is product design or coming up with something new, whether it's actually going to be popular
18:21
in any way, shape or form. A lot of the things that are used in here, the tools and the websites, I don't do the math. That's this gentleman's side of things. Occasionally I get asked to find things. Who in the crowd, who finished high school? Show of hands.
18:41
It's okay. Who went to college? Now who finished college? Okay. Okay. This is your crowd. So anyway ‑‑ do you want to do that? No. So I did not finish college. I had
19:12
to learn what you can and can't do. It was not taught out of me. Oh, you can't do it
19:22
that way. I've never heard that before and I don't pay attention to it. It makes it a lot easier for me to do things like drill data on somebody. Occasionally I'll get a phone call and a couple pieces of criteria and they say find someone. I've become very
19:42
successful. Anyone say to the Bellagio? This is audience participation. You're awake, right? Anyone been there? Cabana by the refrigerated pool. Absolutely wonderful. If at any point in your lifetime you can make that happen, definitely do it. I'm in the sun. I've got the MacBook Air with me. I'm trying to get on the shitty wireless there that does
20:04
not work. And there's a gentleman to my immediate right and he notices I have a computer, which for all of us that is typically the sticking point to, yeah, dude, my computer at home doesn't work. Who's ever answered that question? So I'm in a swim suit by
20:22
the pool and a guy starts talking to me. Okay. I'll bite. No problem. So we start discussing China, politics, the economy, fun things like that really make you happy. We have a few drinks and he says, so you're in Vegas. Are you here for business or pleasure?
20:41
And I said, well, currently for pleasure. I would think that would be the case if I'm by the pool. And he says, so you're here for pleasure, that's good. And I said, well, actually, in two or three weeks I'm coming back out to the largest hacker conference in the United States called DEF CON. And you could hear his asshole pucker in the seat.
21:04
So, you know, that's one of those things where who in the crowd hasn't had to explain what that means? Put your hand down, ass hat. So I began to explain what DEF CON is. Since we didn't have the documentary, it was very interesting trying to explain it
21:21
to him. The what? The hearing impaired con. The hearing impaired con, definitely. I got to spend some time trying to explain to Mundane actually what we do and why we get together for all this. And then his jackass friend shows up who has come to Vegas to go to the Pwn Stars place downtown and
21:41
he comes back, dude, I got to meet Hoss. And I'm thinking, okay, let's go get a steak. So he becomes, you know, packs everything up and he says, yeah, we're going to head off and get a steak at so and so place and really nice meeting you later. And I said, just a second. I said, your name is Brian. And your family owns a civil construction
22:06
firm in Seattle, Washington. And the guy says, yeah. And I said, I'll send you an email to your work email within the next 48 hours. Again, you could hear his asshole pucker. And I said, don't worry. I'm going to show you. I have two bits of information
22:25
on you. I don't have your last name. I don't have much more than that. But I'm going to send you an email and show you what's possible. So we went out and had a nice dinner. Went out to the pool the next day and at some point I thought, I got to go
22:40
find Brian. So I sit down on the bed and fire up the laptop and in 45 minutes I owned this guy. I have where he lives, pictures of his house, what he paid for, pictures of all of his relatives. I then took it upon myself to scan the exterior of his network and tell his system administrator, you probably should change this. It's not good to have
23:02
this open. Brian never responded to the email, oddly enough. I didn't think it was a problem. I didn't send him an invoice. I did it gratis. But that's a good example. I had two bits of information on the guy. Fortunately, one of them was unique enough. It allowed me
23:22
to find him. I was able to correlate civil construction, oddly enough, against the YouTube video which I was able to pick this guy out in and from there just went to town on him. So I guess if you get an email from a guy that you met by the pool who says he's a hacker and he has a picture of your house from the driveway, it might
23:43
be a little bit unnerving. So ‑‑ Was that legal? That was ‑‑ I don't give a shit. So anyway, I don't have to have a court order and apparently no one else does. But anyhow, the open source side of it can be a lot of fun.
24:04
One of the things that Noah is going to discuss is finding outliers in the data. Brian had enough for me to be able to find. Had he said my name is John, the problem would be a little bit more difficult if he said, yeah, I work at Starbucks. Okay, not as much of an outlier there, but given time and effort and how much he pissed
24:25
me off, I probably would have found him eventually. But based on the information, it took me about 45 minutes to track him down. So if you ever get bored and you're by the pool at the Bellagio, just wait for someone to come by. It's a lot of fun. You like talking to guys at pools, don't you?
24:47
Have you ever been given a wedgie on stage? I would love to. You take the little microphone. Okay. Wow. Sky claimed that I'm going to talk about a lot of things that I don't
25:07
know where he got that from. You're drunk. You're really, really drunk. I know a little bit of math, some basic addition, subtraction stuff. And, yeah, I'm not really going to talk about anything really hard in advance because
25:23
that's for smart people, but data ‑‑ actually a lot of these slides ‑‑ hello? There's an echo. I don't like that echo. I picked him out of other people's sets. Sorry. Have fun. Damn it. So these slides are semi new to me.
25:44
But I think I did make them. So let's go through them. Data science. This is a big field, right? Data science. The science of data. Science has been around for a long time. Data has been around for a long time. You put them together and it's ‑‑ okay.
26:05
It's emerged mostly over the past decade to being really like the real data science, you know, information scientists have really ‑‑ that's been a past decade kind of thing. And it sort of came out of the whole, you know, business analytics, competitive intelligence,
26:23
like everything else, driven by big business because they're just looking out for our best interest. And, you know, so all of a sudden people who are like statisticians who are experts at data mining and all of these types of advanced mathematical analyses are very valuable
26:42
to big businesses and other entities that like to analyze large data sets. Are there other entities that collect lots of data? None that I've heard of. I haven't heard of any either, but I'm sure there are organizations out there that are collecting lots of data and doing something with this, but ‑‑
27:02
Purely for benevolent reasons. What's that? Purely for benevolent reasons. Yeah, exactly. But, yeah, it's mostly to enhance our shopping experience, right? Like you ‑‑ other people who bought this also bought this. Statistics, just given data, you try to come up with a model.
27:25
Probability given a model. Let's try to predict the data. Simple concept. Okay. Here's a little graphic demonstrating what I just said and it's useless. Okay. Historic data model feature. Ignore.
27:42
Data sources. Okay. These are some just random examples of readily available public data sets. And we've actually gone from like having databases of information to databases that are cataloging the databases of information and it's increasing exponentially.
28:03
And my favorite was, well, Freebase I came across when I was searching for something else. But apparently it's a database. So I also like info chimps, too. I don't know why. It's just a funny name.
28:21
Okay. Big data. Not just data, but big data. Buzz word? Who thinks it's a buzz word? I was thinking more buzz word. Some people and the other people think it's really like a legitimate real thing. Okay. That's cool. I don't judge.
28:46
Well, I don't know, I mean, it's hard to define what that really means, big data. Like, you know, is it big data? Is it in the cloud? It's a large type thing.
29:00
What's the cutoff for being big? You know, 8 inches, 10 inches? When does it become really big data? How big is your data? My data is huge. I work with a very small data set. And I'm okay with that. And at this point this is yet another presentation we cannot put in our portfolio for public speaking.
29:26
Oh, boy. That's true. So technically at least what I found is that it's sort of defined as big data, these incredibly large amounts of data that are being rapidly generated and have lots of variability.
29:43
Okay. You know, sure. But it's still big data. But the interesting thing about it from our perspective is that the creation of big data has also sort of brought forth the development of tools to work with big data, to analyze these big data sets.
30:06
Visual representation, doing number crunching on them. So all these new mathematical and advanced platforms for performing all kinds of functions on big data, which is of interest to us. And we're going to look at that in a few minutes.
30:22
Terminology. That means sort of that of defining words, kind of. We Google it backstage. A lot of Googling. Depending on who you talk to or what publication you read or what book, anonymization, deidentification, they basically mean the same thing.
30:53
Again, some studies, some groups that will distinguish for the purposes of our talk, they're synonymous, but sort of antonyms, opposite meaning.
31:04
So the ‑‑ so you reverse one of these processes, you get to the other. Pretty simple. Anyone with fifth grade background should get that. Okay. Sweet. Moving on.
31:21
This is real simple stuff. Data. When it's initially collected, a lot of times it contains personally identifiable information like social security number or address or ‑‑ something else. Your name. That would be personally identifiable.
31:42
So there needs to be some kind of process that takes this data and makes it sort of anonymous. I love you, too. Oh, what was that, 10? Holy ‑‑ okay. Super, super fast. Dude, you took up all the damn time. Wow.
32:02
Okay. So we need to find a way to make this personally identifiable information ‑‑ why ‑‑ okay. Make it into anonymous public data. So there's a couple different ways that can be done in general. Just removing variables all together. A variable that actually is unique enough to be identifying by itself.
32:23
Like, you know, I've had eight kids and been in porn. That's, you know, Octomom or something. You know, whatever. Just remove those. Global recoding. Local suppression. Where, again, recoding certain variables or suppressing certain values in different columns that are really identifiable.
32:46
A whole bunch of different ways to ‑‑ yeah. Okay. Anonymization metrics. We have to figure out a way to look at the way we anonymize data and figure out, hey, is this working? Is this, like, actually making the data anonymous? And at the same time making it usable.
33:02
So the whole utility versus actual ‑‑ I mean, that's a balance right there. So two metrics. Disclosure risk. Likelihood of revealing data in the public set. And then information retention. How the utility of that data. So we take away all this information, it's anonymous, but is it still usable?
33:22
So that's the balance you have to strike. Yeah. It's a tough problem. You want to minimize disclosure risk, maximize information retention. Easier said than done. But information entropy. Anyone familiar with this? Entropy? Yes, yes. And not the entropy from thermodynamics, which I spent a long semester trying to go through.
33:46
So, yeah. Information theory. So the idea is the ‑‑ oh, my goodness. I have, like, a million slides to go through. Basically the amount of information that can be ‑‑
34:07
for a given state, like the ‑‑ I actually use an eight‑sided die in an example that obviously you can roll and you get, like, one through eight because it's got eight sides.
34:21
So, yeah. Information entropy is going to be three bits. And, yeah. So population of the world, let's just say 8 billion, that's, like, 33 bits. Awesome website. 33bits.org. Very good. Anyway. All right. I'm going to cruise over lots of stuff. Audience participation. Everyone
34:41
just get up and participate in some way real quick because we got to do something. Get up and ‑‑ no, I don't ‑‑ should we do this? Do we have time for this? I think we have all the time we want. Really? You got that poll? I didn't do that. I'm wrong. Let me get the radio and get a couple of red shirts in there.
35:06
We were going to look at audience participation and kind of go through and sort people out based on some criteria. We can skip it if you want or if you want to stand up and raise your hand. Do you want to do that? All right. First question. Everyone here
35:22
who ‑‑ this is their first time attending DEF CON, please stand up. Noob. Noob, noob, noob, noob. Anyone from the east coast, stay standing. Everyone else,
35:44
sit down. You guys paid the highest airfares. Thank you very much. We enjoyed that. Yours. Anyone here from New Jersey ‑‑ wait, I didn't say what to do. I just said anyone from New Jersey ‑‑ Simon says ‑‑ you can sit down.
36:07
What do we got, like 7, 8, 10 people? What are the states below New Jersey? No, no, no. I was going to say you had a hangover, but I guess it's not publicly available data unless we query everyone in the room. I would say anyone who's male
36:24
stays standing, but that's pretty much everyone. Any female? If you're female, raise your hand. Shitty data set. Never mind. No, actually, that would be ‑‑ there we go. Okay. I tell you what. Anyone 29 years of age or younger remains standing. All
36:46
the old folks in the room, sit down. That's good. How we got left? One, two, three, four. Yours. Oh, man. Anyone here from sort of living below North Carolina, South Carolina border, sit down.
37:03
Did we do New Jersey and up? Yeah, so we're now between like North Carolina, Jersey, so ‑‑ what do we have? You said New Jersey and up still stay standing, so you're in the upper quadrant there. So I did age. We can't do male, female. Who got laid last night? Okay, that's
37:29
three people we're up to. Who is remaining standing? Count them off. I can't see for the lights. How many people you think are in this room right now? 7, 800, 1,000,
37:40
something like that? I don't know. Of that, we're down to, what, four people, three people who remain standing? And how many questions? Well, that was ‑‑ five questions? Well, it was ‑‑ so it was maybe four or five questions, but the entropy for those questions, so what, north, west coast, east
38:00
coast? Entropy there is one bit. We had ‑‑ what was the other question? First time at DEF CON. Information entropy there is two bits.
38:20
Anyone above what, New Jersey and above? Is that what you said? Pretty much. I think all the questions were like two bit, yeah, entropy questions. So five ‑‑ yeah. So basically five bits of entropy and we were able to narrow down the population to, what, three, four people. And it's all innocuous information, but
38:41
the point is that the combination of all this innocuous information can actually be quite identifiable and ‑‑ yeah. Thank you for participating. A round of applause for yourselves. Thank you. Okay. So how much time left? Just keep going. Three? Okay. I have 20 slides to do in three minutes. Okay. Thank you, Scott.
39:06
I appreciate that. Outliers, values traits, anything that outside of normal distribution, yeah. Single outliers, easy to pick up. If you have them in combinations
39:21
or sets which are unique, a little bit trickier to detect, but mathematically possible. Graphical example of an outlier was ‑‑ this is an IQ of probably here, everyone in the audience, and I was an outlier kind of ‑‑ I'm special.
39:40
Data set intersections, Venn diagrams, who's heard of them? Yeah. Okay. You have sets of data. You have set A, set B. What's the intersection there? A. Look at that. A and B. Amazing. Now you add C. Look what you have. A and C. B and C. And what's
40:05
in the middle? Holy crap. Isn't that amazing? That's a math joke, isn't it? That's the math thing happening. Yeah, well, that's a good point. Unique variable overlap. You know what?
40:22
If you have outliers for different types of data and they ‑‑ you know what? Just move on. Mathematical attacks with three minutes. Yeah, that's not ‑‑ Slow down. Just do it. I got it covered. Sweet. Inferential analysis and an example
40:45
of it, remember the targeted advertising? The teenage woman who was pregnant and was getting all this targeted advertising based on her purchasing behavior to her household and then her dad was upset that she was getting targeted ads for like Enfamil and
41:06
diapers when ‑‑ this is my teenage daughter, she's not pregnant and got all pissed off at the manager. Anyway, she was pregnant and that's how he found out was through ‑‑ yeah, that's not a good way to tell your parents you're pregnant through target. That's not how I'll tell my parents.
41:24
So database linkage, classic example is the whole Netflix IMDB thing that was ‑‑ yeah, I'm sure you all remember that. U.S. census data. This happens. They don't knock on your door anymore. I don't think they ‑‑ when did they stop knocking
41:41
on your door? I don't answer the door, me and my 12 roommates. Do they really? Man, all right. Another reason not to answer the door. Actually, a researcher in 1990, this ‑‑ ah, Latanya Sweeney came up with a way to actually just using information from the census data, which was date of birth, gender, zip
42:04
code, 87% of the population was unique. Amazing. And it's just based on principles of information entropy. Amazing. Exposed health care records of the governor of Massachusetts at the time, which is kind of funny. And screw you well ‑‑ yeah, so ‑‑ and
42:22
applied entropy. So how she did it, I mean, zip code. There's 43,000 zip codes in the U.S. roughly. Birth dates, 365, birth year, about 70 different ‑‑ age range of 70 in two different genders. Hermaphrodites were excluded. So 30 bits of entropy, which
42:40
includes all of the population of the U.S. Simple as that. PGP. Ever heard of PGP, personal genome project? This is another program going on where people voluntarily submit all this genetic information about themselves. They want to correlate genotype, phenotype to learn about themselves ‑‑ oh, dude.
43:08
Anyway, again, this is ‑‑ the project's gone bad. Yeah. No one saw that. That didn't happen. Pass. Yeah. Record linkage ‑‑
43:25
Is this a cool diagram? You got to see this. Take care of him, dude. He's stressing me out. Record linkage. This is where you have a public data set and a private data set. Public data set and maybe has metadata that's publicly available and might have some innocuous but identifying information
43:43
about an individual. The private data set, well, that's got personally identified information that you don't want people to know. The record linkage, it's possible to actually correlate the two and discover sort of these anonymous or so‑called anonymous traits
44:01
about a person by combining the two data sets. And I'll get to mathematically how to do that in a second. Or not if I get kicked off stage. So flying through these slides. Vectors. This is where it gets ‑‑ this is where I get into the math. So either go to sleep or those who ‑‑ anyone math torn?
44:23
Okay. Your data points now become a vector. Your record attributes ‑‑ yeah. Boom. Okay. We're now with the only vector math. Take it one step further. The whole database, it's a matrices. Boom. Records, people, attributes, database. Okay. Cool. And again,
44:48
we now can apply matrices math to this, matrices inversions and dot products, Gram‑Schmidt, orthonormalization, all kinds of wonderful things like that. And actually a cosine similarity,
45:01
actually measuring the angular difference between two vectors or matrices can actually find the similarities in large data sets. Yeah. Boring, boring, boring, boring. Math, math, math. The one cool thing that we did do is hold on. So this is the actual mathematical formula for the similarity function in case any of you want to try this at home or see me after class and we'll discuss it. Yeah.
45:25
Venn diagrams, this is really cool. So to be able to visually understand and represent and identify overlapping data sets, we had two data sets. A, B. Multiple variables
45:41
that were in common that were the same descriptive traits. Look at the intersections of them. Noted here by these little lines across. Okay. So these data sets, independent scriptive variables, and they're in common. Then we take those little sections that are in common and we Venn the Vens, as we say. So take those and watch this. Bam, bam, bam,
46:11
bam. Right there. Look at that. And then based on that, we can actually now actually
46:21
the sub space defined by that area is the intersection of all of these groups and actually identifies records for which all the tributes are identical and actually identifies an actual person. Wait, we got ‑‑ okay. So we're ‑‑ so we're talking about
46:41
and summation, the rising side of dark side of OSINT. So, yeah, emergence of big data, big problem, big data. Lots of tools are being used for analysis and visualization. More data sets are being developed and this is the mathematical attacks are going to become easier and easier. It's another weapon for social engineering tool kits because this
47:05
is information about individuals that we're going to be able to ascertain and they're not going to be aware of it and they're not voluntarily giving this information but it's going to be actually sort of re‑identified about them from these anonymous data sets.
47:20
So cool for us, bad for them. What can we do to defend against the dark arts? Proper sanitization methods. There are not ‑‑ there's no way to ‑‑ there's no standards to actually implement anonymization metrics that actually are ‑‑ provide the
47:41
utility requirements but also provide true anonymity. They don't exist. So we need access controls or my recommendation is to falsify everything and just make ‑‑ so that's ‑‑ I would do. And conclusion ‑‑ yeah, just ‑‑
48:01
questions and answers will be handled at the bar. You guys are buying. Ladies and gentlemen, the full presentation will be seen at Skydog Con later this year. Absolutely. Can I have a round of applause for the speaker goons for letting us go a little long. Thank you. How can we take out Skydog and his buddy?