We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Open Source Data Analysis and Trend Monitoring

00:00

Formal Metadata

Title
Open Source Data Analysis and Trend Monitoring
Alternative Title
Open Public Sensors
Title of Series
Number of Parts
112
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Our world is instrumented with countless sensors. While many are outside of our direct control, there is an incredible amount of publicly available information being generated and gathered all the time. While much of this data goes by unnoticed or ignored it contains fascinating insight into the behavior and trends that we see throughout society. The trick is being able to identify and isolate the useful patterns in this data and separate it from all the noise. Previously, we looked at using sites such as Craigslist to provide a wealth of wonderfully categorized information and then used that to answer questions such as "What job categories are trending upward?", "What cities show the most (or the least) promise for technology careers?", and "What relationship is there between the number of bikes for sale and the number of prostitution ads?" After achieving initial success looking at a single source of data, the challenge becomes to generate more meaningful results by combining separate data sources that each views the world in a different way. Now we look across multiple, disparate sources of such data and attempt to build models based on the trends and relationships found therein. The initial inspiration for this work was a fantastic talk at DC13, "Meme Mining for Fun and Profit". It also builds upon a similar talk I presented at DC18. And once again seeks to inspire others to explore the exploitation of such publicly available sensor systems. Daniel Burroughs first became interested in computer security shortly after getting a 300 baud modem to connect his C64 to the outside world. After getting kicked off his favorite BBS for "accidently" breaking into it, he decided that he needed to get smarter about such things. Since that time he has moved on to bigger and (somewhat) better things. These have included work in virtual reality systems at the Institute for Simulation and Training at the University of Central Florida, high speed hardware motion control software for laser engraving systems, parallel and distributed simulation research at Dartmouth College, distributed intrusion detection and analysis at the Institute for Security Technology Studies, and the development of a state-wide data sharing system for law enforcement agencies in Florida. Daniel was an associate professor of engineering at the University of Central Florida for 10 years prior to his current position as the Associate Technology Director for the Center for Law Enforcement Technology, Training, & Research. He also is a co-founder of Hoverfly Technologies, an aerial robotics company, and serves on the board of directors for Familab -- a hackerspace located in Orlando. He is also the proud owner of two DefCon leather jackets won at Hacker Jeopardy at DEF CON 8 & 9 (as well as few hangovers from trying to win more).
23
65
108
Data analysisOpen sourceMathematical analysisWave packetInformationWeightComputer animation
Integral domainObservational studyInformation securityComputer engineeringBitComputerInformation securityPhysical systemWave packetObservational studyIndependence (probability theory)Computer programmingComputer animation
Computer networkInformationBayesian networkHypothesisVideo trackingIntrusion detection systemComputer hardwareControl flowInformationComputer hardwareMultiplication signElectronic mailing listIntrusion detection systemSoftwareComputer animation
Type theoryMeasurementLimit (category theory)SummierbarkeitMereologySet (mathematics)QuicksortState observerDifferent (Kate Ryan album)Computer animation
CurveData miningBitData miningMultiplication signCurveQuicksortProcess (computing)TwitterDisk read-and-write headComputer animationLecture/Conference
Bubble memoryInternet forumInformationPrice indexLine (geometry)Traffic reportingMultiplication signBlogWordSlide ruleOpen sourcePresentation of a groupCASE <Informatik>Computer animation
Variable (mathematics)Cross-correlationService (economics)VotingService (economics)Process (computing)BitCategory of beingType theoryUniform resource locatorCross-correlationDifferent (Kate Ryan album)State observerCurveSimilarity (geometry)Variable (mathematics)Price indexTwitterSingle-precision floating-point formatStudent's t-testComputer animation
Abelian categoryCross-correlationService (economics)Line (geometry)CASE <Informatik>Pattern languageMultiplication signProcess (computing)TwitterMoment (mathematics)QuicksortBookmark (World Wide Web)
Correlation and dependenceProcess (computing)Different (Kate Ryan album)Multiplication signCycle (graph theory)Cross-correlationBitStudent's t-testComputer animation
Correlation and dependenceFinitary relationWordCross-correlationProcess (computing)1 (number)Cycle (graph theory)Computer animation
TouchscreenSoftware testingServer (computing)Software testingTouchscreenoutputSoftware developerProcess (computing)LogicGoodness of fitIntegrated development environmentCartesian coordinate systemComputer animation
OutlierOracleAndroid (robot)OutlierBitAndroid (robot)Process (computing)Web pageProgrammer (hardware)Web 2.0Computer animation
MereologyConnected spaceState of matterSimilarity (geometry)Internetworking
InformationNeighbourhood (graph theory)Office suiteNeighbourhood (graph theory)Incidence algebraData miningOpen sourceTwitterSystem callComputer-assisted translationComputer animation
Real numberMereologyAreaInformationOffice suiteGreatest elementTwitterComputer animation
NumberMultiplication signOpen setInformationBitPhysical systemOpen sourceIncidence algebraComputer animation
Type theoryMilitary operationLevel (video gaming)InformationIncidence algebraUniform resource locatorReal-time operating systemComputer animation
Incidence algebraAreaVisualization (computer graphics)FrequencyMultiplication signLevel (video gaming)Hecke operatorMereologyBit1 (number)Exception handling
FrequencySoftwareOrder (biology)BitMultiplication signRight anglePattern languageInformation1 (number)QuicksortAreaGroup actionDifferent (Kate Ryan album)Similarity (geometry)Universe (mathematics)FrequencyLecture/Conference
Non-standard analysisMultiplication signInformationStudent's t-testUniverse (mathematics)Process (computing)Computer animation
Expandierender GraphData modelData miningHypothesisEndliche ModelltheoriePredictabilityType theoryData miningQuicksortMathematical analysisCategory of beingBitMultiplication signDirection (geometry)Forcing (mathematics)ExistenceCross-correlationSet (mathematics)HypothesisProcess (computing)BuildingComputer animation
InformationMultiplication signComputer animationLecture/Conference
Transcript: English(auto-generated)
What I'm talking about today is this is basically a follow-up to a talk I gave a few years ago at DEF CON 18 about looking at information that's freely available out there on the net and doing some trending and analysis of it and trying to make something useful out of it. So a little bit about my background. I'm currently the director of technology at the Center for Law Enforcement Technology
Training and Research, which is a nonprofit research center that got spun out of work that I used to do when I was a professor at the University of Central Florida. I was there for about ten years and I, in the engineering program, taught computer engineering.
I developed the computer security curriculum there and did embedded systems among some other things. Eventually moved away from teaching and more into research and we ended up spinning out that research into an independent nonprofit center. I'm also CTO for Hoverfly Technologies and prior to this, I used to work as a research
associate up at the Institute for Security Technology Studies at Dartmouth College. So over the course of the last 20 years, some of the things that I've worked on are up here on this list and, you know, it took me quite a while to catch on to kind of what like the common theme between all of the things I was working on is I'm kind
of slow to pick up on these things at times. And eventually, as I started putting it together and kind of realizing some of the same things that I was coming across and the same things I was doing, I realized that all of this stuff from information sharing that I'm working on now to hardware sensor networks to intrusion
detection systems, they really all rely on some of the basic concepts of sensor data collection and in particular sensor fusion. Because like everything that we're doing in all of those things that I listed up there,
they're all based on taking some sort of sensor and using it to try to get some measure of reality. But the sensor always has some limitations. Sometimes it's a significant one, sometimes it's not so bad. But every sensor that we look at reality, including ourselves, including when we view
things, it's always got some sort of limitation and it's one particular view and that influences the data we're seeing. And you can get ‑‑ we have to work towards trying to get more meaningfulness out of the data that we have. One of the ways that we do this and one of the techniques that I find most versatile,
I would say, is sensor fusion where we take multiple sensors, we take multiple ways of looking at the same thing and kind of put that together with the hope that we can take the limitations of one observation and cancel it out with a different observation that has a different set of limitations. So at least that's the hope.
At least, you know, we can put two halfway decent things together and get something that's more than the sum of its parts. So before I get kind of more into my stuff, I always feel like in this particular subject that I have to give an acknowledgment to the guy that inspired kind of some of these thoughts in my head and it was actually at DEF CON, way back at DEF CON 13, Broward Horn gave
this talk on meme mining for fun and profit and his problem, you know, all great ideas come out of a problem, I guess a lot of bad ideas come out of trying to solve a problem too, but his was a really good idea. His problem was that he would find that he would like start learning some new technology,
some new tool or at least it was new to him and by the time he felt he had mastered it, it was kind of on the way out or the market, the job market was just saturated with people doing that now or it had just fallen by the wayside, nobody cared about it. And he was always kind of struggling with trying to figure out what should I spend my
time studying? What should I learn to kind of get ahead? And he ended up kind of thinking about this as like everything's got this sort of saturation curve where a trend starts happening and there's a little bit of chatter about it and eventually it starts taking off and everybody hears about it when it's big and growing
and then it kind of gets boring and old, but he wanted to try and identify these things earlier on and went through and did it. This is a slide pulled out of his old presentation where what he would do is he would look at news sources and forums and blogs for information and keywords and kind of pull those out and
see what was trending on there with the idea that that's kind of a precursor to seeing that early chatter about it, something can take off. This one in this particular case, this is the red line shows how many times the word
palladium showed up in news reports and forums and the blue is the price of palladium and you can see clearly there's a lot of chatter about it before the price spiked up and then actually the chatter dropped off before the price comes back down. So it's apparently a really good indicator for predicting the future there, what's going
on. So anyway, that thought inspired me and when I was teaching, I'd have students come to me and they would want to know what skills did they need to get a good job and all of that and I tried to apply what Broward had done in a similar way by monitoring and observing trends and this is mostly single variable observation, it's doing some correlation
and it started off looking at Craigslist data just because Craigslist is nicely available, it's well organized by geographic location and you can go in certain categories like where they have the job postings in there, it's categories by different types of jobs and I know Craigslist isn't necessarily the best place to look for jobs but it had
some interesting properties in that it's a lot of small companies that post on there or maybe trying new things, a lot of entrepreneurial companies, startups, things like that are posting there, not so much the big ones, so that actually tends to skew it a little bit more towards being a leading indicator, something that is ‑‑ will come out ahead
of the curve. So some of the things I ended up looking at just because I found correlations in here were jobs, items for sale and adult services and I mean, I didn't ‑‑ I'm not saying I looked for adult services on Craigslist, it's just my research took me there. So, you know, things I saw looked like this. This is an example. This is
just showing job postings by date and there was a ‑‑ this is showing the dips you see there. This is a weekly trend and these are some different cities. It goes kind of dead on the weekends. There's a spike on a Monday, a spike on a Friday. You see this kind of pattern. It's okay, fine, whatever. It's kind of boring but sort of interesting,
not unexpected but there are certain things that started standing out when you look at this data. In this particular case, one of the things that jumped out at me was Austin never had a spike on a Friday. It always dropped off. It's kind of hard to see but it's the orange line in there. It never has a second spike in it. Thought that was kind of interesting. The other thing, this is what came out of the adult
services was that there was a correlation between adult services being offered and bicycles being for sale or actually a lot of items being for sale. And this led to a couple of interesting discussions that were one of my favorite moments at DEF CON was when somebody stood up in the audience and said, hey, I think I can help you out. I'm from Austin and my sister is a prostitute. So ‑‑ so that and then there's a ‑‑ led
into a discussion of things you can sell one time like a bicycle and something you can sell over and over and over again. So okay. That's what I had done before. We had looked at that. There's some interesting stuff there. But I wanted to dig a bit deeper
into the data and look for more relationships and more correlations between data and hopefully be able to pull in other sources and do some fusions on this. So I started looking for things like different cycles in like the job postings or correlations in them because at the time when I was working on this, keep in mind I was really trying to help out some of the students that were graduating looking for jobs, trying to help them find
out what skills they needed, what would really kind of help them get ahead. There were definitely correlations in there. You know, there were things in the cycles you'd see but nothing unexpected, nothing really interesting that jumped out in related skills. You know, you can say like you could say that if a job was going to have one particular tool
set or skill set listed, there are other ones that are likely to be listed with it as well. Again, it was nothing ‑‑ nothing really jumped out at me as being unexpected out of it. But eventually there were a couple of interesting things that showed up. One that I think is just kind of funny, and it was ‑‑ it was how often the words
drug test or drug screen showed up in a job advertisement correlated with the different skills in it. And apparently like ‑‑ if you don't think you're going to pass a drug test, don't bother learning SAP because it's not going to do you any good. On the other hand, if you want to develop iOS applications, you know, go knock yourself
out. You know, I guess there's probably some logic here is like how corporate or uncorporate the environment is, I suppose. Another thing was looking at jobs that had benefits and like retirement and health and medical. You know, the interesting one, the
best one was COBOL but I think it was a bit of an outlier because there were just so few jobs offered with COBOL and I guess to get like any old grizzled COBOL programmer to come work for you, you got to give them a lot of benefits. You know, things like Python and Android and HTML, looking for somebody to develop your web page, you're
not going to give them much in benefits, I suppose. So as I was looking into this, I came across ‑‑ actually this is much more recently, earlier this year, I came across this article, this is actually out of the Journal of Psychology where psychologist Dorothy Gambrell was doing something similar and actually went
through and looked at the missed connections part of Craigslist and if you haven't ever been there, this is where people like say, oh, I saw you as I was walking across the parking lot and tried to catch your eye and then they go and post this up on the Internet hoping that person will find this and somehow make a connection with them.
And these are organized by state. These are where people make or had the most missed connections. And there are some things that just make me ‑‑ I find it funny, like Wal‑Mart's got a lock on the south. You know, Oklahoma, it's the state fair, of course. You know, it makes perfect sense. And, you know, in Nevada it's casinos.
And the one thing that I just ‑‑ I just had to put this up there, one thing that just jumped out at me like crazy was Indiana. It's at home. Like, I don't know what they're doing in Indiana, but I'm pretty sure they're doing it wrong. So I was talking
with a friend of mine about this stuff, Dave Grableski, and his eyes lit up and he started telling me about this thing that he had done where in his neighborhood, this is back in Orlando, Florida, his neighborhood, they had a rash of crime recently. And they didn't really know they had a rash of crime until all the neighbors got together
and started talking with each other. And they found out a whole bunch of ‑‑ everybody knew a little different incident that had happened. So he went and did some searching and found out there was some open source data that the sheriff's office and police department would post about their CAD, their dispatch calls. And he started writing this little tool to take that, do some geo‑locating on it and tweet it out and
then you can subscribe to it and get tweets from this thing, like really hyper local things for your neighborhood about what's going on there. And it's actually ‑‑ one thing that's funny, I just pulled this up earlier today and, like, you know, I was just noticing things ‑‑ this is back ‑‑ this is in Orlando area. The first tweet that's
on there, and I'm amazed that the sheriff's office is putting this out, they're basically saying there's a designated patrol area available, which means there's an area where there's nobody patrolling it currently. And this is down, like, in a real tourist trap part of Orlando, so, you know, that could be useful information to somebody to know there are no cops there right now. And then there's a few accidents and then I guess the people
at the bottom down on Poppy Avenue would be happy to note there's a fugitive from justice running around in their area. So this kind of led us to, like, look into more sources for data, because what they offered where we were wasn't very ‑‑ wasn't very useful or organized. And we found out ‑‑ and started looking in places that kind
of subscribe more to the open gov system, and this is a movement to have more transparent government data. Some cities publish huge amounts of data about what's going on in their city with the fire department, police department, live interesting data in Seattle,
Boston, Chicago, a number of others. These are three that we spent a bit of time looking at. There's information about incidents that are going on, like police fire. In Chicago you can actually track where the snow plows are in the city. You can track where garbage trucks are in real time from the city, which I just find really kind of fascinating.
There's information about where bicycle racks, public toilets, landmarks and even where cameras are, where the city has all of its cameras posted, which that one I thought was actually particularly interesting. But you can really go on here and make a map of what is an observable location throughout the city and what is not an observable location, which again could be useful information for somebody.
Here's something, the Seattle one is great. They've got their visualization tools built right into this thing. And this is a ‑‑ this is showing a map showing police incidents over a period of time around in part of Seattle. And I pulled up this area and you'll notice it. Most of it, everything is kind of in that same yellow‑orange except for
this one big glowing red blob out there. And over in Georgetown, I don't know if anybody is from Seattle here, but I'm wondering what the heck is going on over in Georgetown. And you can look in a little bit closer and right next to it is the Boeing propulsion engineering labs, which, you know, that makes me feel really good.
So, coming back to, like, an area I know a bit more about, back in Orlando, we pulled up data that had ‑‑ we pulled out traffic tickets. They don't publish information about who got the ticket or exactly what ticket was for, but you can see when there was a traffic stop occurred. And I ‑‑ we looked at it and pulled data that covered
three roads in the area. And these are ‑‑ this is right out by the University of Central Florida. These are three roads that they all run east‑west and they're kind of the three major roads, just kind of ones right into the university, ones a bit north, ones a bit south. And they all have about the same amount of traffic on and they
all have a very similar traffic pattern. And when we went through ‑‑ and what this chart is showing here is this is each one of the kind of groupings is a ‑‑ is a week‑long period, five week days. And then it's repeated over six weeks. And one
of the things that I found really interesting was the chance of a traffic ticket occurring on one of these roads, the order ‑‑ it was always likely at different times of the day. It always followed the same sort of pattern, particularly between this highway
50 and University Boulevard, that the ‑‑ the highway 50 traffic stops always preceded the University Boulevard traffic stops. And when you go out there and you look at the traffic, the traffic pattern is not really any different. So if you start thinking about this and start putting together, well, why do you always see one before the other?
I don't have ‑‑ you know, I don't have hard evidence to back this up, but what our belief is is that you're seeing an influence of the patrol pattern of the police in the city. So you're actually able to kind of get in there and through their information that they're putting out, sort of start tracking them. It's kind of like, you know,
there's a talk I went to earlier yesterday, I guess it was, there's a great talk with Brendan O'Connor that was talking about tracking people by seeing like information their devices are spitting out on wireless networks. It's a similar concept, that they're putting out a lot of information here that is ‑‑ that if you look at it the right way and you
take the right pieces of data and put it together, you can pull a lot more information out about what they're ‑‑ about what they're doing and what's going on. So you know, why ‑‑ so by this time, I've kind of changed kind of what I was
interested in doing and probably because I quit teaching and I left the university, so I don't have students anymore, so I'm not that interested in helping people find jobs. So now I found it kind of interesting to like look at these ‑‑ look at these government entities and the police and other things that are going on and also because I've worked with law enforcement a lot and it's kind of interesting to see like how
on one hand they're very protective of their data, but at the same time they're putting out a lot of information that I'm not sure that they quite realize how much that they're putting out there. Frankly, I think it's actually kind of a good thing. I like being able to have more information and being able to look back on them and like I say, why
should the NSA have all the fun spying on people? So the ‑‑ what's next with this? And there's ‑‑ there's so much more I'd like to talk about, but these 20‑minute talks you have to be kind of fast in. What I'm really interested
in is actually ‑‑ is expanding the model that we've been using on this data to be analyzed. We kind of built things that are very purpose‑driven, that the first set of analysis we did was very structured around the seeking out the jobs, doing that, and
then kind of got sidetracked by the crime and going off that direction. And I want to bring this back together and try to build a more robust model for analyzing this data and throw some data mining at this where so far a lot of what we've done has been
what I say is like hypothesis‑based where I make a prediction about something I think I should see and there's some correlation and then go looking for it and see if it exists in the data or doesn't exist. And I'm sure there's a lot of relations that are in there that are things that, you know, that I wouldn't expect or I wouldn't find otherwise. I want to throw a bit of sort of data mining and kind of ‑‑ that sort
of blind either AI or brute force type approach to finding relations throughout the data. So I think I'm about out of time right now and I'm getting a nod from the back so I'll wrap it up there and if there are any questions
I'd be happy to take a couple until they cut me off. Thank you.