Open Source Data Analysis and Trend Monitoring

Video thumbnail (Frame 0) Video thumbnail (Frame 406) Video thumbnail (Frame 1674) Video thumbnail (Frame 2674) Video thumbnail (Frame 4659) Video thumbnail (Frame 6663) Video thumbnail (Frame 8026) Video thumbnail (Frame 10056) Video thumbnail (Frame 11888) Video thumbnail (Frame 12638) Video thumbnail (Frame 13481) Video thumbnail (Frame 14346) Video thumbnail (Frame 15141) Video thumbnail (Frame 17018) Video thumbnail (Frame 18299) Video thumbnail (Frame 19271) Video thumbnail (Frame 20098) Video thumbnail (Frame 21036) Video thumbnail (Frame 22018) Video thumbnail (Frame 25878) Video thumbnail (Frame 27181) Video thumbnail (Frame 29316)
Video in TIB AV-Portal: Open Source Data Analysis and Trend Monitoring

Formal Metadata

Title
Open Source Data Analysis and Trend Monitoring
Alternative Title
Open Public Sensors
Title of Series
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2013
Language
English

Content Metadata

Subject Area
Abstract
Our world is instrumented with countless sensors. While many are outside of our direct control, there is an incredible amount of publicly available information being generated and gathered all the time. While much of this data goes by unnoticed or ignored it contains fascinating insight into the behavior and trends that we see throughout society. The trick is being able to identify and isolate the useful patterns in this data and separate it from all the noise. Previously, we looked at using sites such as Craigslist to provide a wealth of wonderfully categorized information and then used that to answer questions such as "What job categories are trending upward?", "What cities show the most (or the least) promise for technology careers?", and "What relationship is there between the number of bikes for sale and the number of prostitution ads?" After achieving initial success looking at a single source of data, the challenge becomes to generate more meaningful results by combining separate data sources that each views the world in a different way. Now we look across multiple, disparate sources of such data and attempt to build models based on the trends and relationships found therein. The initial inspiration for this work was a fantastic talk at DC13, "Meme Mining for Fun and Profit". It also builds upon a similar talk I presented at DC18. And once again seeks to inspire others to explore the exploitation of such publicly available sensor systems. Daniel Burroughs first became interested in computer security shortly after getting a 300 baud modem to connect his C64 to the outside world. After getting kicked off his favorite BBS for "accidently" breaking into it, he decided that he needed to get smarter about such things. Since that time he has moved on to bigger and (somewhat) better things. These have included work in virtual reality systems at the Institute for Simulation and Training at the University of Central Florida, high speed hardware motion control software for laser engraving systems, parallel and distributed simulation research at Dartmouth College, distributed intrusion detection and analysis at the Institute for Security Technology Studies, and the development of a state-wide data sharing system for law enforcement agencies in Florida. Daniel was an associate professor of engineering at the University of Central Florida for 10 years prior to his current position as the Associate Technology Director for the Center for Law Enforcement Technology, Training, & Research. He also is a co-founder of Hoverfly Technologies, an aerial robotics company, and serves on the board of directors for Familab -- a hackerspace located in Orlando. He is also the proud owner of two DefCon leather jackets won at Hacker Jeopardy at DEF CON 8 & 9 (as well as few hangovers from trying to win more).
Information Weight Open source Mathematical analysis Data analysis Wave packet
Observational study Multiplication sign Computer Computer programming Wave packet Intrusion detection system Computer hardware Information Computer engineering Information security Physical system Observational study Information Video tracking Electronic mailing list Independence (probability theory) Computer network Bit Control flow Hypothesis Software Integral domain Intrusion detection system Computer hardware Information security Bayesian network
State observer Curve Multiplication sign Curve Set (mathematics) Bit Mereology Limit (category theory) Disk read-and-write head Measurement Twitter Process (computing) Type theory Data mining Summierbarkeit Quicksort
Slide rule State observer Presentation of a group Service (economics) Open source Multiplication sign Similarity (geometry) Student's t-test Variable (mathematics) Twitter Cross-correlation Internet forum Different (Kate Ryan album) Bubble memory Single-precision floating-point format Traffic reporting Curve Service (economics) Information Bit Line (geometry) Price index Variable (mathematics) Category of being Type theory Word Uniform resource locator Process (computing) Voting Personal digital assistant Cross-correlation Blog
Service (economics) Multiplication sign Moment (mathematics) Correlation and dependence Bit Line (geometry) Student's t-test Twitter Process (computing) Cross-correlation Personal digital assistant Different (Kate Ryan album) Pattern language Cycle (graph theory) Abelian category
Touchscreen Touchscreen Server (computing) Software developer 1 (number) Correlation and dependence Cartesian coordinate system Word Cross-correlation Process (computing) Integrated development environment Logic Finitary relation output Software testing Software testing Cycle (graph theory)
Web page Android (robot) Outlier State of matter Outlier Android (robot) Similarity (geometry) Bit Mereology Connected space Programmer (hardware) Process (computing) Internetworking Oracle
Area Neighbourhood (graph theory) Greatest element Open source Information Real number Neighbourhood (graph theory) Incidence algebra Mereology System call Twitter Data mining Information Office suite Computer-assisted translation
Uniform resource locator Type theory Open source Information Mapping Military operation Multiplication sign Bit Open set Incidence algebra Number Physical system
Area Group action Mapping Information Multiplication sign 1 (number) Similarity (geometry) Bit Hecke operator Incidence algebra Mereology Frequency Frequency Visualization (computer graphics) Software Universe (mathematics) Order (biology) Pattern language Right angle Quicksort Exception handling
Predictability Building Information Multiplication sign Forcing (mathematics) Mathematical analysis Set (mathematics) Non-standard analysis Bit Student's t-test Expandierender Graph Hypothesis Hypothesis Data model Type theory Category of being Data mining Process (computing) Cross-correlation Universe (mathematics) Data mining Endliche Modelltheorie Quicksort
Multiplication sign Information
what I'm talking about today is this is basically a follow-up to talk I gave a few years ago at defcon 18 about looking at information that's freely available out there on the net and doing some training and analysis of it and trying to make something useful out of it so a
little bit about my background I'm currently I'm the director of technology at the center for law enforcement technology training and research which is a is a nonprofit research center they got spun out of work that I used to do when I was a professor at the University of Central Florida I was there for about 10 years and I in the engineering program taught computer engineering I developed the computer security curriculum there and did embedded systems among some other things eventually moved away from teaching anymore into research and we ended up spinning out that research into an independent nonprofit center I'm also CTO for hoverfly technologies and prior to this I used to work as a research associate up at the Institute for security technology studies at Dartmouth College so over the course of the last
20 years some of the things that I've worked on or up here on this list and you know it took me quite a while to catch on to kind of what like the common theme between all of the things I was working on some kind of slow to pick up on these things at times and eventually as I started putting it together and kind of realizing some of the same things that I was coming across and same things I was doing I realized that all of this stuff from information sharing that I'm working on now to Hardware sensor networks to intrusion detection systems they really all rely on some of
the basic concepts of sensor data collection and in particular sensor fusion because like everything that everything that we're doing in all of those things that are listed out as I listed up there they're all based on taking some sort of sensor and using it to try to get some measure of reality but the sensor always has some limitations sometimes it's a significant one sometimes see ya in not so bad but every sensor that we look at reality including ourselves including when we view things it's always got some sort of limitation is one particular view and that influences the data we're seeing and you can get we have to work towards trying to get get a more meaningful this out of the data that we have one of the ways that we do this and one of the things the techniques that I find most versatile HUD say is sensor fusion where we take multiple sensors we take multiple ways of looking at the same thing and kind of put that together with the hope that we can take the limitations of one observation and cancel it out with a different observation that has a different set of limitations so at least that's the hope at least you know if we can put two halfway decent things together and get something that's more than the sum of its parts so before I get kind of more
into my stuff I always feel like with the in this particular subject that I have to give an acknowledgement to the the guy that and then inspired kind of some of these thoughts in my head and it was actually a DEFCON way back at DEFCON 13 broward horn gave this talk on me mining for fun and profit and his his problem you know all great ideas come out of a problem and I mean I guess I'll you know a lot of bad ideas come out of trying to solve a problem too but his was a his was a really good idea his problem was that he would find that he would like start learning some new technology some new tool or at least it was new to him and by the time he felt he had mastered it it was kind of on the way out or the market the job market was just saturated with people doing that now or to just fallen by the wayside nobody cared about it and he was always kind of struggling with trying to figure out what should I spend my time studying what should I learn to kind of get ahead and he ended up kind of thinking about this is like everything's got this sort of saturation curve where a trend starts happening and there's a little bit of chatter about and eventually it starts taking off and everybody hears about it when it's big and growing and then it kind of gets boring and old but he wanted to try and identify these things earlier on and went through and did it this is a
slide pulled out of his old presentation where what he would do is he would look at new sources and in forums and blogs for for information and keywords and kind of pull those out and see what was trending on there with the idea that that's kind of a precursor to seeing that early chatter about it something can take off this one in this particular case this is uh the the red line shows how many times the word palladium showed up in news reports and forums and the blue is the price of palladium and you can see that clearly there's a lot of chatter about it before the price spiked up and then it actually the chatter dropped off before the price comes back down so it's really good apparently a really good indicator for predicting the the future there what's going on so
anyway that kind of that thought inspired me and when I was when I was teaching I'd have students would come to me and they would want to know what do they need to to what skills do they need to get a good job and all that and I tried to apply what broward had done in a similar way if by monitoring and observing trends and this is mostly single variable observation what's doing some correlation and I started off looking at craigslist data just because craigslist is nicely available it's well organized by geographic location and you can go in in certain categories like where they have the job postings in there it's categories by different types of jobs and I know like you know craigslist isn't necessarily the best place to look for jobs but it was kind of had some interesting properties in that it's a lot of small companies that post on there that or maybe trying new things a lot of entrepreneurial companies startups things like that or posting their not so much the big ones so that that actually tends to skew it a little bit more towards being a lead leading indicator something that is pre vote will come out ahead of a bit ahead of the curve so some of the things that ended up looking at just because I found correlations in here were jobs items for sale and adult services and I mean I didn't I'm not saying I looked for adult services on craigslist such as my research took me there so so you know the things I saw look
like this this is just this is an example this is just showing job postings by date and there was a this is showing the the the dips you see there this is a weekly trend these are some different cities it goes kind of dead on the weekends there's a spike on a Monday spike on a Friday you see this kind of pattern and it's okay fine whatever it's kind of boring but you know sort of interesting not unexpected but there are certain things that started standing out when you look at this data in this particular case there was you know one of the things that jumped out at me was Austin never had a spike on a Friday it always dropped off you it's kind of hard to see but it's the orange line in there it never has a second spike in it thought that was kind of interesting the other thing and this is what came out of the adult services was that there was a correlation between adult services being offered and bicycles being for sale or actually a lot of items being for sale and this will do a couple interesting discussions that were one of my favorite moments that def con was when somebody stood up in the audience said hey I think I can help you out I'm from Austin and my sister's a prostitute so the so that and then there's a dead light into a discussion of things you can sell one time like a bicycle and something you'd sell over and over and over again so so
okay that's what I done before we had looked at that there's some interesting stuff there but i wanted to can dig a bit deeper into the day and look for more relationships and more correlations between data and hopefully be able to pull in other sources and do some fusions on this so i started looking for things like different cycles in like the job postings or correlations correlations in them because at the time when i was working on this keep in mind i was really trying to help out some of the students that were graduating looking for jobs trying to help them find out what skills they needed what would really kind of helped him get
ahead there were there were definitely correlations in there you know there are things in that cycle should see but nothing unexpected nothing really interesting that jumped out in related skills you know you can say like you could say that if a job was going to have one particular tool set or skill set listed there are other ones that are likely to be listed with it as well again and it was nothing nothing really jumped out at me as being unexpected out of it but eventually there are a couple interesting things that showed up one that I think is just kind of funny and it was it was how often the words drug
test or drug screen showed up in a job advertisement correlated with the different skills in it and apparently like if you don't think you're going to pass a drug test don't bother learning SI p because it's not going to do any good you know on the other hand if you want to develop iOS applications you know not go knock yourself out you know I guess there there's probably some logic here is like how corporate or uncor purrit the environment is I suppose another thing was looking at a
job that had benefits and like retirement and health and medical yet you know the interesting one the best one was cobalt but I think it was a bit of an outlier because they were just so few jobs offered with COBOL and I guess to get like any like old grizzled cobol programmer to come work for you you got to give them a lot of benefits you know things like python and android and HTML for somebody to develop your web page you're not going to give them much in benefits I suppose so as I was looking
into this like I came across actually this is this is much more recently this is earlier this year I came across this article this is actually out of the Journal of Psychology we're somewhat psychologist Dorothy gambrel was doing something similar and actually went through and looked at the missed connections part of craigslist and if you haven't ever been there this is where people like say oh I saw you as I was walking across the parking lot and tried to catch your eye and then they go and post this up on the internet hoping that person will find this and somehow make a connection with them and these are organized by state these are where people make the or had the most missed connections any kind there's some things that just make me I find funny like Walmart's got a lock on the South you know you know Oklahoma it's the state fair of course yeah you know it makes perfect sense and and you know in Nevada its casinos and and the one thing that I just I just had to put this up there this morning that just jumped out at me like crazy was was Indiana it's at home like but I don't know what they're doing in Indiana but i'm pretty sure they're doing it wrong so so i was talking with
a friend of mine about this stuff dave kehr Blonsky and he his eyes lit up and he started telling me about this thing that he had done where in his neighborhood this back in orlando florida his neighborhood they'd had a rash of crime recently and they didn't really know they had a rash of crime until all the neighbors got together started talking with each other they found out a whole bunch of everybody knew a little different incident that had happened so he went and did some searching and found out there was some open source data that the sheriff's office and police department would post about their cat there their dispatch calls and he started writing this little tool to take that do some geo-locating on it and tweet it out and then you can subscribe to it and get tweets from this thing like like really hyper local things for your neighborhood about what's going on there and it's actually
one thing it's funny I can hold I just pulled this up earlier today and like I you know it's just noticing things this is back this is in orlando area you know the first tweet that's on there and i'm amazed at the you know the sheriff's office is putting this out there basically saying there's a designated patrol area available which means there's an area where there's nobody patrolling it currently and this is down like in a real tourist trap part of orlando so you know I mean that could be useful information to somebody to know there are no cops there right now then there's a few accidents and then I guess the people down at the bottom down on papi Avenue would be happy to note note there's a fugitive from justice running around in their area so this kind of led
us to like look into more sources for data because what they offered where we were wasn't very wasn't very useful or organized and found out in started looking in places that kind of subscribe more to the open gov system and this is a meant to have more transparent government data some cities / amounts of data about what's going on in their city with the fire department police department live interesting data in the Seattle Boston Chicago a number of others these are three that spend a bit of time looking at there's information
about incidents that are going on like police fire in Chicago you can actually track where the snowplows are in the city you can track where garbage trucks are in real time from from the from the city which I i just find really kind of fascinating there's information about where bicycle racks public toilets landmarks and even wear cameras are where the city has all of its cameras posted which I that one I thought was actually particularly interesting because you can really go on here and make a map of what is an observable location throughout the city and what is not an observable location which again that could be useful information for somebody here's something that the
Seattle ones great they've got their visualization tools built right into this thing and this is a this showing a map showing police incidents over a period of time around in part of seattle and i pulled up this area and you'll notice it like most of it everything's kind of in that same yellow orange except for this one big glowing red blob out there and you know over in Georgetown I don't know if anybody's from Seattle here but I'm like wondering what the heck's going on over in Georgetown and you can look in a little bit closer and right next to it is the Boeing propulsion engineering labs which you know that that makes me feel really good so so coming back to like an area I
know a bit more about back in orlando we pulled up data that had we pulled out traffic tickets but say they don't publish information about like who got the ticket or what exactly what stick was for but you can see when there was a traffic stop occurred and I we looked at in pulled data that covered three roads in the area and these are this is right out by the University of Central Florida these are three roads that they all run east-west and they're kind of the three major roads just kind of one's right into the University one's a bit north one's a bit south and they all have about this amount of traffic on and they all have a very similar traffic pattern and when we went through and what this chart is showing here is this is each one of the kind of groupings is a is a a week-long period of five weekdays and then it's repeated over six weeks and one of the things that have found really interesting was the chance of of a traffic ticket occurring on a pond one of these roads the order it was always likely at different times of the day it always followed the same sort of pattern particularly between this highway 50 and university boulevard that the the highway 50 traffic stops always preceded the University Boulevard traffic stops and when you go out there and you look at the traffic the traffic pattern is not really any different so if you start thinking about this and start putting together well why you know why do you always see one before the other I don't have you know I don't have hard evidence to back this up but what our belief is is that you're seeing a influence of the patrol pattern of the police in in the city so you're actually able to kind of get in there and through their information that they're putting out sort of start tracking them it's kind of like you know there's a talking went to or earlier yesterday I guess it was there's a great talk with a brendan o'connor that was talking about tracking people by seeing like information their devices or spitting out on wireless networks it's a it's a similar concept that they're putting out a lot of information here that is um that if you look at it the right way and you take the right pieces of data and put it together you can pull a lot more information out about what they're about what they're doing and what's going on so you know why so net by this time I've
kind of changed the kind of what what I was interested in doing and probably because I quit teaching and I left the university so I don't have students anymore so I'm not that interest in helping people find jobs so now I found it kind of interesting to like look at look look at these government entities and the the the police and other things are going on and also because of work worked with law enforcement a lot it's kind of interesting to see like how on one hand they're very protective of their data but at the same time they're putting out a lot of information that I'm not sure that they quite realize how much that they're there they're putting out there frankly I think it's actually kind of a good thing I like being able to have more information and being able to look back on them and like I say you know why should the NSA have all the fun on spying on people so the what's next
with this and in there's there's there's so much more like to talk about but these 20 minute talks you have to be kind of fast in uh that what what I'm really interested in is actually is expanding the sort of the model that we've been using on this data to be analyzed we kind of build things that are that are very purpose driven that won the first set of analysis we did was very structured around the the seeking out the jobs doing that and then kind of got got sidetracked by the crime in going off that direction and want to bring this back back back together in and try to build a more robust model for analyzing this data and throw some data mining at this we're so far a lot of what we've done has been what I'd say is like hypothesis based where I I make a prediction about something I think I should see in there some correlation then go looking for it to try it and see if it exists in the data or doesn't exist and I'm sure there's a lot of relations that are in there there are things that you know that i wouldn't expect or I wouldn't wouldn't find otherwise I want to throw a bit of sort of Bob data mining and kind of that that sort of blind either either AI or brute force type approach to finding relations throughout the data so I think I'm about
out of time right now and I'm getting a nod from the back so I'll wrap it up there and if there are any questions I'd be happy to take a couple till they cut me off thank you
Feedback