AI VILLAGE - Identifying and correlating anomalies in internet-wide scan traffic to newsworthy security events

Video thumbnail (Frame 0) Video thumbnail (Frame 717) Video thumbnail (Frame 2041) Video thumbnail (Frame 3117) Video thumbnail (Frame 4068) Video thumbnail (Frame 6383) Video thumbnail (Frame 6982) Video thumbnail (Frame 18530) Video thumbnail (Frame 22342) Video thumbnail (Frame 27420) Video thumbnail (Frame 29952) Video thumbnail (Frame 30660) Video thumbnail (Frame 31405) Video thumbnail (Frame 32274) Video thumbnail (Frame 32898) Video thumbnail (Frame 33587)
Video in TIB AV-Portal: AI VILLAGE - Identifying and correlating anomalies in internet-wide scan traffic to newsworthy security events

Formal Metadata

Title
AI VILLAGE - Identifying and correlating anomalies in internet-wide scan traffic to newsworthy security events
Title of Series
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2018
Language
English

Content Metadata

Subject Area
Abstract
DEF CON,DEFCON,DEF CON 26,DC26,computer security,security conference,hacker conference,information security,cyber security,def con 2018,hackers,hacker videos,security research,artificial intelligence,machine learning, AI research, AI security,adversarial machine learning,
Noise (electronics) Slide rule Software bug Event horizon Internetworking Port scanner Video game Information security Information security Event horizon Software bug
Noise (electronics) Statistics Sequel Firewall (computing) Multiplication sign Interior (topology) Bit Price index Login Event horizon Mathematics Machine learning Process (computing) Query language Internetworking Right angle Physical system Vulnerability (computing)
Wechselseitige Information Server (computing) Sequel Multiplication sign Maxima and minima Control flow Port scanner Open set Inverse element Event horizon IP address Twitter Revision control Optical disc drive Goodness of fit Internetworking Cuboid Communications protocol Noise (electronics) Information Binary code Projective plane Electronic mailing list Measurement Partition (number theory) Software Query language Optics Right angle Routing Communications protocol
Noise (electronics) Touchscreen Electronic mailing list Website
Group action State of matter Multiplication sign Equaliser (mathematics) Source code Port scanner Set (mathematics) Voltmeter IP address Perspective (visual) Software bug Neuroinformatik Fraction (mathematics) Order (biology) Mathematics Different (Kate Ryan album) Single-precision floating-point format Computer network Negative number Cuboid Noise Information security Physical system Constraint (mathematics) Spacetime Digitizing Closed set Electronic mailing list Cloud computing Bit Mereology Port scanner Statistics Measurement Partition (number theory) Category of being Data management Malware Internetworking Auditory masking Internet service provider Right angle Moving average Quicksort Point (geometry) Multitier architecture Statistics Sequel Firewall (computing) Virtual machine Electronic mailing list Login Event horizon Number Product (business) Twitter Frequency Internetworking Average Operator (mathematics) Selectivity (electronic) Statement (computer science) MiniDisc Communications protocol Normal (geometry) Binary multiplier Macro (computer science) World Wide Web Consortium Noise (electronics) LTI system theory Uniqueness quantification Counting Analytic set Volume (thermodynamics) Database Denial-of-service attack Incidence algebra SCSI Single-precision floating-point format System on a chip Infinite conjugacy class property Statement (computer science) Point cloud Table (information) Communications protocol Fingerprint Window
Demon Android (robot) State observer Existential quantification Code Multiplication sign Real-time operating system Icosahedron Ordinary differential equation Different (Kate Ryan album) Cuboid Noise Information security Vulnerability (computing) Link (knot theory) Sound effect Port scanner Electronic signature Wave Cross-correlation Bridging (networking) Self-organization Website Right angle Curve fitting Router (computing) Web page Service (economics) Divisor Sequel Number Twitter Cross-correlation Causality Hacker (term) Internetworking Bridging (networking) Gastropod shell Router (computing) Hydraulic jump Surjective function Default (computer science) Noise (electronics) Vulnerability (computing) Standard deviation Electronic data interchange Information Android (robot) Exploit (computer security) Timestamp Personal digital assistant Universe (mathematics) Theory of everything Game theory Library (computing) Computer worm
Noise (electronics) Average Multiplication sign Figurate number Number Vulnerability (computing)
Predictability Satellite Server (computing) 1 (number) Price index Mass Prediction System on a chip Blog IRIS-T Computer worm Noise Lipschitz-Stetigkeit Surjective function Oracle Default (computer science)
Satellite Default (computer science) Server (computing) IP address Product (business) Internetworking Computer network Revision control Right angle Noise Information security Window Metropolitan area network
Digital filter Optics Twitter
okay I'm gonna go ahead and get started even like kind of as people are coming in and that's just how it's gonna go so this is identifying and correlating anomalies in internet-wide scan traffic to newsworthy security events also known as the longest title of a talk I've ever given in my entire life my name is Andrew Morris I work for coming to call gray noise all the slides are gray now they're purple so I work for a company called purple noise again my name is
Andrew I work at grey nose I also am gray noise there's no-one else in the company it's literally just me but it's kind of an open secret and I like referring to the company as we because it makes us sound more legit and professional but make no mistake it's just it's a guy in an apartment doing all of this before I started gray noise I was on the R&D team at endgame and prior to that I was always kind of private sector red team doing various different things I've been staring at internet scan traffic for so so so very long and I'm not a data scientist I'm not good at math I'm not even good at stats I'm not good at machine learning I don't know anything about any of these things at all but I had to learn a little bit about some very basic statistics to kind of do the thing that I'm trying to do and write this big ol big old bastard of a sequel query that I'm gonna show you guys so today what I
want to do is go from I want to talk about the process of going from a bunch of firewall logs in disparate systems to actually you know some kind of anomaly some kind of indicator of an anomaly that is able to be correlated with an actual thing that makes sense so just a bunch of Apache and firewall logs from that in a lot of systems to hey at this time in this place there was a giant uptick in people scanning and probing for those things and it probably had something to do with this vulnerability that is associated with that traffic or this event this thing that happened right and so that's it and the way that I did it is with this that's it
giant ass sequel query what you guys do be fair if anyone in your who has already heard of granites before you raise your hand if you right here great noise all right so we got probably really I don't know maybe three percent of the people in the room already heard a great noise so we do this thing where and I'm always gonna say we it's just hardwired in my brain we do this thing where we write these tweets any time we see these optics and scan traffic that are explainable that have value and this is how we do that we just run this query I mean it takes forever but this is it and so I'm gonna break this whole thing down and and I'm gonna kind of get get
there right so what's internet background noise it's basically it's the baseline omnidirectional scan traffic that's generated by all these people that are scanning the internet like the showdowns and the and the MRIs and like the censuses and like the one it cries and like everybody scanning the internet creates this thing called the internet background noise right so what is scanning the internet actually mean really it just means querying all for Billy and I'm routable IP it's 4 billion odd routable IP addresses sending them like a syn packet or a UDP over or something like that to try to figure out like a certain open quarter protocol way back in the day it used to be that you if you had like one box and you wanted to figure out what was running on it you'd port scan it and that would basically be one IP mini ports and you'd get information now mass canning is the flip is the inverse of that it's one port every IP to figure out what's open on the Internet right so why would people scan the Internet there's many reasons to do it find expose devices measure risk exposure like try to figure out how many Apache servers are there or how many people have this port open or how many versus this how many places are this version of software running or I just want to hack a bunch of devices I don't know whatever so who actually does it lots of good guys do it show Dan projects own our binary edge census I have like eight actually many giant lists of all the labels that I provided in here and I'm gonna go through that a lot of bad guys tumor I want to cry satori my sick what's an anomaly it just means like anything more than we're used to seeing like an actual like an uptick it's some some amount more right and what is it newsworthy event that's just like a thing that happens that we're trying to talk about that falls into something of interest right this is like
one list of the labels that we have already made in grey right now this is like integrating it's like what we do is like we're looking at all of this omnidirectional internet-wide scan traffic and we're trying to label all of this stuff so these are some of the actors that do it your are even more of the actors that do
it I took these screen shots right before I came up here on stage so I guess before I go any further I'm gonna go and explain like what gray noise is
and how it works so it is a system that collects and analyzes all internet-wide opportunistic omnidirectional scan and attack traffic right why well the reason why is because every single one of those packets is flying around the internet has its own unique and special story and that's actually true there's there's no such thing as random noise that happens for no reason every single thing is explainable it's just that everyone thinks that it's just but it's not you can look at these the trends of this internet wide backscatter and scan traffic to figure out like what's going on to try to peel back like what why are people looking for these things you know and from a security operation centers perspective if you're like a network defender or something like that then if you're actually looking at the firewall of your network or something like that you need to be able to differentiate between the actual things that matter that are hitting you specifically versus like the things that are just hitting everybody on the entire internet how many times have people been responding to an incident or something like that that ended up just being like a Chinese SSH or something like that and it's like this is not a big deal this thing is sitting everybody this isn't some sexy apt so really what we're trying to do is we're trying to provide rock-solid negative ground truth of what everyone should be expected to see and this is actually something that Alex Sierra over at Verizon said yesterday two days ago and I love that negative ground truth that's exactly what that's exactly what we're trying to do so the technical mission statement of grey noise is label every single internet wide scanner as either good or bad and put it first of all take the category of everyone and then try to label as many of them as humanly possible like what are you what are you doing why are you doing it this is kind of a State of the Union to that we're right now good the mark-up of good is about fraction of a percent it's about 1/10 of a percent of Internet wide scanners bad is about 10 to 20 percent it's label where we know they've done something violating some like the Computer Fraud and Abuse Act they're logging into a system that they don't belong it doesn't belong to them they're slinging an exploit they're compromised and they're slinging an exploit on behalf of somebody else blah blah blah and then unknown that's everybody else that's the great no I said I guess the purple noise so how do we do it we have a big network of nodes in a kajillion different networks all around the internet they're constantly shifting around in AWS and Google cloud and digital ocean and all these different providers they're always shifting around they have no business value whatsoever and they just hang back and they wait for people to talk to them they're completely passive it's just like a ton of people with their ears to the ground listening for like these little teeny tiny minut signals and then aggregating all of those together in one place and doing all this labeling and analytics to try to find value in all of that so again we want to go from all of the traffic that's hitting everyone to actionable hey this thing happened and it probably had to do with this right this is probably why so this is what the raw data looks like a lot of the time these are just IP tables logs and I rip these straight out from gray nodes so at some point a long time ago those were great noise notes it's bad OPSEC and so now I need to figure out okay like well why why is what we're doing challenging what are some of the things we have to overcome first of all you have to get when you're doing this kind of thing you need to have a very diverse optic and you have a very diverse set of data you need to have data in a lot of different places and you need to make sure that the anomalies that you're measuring are being that they're justifiable in equal amounts of places that you are observing which is to say that you have to avoid collection biases if you have one little optic in one place and that starts seeing all kinds of crazy stuff that is not an internet wide anomaly that's just probably somebody scanning your machine or something like that right but you need to really be able to correlate it across many different places that have different that are different kinds of networks and things like that so it's insufficient to just have cloud providers to do this really you really need to have residential IP space as well business IP space as well because there is like an idea of macro targeting that mask scanners do I'm not even gonna get into that right now any other reasons is you have to make sure that you're getting like an unbiased opinion with your data you can't just install a honeypot on a network that you own because that device has business value and if something has business value then bad guys are gonna pay close attention to that they're not gonna just like backs they're not just gonna like accidentally see that like you need to have things that have no business value right you need a lot of data and so you need to have a lot of volume and managing that amount of volume can be difficult especially if you you know if if you don't if you're not used to dealing with relatively large amounts of data and putting them into databases and querying them and all things like that and then money like who cares about this kind of stuff right so like for me um everything that I need to do this is all I do grey noise is all I do so like I need to be working it's for this is the kind of thing that I'm talking about right now that is like it's more of like an rnd thing it's less of like a thing that is easy to package and product and make money with which is mostly why I'm like just like lobbing it out for free for everybody so the solution my proposed solution to this is collect all this stuff put it into a database average it over time and then when we see any more unique IPS any more than two or three times the normal amount of that of what's expected over some period of time then how to tell you and then do a little bit of research and try to tie it to some kind of event so you're gonna parse out the data and this is really what you need you need the time the source IP the destination IP the protocol the port that's it and if you really want to do this on a budget you can do like you can have like a unique constraint with this like you don't need to record any of these data points more than once you really just need them kind of once once you've seen somebody scan a box that you own on a certain port at a certain time and then they do it again you don't need to record that again you just need alright this is fine this is good okay and so then what we're gonna do is given a thirty to a rolling average of the unique IP addresses scanning the internet for a given protocol pair show me if there is an increase in unique IP address count that is higher than three or four or whatever times the rest of the months day to day increase which is to say and this is where things are gonna get a little tough for me because I am bad at math and I am still grasping this stuff myself but what is the average number of unique IP addresses on the internet that are scanning for a given port protocol over the past thirty days what's the average daily and also what is the difference from yesterday to today how many how many times is it like a 0.9 X multiplier is it a 1.1 X multiplier of that rolling 30 day average and let's and then show me everything that is above four times the regular 30 day and like I can see some people right now that are like this is so simple and it is I mean I know I know this is easy stuff but it's really really effective when you have really good clean data and so then there's I mean there's all kinds of like statistical trade-offs which in this case like if you decrease the window from 30 days to like a week then you're gonna be able to get you know a measure of the anomaly faster but it's gonna be it's it's gonna it's going it's gonna you're gonna get a faster it's gonna be less accurate right it's gonna be more chaotic and sporadic and volatile if you increase the if you increase the number of IPs that you need to see or above or below a certain number of IPs that you need to see then you're gonna miss some of the smaller anomalies so how did I actually did it do it I did it in sequel which is the literal worst possible way that you can do this because I hate myself and it
looks like this so I'm gonna just basically break down this whole thing so the first thing that we do I'm not a DBA anything I'm like not a lot of things today but I'm really not a DBA so from first of all we're gonna take a window of the last 30 days with the date or protocol and the day and the number of times we've seen a distinct IP hit hit us hit 1 of a node that belongs to us in the last 30 days so where we have where date greater than current date - interval 30 days that's easy to understand and then we're just gonna mash it all together but we have a having clause which basically says it's it's like just to avoid another gross sub-query we basically are sort or you know grouping out in the in the in the group in the groups after the group I and so then we're gonna say but I only want to see things where that if I only want that count to be affected when we've seen some one unique IP address that has touched one over over one distinct node so don't show me anything that's only seen that's only touched one of our devices it has to be two or more right and that's going we're gonna call that like so wait that's having changed yeah okay and then so now that we've put that kind of into its own little list now show me the average I don't understand Windows very well this is very like this is very cobbled together from like Stack Overflow but basically I want to see everything where ya at the end I want to see all of the different the days the port protocol is the unique amount of IPs and what the both the month yeah what the month the average the 30-day rolling average was for that given thing on that day and then out of all of those things show me everyone there at the very end select date protocol port ip's round x mean show me everything that is above a 4x multiplier on the 30 day rolling average in the last five days I never want to do that again how can we make that better by
doing literally anything else probably I mean this takes forever to run it's inefficient it's in sequel it's gross it's slow it's not real-time it's it's limited to dates not times we don't have any time stamps factors very little information like an anomaly here is just like dictated by how many people are looking at something but there's way more ways to calculate out an anomaly like maybe an ASN that's scanning the internet for a certain given thing or maybe a you know maybe a I don't know like a given organization or or boxes that look like a certain thing etc and I mean I'm sure any actual decent library you could just like crammed into this and it would just do everything that I just spent the last six months working on just immediately so then the correlation piece of it is like okay why does that's like why did you see that what what is the the the cause most likely to for that to explain that increase most often it's like a big botnet has just operationalized a new vulnerability that they want to they want to capture as many devices as humanly possible on the internet that that are exposed to that vulnerability or something like that sometimes it's because a new CD comes out like heartbleed or you know like shell shock and everyone in their grandma is like oh let me scan the internet to figure out how many devices are vulnerable to this thing which is funny because when whenever this happens it's always security researchers are that big first wave you can see them all like using this standard security researcher tool and then like the bad guys are way slower they come in like months later it because it's all these like bottom-of-the-barrel bad guys that are like you know kind of it's a numbers game for them they're just trying to compromise as many devices as possible it's always like you know it's like six months later they kind of like you know gotten around to like figuring out how many boxes are exposed but like there's not there's not a big jump it's always the it's always the people in this room that are doing it basically or like a worm breaks out right it's like a twitch we've seen with like regimen I mean way back in the day obviously the great example would be a convict or MSO toe 67 back in the day I mean that was if you were if we had the same information if we had the same optic then at the time then we would have seen on one day many people are scanning for port 445 and on this next day that times a quintillion people are scanning for port for four five right so and we see other like worm stuff now which I'm not really gonna get too terribly into right now but slowly some easy hacks if you want to do this to do some of the correlation piece is just googling the port number to try to figure a Search Search github to find the port number to try to associate it with something that you're seeing look on exploit DB for new exploits for a given moaner ability use Metasploit to figure out the default port number for different things that's an effective tactic look at CBE sometimes CBE the CBE actual page will contain the port that the vulnerability is on most of the time it doesn't actually almost none of the time it does but one of the resources and those things will include that information check out router sploit which is like Metasploit for only for routers so here's some of the things
that we've found some of the observations actually like over the last seven months these are just like screenshots of tweets but I'm gonna go through kind of each one of these a couple of the success cases so like top left was yeah 52 869 we weren't the first one to catch this 360 net lab was and I have a shout out to them they do they do great work um it was like a universal plug and play service that's tutorial weaponized so we tagged everything that every all of the new people that we're doing that all of a sudden and like figured out a signature to be able to have that tagged as bad top right great noise observed site 55 55 this one's whack it is the ADB Android debug bridge vulnerability which is like ADB like in your Android device when you're you know doing something there's a vulnerable it's not technically a 4 it's like a design thing where it you can do arbitrary code execution if you have access to that daemon or whatever back in February we detected a big uptick in that of people like looking for it and but there was no real like smoking gun at the time and then like a month ago we saw people start to actually exploit it on the Internet bottom-left oh yeah Belkin and 750 that was just like a some crappy router that had some vulnerability that we started seeing just like this giant thing tikkun we dug into it we found that and then yeah this d-link 2750 be on port 8000 we saw we saw 1000% increase from one day to the next thing was that guy something's going on here this is the
adb thing i mean this is what it like this what the actual numbers kind of looked like of how many people on average we're scanning for some of these so this i mean the reason that I'm putting a news article in here is not because I'm trying to show like how cool we are I'm trying to show you that this works it's effective I mean we wrote this and it was like we like saw this and it matched perfectly with what the rest of the world was seeing with what some of like other people were reporting on the
same thing happened this one was gnarly this was wretch may this mikrotik vulnerability that I think might have been like a vault 7 thing and so this one was it was a nasty one and so then the uptick in this was gigantic no one was really looking for it the other fun thing with gray noise is like once we have these updates we can look back in time and figure out like who's scanned for this thing like ever like who scan for it like last year so yeah
then this is just kind of a summary of some of the ones that we've been able to find these were the successful predictions that we've been able to find using this methodology I think I just went over all these triple dead and that was gnarly we did the same thing with you our eyes that people request so like we were able to do the same exact thing to see when Drupal gedan was beginning to be opportunistically exploited we see oral oracle weblogic is like the gift that keeps on giving it's just it's like permanently screwed up and it's just always being exploited and I think I've talked about all the rest of them failed
predictions there was one maybe like four months ago where we saw this uptick in port 56:47 TCP and the people that were scanning for that port also had the port open on their device which is like a gold mass indicator and so we were like yeah we got we are clearly the smartest people ever exist well I was like I'm clearly the smartest person ever exists
I was so wrong and so I worked with the Red Hat people that it affected like Red Hat satellite server that was what it was the default port for I was like yeah man like this is so cool I'm saving the world and then I ended up working with their team they're like yeah that's not us at all like our devices are never exposed to the Internet they're deep inside of networks and the IPS that you published out are not running Red Hat a lot of my running Windows I was like past ok so then I had to kind of
eat crow on that one blah blah blah blah blah but they were awesome to work with they were really really great they were super cool about it they knew that we were like trying to do the right thing but yeah they were like yeah this does not affect us I was like ok so I
just kind of want to get a couple tips for people if you're gonna be doing the same thing or want to do the same thing filter out the known good actors when you're doing stuff like this so Bob Brutus or Harbormaster on Twitter over rapid7 like he he was the first person that I know of that when he was when he took bring in known good labels when he's calculating optics so what that is to say is like he doesn't if there's if there's a good like if show Dan's out there scanning the internet for a given thing and then they add a different thing from a thousand different places he doesn't want that to
Feedback