We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

AI VILLAGE - Generating Labeled Data From Adversary Simulations with MITRE ATT&CK

00:00

Formal Metadata

Title
AI VILLAGE - Generating Labeled Data From Adversary Simulations with MITRE ATT&CK
Title of Series
Number of Parts
322
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Attackers have a seemingly endless arsenal of tools and techniques at their disposal, while defenders must continuously strive to improve detection capabilities across the full spectrum of possible vectors. The MITRE ATT&CK Framework provides a useful collection of attacker tactics and techniques that enables a threat-focused approach to detection. This technical talk will highlight key lessons learned from an internal adversary simulation at a Fortune 100 company that evolved into a series of data science experiments designed to improve threat detection.
BitComputer simulationSet (mathematics)Electric generatorGraphical user interfaceDecimal
InformationSimulationComputer programmingData analysisWave packetExploratory data analysisBitSlide ruleVirtual machineQuicksortReal numberDependent and independent variablesInformation securityCuboidOpen setDirect numerical simulationBlack boxMultiplication signUniform resource locatorRight angleService (economics)Data analysisWave packetExploratory data analysisMachine learningComputer animationLecture/Conference
Video gameLevel (video gaming)Phase transitionBitLine (geometry)MultilaterationLimit (category theory)ResultantKnowledge baseDirection (geometry)Endliche ModelltheorieStorage area networkComputing platformCycle (graph theory)Context awarenessCybersexPhase transitionPrime idealDirection (geometry)Software frameworkEndliche ModelltheorieComputing platformCybersexComputer animation
CausalityLine (geometry)Slide ruleDaylight saving timeCentralizer and normalizerAreaAsynchronous Transfer ModeExterior algebraHand fanPoint (geometry)Communications protocolDirect numerical simulationEndliche ModelltheorieDifferent (Kate Ryan album)Fitness functionTwitterExterior algebraCommunications protocol
Menu (computing)CuboidFocus (optics)
Row (database)SimulationPhysical systemSlide ruleSystem callProcess (computing)Meta elementDegree (graph theory)Multiplication signCybersexRight angleFault-tolerant systemWave packetKerr-LösungSoftware framework
Semiconductor memoryWave packetTheorySampling (statistics)Control flowSource codePlastikkarteDifferent (Kate Ryan album)Tracing (software)
Computer architectureRandom matrixCausalityCountingComputer architectureUniqueness quantificationMenu (computing)
Descriptive statisticsType theoryWave packetExploratory data analysisDecision theoryArithmetic meanCore dumpField (computer science)Direct numerical simulationTexture mappingSoftwareData analysisExploratory data analysisInformation security
Perspective (visual)Information securityLaptopStandard deviationData analysisExploratory data analysis
Mathematical analysisCodeData typeFeedbackSoftwareSpreadsheetVideoconferencingLink (knot theory)Bridging (networking)Frame problemGreatest elementPreprocessorOpen sourceFigurate numberData analysisExploratory data analysisComputer animation
SoftwareBitDirect numerical simulationEndliche ModelltheorieHookingMultiplication sign
AlgorithmMathematical analysisRow (database)InformationSoftwareDescriptive statisticsWave packetSoftware testingFormal verificationCausalityBitLetterpress printingMereologySlide ruleVirtual machineLink (knot theory)Information engineeringField (computer science)Bridging (networking)Social classVirtual realityChemical equationGreatest elementGraph coloringSource codeDirect numerical simulationOpen sourceSelectivity (electronic)Disk read-and-write headEndliche ModelltheorieDifferent (Kate Ryan album)Image resolutionLoginCycle (graph theory)DemosceneMachine learningProcess (computing)Virtual reality
Mathematical analysisMatrix (mathematics)Point (geometry)Frame problemDirect numerical simulationLoginMultiplication signStructural load
Type theoryError messageInformation securityPoint (geometry)IP addressFrame problemDirect numerical simulationDifferent (Kate Ryan album)TouchscreenEquals signUniform resource nameComputer animation
Row (database)Data structureMathematicsFunction (mathematics)PlotterFrequencyProduct (business)Fast Fourier transformCausalityLine (geometry)Lambda calculusResultantTable (information)Term (mathematics)Electronic visual displayQuery languageRevision controlOperator (mathematics)WeightDirectory serviceInformation securityIP addressProgrammschleifeEncryptionData conversionFrame problemDirect numerical simulationView (database)LoginLengthMultiplication signEntropie <Informationstheorie>Uniform resource locatorMessage passingDemosceneComputer-assisted translation
Row (database)Theory of relativityVariable (mathematics)Dot productTerm (mathematics)Similarity (geometry)Point (geometry)Social classCuboidOpen setChemical equationDirect numerical simulationEndliche ModelltheorieDifferent (Kate Ryan album)LoginLengthMultiplication signUniform resource locatorTracing (software)Message passingRight anglePattern languageComputer animation
CausalityHistogramVisualization (computer graphics)Scaling (geometry)Query languageFrame problemDirect numerical simulationFlagLengthRule of inferenceMessage passing
Data structureSoftwareExistential quantificationMoment (mathematics)Slide ruleTable (information)Link (knot theory)Range (statistics)RandomizationBridging (networking)Degree (graph theory)Entropie <Informationstheorie>Right angleTwitter
Mathematical analysisGene clusterComputer animation
StatisticsWave packetSoftware testingCausalityPredictabilityFrame problemEndliche ModelltheorieNatural numberWave packetSoftware testingHeegaard splittingPredictabilitySet (mathematics)Endliche ModelltheorieComputer animation
Performance appraisalCASE <Informatik>Endliche ModelltheorieThresholding (image processing)Right angle1 (number)Performance appraisalResultantEndliche ModelltheorieComputer animation
Perspective (visual)Endliche ModelltheorieMachine learningProcess (computing)Social classComputer animation
Information securityIP addressMultiplication signMathematical analysisSoftware bugAddress spaceSource codeLoginComputer animation
Transcript: English(auto-generated)
For our first speaker today, we have Brian on generating labeled data from adversary simulations with MITRE ATT&CK. Please give it up for him. Thank you very much for coming so early on Sunday morning. That's awesome. I appreciate you guys coming and listening to this for a little
bit. I wanted to talk to you about a couple of things and just give you a little bit of background on how I see this problem set. So the general premise here is that whatever I'm looking at, whether it's prologues or whatever the problem is that I'm trying to solve,
I try to recognize the biases that I have. Right. So I looked at this last week, I looked at this last month, that kind of idea. So if I can abstract away some of that bias and have a repeatable methodology, something that's based on math, maybe I can find some insights. And the interesting thing about what I'm talking about today
is for me, it's not theoretical at all. We have an internal red team that's really proficient. Is anyone here from the red team? We have a red team that sometimes will perform some activities
based on MITRE ATT&CK and whether that's DNS exfiltration, like we're talking about today or some other technique, they'll hit a canary URL first. So think about white box and overt out in the open versus black box. So if you heard of the threat hunting, the hypothesis,
you know, I think that there might be DNS exfiltration and therefore you come up with a plan and look for the artifacts. We'll get into that in a minute. But for me, it's not theoretical at all. I know based on what I saw here, that those guys, my friends that I drink beer and bourbon with, they ripped us off, they broke in and they
stole some stuff on May 18th, 2018. And that was the white box overt time where they hit the canary URL. And I know from patterns, that means that they probably broke in again in a covert black box way. So when we talk about assumed breach, completely not theoretical,
whether you believe in that philosophy or not, which I do, you know, I know that these guys that bought me a beer the other night, they're probably sitting on some data that they exfilled. So that's kind of the background that we'll get into here with the threat hunting and how this ties in. But here's the quick agenda.
Do a very quick intro. Believe it or not, the MITRE ATT&CK, it's gonna be real quick. Love MITRE ATT&CK, absolutely do. But I think a lot of CFPs, a lot of cons, a lot of stuff's getting saturated, right? So if anybody wants more information than I'm going to provide in the slides here, please come up. I'll talk about it as long as you want to afterwards,
but I'm going to trim that down just a little bit because I think everyone's probably heard a lot about it recently. So harvesting labeled training data, I'll get into what I mean by that. EDA, exploratory data analysis, machine learning, work example, and I'll talk about just very candidly,
some challenges that I've run into and a little bit about future work. So before I do that, I just want to get a quick sense of what the background is in the room. So if we could just start here and we'll go around and tell everybody what you do and name. Okay. Well, how about, could I just say a show of hands? How many people do something like threat hunting?
How many people do any kind of data analytics outside of Excel? Awesome. And how many people have some sort of a program where you're doing adversary simulation where you've got actual purple team,
both folks internal. Okay, cool. Thank you very much. That was a terrible idea. I don't know what I was thinking. All right. So real quickly about me, my name is Brian. I'm a threat hunting lead at a fortune 100 financial services company. I also help out with the threat intelligence and now security orchestration automation response or SOAR.
But the bottom line for me is that there's one prime directive. It's find evil. You know, Rob Lee from SANS talks about no normal find evil. I mean, I, I think about this all day, sometimes all night. I hear about that a little bit from my wife sometimes,
but she's been very patient with that. It's almost an obsession. So MITRE ATT&CK framework, we're probably mostly familiar with it, but just to level set, it's tactics, techniques, common knowledge, and it's a curated knowledge base and model for cyber adversary behavior
reflecting the various phases of adversaries life cycle and the platforms they're known to target. So my buddy Zach, the lead red team guy at our place in Milwaukee, we did a talk at a DerbyCon in 2016. It's a small world.
Anybody see that talk in 2016? So we were just talking about very open kimono, very transparent. Here's what we were trying to do with limited resources and budget and everything else. Cause there's a lot of techniques. Here's what we tried. Here's why we're doing it. And here were the results. So I didn't put the ATT&CK timeline on here,
but I know that we have pre-ATT&CK now, but at that point, we were just primarily focusing on the later stages of ATT&CK. So that's the context in which I'm talking about some of these techniques. So in particular, and we're talking about exfiltration over alternative protocol.
You know, I didn't blow this up cause I wanted to fit everything on there. So you don't need to see what's on there. I will, I will make sure that I have my Twitter handle, which is just at Brian Ginns. If anybody wants to watch that, I'll have all these slides up by Tuesday, Tuesday, midnight central daylight time.
So exfiltration over alternative protocol here, I'm focused on DNS. So as anybody think of any tools that you might use for DNS exfiltration, go ahead and shout it out. Somebody said iodine. Yep. Anybody else? Cobalt strike fans or write your own. So there's a lot of different ways you can do this, right? Um, interestingly,
I don't want to create an over-fitted model. Does anybody ever see that thing? There was some kind of picture I saw on Twitter from the Bay area and it was talking about over-fitting a model. I can't verify whether this is true or not, but maybe somebody in the back can tell me if you heard about this. Essentially that a lot of the models were trained on roads,
the Teslas with the self-driving car mode in the Bay area. And then when they were driving on roads that were outside of that area that didn't fit the initial kind of things that was used to, there were five salt lines that were laid down by a salt truck and that that was just messing with that. So, um, if anybody heard about that, cool. If,
if not, um, it's one example in my mind of how, if I try to detect my buddies that are breaking in and stealing data, cause if I can't detect them, I can't detect somebody else doing it right. If I focus only on what it looks like with cobalt strike,
that's too narrow, right? You use iodine, maybe I'll catch it, maybe I won't. So the point is that this is one of many techniques, but it's one that we focused on, um, because we had the instrumentation and the telemetry to dig into. So really for me, all minor attack is, it's a true North.
It's a true North where I and we can sit in front of our CISO and executive leadership and say, certainly there are compliance requirements. There are other things that we have to do and boxes that need to be checked, but let's focus on what the attackers are doing. They don't have to do something off the menu, uh, on the menu,
off the menu. It's up to them, but let's start with known TTPs and tighten up our monitoring and our detection engineering efforts. So not going to spend a lot of time on this. If you follow this stuff, you're probably familiar with, uh, Katie Nichols. Um,
and there's a couple of other folks did that, but miters got caldera, red canaries got atomic red team. Uber's got meta and games, got red team automation. There are varying degrees of what they're automating, but it's essentially helping teams. These are open source, helping teams figure out how to do a repeatable processes for adversary
simulation. And then, um, cyber war dog, uh, Roberto Rodriguez, I think is it a specter ops now and Devin care and game, right? Um, I had had an article that this will be linked to if you want to go to it when I send the slides out. Um,
basically they were using the API and hooking it in with, uh, um, basically letting you dig into that. Hang on just a minute and get some more vodka. Just kidding. And then that, that, uh, talk that we did is online if you want to look at that. So there I was minding my own business, red team sitting here,
I'm sitting on the right and red team is lighting things up, probably cobalt strike at the time. And honestly I can see them, they hit enter and they're waiting for something to call back and then starts popping up. I think that they're counting the milliseconds to boom. Okay.
I got a call back here and then they're kind of looking at me like, are your systems lighting up yet? Why aren't you guys hunting for this? Why aren't you guys looking at a ticket from Splunk or whatever semi use? I look back at it and there were 300 rows that were specifically related to what my buddy Matt and Zach had done.
Has anybody heard of a low cards exchange principle? So every contact leads a, leaves a trace that the burglar breaks in might break a window, might leave some skin, might leave some hair, some kind of sample that law enforcement could use to trace back in a DNA
sample. Um, you know, footprints outside of it's money. This is what we're trying to do with MITRE ATT&CK to identify, you know, as Chris Sanders who has the investigation theory course we brought him in, in December to do some training for our folks. And he talks about a triangle, a pyramid he's got of four different kinds of evidence sources and he
breaks it down like network, host, memory and OSINT. And I want to know what are the digital artifacts that are left when my buddies or somebody else breaks in and steals stuff. And our architecture is not like yours. Um,
your architecture probably isn't exactly like it was six months ago or a year ago. So I think there's a lot of value in seeing what this looks like. And why, why is everything moving on the screen? Because it's early. I know, I know how this goes. Um,
it was slightly amusing to do that cause it's parallax and that's, that's better than PowerPoint animation. So I think that counts. I don't think that's against the rules, but something that moves, you know, so how many people have heard of EDA exploratory data analysis? Well, let's start with, um, and again,
don't worry about trying to read the small texture. What I wanted to show is the entirety of this rectangle, which is from a core lights bro G sheet. And this is all the, uh, DNS log fields and the type and the description of that.
So in a minute, when we're trying to figure out, how do we represent the knowledge of what we're seeing on the network? How do we represent that and convert it into a feature or think of a column in a spreadsheet, right? How do we convert that into a feature that we can kind of hook into and just in the same way that, you know,
maybe you're training a child in some experiment, I don't know, to classify a fork versus a spoon. If none of that's labeled, if you don't know what the ground truth is, you're dependent on somebody coming and doing that. Otherwise, all you can do is kind of cluster them based on similarities, right?
But the first thing we have to do before we make those decisions is understand what AA is, you know, what does that mean in your environment? Um, you know, protocol, proto, and then there's some other stuff we'll get into in a minute. But when I say EDA, I'm talking about starting with that. Uh, Jupyter notebook, used to be IPython notebook.
Um, this is actually from Clarence's book that I've got here. Shameless plug. But I say that because, uh, this is from chapter two of one of my new favorite books. Anybody else buy Data-Driven Security in 2014, like the day it came out?
Yeah. Um, uh, this, this is something I've been digging into and it's very helpful and it just reminds me that, you know, there's always something to learn and I always find it extremely valuable to get somebody else's perspective here. So this is actually from the O'Reilly GitHub and it's just an example of,
you know, we're bringing in, uh, some imports, loading the data, um, and just kind of standard pieces there. Candace data frame just oversimplify if you haven't heard of it before, but, uh, you can, I like to think of it visually as an Excel spreadsheet. I'm probably going to have pitchforks and torches after that comment,
but I think it's a tape, it's tabular. So you can think of it right now, but it does much, much more. Has anybody heard of, uh, Kitware's bro analysis tools, bat? Um, so there's a guy named Brian is one, he's one of the developers there.
And I just can't say enough good things about these folks too because I wasn't at Brocon last year. I saw the video that he'd done. And again, there'll be a link in this one at the bottom and I contacted Brian because I was stuck on something on his open source code. I don't like to do that.
I don't like to ask people to Google stuff for me to figure something out. I just, you know, a few weeks ago I was playing around with something. And I said, you know, I'll figure this out. I'm just not sure how long it's going to take. And I just sent him my question and without getting into the details, it was essentially around why can't I join two of these data frames together? And it was because of something on the backend,
the way they were doing the pre-processing with bro analysis tools, which they describe as a software bridge. So you can get from bro to pandas and then from pandas to scikit learn, which we'll talk about in a minute. But he appreciated the feedback from somebody kind of in the field, in the trenches saying, this is what I'm trying to do.
And he explained to me what the work around was because it was a different data type. So just another great example of people, you know, pitching in the open source community. And I mean, I sent the dude an email at like nine or 10 at night and he responded that night by midnight.
So just it's really encouraging when you're kind of working through something and somebody else helps you out a little bit. So a feature engineering, again, we're trying to figure out what, what are the things that we can use to categorize something? You know, this is, you could talk about height, diameter, top.
So in the same way, we want to find ways to represent the knowledge to describe what's going on on the network with DNS. And hopefully that's going to allow us to figure out what features we can hook into and then train a model so we can catch my buddies the next time they break in. Okay. Griffin data science,
virtual environment. This is Charles Givray. He does a class here with Austin Taylor and sometimes with Jay Jacobs from data driven security. Awesome folks. He's got, and yeah, I don't know if this is accurate. I've always thought of it as the Kali Linux for data science.
So I use this, it's, it's pretty decent. And again, there'll be a link there. So I said we're going to do a machine learning worked example. I pulled,
I pulled some stuff out of this after listening to some of the other talks, just because I wanted to make sure that I don't cram some kind of crazy algorithm in and, and try to show everything that I'm doing. Cause again, I mentioned I'm doing a little bit of orchestration automation. So I want to kind of have that cycle going where I'm getting an internal IP
address, you know, going forward and then enriching that with friendly intelligence or okay, here's the IP address, which host grabbed that? Let's assume 24 hour lease from it. Okay. Which host, which internal host name has it? Who's the last logged in user with that, then go collect some other stuff. The more you can find out and the faster you can find out, feed that back in.
There might be a feature or a column that you can compute or some other insight. I'm going to say reputation score. I know it's a terrible example, but some other verifiable piece of information that you can create another column about and a very fine print in the bottom.
I'm being very explicit about giving credit to Charles Givry cause I literally lifted this from his slide. I just made a different color. So thank you Charles. There's different descriptions of how this works. I liked this from his training class because it's consumable for me.
You get and you clean the data, you preprocess, do the feature engineering. Now some of this stuff, this is naive of me. When I thought Bro Logs, I'm like, ah, Bro Logs are pretty structured,
right? I'm not going to have a lot of this. No, because there's a lot of getting it from where it is to where it needs to go. A lot of the data engineering and the pipeline, that kind of stuff. And believe it or not, ID dot O R I G underscore H. That's the one I'm going to think of as the source IP that initiate that DNS
request. Just the fact, can anybody think of a problem when you start doing stuff in Python and the label, the field or column name is called dot something. So you're going to throw an error, right? And it's just simple things like that where you'll see that in a few minutes
here where we have a column that's renamed, not a big deal, but I wouldn't have thought that I'd run into that. It's just a different use than maybe we originally thought of for it. But then with the preprocessing feature engineering Bro analysis tools, again, Kitware describes that as an open source software bridge that's going to kind of
do some of the behind the scenes heavy lifting. So you can just use it kind of as a gray box and move forward with what you're trying to analyze. And then advanced feature selection. Then we have the data that we're going to split into train and test.
And then we'll build the model, we'll evaluate the model. And I tried a couple of different things. So I'll show you a couple of differences. But the main thing that popped into my head when we started talking about this is if I have labeled data, if I can use labeled data when we train that model,
now I can move from unsupervised, which is clustering to supervised. Now we've got, I know from the 300 records that are from the DNS exfiltration from cobalt strike or whatever it is. I know what they did and when they did it on May 18th.
So I have another column where it's one if it's known malicious and it's zero if it's not. Is that, does that solve everything? No, there's, there's some issues there because what are the, what are the attackers and red team want to do? You want to be stealthy? And the better your trade craft is, the more stealthy you are,
the more stealthy you are and the quieter you are, the fewer artifacts that I have, which leads to something we call class imbalance and you can correct for that. You can adjust for that. But I kind of wonder sometimes do I want to, do you want to want to make that seem like it's a bigger part of the log
data than it really is? So I'm importing pandas and then we just have a as PD. I'm importing NumPy, get into the matrix. And then from bat bro analysis tools,
which he's going to change the name after they change the name at some point. So keep in mind that this will be called something else because bro itself has changed the name of their offering, but import log to data frame. And then a lot of times you'll say DF equals,
I just put DNS underscore DF equals and I'm calling log to data frame, dot log to data frame path to this is one hours worth of logs, a one hours worth of logs. And then next I see DNS DF dot rename columns.
So you see what I was talking about the ID dot origin. So if you have something dot ID dot origin, you're going to throw an error. And anything I say, there's two or three ways. You could probably get around that. This is the quickest for me,
filtered DNS DF. All I'm saying there is the data frame. We call the DNS DF. The data frame is after the equal sign. So we're saying DNS underscore DF, we're referencing that pandas data frame. And then in the square brackets, we're saying ID underscore origin underscore H.
That's the host that initiated that DNS request that string contains. And I masked this, but there is a large subnet that wasn't relevant to this. And I won't get into that for opposite reasons, of course, but the point is you might segment that through,
you might go through and convert the IPs to integers. You might, you know, you can do ranges, you can do a lot of different stuff with that. And I think that's actually covered in the data driven security book among other places. And then I did the type filtered just to make sure, you know, after I talked to Brian from kitware,
so I'm kind of leaving myself some breadcrumbs and going through, I pulled all the comments out of this just to make it easier to have less on the screen. So this is just a nuance, but filter DNS DF dot is copy equals false. It's trying to be helpful if you don't do that and say, Hey,
you keep slicing these things off and you're trying to do things on a copy. So I had to Google that and it turns out if you just do equals false, then stops throwing those errors. It sounds pretty scientific, right? It worked filter DNS DF query length.
So here what I'm doing, this is the current version of the data frame or the tabular data structure that we're dealing with. And then I say query length in quotes equals, and then I'm saying add a new column and what I want in that column for each
row is the data frame and then give me the length of what's in the query. So again, we start talking about malicious URLs. When you start talking about message length and that kind of stuff, this might seem like one of the go-to things.
However, I took a bunch of the other stuff off cause originally, you know, when I'm trying to do this in a production environment, I want to know for that IP address for that time period or maybe expanded it to longer, like 24 hours. Does that IP address and what does that look like in terms of the con dot
log? A lot of you I'm sure are familiar with con dot log, but if you're not, I think of con dot log and bro as the closest I'm going to get to a hundred percent net flow, right? So basically just the phone record instead of the phone conversation. So I'm trying to take an entity based view of this,
a user 360 and a device 360 and essentially understand what behaviors are being exhibited by that host during that timeframe. So, uh, just real quick aside, how many people have heard of a Black Hills information security is Rita?
Is anybody using Rita? Um, I think I've got a link in there. I'll make sure it is before I send it out. But I've been using that for a while. Basically you pipe in the bro logs to it. You import a bro log for a day and a directory full of bro logs for one 24 hour period. I should be more precise.
And from there you import it, it creates a Mongo DB collection and then you run analyze and it's going to tell you beaconing. Uh, and John Strand and a couple of guys did a talk at Derby con a couple of places. They're using some kind of crazy math behind the scenes like fast Fourier transform and looking at the signals and what you get,
I just use the command line, but they've got a AI Hunter product. What you get is basically a table or a CSV and when I cat it out on the command line, what I see is a score on the left. Uh, yeah, we're 99% sure this thing's beaconing. Well, there's other stuff that looks like beaconing, right?
So I don't ever want to have one view into something. I want to have a more holistic approach and enrich these things by either computing new features, you know, add a column, perform an operation on a different column. And now I know something else about that entity in that record.
So the next one, 22 in 22, I'm just, you have to put percent matplotlib in line so that you can have the plot actually display in a minute here. And the rest of this is just, uh, in 22 just the formatting for how I want that plot to look.
There's a lot of cleaner and, um, you know, more sophisticated looking things. I just wanted to have the basics out there. And then in 23, we're just saying import math and we're going to look at entropy. So entropy, you might have something along the lines of, uh, um, base 64, base 32 encoded, uh, or you might have some encryption.
So either way, I find that to be pretty helpful. So filter DNS, again, we're creating a new column entropy, we're running a Lambda. So in the data frame, you don't have to do four loops shouldn't do four loops. Uh,
you want to, you know, map or apply or use a Lambda function, and you're hitting it on that series, which is that, that column, right? So essentially very quickly I'm populating the value of this new column entropy with the results of that, uh, mathematical function.
And now I know two more things about each of these rows. I know the message length and I know the entropy. When I look at the length after I filled it out, that other, uh, those segments I didn't need were at 14,000.
In 27, do you remember I was talking about that canary URL? It's not actually called canary URL. I did a sophisticated find and replace and I masked it because that's also scientific. So I'm trying to understand length of all of that inside the parentheses, which means just to zoom back out for a minute that my friends on the red
team internally have 121 records or DNS message requests or messages that were logged during eight to 9 AM. Well, look at that 121 out of 14,000. That's what I'm talking about in terms of the class imbalance.
So MITRE ATT&CK, here's one of the ways this fits in. I can help do the detection engineering. I can help look for those artifacts. Every contact leaves a trace. I can help, uh,
that will help me dig in and start dissecting things. And I have one goal in mind. I'm trying to protect this house. I'm trying to find how they did the exfil and then see if there's any similarities that I can come up with. And if that works out mathematically, then maybe I can run that against everything else from 18th until yesterday
and see what I find. So now the point is I, I keyed in on that canary URL and now I can isolate the traces that they left based on that overt white box attack. And from there, um, this is a little bit early. You know,
I don't really need to add this column just yet, but that's where it was. And I didn't want to mess around with it. Excuse me. So essentially what I'm doing is filter DNSDF. I'm creating a new column called is malicious. This is my, uh, my label. I'm going to have, essentially it's if it contains this canary URL, um, is malicious is going to have a value of one.
Notice I did a dot map. It's going to hit everything. The other interesting thing that I saw, has anybody ever seen bro logs with DNS with DNS exfil where once in a while you'll see an API dot encrypted string,
200 characters long and then a post dot. Anybody have any ideas what that is or yeah, yes sir. No, I was hoping you'd tell me, man. Uh, no. So basically again, this is a pattern. So I don't know this, that's kind of a hail, hail Mary. When I throw that out there, um,
normally I'm going to look at the, make sure that these variables aren't related. I'm going to make sure that I look at the feature importance. We're not going to get into that right now because of time. Someone hurry up just a little bit. If it has posts in it.
Now I'm looking for a string, right? So this is the thing about the, the spy versus spy. Anybody in here in the room that sees how I'm doing this, you're going to come up with a different way around it. It's so that's why I have to keep doing this and making sure that the model doesn't degrade and I don't get lazy in the detection here.
All right. We talked about query length. So I just said for the data frame filter DNS DF and then I wanted to know about the column that has the values that we computed, which is the query length. So we computed a feature, populated the column for each row, and now we have a histogram.
So it might be kind of hard to see in the back. I didn't blow this up. Or there's some different things you can do on the scale here. And that that visualization didn't look much better, but you see a preponderance. Try to work that one, that word in on a Sunday morning, every possible chance you see a,
a high number, uh, over, you know, 14,000 it looks like of DNS requests that are what, 25, 30. And then way over there on the right, you see just a few, just a few, I'm guessing like not 121 maybe like 106 or something that are 200.
Can you write a signature? Can you write a rule that says, Hey, anything that's got a message length and DNS over 40 is malicious and flag it. What's going to happen when you do that? Yeah, it's going to light up, right? Cause there's, there's stuff that looks like that.
Now we computed two things though. We figured out the entropy entropy. Uh, you know, what's the degree of randomness and in a moment I'll, I'll show what the ranges are for the values, but in the upper right hand corner, that's weird, right?
So we've got when we look at entropy against query length, something's definitely unique about those. So I'm going to bust through a lot of this short on time. Thank you. I'll get it. Uh, anybody has any questions about this? Again, I'll have the slides up by Tuesday at Brian Gans on Twitter.
I'll put out the link there. I'm going to push through the rest of this. So I said, here are the columns that I want for features. So now I made a new data frame. I said, I may have a new tabular data structure here, but only give me the data that's in these columns series. And that's my new features underscore DF. Um,
we imported some other things. And again, the scikit learn, we had a software bridge from bro to pandas to scikit learn and bro analysis tools and bro analysis tools is doing a lot of this transformation for us. Um, again,
without us doing adversary simulation, I'm stuck with clustering because I don't have any ground truth labels, right? So where's the evil, where's Waldo, where's my buddy, Zach and Matt. Anybody have any idea which one's malicious?
You shouldn't be able to tell. Um, I mean, you might have some ideas, but this is one issue that I run into with just clustering stuff. So with MITRE ATT&CK, I've got a column that's got one if it's known malicious cause my buddy just
did it. And there's a zero if I don't know, right. I don't know that it's not. So what I'm doing here is creating, um, another data frame and I'm going to push past that. Essentially I'm going to split that into train and test sets,
train the classifier model, make predictions. I'm using logistic regression, not maybe the kind you might think about from a stats. And then I'm saying, Hey, predict, you know, how well is this model going to do once we get to the results?
And in this case it was 99.85% accurate when we look at the model evaluation or model results. But that's not the whole story. Overall it had in the 2775,
you know, we're okay with the top left and the bottom right. The four on the bottom left, that means there were four malicious ones that I told the model that those are malicious and we missed those. So again, it has to do with your threshold and how it works. So again,
that's the model we looked at. We need news flash, right? More signal, less noise. And this is, you know, this is just something that I've come across. If anybody has any other perspectives on it, come see me afterwards. I'd be interested to hear your perspective. But I just think that the more stealthy attackers are,
the fewer footprints, the fewer contacts they're going to leave, which makes it harder for me to hook into something. Future work, I'll just push through that. But like I said, looking at other bro logs, I want to generate some features based on the presence or absence of beaconing. So take the insight that I'm getting out of Rita from Black Hills Information
Security or offensive countermeasures now, enrich that and then do some other enriching IP addresses. Also very excited to look at some Neo4j and some graphistry stuff as well. So thank you very much for coming out on Sunday morning. I appreciate your time. I'll be in the back and I hope you have a great rest of the conference.