AI Village - Loss Is More! Improving Malware Detectors by Learning Additional Tasks
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 335 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/48321 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
DEF CON 2719 / 335
8
9
12
14
32
38
41
58
60
61
72
75
83
87
92
96
108
115
128
132
143
152
158
159
191
193
218
230
268
271
273
276
278
295
310
320
321
335
00:00
Insertion lossMalwareTask (computing)MalwareTask (computing)AdditionInsertion lossComputer animationMeeting/Interview
00:33
TwitterGoogolChannel capacityMachine learningDemo (music)Pattern recognitionMalwareVirtual machineAdditionFunction (mathematics)Channel capacityInsertion lossAreaAdditionBitFlow separationMereologyProjective planePattern recognitionVirtual machineMetadataMultiplication signQuicksortSoftware bugInformationDigital photographySingle-precision floating-point formatWave packetUniverse (mathematics)SimulationInformation securityMalwareFunctional (mathematics)TwitterMathematical optimizationDemo (music)MultilaterationMachine learningComputer animationMeeting/Interview
03:08
MalwareVariety (linguistics)Fluid staticsBinary fileSample (statistics)Electric currentSource codeEndliche ModelltheorieNumerical analysisRepresentation (politics)Vector spaceLatent heatMerkmalsextraktionComputer networkAriana TVInformationService (economics)TrailNumberVirtual machineMalwareInsertion lossDiagramVector spaceVariety (linguistics)Type theorySoftware bugBitInformation securitySource codeSampling (statistics)Representation (politics)Multiplication signService (economics)InformationElectronic signatureHybrid computerOcean currentNumberArtificial neural networkFunction (mathematics)GradientReflection (mathematics)Point cloudNumeral (linguistics)WeightQuicksortBinary codeOperator (mathematics)Computer fileContext awarenessThresholding (image processing)Form (programming)Fluid staticsDifferent (Kate Ryan album)Functional (mathematics)SoftwareDynamical systemMessage passingWave packetReal-time operating systemCASE <Informatik>Electronic mailing listMeeting/InterviewComputer animation
08:45
Augmented realityHypothesisInsertion lossMathematical optimizationFunction (mathematics)Source codeInformationWeightAttribute grammarMalwareDynamic random-access memoryInstallation artTerm (mathematics)No free lunch in search and optimizationCountingBinary filePauli exclusion principleNewton's law of universal gravitationData modelInsertion lossMathematical optimizationHypothesisMusical ensembleAugmented realityBinary codeBitAdditionEntropie <Informationstheorie>Functional (mathematics)CountingLink (knot theory)Wave packetMalwarePerspective (visual)Sampling (statistics)CASE <Informatik>Derivation (linguistics)Attribute grammarMultiplication signSoftwareContent (media)Function (mathematics)10 (number)WeightConsistencySemantics (computer science)InformationMorley's categoricity theoremDependent and independent variablesRange (statistics)SummierbarkeitType theoryEndliche ModelltheorieComputer architectureOnline helpPoint (geometry)QuicksortProof theoryIntegerSource codeTask (computing)Parameter (computer programming)NumberArtificial neural networkInferenceRepresentation (politics)Variety (linguistics)Greatest elementCartesian coordinate systemSlide ruleProcess (computing)Meeting/InterviewComputer animation
15:04
Performance appraisalSample (statistics)Wave packetSoftware testingEndliche ModelltheorieSampling (statistics)Software testingPerformance appraisalConsistencySet (mathematics)Validity (statistics)Wave packetOrder (biology)Heegaard splittingFitness functionMeeting/InterviewComputer animation
16:25
StatisticsInformationPredictionNumberScaling (geometry)Insertion lossSampling (statistics)Theory of relativityCountingThresholding (image processing)InformationMeeting/Interview
17:22
StatisticsMatrix (mathematics)InformationMultiplication signMeeting/Interview
17:56
Computer networkArtificial neural networkBinary fileMalwareLatent heatStatisticsHistogramUser interfaceHausdorff dimensionString (computer science)MetadataSoftwareDreizehnPairwise comparisonRandom numberWeightVarianceReceiver operating characteristicStapeldateiInsertion lossCountingFingerprintSimilarity (geometry)Parameter (computer programming)Type theoryInformationMultiplicationAsynchronous Transfer ModeReflection (mathematics)BijectionRule of inferenceEndliche Modelltheorie1 (number)CurveMalwareDifferent (Kate Ryan album)ResultantCountingTerm (mathematics)StatisticsDialectVarianceAreaInsertion lossBit rateRepresentation (politics)Type theoryQuicksortCombinational logicSummierbarkeitReceiver operating characteristicThresholding (image processing)Artificial neural networkEntropie <Informationstheorie>Two-dimensional spaceParameter (computer programming)InformationPosition operatorComputer programmingProcess (computing)Functional (mathematics)Arithmetic meanReduction of orderAsynchronous Transfer ModeHistogramScaling (geometry)Characteristic polynomialLengthDimensional analysisGauge theorySimilarity (geometry)SoftwareBoss CorporationLogarithmDerivation (linguistics)Vector spaceMetadataBinary codeGroup actionHash functionComputer fileString (computer science)Field (computer science)WeightLatent heatControl flowMathematical optimizationForm (programming)Cartesian coordinate systemSampling (statistics)FingerprintLength of stayMeeting/Interview
25:30
Mathematical optimizationoutputHash functionSign (mathematics)Representation (politics)Content (media)InformationComputer networkReceiver operating characteristicFingerprintInsertion lossOrder (biology)Computer fileInformationPseudozufallszahlenFunctional (mathematics)Different (Kate Ryan album)Error messageRepresentation (politics)Content (media)Endliche ModelltheorieLinear regressionMultiplicationSampling (statistics)Mechanism designMathematical optimizationSound effectSign (mathematics)Hash functionObject (grammar)QuicksortRegular graphSource codeType theoryService (economics)SoftwareSurfaceMeeting/Interview
27:57
Computer networkMathematical optimizationCorrelation and dependenceMalwareInformationSoftwareFunction (mathematics)Content (media)Point cloudMathematicsWave packetInsertion lossCartesian coordinate systemMalwareGroup actionRegular graphArtificial neural networkResultantQuicksortSound effectService (economics)Source codeEndliche ModelltheorieMeeting/Interview
29:12
Direction (geometry)Domain nameGroup actionTwitterDerivation (linguistics)AlgorithmDirection (geometry)Group actionDomain nameComputer architectureCodeGame theoryBlogInsertion lossAlgorithmResultantElectric generatorInformationVotingSimilarity (geometry)File archiverMeeting/Interview
30:23
Artificial neural networkAerodynamicsMultiplicationTask (computing)MalwareInsertion lossMorley's categoricity theoremEntropie <Informationstheorie>Functional (mathematics)Insertion lossMalwareDynamical systemMultiplicationQuicksortSampling (statistics)Type theoryWechselseitiger AusschlussVariety (linguistics)FamilyContent (media)Set (mathematics)Different (Kate Ryan album)WeightComputer fileModal logicMeeting/Interview
31:20
Execution unitAttribute grammarPlastikkarteMalwareDerivation (linguistics)Pairwise comparisonPredictionEinbettung <Mathematik>PlastikkarteMalwarePredictabilityEndliche ModelltheorieAttribute grammarSemantics (computer science)Computer animationMeeting/Interview
32:01
Computer networkTopologyData structureSimilarity (geometry)Pattern recognitionDemo (music)Mathematical optimizationMixed realityAttribute grammarPattern recognitionMixed realityDifferent (Kate Ryan album)Attribute grammarMalwareMathematical optimizationModal logicDemo (music)Type theoryGroup actionSoftwareObject (grammar)Computer animationMeeting/Interview
32:37
MiniDiscContext awarenessoutputFunction (mathematics)Contrast (vision)Computer fileMultiplicationDisk read-and-write headDirection (geometry)Contrast (vision)Functional (mathematics)InformationType theoryContext awarenessInsertion lossMiniDiscSource codeQuicksortEinbettung <Mathematik>Representation (politics)Different (Kate Ryan album)outputMeeting/Interview
33:47
Local GroupCollaborationismCollaborationismSoftware developerMultitier architectureGroup actionOffice suiteComputer animationMeeting/Interview
34:35
Information securityFundamental theorem of algebraDemo (music)Pattern recognitionInteractive televisionElectronic mailing listDrum memoryGraph coloringInformation securityPattern recognitionPresentation of a groupCross-correlationCollaborationismMeeting/InterviewComputer animation
35:15
Insertion lossDependent and independent variablesDifferent (Kate Ryan album)Term (mathematics)Endliche ModelltheorieoutputComputer fileType theoryRepresentation (politics)Library (computing)RandomizationSource codeResultantAdditionFunction (mathematics)ForestBitEinbettung <Mathematik>Multiplication signMusical ensembleMultiplicationMalwareAreaService (economics)Maxima and minimaGoodness of fitPropagatorContext awarenessProcess (computing)Functional (mathematics)Interpreter (computing)Artificial neural networkPoint (geometry)Meeting/Interview
Transcript: English(auto-generated)
00:00
Our next speaker is Dr. Ethan Rudd, Senior Data Scientist at Sophos. Hello, everybody. Yes, my name is Dr. Ethan Rudd.
00:21
I'm a Senior Data Scientist at Sophos. And the title of my talk is Losses More, Improving Malware Detectors by Learning Additional Tasks. So before I go into the meat of the talk, I just wanted to let you guys know a little bit about who I am so that you know that I'm not just some random guy that sort of stumbled
00:42
in here off the street. This is my first time at Def Con. Very excited to be giving a first Def Con talk. And thank you. I've been on the Sophos data science team for about two and a half years in a research capacity.
01:01
Prior to that, I worked on several projects and in several areas of applied machine learning. My PhD research was funded by the IARPA Janus project, face recognition project. I did a project at Google with their advanced technologies and projects team. And then I've been involved in several other small business and university projects.
01:22
So I mentioned the face recognition stuff because we're also running a great facial recognition demo at the unwind session. So please check that out. And you can check me out on Twitter or Google Scholar for various research if you like this talk. So what is this talk about?
01:41
Well, as we've seen, there have been several great talks on machine learning for information security prior to this one. But for many machine learning malware detectors, we're looking at training on a single malicious or benign label when there's actually lots of additional information available,
02:01
lots of additional labels, lots of additional metadata, etc. And so really the question that we answer in this talk is, can we train and craft a bunch of auxiliary labels to train on rather than just having a single malicious or benign label and can we get better performance? Well, as it turns out, we can.
02:21
And interestingly enough, we also find that these performance gains can be attributed to a better informed classifier. And I'll explain what I mean by that a little bit later on. So in before and after photos, if you will, what we're talking about is adding additional loss functions during the optimization. So on the before side,
02:41
you'll see that we have only a single loss function. This is how many malicious and benign detectors are trained. And they work pretty well. There are a lot of them that are commercially deployed. You can get good performance. But if you add a bunch of auxiliary loss functions on a bunch of labels, hence the after loss part of this before after thing,
03:04
we get way, way better performance as it turns out. So before I dive into exactly how we formulate this, I'll just give a brief review of machine learning for malware detection. So up until about 2015,
03:21
most malware detectors were largely signature driven. There were a few machine learning, but ML really took off around then. Now actually a lot of the detectors consist of hybrids ML and signature. They use signature largely for blacklisting. And one can really triage detection as this diagram here
03:44
where ML and signature detectors actually both work on static and dynamic features. For ML, static is a little bit more common and we focus largely on static detection in this talk. And the reason by the way that static features are more common is that to get ML to work well,
04:03
it requires lots and lots and lots of data. And it's easier to collect a lot of static data. So we find that we can actually do very, very well with that. So a typical detection pipeline is built on some sort of a binary classifier,
04:20
deep neural network, or maybe a gradient boosted machine. I'll discuss the deep neural network use case for this talk. And we're talking about training on millions to hundreds of millions of labeled malicious and benign samples. And we're also talking about classifiers that are periodically retrained
04:40
to be able to reflect current threat landscapes. These can be deployed in a lot of different contexts. They can be deployed actually on endpoints. They can be deployed in the cloud. They can be deployed in security operations centers. It really doesn't matter for the purpose of this talk. Now, as far as labeling sources that a lot of vendors use,
05:02
they rely on vendor aggregation services or threat intelligence feeds, which basically take a bunch of vendors or different labeling sources throughout the industry and submit malicious and benign samples to those and say, okay, how do these label the samples? And then some sort of an aggregate label
05:22
is generally derived. Often there's also a little bit of time lag between the time at which the samples are submitted to the vendors. And there's a little bit of time lag that's left to basically let vendors
05:43
update their blacklists and let the scores settle down. But long story short, most approaches take an aggregate malicious or benign label that is obtained from these threat intelligence feeds. Now, these detection engines,
06:04
also because they use ML, they need to convert the malicious and benign samples to some sort of an ML friendly numerical representation. There are a variety of ways to do this. Some try and do something that's closer to raw bytes.
06:22
Some use various types of feature vector representations. We presume a specific one and we use a portable executable malware and benignware in this work. But the approach is fairly broad and can be done in a lot of different ways.
06:41
So the way that a typical neural network will look is you'll have features that are extracted from malicious and benign samples during training. A forward pass is done through the network. The output of the classifier is taken and then some loss function along with the associated label is used to correct the representation
07:02
so that we have a good representation that we can then later deploy. At deployment time, we take this learned representation and here, I just want to highlight, we don't have any labels on the malware samples. That's what our classifier serves to do. We deploy that to wherever we're deploying,
07:21
whether we're deploying on endpoints, whether we're deploying on SOX, whether we're deploying the cloud. And then we submit our feature vectorized forms of our files in real time to the classifier. And in this case, our classifier's neural network. It could be whatever, but we're dealing with neural nets in this work.
07:43
And we use the predicted output basically as a score. This says, okay, how malicious is this file? A maliciousness score, if you will, that one can threshold in a variety of ways. So this is how things are currently done or commonly done, I should say.
08:03
However, just a little bit more on these threat intelligence feeds. They have lots and lots more information than just whether a given file is malicious or benign. In fact, that's even a simplification from what they're providing. They also provide information on individual vendor detections.
08:21
They provide, of course, the net number of vendor detections. And then they provide some stray information on the detection names per vendor at the very least. Some provide a lot more. So really revisiting the original question that I posed, can we craft and learn from auxiliary labels and get a better detector?
08:42
And the answer is yes, in fact, we can. The technique that we've derived to do this, we refer to as ALOHA, or auxiliary loss optimization for hypothesis augmentation, hence the nice Hawaiian art here, besides. All right.
09:03
So in short, we've got all this auxiliary information that we want to utilize. And as we saw in the case of just a malicious and benign label, well, we have this loss function that we use here. So why not just add more loss functions? And that is really the crux of what ALOHA does.
09:24
We have more labels, more loss functions. And this has a couple of nice advantages. First, although we have more labels and we use more network outputs during training, we do not actually have to use these during deployment time.
09:41
So we can notionally get a much better network representation during training. But at deployment, we don't have to update our infrastructure at all. Now, alternatively, we can use the additional auxiliary outputs to do certain additional tasks. So if we want to do things that are,
10:02
I'll say specific to like an EDR or an MDR type application, where we're getting more fine grained information about the particular malicious samples, say maybe in a sock, but maybe not on our endpoints, we can sort of dual purpose this training and use the learn models in a variety of ways.
10:21
So for our labeling sources in this work, we selected nine vendors from our aggregation feeds and used detection labels from each of the respective vendors. We also used the net number of vendor detections and there were more than nine in our feed. There were tens and used the integer value
10:42
of the number of vendor detections as an auxiliary target as well. We also use our main target of this aggregate malicious and benign label. And then we also use 11 semantic malware attribute tags. So these describe the content
11:02
of malicious and benign samples. These are derived from the detection names within our feeds. The derivation process, I can speak a little bit more to at the end, but I would actually refer you guys to a, in my opinion, very good paper that we wrote on it. And I'll include the link for that in the end.
11:22
But basically these tags are not mutually exclusive and they summarize the content of most benign samples in ways the human can understand. So for each of these additional labels and network outputs,
11:41
we have additional loss functions. And so our main aggregate loss function is actually a binary cross entropy loss taken between the output of the network and the aggregate malicious and benign tag or aggregate malicious and benign label. Now this is a pretty common loss function
12:02
for a lot of neural networks that are doing malware classification. Well, with respect to our auxiliary losses, we have a loss function that is specific to the vendors, one for the tags, the semantic tags, and then one for the counts. And for the vendor loss functions,
12:21
we actually take a sum of binary cross entropy losses for each individual vendor response. For the tag losses, we do a very analogous thing, but for each of the attribute tags. Now I would again point out that none of these tags are mutually exclusive. So we use binary cross entropy here rather than,
12:43
or rather than say a softmax categorical cross entropy and take the sum. Then for the count loss function, we use a Poisson loss on this. Now, prior to that, we do an exponential activation
13:01
to constrain our count range from zero to, well, basically to be non-negative, as counts can't be non-negative. So our total loss is written at the bottom here. And this consists of the most benign loss with all of our auxiliary losses just summed
13:21
and multiplied by a constant. Now, the constant in this case, we use 0.1. We didn't explore good values of this in depth, but other work has, and so we did this sort of in a principled manner consistent with some other work that I'll reference towards the end here.
13:40
So during training, and you've seen this at the beginning of the talk, but basically we have all these aggregate loss functions where we have a main malicious and benign loss that's what we're trying to ultimately optimize for and detect, but with respect to the aggregate loss functions, we have our vendor counts, our individual vendor detections,
14:01
and our attribute tags. Now, the aggregating all of these losses together, we can use all of these or just one or two of these auxiliary losses, or we could even potentially add more if we had more information in our feed. I would very much point out that this is,
14:22
you know, this is sort of a proof of concept model, a very general sort of architecture that I'm describing. But the point is that adding these auxiliary losses theoretically helps. At inference time, however, absolutely nothing has to change whatsoever
14:43
with respect to the network outputs. You'll see that in the prior slide, we had all of these different outputs that we added, but we pruned those, we pruned the associated model parameters at inference time. And so our deployment infrastructure can remain entirely the same.
15:01
We don't have to change that at all, which is nice from an engineering perspective. So I've made these claims that the Aloha model works very well. Now I intend to actually provide some evidence of that. And to do that, we collected a data set of approximately 9 million training samples,
15:23
100,000 validation samples, and 7.7 million test samples. And the training and validation splits was taken temporally before the test split to ensure basically a fair evaluation. I mean, we can't fit in order to ensure
15:42
a temporal consistency, we ordered our samples as follows. And for our aggregation, for our aggregate malicious benign label, we used what we call a one minus five plus criterion here, which basically means that for one or fewer
16:02
vendor detections, we label as benign, and for five or more, we label as malicious, and then we ignore those with two to four labels. Now I'd mentioned there are more sophisticated ways to do this, this is just the one we chose largely for simplicity, but there are more sophisticated ways to do this.
16:21
But this works pretty well, it's a fairly standard practice. When we look at actually vendor counts across our data set and looking at this was one of the reasons why we chose the one minus five plus. But what you'll see is that there are
16:40
a disproportionate number of one minus and specifically zero and then a lot that have many, many, many vendor counts. And bear in mind, these are taken over a logarithmic scale. However, we still see, and this was one of our motivations for using account loss initially,
17:01
that just taking these basic thresholds washes out a lot of finer grained information. So this was actually one of our motivations for the account loss. And as you can see, it's not really a common occurrence, but we might be able to say something about relative sample difficulty by adding that.
17:24
We also looked at the respective vendor agreements with one another and these are plotted in this confusion matrix here for each of our nine selected so-called high coverage vendors. And as we can see, we see an agreement that occurs most of the time but not all the time.
17:43
I mean, vendors are consistent all approximately 85 to 95% of the time, but they don't always agree. So perhaps there's some independent auxiliary information that we can glean from these. As features, we use the same features
18:00
as Saxon Berlin did in their work, deep neural network based malware detection using two-dimensional binary program features. In full disclosure, Sax is my boss. So that's one of the reasons why we chose to use these features. We used the features that he and others
18:20
within our group derived. And I won't go into these in depth, but I leave the paper there and I just wanna give sort of a semblance of what these are. So basically they fall into three different camps. So 512 of the dimensions of our net 1024 dimensional feature vector are based on
18:43
windowed byte statistics and basically aggregate histograms of, sorry, windowed byte statistic histograms, which are basically aggregate statistics over the entire file. We then have 256 dimensions devoted to a
19:01
two-dimensional string length hash histogram or basically across a logarithmic scale of different lengths. We apply the hashing trick. And then we also have specific P metadata fields like the exports, like the imports, et cetera, that are hashed into another 256 dimensional vector.
19:23
And all of these get concatenated, so that's our representation of individual files. So that's how our dataset breaks down. When we compare performance here, so we tried using different combinations of
19:40
our main malicious and benign loss with different auxiliary losses. So we used one, just our malicious and benign loss as our baseline. So that's sort of tantamount to a lot of types of models that are currently deployed. Then we applied each individual loss type. Then we applied everything combined,
20:03
which I guess you might say is the full Aloha model. And we fit each of these classifiers for each different loss combination. Actually, we fit five different classifiers and we report our results in terms of mean and variance statistics over
20:22
receiver operating characteristics curves to be able to gauge statistical significance. Now, I know that there's a lot of talent in the room with a lot of different backgrounds. So for those of you that might need a refresher on receiver operating characteristics curves, or ROC curves, basically we look at
20:42
this false positive rate across the x-axis, and then a true positive rate or a detection rate at that false positive rate across the y-axis. And so typically what's done in the industry is at various false positive rates that are deemed sort of acceptable to the user,
21:01
a threshold is chosen and then you'll get the true positive rate at that threshold. So what we see when we add our count loss is that we do in fact get better performance in terms of both the area under the receiver operating characteristics curve,
21:22
which is basically a gauge of how good is the curve overall. But specifically we see also at higher false positive rate regions, or sorry, lower false positive rate regions, we see a particular bump, and this becomes
21:41
a bump in detection rate. Now, this becomes relevant because as we get to lower and lower FPR regions, there are more deployment scenarios that we can address with our models. Similarly for the vendor loss, when we add that to our baseline,
22:00
we see a boost in the receiver operating characteristics curve or ROC curve at the relevant region. We don't see quite as much of a boost in terms of the area under the curve. In fact, the area under the curve stays statistically pretty similar. And that's not to say that this isn't still a very significant result.
22:23
In fact, again, the AUC is a net statistic on the curve, but we don't really care so much about the higher false positive rate regions because the detection rates there are very, very good already. And so we can deploy those very easily.
22:42
But as we're getting down, we see that although the AUC is relatively similar, we still see this as basically a win. The tags loss gives us similar results. And I'd mentioned that actually both of these, both of these loss functions, the tags and the vendor's loss functions,
23:02
not only do they assume sort of a similar functional form, but they also give us an even better result than the Poisson loss function did or the count loss at the lower false positive rate areas. But they are slightly worse in terms of AUC performance.
23:23
When we combine everything together, what we find is that we get even better results. And we find that not only are our results far better, but our variance between different model instantiations is reduced. And we see basically there are two modes of improvement here.
23:41
There's an improvement at the higher false positive rates that is above 10 to the negative third. And then there's an improvement below that. And basically the higher FPR improvements, these are really what are driving the area under the curve improvements.
24:00
But again, the lower FPR ones are still quite relevant. So in summary, we see that yes, adding additional losses does seem to improve performance. And notably it also reduces variance across different instantiations of the model.
24:20
We suspect that this is actually occurring, this variance reduction is occurring, because as you have more things to optimize for, you're inherently sort of constraining your optimization process. So there aren't as many different types of ways that parameters can vary. We also see that there are similar behavior
24:41
for similar loss types. So both of the vendor losses and tag losses it consists of sums of binary cross entropy losses. And again, these seem to drive different things with respect to our RRC curve. We suspect that we see these higher FPR gains in detection
25:04
for the count loss, because it actually does communicate something about the difficulty of samples. And then with respect to the tag losses, perhaps the network's able to correlate some sort of information between when,
25:20
say, just one or two of these vendor tags trigger versus when, say, all of them trigger. And so it drives things really at lower FPRs. So, okay, we've presented, or I've presented some evidence hopefully that the ALOHA model is able to deliver
25:41
better detection performance. But now I'll just really briefly discuss what's driving this performance gain. Is it some sort of a smoother optimization surface that's brought about due to a regularization effect of multi-objective optimization? Or is it perhaps due to a more informed representation
26:03
from all of these different auxiliary label sources? And going into this, we sort of suspect the latter of these two. But we wanna actually make sure and then see what's going on here. So in order to test this, we used auxiliary loss functions
26:21
on so-called non-informative targets. So we employed various mechanisms of duplicating labels. One of the ways that we did this was, or using, or providing labels that delivered no additional information about the sample.
26:41
One way that we did this was with a pseudo-random label, where we took the hash of the file contents and just took the sign of that as an auxiliary target. So for a given file, you're going to be looking at the same label, but the labels are just pretty much randomly there. We also tried adding a duplicate target
27:02
and optimizing for that. And then we also applied a duplicate target with a different type of loss function. So we scaled and shifted a copy of the target label, and then we used a mean squared error loss on this, which is a common regression loss. And so from these, what did we find?
27:21
Well, we found that adding these non-informative losses did not improve our performance in a noticeable manner at all. In fact, it was statistically identical, if not worse, when we added auxiliary targets to our original target as well.
27:42
So this suggests that, yes, the Aloha network gains are actually coming from additional information from the additional labels. The network's doing what we want it to do, and it is learning a better representation, a better informed representation. So overall, what we find is that, yes,
28:02
our Aloha technique works well and it seems to be a result of the neural network's ability to actually correlate information from auxiliary labeling sources. It's not just simply an artifact of regularization. We also have the advantage that the network can be trained and deployed with minimal changes
28:22
to existing infrastructure that's out there. So no re-engineering of anything on the endpoint, anything on the SOC, anything in the cloud has to take effect. And then also there are additional applications that these outputs can be used for, like EDR and MDR.
28:41
So one additional application, as an example, is since we have outputs that describe the content of the malware, we can actually group malware by the predicted tags. And we might have an application where we might want to deploy that for internal use or say as a service,
29:01
but yet still be able to deploy our model. Well, we can do that all under this one training regime just by pruning our losses and pruning our outputs respectively. So before I take some Q&A here, I just wanted to also mention some related research and some directions for future work
29:21
if you found this topic interesting. There's been a lot of research by our group and also by other groups that is related. It's interesting to look into and it can perhaps be leveraged in some very similar ways. So first, I'd also mentioned that Aloha is a USENIX paper now,
29:42
so I'll be presenting this at USENIX next week. But interestingly, oh yeah, and it's available on archive as well, so feel free to check it out if you want more information. And interestingly, a gentleman by the name of Jason Trost actually did a nice blog post
30:01
where he used Aloha architectures for a much different problem using end gaming's domain generation algorithm detection code. He tailored that to use basically these Aloha losses. And anyway, he did a nice blog post about his results.
30:24
There's also a paper out of Microsoft Research called MTNET and this paper actually is similar to ours in a variety of ways. It uses an impressively large dataset of dynamic features and it largely substantiates a lot of our findings.
30:41
However, they use only one type of loss function. They use multiple loss functions, but only a categorical cross entropy softmax. They do something sort of similar to a tagging approach that we do describing the content of each sample except they use Microsoft malware family names
31:03
and so they actually do employ some sort of a mutual exclusivity assumption here. But anyway, it's another great paper and it's very cool to see that they're also able to sort of substantiate our findings with a much different data modality. It's also PE files, but it is dynamic features.
31:23
I'll also mention the paper on malware attribute tagging. So smart semantic malware attribute relevance tagging is a paper that we also put out there that which will pretty much tell you everything about everything that you want to know about malware tagging. It is the approach, the tagging approach
31:43
is the same as we employ in this work. So please see that for details on the tagging problem and the tag prediction problem. There are several other models that we employ in the smart paper as well. So if you're interested in that, check it out.
32:03
I will also mention another paper that Moon, a mixed objective optimization network. The reason why I bring this up is while this is applied to facial attributes, it has nothing to do with malware. It is actually the approach that I use in my face recognition demo,
32:20
which again, please stop by during the wind down session. If you want to see basically how this type of optimization can be employed very powerfully in action, the approach is fundamentally the same as Aloha, but with a much, much different data modality. One final work that I'll mention, and then I promise I'm done,
32:41
is a work paper that we did called Learning from Context. It uses multi-view learning or multi-input learning in contrast to our approach using multiple labels and multiple loss functions. But using this approach, we are able to include extra information
33:03
in the representation, just in a different way. We're sort of turning the Aloha approach on its head. And this type of approach could be trivially combined with Aloha, I would mention. So that's maybe a nice direction for future work. So having multiple PE file features
33:24
and then also other auxiliary information, like we took embeddings of the PE file on disk and concatenated those together. But also potentially multiple labels, you know, just adding multiple other sources of information
33:41
into the representation, it seems to work well. So it's definitely an avenue for future research. I'll finally close with an obligatory Sophos pitch. So I'm with the Sophos Data Science team. We do a really cutting edge research and we're always interested in transparency and collaboration and in,
34:01
hopefully, I've communicated publication. And while we're not the only one, we are one of the only research teams in the MLSec industry that is getting papers accepted at some of the top tier academic venues like USENIX. Our group consists of about 10 to 15 people were split about half between research,
34:22
about half in development. And so check out our group if you're interested. You can talk with me or you can talk with Rich Herang, who's also here, who's one of our directors of data science. Here's a picture of our team, lots of great colorful characters.
34:40
There's some more Sophos presentations going on this week. As I said, I have a facial recognition booth. Rich has a talk on hacking facial recognition on the 10th. And then he also presented a talk on security data science at B-Sides. So if any of you saw that, there's just a name to correlate.
35:01
And then I'd like to thank Sophos for funding and for promoting this research. And I'd particularly like to thank my collaborators and my co-authors here for all the work that they did. This was definitely a team effort. So with that, I'll open it up for questions.
35:29
Yes. Thank you.
35:50
Yeah, so that's a great question. The question is, so this talk was about incorporating these auxiliary losses on neural networks, but have we tried other classifiers like ensemble models,
36:02
random forest boosting, et cetera? And while I would say that we don't have, while we don't have concrete results here on those, there's nothing that would preclude a person from doing so. The representation that's learned by an ensemble model is a little bit different.
36:22
So I guess that like in the, in say a random forest or in a boosted model, I guess the question is how you would do shared splits in a way that works well across data. So I'd say that the technique could be very well applied. I don't know how well it would work.
36:41
I can say that I've looked at some of the libraries that are out there for this, like light GBM, like XGBoost, et cetera. And they generally assume that you're gonna be using only one loss function, but there's nothing that would preclude somebody from implementing it. I just don't know how well it would work. Thank you.
37:02
Let's see more questions. Yes, please. So what are some of the next steps of the process? So, and some of the features that we want to develop. So features in terms of representation of the,
37:21
sorry, in terms of representations of the malware fed to the classifier or features in terms of just like extra things to tack onto the classifier? Sure, sure. Yeah, so extra additions to this overall technique. So I would certainly say that the approach that I'm most interested in is actually having
37:41
a unified multi-input and multi-output model that's really able to learn multiple labels or learn from multiple labels, but also have multiple just heterogeneous inputs. Like you could have as an example
38:02
the character embeddings of the file path. You could also apply this, I would say, to a lot of different malware types. I've been talking about P files this entire time, but there are a lot of different types of malware that one could apply this to.
38:21
So I'd say that those are two different avenues that I'd certainly like to go down. And then I'd also say that there are other sources of data that are on some of these threat feeds. And so I think that looking into those would be very interesting as well.
38:42
Yes, please. Yeah, so I'd say that not only does it not do worse,
39:02
I mean it does, in aggregate it does better. And I would say that yeah, if you have multiple inputs, yes, we do see in fact, and we have seen, I'd actually point you to that learning from context paper, that yeah, we do get a nice performance bump, but actually yes, having missing data
39:22
is a little bit more of a problem with that. So for our loss functions here, if we have a missing label, we can just zero that entry out into loss. But if we have a missing, and just back propagate that, but if we have a missing input,
39:41
well that becomes a lot more hairy, and that's an area of research that I'd really like to see addressed a little bit better. Let's see, I think we have time for one more. One more question. Yes, please.
40:08
Yeah, yeah, that's a really good question. So the question is which inputs are most prominent in terms of the respective output response? And so this goes back to a lot of the model interpretability literature.
40:22
So I'd say the lime shop values, layer wise relevance propagation, a lot of the literature in that area would be very good to look at, or techniques like activation, maximization. Yeah, those are a few techniques, and this is definitely an area where I think
40:42
that not only I, I think that, not to speak for the entire industry, but I think that they're interested in that. So anyway, I can chat more after on that, but yeah, thank you. Thank you for the question, good question. Thank you. Thank you.