We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Detecting random strings; a language based approach

00:00

Formal Metadata

Title
Detecting random strings; a language based approach
Alternative Title
Detecting Randomly Generated Strings
Title of Series
Number of Parts
109
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Numerous botnets employ domain generation algorithms (DGA) to dynamically generate a large number of random domain names from which a small subset is selected for their command and control. A vast majority of DGA algorithms create random sequences of characters. In this work we present a novel language-based technique for detecting strings that are generate by chaining random characters. To evaluate randomness of a given string (domain name in this context) we lookup substrings of the string in the dictionary that we’ve built for this technique, and then we calculate a randomness score for the string based on several different factors including length of the string, number of languages that cover the substrings, etc. This score is used for determining whether the given string is a random sequence of characters. In order to evaluate the performance of this technique, on the one hand we use 9 known DGA algorithms to create random domain names as DGA domains, and on the other hand we use domain names from the Alexa 10,000 as likely non-DGA domains. The results show that our technique is more than 99% accurate in detecting random and non-random domain names. Speaker Bio: Mahdi Namazifar is currently a Senior Data Scientist with Talos team of Cisco Systems' San Francisco Innovation Center (SFIC). He graduated his PhD in Operations Research from the University of Wisconsin-Madison in 2011. His PhD work was on theoretical and computational aspects of mathematical optimization. During his PhD Mahdi was also affiliated with Wisconsin Institute for Discovery (WID) and the French Institute for Research in Computer Science and Automation (INRIA). Also he was a National Science Foundation (NFS) Grantee at the San Diego Supercomputer Center in 2007 and a Research Intern at IBM T.J. Watson Research Lab in 2008. After graduate school and before his current position at Cisco he was a Scientist at Opera Solutions working on applications of machine learning in a variety of problems coming from industries such as healthcare and finance.
Programming languageString theoryTwitterMereologyString theory
String theoryRandom numberSequenceAddress spaceWordData dictionaryParameter (computer programming)Electric currentCodeProblemorientierte ProgrammierspracheAlgorithmBayesian networkApache ForrestSequenceRandomizationString theoryFocus (optics)Random numberData dictionaryDifferent (Kate Ryan album)LengthNumberDomain nameProgramming languageElectric generatorElectronic mailing listProblemorientierte ProgrammierspracheWordAlgorithmVirtual machineSlide ruleMereologyBitWave packet
Data dictionaryDifferent (Kate Ryan album)Programming languageRevision controlGamma functionUser interfaceData dictionaryMereologyProgramming languageInternetworkingRevision control
Revision controlProgramming languageGamma functionData dictionaryProgramming languageWordData dictionary
Data dictionaryWordElectronic mailing listAdditionLength of stayCodeElement (mathematics)Kolmogorov complexityAverageString theoryProgramming languageWordElectronic mailing listGoodness of fitPointer (computer programming)String theoryResultantReduction of orderMathematicsData dictionaryProcess (computing)Standard deviationCodierung <Programmierung>Right angleBitComplex (psychology)MereologySubject indexingKey (cryptography)Domain nameInstance (computer science)NumberComputer-assisted translationPrice indexPoint (geometry)Traverse (surveying)1 (number)Slide rule
Set (mathematics)Floating pointData dictionaryAlgorithmForceSubsetInfinityElement (mathematics)Maxima and minimaNP-hardGreedy algorithmMaxima and minimaWordString theoryElectronic mailing listSet (mathematics)Programming languagePeer-to-peerSpacetimeData dictionaryCASE <Informatik>1 (number)Greedy algorithmRight angleLengthNP-hardLecture/Conference
Greedy algorithmSet (mathematics)Maxima and minimaFactory (trading post)Parameter (computer programming)FaktorenanalyseWordString theoryLengthData dictionarySequenceLengthMaxima and minimaFraction (mathematics)Set (mathematics)Programming languageString theoryNumber2 (number)Random numberMathematical optimizationWordGreedy algorithmFlow separationSequenceArithmetic meanPoint (geometry)ConsistencyPhase transition
Negative numberResultantBit rate
Random numberProblemorientierte ProgrammierspracheAlgorithmSign (mathematics)CodeAlgorithmCodeRandom numberObservational studyProblemorientierte ProgrammierspracheBit rateElectric generatorWebsiteNumberString theoryEmail1 (number)Instance (computer science)TouchscreenDomain nameNegative numberWhiteboardSampling (statistics)Electronic mailing listSlide ruleMalware
Computer animation
Transcript: English(auto-generated)
currently I'm a senior data scientist with Twitter, but before that, I was lucky enough to be a part of the amazing TOLOS team of Cisco, and this work on detecting random strings is something I did when I was, oh wow, that's
awesome. It's something that I did when I was with TOLOS. So first I want to give you the definition of the problem that I'm trying to address here. Let's say that you're given a random string and you want to decide, you're given an arbitrary
string and you want to decide whether this string is a random sequence of characters. So one thing to note here that I set random sequence of characters, and this work does not address strings that are random sequence of dictionary words. The other
thing is that my focus is on strings that are at least eight characters long, and anything less than that is very difficult for even a human being to detect randomness. So I'm focusing on strings of length eight or more. So why do
we even look at this problem? Our motivation for this was detecting domain names that are generated using domain generation algorithms. You know better than I do how these are used, and this is not a new problem. This problem has been
studied quite a lot. There's a rich literature around it. At least I'm aware of some of them here and also a bunch of works that are done at Cisco. Usually the way these works work is that they look at this problem as a classification
problem, and they take a machine learning classification approach to solve these problems, but here my approach is to take a little bit different. So I give you the big picture of the approach in one slide, and then I'll go deeper into
the details of it. So the first thing I want to do is I want to put together as many dictionaries as I can find from different languages, and basically out of these, I want
a word list, list of words, as many as possible. And once I have this, I put them together, I call it the mega dictionary. How do I use it? I basically take the arbitrary string that I'm given, and I take substrings of it, and I
look them up in this mega dictionary. So based on the number of dictionary hits that I find out of these substrings, based on the length of these substrings, and based on the different languages that these substrings are from, I come
up with a randomness score, and based on this randomness score, I determine whether or not this is a random string, and if you notice here, the idea is that if I'm going to
understand, if we could see how the substrings of this string could be covered by different words from different languages, then we have an idea of whether this is a random string or not. So I'll get into some details here. So the first part is that mega dictionary. First, I try to find as many
dictionaries that I can, language-based dictionaries, basically, from the Internet. These are almost 70 different languages that I found dictionaries for. Some of them
are constructed languages. Some of them are, for instance, for English. I have multiple versions of the English dictionary, like British English dictionaries or American English dictionaries, and so this is the languages that I was
able to find free dictionaries for. So other than that, also, I should note here that out of these dictionaries, I only want the key, not the value. So each dictionary, you have the word and you have the definition of it. I don't care
about the definition. I just want the words. So that was the languages. I also get some names based on census data, female names, main names, surnames. Also, I get a list
of Scrabble words. These are the words that are not necessarily in the English dictionary. They could be acronyms here, and so these two items were given to me by my good friend and former colleague, Adam Katz, and they
helped a lot, actually. The next thing is I get the Alexa 1000 domain names. I add them to my word list. This is again important. The word Yelp, the word eBay might not be in any dictionary, but these are actually important words. Some numbers, and also, I've also got my hands on
a dictionary or a list of texting acronyms. YOLO, TTYT, BRB, things like that. So for some of these words, I need to
do some special treatments. For instance, the words that are coming from Eastern European languages, I need to get rid of accents on the characters. For the Mandarin language, I
can't find my pointer. These characters, and basically translate them to Roman characters, and for that, I use the pinyin standard. For Russian and Ukrainian, I needed to use special decoding, and I also needed to use, well,
take care of the fact that I and Y in this decoding are used interchangeably, and a bunch of other special treatments like that. So next thing I need to note here is that the same word might appear in multiple dictionaries.
The word book appears in at least these three dictionaries, English, Polish, Dutch. So to take care of this, I run a math reduce job to find all the dictionaries that a given word appears in. So the result of this looks like something
like this. For each word, for each given word that I have in these dictionaries, I have a list of dictionaries that that word appears in. So here in this example, Sui appears in the French dictionary, in the Catalan dictionary, and a bunch
of others. So this is, at the end, what might be the dictionary that my mega dictionary is going to look like. It's a Python dictionary, and the keys here are the words, and the values here are lists of dictionary names.
And the lookup complexity of this is constant, so it's pretty fast to look up anything here. And at the end, I have about 12 million words in this mega dictionary. So that was the dictionary part. Now how am I going to use this
dictionary for detecting random strings? So I find substrings of a given string, and I look them up in this dictionary. How do I do that? I do that by traversing strings, and this is how I define traversing a string. I
can traverse a string from left to right this way. I have two indices. I fixed them, one at the end, one at the beginning. If I'm traversing from left, I fix the right index, and I move the left index one at a time, and as a result, these are the substrings that I find. If I
find the substring the same string from right to left, I get these substrings. It will be a little bit more clear why I do it once from right to left, once from left to right later on, but let's just fix the definition here and then go to the next slide and look at an example. How do I find the
substrings of a given string? So let's say that we are just dealing with the English dictionary only. And let's say that this is my given string. So I start traversing this string, and at each step, and I'm doing it from left.
And at each step, I find the substring, and I look that up in my mega dictionary. I continue until I find a hit. So none of these words appeared in my English dictionary until I hit this word that appeared in my English dictionary. So I
take that word, I take it out of my string, put it aside, and now I'm left with this substring. So again, I start, I reset the indexes, and I start doing the traversal from left
again, and these are the substrings. None of them are in the English dictionary until I find this word, and it's a hit. I take it out, I'm left with this substring, and so on. So at the end, I get these three words out of this substring that is good to be here. So I did this once from
left. I do the same thing once from right to left. And at the end, from left to right, I get this list, and if I do it from right to left, I get this list. You see that? So
these two, ones from right, ones from left, give me a higher chance to find the right words in that given string. So I need to pick between these two, and because the minimum length of the words in this list is four, I pick this one, because
the longer words that you find in the substring, the higher is the chance that this is not by chance. So that was just looking up in the English dictionary. What about the case
that we have, well, we built a dictionary based on almost 70 different languages, right? How do we use that? So let's say that we are looking at this string, and let's say that I found these substrings in that, in this string. And
these are all the dictionaries that each one of them appears in. So the question here is, okay, so at the end of the day, how many languages do I need to cover these substrings that I found? Right? If I take the union of these, these dictionary
lists, it's going to be way too many. If I take the intersection of these, it's going to be zero. They don't have any intersection. So I need to find the minimal set of dictionaries or languages that cover these substrings. So how
do I do that? That's actually the problem of minimum hitting set. It's a very well-known problem, very well-studied, and this is basically the father of a bunch of other well-known problems such as set covering, set covering
problem and stuff. This is the definition of the problem. I don't want to get into the definition. You can always look it up. Unfortunately this is an MP hard problem. But the good news here is that our sets are small enough, if you look at it again, our sets are small enough that even if I do a greedy search in the space of possible solutions, I
can find the minimal hitting set. So I exactly do that. I basically have this very, very simple greedy algorithm for finding the minimal hitting set problem. This is by no means
optimal or the best greedy algorithm for this, but I don't care because it's fast enough in a very, very small fraction of a second. It gives me what I need. That's good enough for me. So back to our example. So we had these sets that we
wanted to find the minimum hitting set of, and by just applying a very simple greedy search, I find these minimum hitting sets. And these are the cardinality of these sets
too. So my minimum hitting set number is two, meaning that I need at least two different languages to cover these substrings that I found. So based on this minimum hitting
set number, the length of the string itself, the percentage of it that was covered by the substrings that I found, some of the length of the words that are in the string, length of the substrings themselves, I define a randomness score, and that becomes my touchstone for detecting randomness. One last
thing to mention here is that I do it twice. I run the string first against the English language only. So English language is universal. Many people use it. So I first
run it against that. If according to the English language, I don't have a verdict about randomness of this string, next I go to all the languages. And this is basically a two-phase filter. So a bunch of other
considerations here. If you have a sequence of alternating vowels and consonants, if you do this practice yourself, sit down and put a sequence of random alternating vowels and
consonants, it looks pretty legitimate to you. You would think that, oh, it's got to be in some language somewhere. So for this, I penalize if I see something like this. Also, another thing that I consider is that if I see a dash or underscore in the string, it means that there is some
natural separation at that point. So because otherwise, why would it be an underscore or a dash in the middle of a string? So based on that, I treat that as a separation and
I look at each one of these separate pieces separately. So I want to get into some results here based on the experiments that we ran and I look at both false positive and false
negative rates for this. Let's first look at false negative. So I'm using nine domain generation algorithms. These are from known malware and botnets. And these are reverse
engineered by the TOLOS team. So they gave me the code and I generated basically these samples. So this is the number of samples that comes from each one of these algorithms. And then this is the number of strings that were generated by
this specific algorithm that were missed by my randomness detection. So if you look at the missed percentages, I don't know how clear it is on the screen, but basically the rate is pretty low. Across the board, it's probably around 1%,
less than 1% false negative rate. So how about false positive? For this, I took Alexa 10,000 domain names. I
filtered out strings that are shorter than eight characters. I put them aside, as I mentioned in the first slide. So I'm left with 5,400, almost 5,400 domain names. And I run them through the code. So the rationale here is that in the
Alexa 10,000, you're unlikely to see lots of DGAs. So hopefully a lot of those are legitimate websites. And out of this 5,400 domains that I checked, 42 of them were
detected by my algorithm as being random. And this is the whole list of them. Some of them, like for instance this one, this one, if you showed it to me, I would say this is a random string. Or I don't know, like this one. This is
pretty random looking to me. But some other ones are not random. Like for instance, as a matter of fact, I know that this is a pretty legitimate Turkish website. Or this one is a legitimate Farsi website. So at the end, out of 5,400, we
have 42, and we can say that it's about 1%, 1% false positive rate. Which is not bad. So 1% for false positive, 1% for false negative. And overall I think it's a good
rate compared to other studies that I've seen on this problem. And this would conclude my talk. Thank you very much.