Blue Team Village - QuarkEngine - TIB AV-Portal

Blue Team Village - QuarkEngine

00:00

4

Related Material

Song, JunWei Chen, KunYu

Formal Metadata

Title

Blue Team Village - QuarkEngine

Title of Series

DEF CON Safe Mode

Number of Parts

374

Author

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/49824 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Android malware analysis engine is not a new story. Every antivirus company has their own secrets to build it. With python and curiosity, we develop a malware scoring system from the perspective of Taiwan Criminal Law in an easy but solid way.

Speech

Text

Image

00:00

Asynchronous Transfer ModeTheoryOrder (biology)MalwareAndroid (robot)Level (video gaming)CodePasswordUniform resource locatorInformationRule of inferenceOrder (biology)QuarkDirection (geometry)Functional (mathematics)Function (mathematics)Level (video gaming)Digital photographyCellular automatonWeightShared memoryStrategy gameMessage passingCalculationTotal S.A.Inheritance (object-oriented programming)ResultantMathematical analysisPower (physics)Library (computing)Physical systemOperator (mathematics)SequenceBefehlsprozessorNumberProduct (business)Computer fileBitLimit (category theory)BootingThresholding (image processing)CASE <Informatik>Wechselseitige InformationoutputElectric generatorDescriptive statisticsComputer fontCodeCore dumpSystem callCombinational logicVector spaceString (computer science)ExistenceSingle-precision floating-point formatArithmetic meanTraffic reportingPlanningMalwareDeterminantService (economics)Confidence intervalRevision controlGroup actionGodProcess (computing)ImplementationMereologyCartesian coordinate systemSinc functionMultiplication signSource codeHalbordnungStreaming mediaArray data structureOpen sourceBinary file1 (number)SoftwareParameter (computer programming)Slide ruleField (computer science)View (database)SummierbarkeitBinary multiplierReal numberTheoryIndependence (probability theory)3 (number)Insertion lossArithmetic progressionPhysicalismObject (grammar)Android (robot)Point (geometry)Right angleLogicSheaf (mathematics)Software testingMathematical optimizationException handlingTable (information)TwitterVirtual machineDifferent (Kate Ryan album)Prisoner's dilemmaExterior algebraElectronic mailing listMathematicsLine (geometry)Hash functionWritingTwo-dimensional spaceSimilarity (geometry)Presentation of a groupWordCybersexInformation securityPhysical lawOracleIntelMusical ensembleHidden Markov modelFisher informationData typeTask (computing)FamilyShooting methodPrime idealState observerLatent heatMatching (graph theory)Bit rateComputer configurationInterior (topology)AuthorizationDot productMetropolitan area network

Transcript: English(auto-generated)

00:00

and DevCon. So this is the outline. Number one, we will introduce our malware scoring system. And number two, we will show you how we designed the Delvik Bico loader. And number three, we will go through two cases of real malware analysis using Quark Engine. And number four, we will share our strategy of generating the detection rules.

00:22

And the last thing, yes, the future works because we still have a lot of things to do. All right, so let's introduce the malware scoring system. As we know, when developing a malware analysis engine, it is important to have a scoring system. However, those systems are either business secrets or just too complicated.

00:44

Therefore, we decided to create a simple but solid one and take that as a challenge. And since we wanted to design a novel scoring system, we stopped reading and decoding what other people do in the field of cybersecurity,

01:00

because we don't want our ideas to be subjected to the existing systems. So we started to find ideas in fields other than cybersecurity, and luckily we found one. Yes, the best practice we found is the criminal law. So when sentenced to penalty for the criminal, the judge weighs the penalties based on the

01:23

criminal law. And after decoding the law, we find principles behind it, and we develop a scoring system for Android malware. There are only eight principles decoded from the criminal law, and I'll go through it in the following slides. Now let's see principle number one.

01:40

A malware crime consists of action and targets. In the criminal law, the definition of a crime consists of action and targets. For example, steal money and kill people. So with this principle in mind, we developed the definition of a crime for Android malware.

02:01

And the definition is the malware crime consists of action and targets. For example, steal photos or steal your banking account password. Now let's see principle number two. We consider that the loss of fame is greater than the loss of wealth. So in the criminal law, physical body injury is more serious than psychological injury.

02:26

So the principle we decoded here is when things are hard to recover, we consider it a felony. So with this principle decoded, we developed our second principle. We consider the loss of fame is greater than the loss of wealth,

02:41

because it's easier to make money back than rebuild your reputation. Okay, now let's see principle number three. Arithmetic sequence. In the criminal law, when a murderer is sentenced 20 years in prison, and a robber is sentenced seven years in prison for his crime. Why the number?

03:00

Why 20 and seven years? Why the number? We found no obvious principles in the criminal law, so we use arithmetic sequence to weigh the penalty of each crime. For example, the penalty weight of Y1 is 10 and Y2 is 20 and Y3 30, etc. So now let's see the most important part of the scoring system.

03:23

We create an order theory, which consists of three principles. Here are principle number four, five, and number six. Now let's look at principle number four. The later the stage, the more we're sure that a crime is practiced. As mentioned in chapter four of Taiwan criminal law, each crime consists of a sequence of behaviors.

03:43

Those behaviors can be categorized in the specific order. So let's take murder for example. Determined in stage one. Determined means somebody decided to kill someone. And in stage two, conspiracy. It means he or she started to make a plan for the murder.

04:01

And in stage three, preparation. It means find stops for example weapons or arranging services for the murder plan. And in stage four, starts. It means when things are all set, the murderer takes action and is on the way to kill someone. And stage five, practice. It means the murderer does pull the trigger and shoot someone.

04:23

So as we can see here, the later the stage, the more we're sure that a crime is practiced. So with this principle in mind, we developed Android malware crime order theory. And in this theory, we also have five stages for the crime. For example, if a malware tries to send out your location data by using SMS in stage one,

04:46

we will check if related permission is requested by the malware. And then we will check if the key native API is called. And in stage three, we will see if certain combination of native API exist.

05:00

And then we will check if the APIs are called in a specific order. Finally, we check if the APIs are handling the same register. Okay, so now you can see from this picture, this is a two-dimensional map for Android malware crime. For the crimes, we put them in y-axis. And for each crime, we use x-axis to see the evidence we called for the crime.

05:25

So x5, y1 means in crime number one, we have found a native API that are called in a correct sequence and are handling the same register. And x3, y5 means in crime number five, we have found certain combinations of native API that is used in this APK.

05:45

So now let's look at principle number five. The more evidence we caught, the more penalty weight we give. So we give stage two more weights than stage one and stage three more weights than stage two, et cetera.

06:01

Okay, principle number six, proportional sequence. As we decode it from the criminal law, the later the stage, the more we're sure that a crime is practiced. So we consider proportional sequence, for example, two to the power of n to present such principle in our scoring system.

06:20

All right, principle number seven, crimes are independent events. For simplicity, we assume crimes are independent events and penalty weights can be added up directly. So this is an example of adding up two crimes. In a malware, we find two crimes. They're stealing photos and stealing your banking account password.

06:40

So the calculation of a total penalty weight is quite simple. For example, for each crime, we use penalty weights of crime to multiply proportion of caught evidence and add up the results of the two. The last principle, principle number eight, thresholds generate system. So after calculating the total penalty weights for a malware,

07:03

we need to have threat level threshold so that we can tell which threat level does the malware fill in. Unfortunately, we can't find them in criminal law, but we know we need to design a threshold generate system for that, not just give any number by intuition. So we decided that threshold for each level is the sum

07:24

of the same proportion of caught evidence multiplies penalty weight of crimes. We know it's not a perfect one, but we're sure that we build a foundation for future optimization. All right, now let's talk about the design logic

07:41

of Delphic Bycode Loader. And Junwei, he will take care of this part. Hello everyone, my name is Junwei and I will take care of this part. So now let's talk about the design logic of Delphic Bycode Loader.

08:01

Our Delphic Bycode Loader is actually the implementation of the Android malware crime order theory. We implement every stage of the theory. There are five stages. The first three stages are easy. We simply use APIs in another open source tool,

08:21

Android Guard, to implement the first three stages. As I just mentioned, the implementation of the first three stages are easy. But in stage four, we need to do a little bit more. So before the implementation, we need to know what does stage four do.

08:41

In stage four, we find the calling sequence of native APIs and check if they are caught in a specific order. For example, if a malware sends out your location data by SMS, then first it will call native API get cell location to get your location data.

09:05

And then call native API send text message to send your location data by SMS. Normally, native APIs are wrapped in functions. So we trace back to see which function is cross-reference from the native APIs.

09:25

And we call those functions the parent function. And we will keep tracing back until we find mutual parent function for both the native APIs. Here is the example. Send text message is called by send SMS,

09:43

which is the parent function of send text message. And get cell location is called by get location, which is the parent function of get cell location. And if we keep tracing back, we will see that both send SMS and get location shares the same parent function, send message.

10:09

And after we find the mutual parent function, we will scan through a small code of the mutual parent function and check which function is called first.

10:22

So this is the small code of send message. We can see that get location is called first to get location data of the cell phone. And the send SMS is called to send out the location data.

10:41

And in stage 4, we found that our design can also overcome the obfuscation techniques used by the malware. When applying obfuscation techniques, functions except native APIs are renamed. This has made the decompiled source code hard to read for human.

11:03

The machine can still run the code because the logic of the code remains the same. Here is the example. When applying obfuscation techniques, native API send test messaging is called by function k, and function k is called by function f.

11:24

The alternative API get cell location is called by function e, and both function e and f shares the same parent function, which is a. You see, if you start reading the decompiled source code of a,

11:41

it will be hard to figure out what is going on there. And by the way, since our goal is to find the mutual parent function, so it doesn't matter how many layers the vectors are. Now, let's see the implementation of stage 5.

12:03

Yes, this is the most important part. In stage 5, we need to confirm that if the native APIs are handling the same register. Let's use the same example. Send out your location data by using sns.

12:21

So when native API get cell location is called, it will return the location data of the cell phone. And what we do in stage 5 is to check if the alternative API send text messaging sends out the location data returned from get cell location.

12:43

So in stage 5, we simulate the CPU operation. We read line by line of the Somali-like source code and operate like CPUs to get two things. First, the value of every register.

13:02

Second, the information like functions who have operated the same register. To make this happen, we create a self-defined data type. We call it register object. In each register object, we store three kinds of information.

13:22

Number one, the register name. Number two, the value of register. And number three, the function who use this register. Let's see an example. So the register name is v7, and the value of the register is a string.

13:42

And the string appends the value of string 1 and the result of function 1. And then we can see that the register is used as the input resource of the function 2. By the way, when filling the value of use by which function in the register object,

14:05

we expand every register by cross-referencing of the register object in the table. So for example, by cross-referencing, we know that v8 is a string called user location,

14:21

and v3 is a function called get location. As you can see in the lower right corner, the result of get location is appended to the string, which is user location. And the new string is sent out by using function sendSns.

14:43

In other words, the value of register v7 is generated by using function getLocation, which has native API 1 in it. And the value is used as an input for function sendSns,

15:02

which has native API 2 in it. So now we prove that by using the register object, we can check if the APIs are handling the same register. So after we scan through the source code, we produce that of register objects.

15:23

And those register objects will be organized in a two-dimensional Python list. It is a similar idea like hash table. We use it to boot out the read and write of the list. So now let's see the table.

15:43

As you can see here, register v4 has three register objects. That means in the source code we scan, v4 was used three times. And every time when it was used, we store the present value of register

16:01

and the function who use it if there is one. So basically the whole table is the history of the registers. So when we finish constructing the table, we then scan through all register objects in the table to check if the native APIs are handling the same registers.

16:23

So now let's see how to use Quark Engine to analyze the malware. Now, let's get back to Kun-Yu. Okay, in this section we prepared two malware. One is non-obfuscated and the other one is obfuscated.

16:43

And for each malware, we will show how we detect the behavior of the malware with the detection rule. Now let's look at the first malware. This is a non-obfuscated one. We will use the rule in Quark Engine to detect whether if the malware sends out the cell phone location data by using SMS.

17:03

So this is the detailed report of Quark Engine. In this report, the engine shows the detection results of one single malware behavior, or you can say one single malware crime. For example, we tried to find if the malware sends out your location data by using SMS.

17:21

So in this report, we list out the evidence we found in each stage of the Android malware crime order theory. And this report shows we find evidence in every stage, which means we have 100% of confidence that the malware has this behavior. So let's see. In stage one, permissions like send SMS,

17:41

access course location, and find locations are requested. In the second stage, the key native APIs like get cell location and send text messages are used. And in stage three, we find certain combinations of native API exist. And in stage four, we found out that in functions like send message and do bytes,

18:01

the APIs are called in the right sequence. And in stage five, in function send message, we find out that those APIs are handling the same register. So now let's think, if you're analyzing this malware, and you want to trace the compiled source code to see the evidence,

18:20

how do you do it? Our suggestion is you start the backwards. That means you start from stage five. For example, in stage five, we know that inside a function of send message, it has two functions that contains the two native APIs respectively, and they're handling the same register.

18:41

So you start to locate function, send message in the decompiled source code. And in stage four, we know that those two functions are called in the right sequence. So we can start to find the functions that contains the native APIs and check if they are really called in the right sequence. The information of the two functions and the sequence

19:03

will be shown in the next version of clock engine. So now let's look at a real example. Let's locate the function, send message. And we found out that two functions that contains the two native APIs respectively, there are send SMS and get location.

19:21

And if we dive into the function of get location, we will see that it contains native API, get location. And if we dive in this function of send SMS, we will see that it contains native API, send text message. So the code here means it first collect your cell phone location data

19:40

and send it out through SMS. So now let's dive into the source code of get location. As you can see in the source code, it tries to call native API, get cell location and return this information at the end of the code. And now let's dive into the source code of send SMS. Native API send text message is used to send out the location information.

20:05

So quite simple, isn't it? Now let's look at the second malware. This is an obfuscated one. We will use the rule in clock engine to find whether if the malware detect Wi-Fi hotspot by gathering information like active network info and cell phone location.

20:23

Okay, so as a malware analyst, we read the report backwards. As you can see in stage five, there are functions like p.a, at view.c and af.run. And those functions has two functions that contains native APIs respectively, and they're handling the same register.

20:42

And in stage four, those two functions are also called in the right sequence in functions like p.a, at view.c and af.run. So according to the reports, we can see that malware has the behavior of Wi-Fi hotspot detection in three parts of the source code. And we can pick any parts for further analysis.

21:03

So we just pick a function p.a. So now let's see the source code. Let's first locate the function p.a. And we found out that two functions that contains the two native APIs respectively, there are ap.a and f.f.

21:21

And if we dive into the source code of function ap.a, we will see that it contains native API get active network info. And if we dive into the function of f.f, we will see that it contains native API get cell location. So this code here means after collecting information from function ap.a and f.f,

21:43

it sent the information as an input for function am.a. So now let's dive into the source code of function ap.a. As you can see in the source code, it tries to call native API get active network info and return the related information at some points.

22:01

So now let's dive into the source code of f.f. Native API get cell location is used to get the cell phone location data. And this information is processed with some other strings. And at the end of this function, it returns the strings with the information. As mentioned earlier,

22:20

after collecting information from function ap.a and f.f, they use the information as an input for function am.a. So, and we noticed one thing. The function am.a used byte arrayed output stream as one of its input parameter. And we know when seen byte arrayed output stream,

22:43

it means the function is probably trying to write the data into a file. This is amazing, isn't it? So with Quark Engine, malware analysis can really boost up your productivity. Okay, so now I will introduce our detection rule generate strategy.

23:03

So why do we need to develop the detection rule generate strategy? Because to make our engine practical and easy to use, we have to, we need to have more detection rules. However, the speed of rule generated by human is quite slow. And human generated rules is subjected to his

23:23

or her experiences of malware analysis. So we developed that rule generate strategy to boost up the production of detection rules. And since our goal is to find all kinds of behaviors in the malware, so that if we use permissions and native APIs to generate all possible rules,

23:41

we will have an amazing amount of rules. After generating the rules, we then use Quark Engine to find the intersection between those amazing amounts of rules and the malware we prepared. In other words, we find rules that match the malware behaviors. However, this is not a good way to generate detection rules

24:00

because it's time and resource consuming. So we developed a seven-step generate strategy. Step one, we curl down all native API information on Android official API reference. For example, this is the native API information of send text message.

24:21

You can see the input parameters, return value, and a description of this API. Okay, step two, we did a little bit modification to our engine. We ignore the permission checks in stage one of the Android malware crime order theory. And in step three, we find all kinds of API combination

24:42

and generate rules without permission information. And in stage four, we use the modified Quark Engine to find intersections of the rules and the malwares. We call the rules, we call rules in the intersection the first stage verified rules. In other words, these rules need to be verified again.

25:04

And since we don't need to generate rules with permission and verified permissions in Quark Engine, the whole process of rule production speed up. And in stage five, we try to generate rules with permissions. So inside the intersection,

25:20

we have first stage verified rules matched with malwares. We then use the first stage rules and permission in the matched malware to generate rules with permissions, which is the second stage rules. And in step six, we then use the Quark Engine, yes, the full function version,

25:42

to find again the intersection of the second stage rules, which are the one with permission and the malware we prepared. So after that, for each rule, we label the number of matched malware. For example, the behavior of rule number one

26:01

can be found in 100 malware, et cetera. And in step seven, after labeling the rules, we then sort the rules by number of matched malware. We will review the rules from the highest matched one to the lowest one.

26:20

All right, the last part, future works. As I mentioned earlier, we still have a lot of things to do. For example, we need to have more detection rules and we also need to deal with the .so files and the packed APKs. And we want to have more features of Dalvik Biker Loader,

26:41

for example, the downloader. And we also want to apply the scoring system to other binary formats. Yeah, it is in our to-do list. And we noticed that API changes in different versions of Android. So we will also take care of this problem in the next version of Quark.

27:02

Oh, we probably would change the core library since Android God is quite inactive recently. And one more thing. Actually, we're trying to make Quark itself easier to be integrated to other tools. For example, users can import Quark as a Python library and output the analysis result as a JSON file.

27:23

So now Quark is collected in Blackarc Linux and will soon be integrated to a threat intelligence analysis tool, Intel All. And there's one nugget that I would like to share. We work at the limits of our tools. When new tools come along, new things are possible.

27:42

Okay, so that's all for today. And if you have any question, please feel free to DM our Twitter accounts. Thank you.

Recommendations