GOD MODE unlocked: Hardware backdoors in x86 CPUs

Video thumbnail (Frame 0) Video thumbnail (Frame 4932) Video thumbnail (Frame 10542) Video thumbnail (Frame 21239) Video thumbnail (Frame 31936) Video thumbnail (Frame 40827) Video thumbnail (Frame 43440) Video thumbnail (Frame 44636) Video thumbnail (Frame 47564) Video thumbnail (Frame 49297) Video thumbnail (Frame 55152) Video thumbnail (Frame 61393) Video thumbnail (Frame 62781)
Video in TIB AV-Portal: GOD MODE unlocked: Hardware backdoors in x86 CPUs

Formal Metadata

Title
GOD MODE unlocked: Hardware backdoors in x86 CPUs
Title of Series
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2018
Language
English

Content Metadata

Subject Area
Abstract
Complexity is increasing. Trust eroding. In the wake of Spectre and Meltdown, when it seems that things cannot get any darker for processor security, the last light goes out. This talk will demonstrate what everyone has long feared but never proven: there are hardware backdoors in some x86 processors, and they're buried deeper than we ever imagined possible. While this research specifically examines a third-party processor, we use this as a stepping stone to explore the feasibility of more widespread hardware backdoors.
Presentation of a group Functional (mathematics) Machine code Computer file Multiplication sign Parameter (computer programming) Regular graph Raw image format Coprocessor Computer programming Goodness of fit Mathematics Different (Kate Ryan album) Semiconductor memory Natural number Computer hardware Gastropod shell Energy level Information security Backdoor (computing) Computing platform Exception handling Physical system Demo (music) Structural load Bound state Exploit (computer security) Word Data management Order (biology) Quicksort
Polar coordinate system Presentation of a group Digital media Code Multiplication sign System administrator Direction (geometry) View (database) Demo (music) Sheaf (mathematics) Chaos (cosmogony) Client (computing) Fault-tolerant system Disk read-and-write head Perspective (visual) Magnetic stripe card Neuroinformatik Data model Mechanism design Different (Kate Ryan album) Semiconductor memory Core dump Automation Endliche Modelltheorie Information security Descriptive statistics Physical system Exception handling Structural load Software developer Data storage device Bit Hecke operator Reduced instruction set computing Flow separation Sequence Type theory Arithmetic mean Ring (mathematics) Order (biology) Chain Configuration space Quicksort Figurate number Fundamental theorem of algebra Reading (process) Writing Sinc function Asynchronous Transfer Mode Web page Laptop Trail Game controller Functional (mathematics) Variety (linguistics) Computer-generated imagery Coprocessor Number Architecture Latent heat Goodness of fit Cache (computing) Ring (mathematics) Computer hardware Gastropod shell Energy level Proxy server Address space Backdoor (computing) Computer architecture Addition Torus Information Bound state Cache (computing) Kernel (computing) Game theory Backdoor (computing) Routing
Machine code Code Multiplication sign Port scanner Set (mathematics) Side channel attack Magnetic stripe card Estimator Core dump Information security Error message Physical system God Algorithm Bit Storage area network Process (computing) Ring (mathematics) Order (biology) Bridging (networking) Configuration space Quicksort Arithmetic progression Reading (process) Asynchronous Transfer Mode Point (geometry) Web page Kreisprozess Coprocessor Theory Number Hypothesis Power (physics) Revision control Goodness of fit Green's function Booting Hydraulic jump Computer architecture Form (programming) Addition Uniqueness quantification Bound state Mathematical analysis Depth-first search Kernel (computing) Software Calculation Window
Logical constant Computer file Multiplication sign Connectivity (graph theory) Computer-generated imagery Combinational logic Set (mathematics) PowerPC Black box Coprocessor 32-bit Theory Mathematics Semiconductor memory Bridging (networking) Single-precision floating-point format Core dump Damping Codierung <Programmierung> Determinant Information security Physical system Computer architecture Arm File format Forcing (mathematics) Sound effect Reduced instruction set computing Cache (computing) Kernel (computing) Ring (mathematics) Order (biology) Bridging (networking) Right angle Quicksort Resultant Asynchronous Transfer Mode
Structural load State of matter Function (mathematics) Coprocessor Arm Mathematics Read-only memory Semiconductor memory Bridging (networking) Shared memory Core dump Hardware-in-the-loop simulation Booting Mathematical optimization Physical system Arm Information Shared memory Sound effect Reduced instruction set computing Complete metric space Approximation Pointer (computer programming) Kernel (computing) Function (mathematics) Order (biology) Buffer solution output Hill differential equation Physical system Resultant Router (computing) Row (database) Reverse engineering
Sine Game controller Group action State of matter Multiplication sign Client (computing) Login Power (physics) Different (Kate Ryan album) Ring (mathematics) Videoconferencing Software testing Booting Physical system Computer architecture Task (computing) Module (mathematics) Demo (music) Mathematical analysis Bit Reduced instruction set computing Kernel (computing) Process (computing) Software Ring (mathematics) Order (biology) MiniDisc Fuzzy logic Ranking Resultant Reading (process) Writing
User interface State of matter Multiplication sign Primitive (album) Function (mathematics) Mereology Semiconductor memory Different (Kate Ryan album) Core dump Automation Interior (topology) Sound effect Bit Opcode Reduced instruction set computing Category of being Root Process (computing) Ring (mathematics) Order (biology) Buffer solution Chain MiniDisc output Pattern language Quicksort Task (computing) Row (database) Point (geometry) Functional (mathematics) Statistics Computer-generated imagery Login Binary file Code Field (computer science) Emulation Number Root Operator (mathematics) Boundary value problem Computer worm Data structure Codierung <Programmierung> Task (computing) Addition Matching (graph theory) Scaling (geometry) Information Limit (category theory) Exploit (computer security) Word Kernel (computing) Pointer (computer programming) Physical constant Musical ensemble Table (information) Computer worm
Code Multiplication sign Primitive (album) Function (mathematics) Computer programming Software bug Mechanism design Single-precision floating-point format Core dump Source code Endliche Modelltheorie Information security Physical system God Closed set Shared memory Bit 3 (number) Reduced instruction set computing Sequence Process (computing) Ring (mathematics) Order (biology) Quicksort Ranking Asynchronous Transfer Mode Point (geometry) Ocean current Computer file Coprocessor Field (computer science) Theory Power (physics) Root Bridging (networking) Ring (mathematics) Computer hardware Operating system Gastropod shell Energy level Representation (politics) Data structure Codierung <Programmierung> Booting Address space Task (computing) Computer architecture Module (mathematics) Assembly language Pointer (computer programming) Kernel (computing) Table (information) Routing Computer worm
Presentation of a group Code INTEGRAL Multiplication sign Modal logic System administrator Coma Berenices Side channel attack Magnetic stripe card Neuroinformatik Sign (mathematics) Single-precision floating-point format Core dump Bus (computing) Source code Computer engineering Endliche Modelltheorie Information security Thumbnail God Physical system Email Feedback Bit Hexagon Ring (mathematics) Order (biology) Configuration space Quicksort Reading (process) Resultant Asynchronous Transfer Mode Point (geometry) Implementation Observational study Control flow Online help Coprocessor Theory Number Twitter Bridging (networking) Computer hardware Operating system Energy level Singuläres Integral Booting Firmware Backdoor (computing) Default (computer science) Projective plane Code Line (geometry) Compiler Antivirus software Kernel (computing) Personal digital assistant Iteration Musical ensemble
good afternoon yeah I joined this conference come on come on so Christopher Dumas he's going to do his talk he's gonna be running completely long he'd be using all of his time he's kinda asked that if you got any questions to do it off after the talk so with that go ahead Chris all right welcome everyone well go ahead and get started so what I wanted to talk to you about today is honestly something that I didn't think was possible but it should be pretty interesting so you know the word Hardware backdoor gets turned around a lot these days so much that it's kind of lost all meaning but but what I want to look at today is not the things that were normally worried about it's not the management engine is not the platform security processor this is something totally different that I think wasn't on anyone's radar something we didn't really see coming but like any good presentation I think we've got to start off with a disclaimer basically all of this research was done on my own this does not reflect the opinion of my current employer or any previous employers but with that my name is Christopher Thomas I'm a cybersecurity researcher I think when I think with a lot of different things in the past but recently the last few years what I've sort of been focused on is the idea of low level processor exploitation turns out there's a lot of fun things we can do with that so in order to frame the discussion for the rest of the day I want to start off with a demo of exactly what we're going to be talking about so what we have here is a pause that what we have here I'm logged into the system as a regular unprivileged user I'm going to open up a very simple C file all it does is it loads some values into the X register and then it executes all of these bound instructions now if you're not familiar with the x86 bound instruction it is a pretty simple instruction all it does is it takes two arguments and it checks if your first argument is within the bounds provided by the second argument now if you look carefully at these bound instructions what you'll see is that they use some kind of weird values basically looks like random value on the right what that means is that these instructions don't actually even have access to the memory that they're trying to use now in x86 when you don't have access to the memory that you're trying to use you get a general protection exception so every single one of these bound instructions is going to throw one of these general protection exceptions and Linux you see that as a seg fault despite knowing that that's what's going to happen at the end of this program we are going to try to execute a shell so will I get out of here and we will run this program we'll compile it launch it and exactly what we expect to happen will happen we'll get a segmentation fault because those bound instructions don't have access to the memory that they're trying to use and I'm still the exact same user I was before still logged in as Delta but if I make one tiny change to this program I'm gonna add one x86 instruction it's an instruction that's so obscure and unheard of it doesn't actually have a mnemonic associated with it it is going to have to be written in raw machine code o F 3 F and what this instruction is going to do is it is going to fundamentally change the nature of all the subsequent bound instructions so that now instead of their original functionality instead of checking whether or not one value is within the
bounds of another they are actually going to reach into kernel memory and modify the system itself so I'll compile this and re execute it and all of a sudden we'll see I am now route so so the rest of this presentation is gonna be baked today a very long quick convoluted journey onto how exactly I stumbled across this and the whole thing begins with this idea of x86 rings so in the beginning in x86 about 30 years ago there is no concept of separation of privileges all the code running on the processor executed with the exact same privileges as all the other code running on the processor and it was basically chaos but then we added the idea of rings of privilege to the processor in order to separate out permissions in the ODB on rings of privilege was that's at the lowest level we would have ring 0 that'd be the most privileged realm on the processor where the kernel would live the only really trusted code would excuse ring 0 and it would have complete access to the hardware a little less privileged in that with ring 1 less than that was ring 2 and then way out here in the outermost ring is ring 3 that's where the user code lives basically all the code that we don't trust to have access to the hardware is gonna execute way out here so this was the fundamental idea behind privileges on x86 where ring 3 if they wanted to do anything interesting it would have to go through very specific carefully checked Hardware privileged access mechanisms in order to ask ring 0 to do something for it and all of our other security on x86 is all based off this fundamental premise of separation of rings and things got kind of interesting people started digging deeper we found out we needed things more privileged in ring 0 like the hypervisor had that more access than the kernel normally would so we kind of called that ring - whine we've done out some things have to be more privileged in that system management mode we called ring - - a couple years ago some researchers found another core off the x86 core that they called ring - 3 because it could do things that the x86 core couldn't do so things were we're interesting but if you've been following any of this research sort of in the back of everyone's mind is this question can we go further are there are there even more layers to this to this puzzle so I started setting off to sort of explore that idea and a lot of times when I'm starting out on new research Atlanta an interesting place to begin is with patents sometimes you can find some really interesting information that people don't document in public information but you might be able to find some some bits of useful ideas inside of patents so keeping this whole idea of this ring model of privileges in mind and that my surprise when I was reading through a patent and I saw just sort of nonchalantly buried in the middle this quote additionally accessing some of the internal control registers can enable the user to bypass security mechanisms for example allowing ring 0 access at ring 3 that's a little bit alarming from a security perspective you're telling me like we've had 30 years of relying on rings to provide our privileges on x86 and there's just some way to circumvent all of this so they go on to say in addition these control registers might reveal information that the processor designers wish to keep proprietary and for those reasons the various x86 processor manufacturers have not publicly documented any description of the address or function of some of these control unless ours so this this was really interesting I was really really excited about this and wanted to run with this idea so like any rational person I went out and bought fifty-seven computers to start doing some basic research on so I had a based on the patent type thing and the patent owner I had some some idea of what processor I might specifically want to focus on for this this research but patents are kind of a funny thing IP gets bought and sold by different companies ideas trickles through the industry so I sort of want to cast a wide net and look at a bunch of different systems for this research but what I eventually settled on was the VSC
3 processor so these are interesting processors sort of targeted at an embedded low-power market you can find these often in point-of-sale systems kiosks ATMs gaming since we're in Vegas you might want to start poking around Digital Signage healthcare digital media industrial automation and they're in pcs and laptops as well so I put off my shell for this specific system eventually it's a thin client running ac3 maya core and this is what we're going to look at for the rest of the presentation at the very end of the presentation I looked a little bit more at what other processors might be affected by this issue but this is sort of the target of this specific research so I wasn't able to find a developer manual for this processor so when we're sort of left in the dark a good place to go is sort of follow a trail of patent breadcrumbs see what information we can derive from patents and see if there's anything useful in there that can give us hints as to how we should move our research forward so so I had to dive through a lot of patent literature and I to give a little bit of a glimpse into what that's like so this isn't actually a quote from the patent that I ended up or the patents that I ended up using but it is something I stumbled across during this this research so this this quote says figure 3 shows an embodiment of a cache memory referring to figure 3 in 1 embodiment cache memory 320 is a multi-way cache memory in 1 embody main cache memory 320 comprises multiple physical sections in 1 embodiment cache memory 320 is logically divided into multiple sections and 1 embodiment cache memory 320 includes four cache ways ie cache way through 10 cache way 311 cache way 312 and cache way 314 in one embodiment a processor sequester's one more cache ways to store or to execute processor microcode like my head just exploded when I read stuff like this it's so so convoluted and so wrapped up in legalese it's it's hard to understand what the heck you're even reading and just to put that into perspective this 1/4 page patent contained the phrase in 1 embodiment 147 times so following this just to start your research when you really want to dive into things it is a really really painful process but if you're persistent it can pay off so I eventually settled on these six different patents that seemed to have some basic ideas that could point me into the right direction for this research so I want to highlight some of the key ideas that I that I got out of those six patents so one is it sort of looked like at the time view was embedding a non x86 core alongside the x86 cores in their C 3 processors and this non x86 port was a RISC architecture and the Pens didn't have a consistent terminology for this but I just sort of started calling this the deeply embedded core or the deck on ISA on these processors they also talked about something that they called the global configuration register it's basically a register that it's exposed to the x86 core through a model specific register and they say that this global configuration register can act or I can activate the RISC core they also talked about an x86 launch instruction it was basically a new instruction that they added to the x86 is a where once the RISC core is activated you can start a RISC instruction sequence through the launch instruction according to the patents so kind of putting all these ideas together what it what it looked like is if our assumptions about what this deeply embedded core can do are correct you could essentially use this as a sort of backdoor a means of serously circumventing all of the processor security checks so that's absolutely ideal worth exploring more for security purposes so there are sort of three patents that give us some initial ideas on how the overall mechanisms might work so one patent tells us that a model specific register can be used to circumvent processor security checks and other patent tells us that a model specific register can be used to activate a new x86 instruction and another patent tells us a launch instruction can be used to switch to a RISC instruction sequence so if we sort of put those three things together and try to fill on the gaps we end up with a sequence that looks something like this we have to find a model specific register bit that when toggled will activate a new x86 instruction that didn't previously exist and running that instruction will activate a RISC core on the processor and that core should be able to bypass the processors security protections so so where do we begin well let's look at the very first step in that chain these model specific registers so if you're not familiar with the idea of MSRs and x86 they are basically 64 bit control registers used for a wide wide variety of things there is for debugging performance monitoring cache configuration feature configuration basically anything that doesn't have to do with general computing can get tossed into the MSRs and unlike the x86 registers you might be more familiar with they aren't accessed by name they're accessed by address so instead of EAX and edx we have addresses from 0 to 4 billion is how we access our MSRs so basically you load up an address into the ECX register and then execute the read MSR or write MSR instructions in order to access an MSR now there's some saving grace here we think that maybe we can use this to circumvent the ring protections on the processor but these MSRs are only accessible to begin with with ring 0 code so maybe the rest of the stuff we can do from ring 3 but in order to activate that bit to begin with we need ring 0 execution so so that's that's good news although maybe we don't need ring 0 execution I'm gonna revisit this idea later in the talk but so that we can move the research forward let's for now that we have one time ring zero access on that processor just to enable this backdoor feature and we'll revisit that concept later so so let's look at these mob specific registers some more well the patent basin comes right out and tells us that the various x86 processor manufacturers have not publicly documented any description of the address or function of some of the control MSR so that's a challenge for us I think there's a bit in one of these MSR that might activate something cool but it's probably not going to be documented so step one in order to try to explore this idea is just figure out which in the Czar's are actually implemented by the processor regardless of what the documentation says which and the stars are actually there so this one's pretty easy to solve basically you can set the general purpose exception handler on the processor to point to some handler on under your control you can do that with the lidt instruction then you can load up an MSR address that you want to check let's say you want to figure out does MSR number 1 3 3 7 exist on my processor well load that number into the ECX register and then execute the read MSR instruction now if you don't get a fault from that read MSR instruction you know that that MSR exists if you do get a fault if your handler gets control you know that that MSR doesn't exist so this is a really easy way to figure out exactly which MSR is actually exists on your processor regardless of what documentation says so when you do this on the VSC 3 you end up with an alarming number of MSR is way more than would normally be on an x86 processor we found 1,300 implemented MSRs on that processor which is far too many to analyze I think one of these is going to activate a new x86 instruction that's too many to go through by hand so step two is sort of figuring out which Emma stars here are actually interesting which one should I be exploring for this research so I I came up with this idea
of sort of a timing side channel attack against the MSR s where basically what I would do is I would calculate how long it took to access each of the 4 billion possible MSRs and what that looks like is you've got this read MSR instruction and then on on either side of that you've got a serialized read time stamp counter instruction that lets you see exactly how long it took your read MSR to execute and that shows you how long it took to access the the MSR so if you run this code looks something like this where on the x-axis I've got the our numbers and on the y-axis I've got the access time for that and I saw our green is MSR that are implemented on the processor red is MSR that are not implemented on the processor so this this side challenge actually gives you some really really interesting insights into how the processor works under the hood that we normally would never have access to but what we can do with this then is sort of form a hypothesis I'm going to theorize that this global configuration register this really powerful register that the patents talk about is probably unique there's probably not several similar versions of this register within the MSR so there's probably exactly one of these on that processor so what I can start to do is I can start to look for MSRs with unique access time so like these that are circled in red and when we actually start to do that what we find is that there's relatively few MS ours that are truly unique on this processor in fact about the 1,300 MSR that are there we identify 43 that actually seem interesting and worth exploring more so that's that's good that seems like we're making progress I've sort of whittled down the number of m/s ours from initially four billion down to 43 to actually explore in this research but it's still a lot to tackle by hand that's 27 52 bits worth of MSRs to check now my theory is that one of these bits is going to activate a new x86 instruction so I want to figure out when I toggle these bits did a new instruction appear on the architecture well x86 is a really really complicated instruction set and it's sort of hard to estimate how many instructions could actually be in x86 but a rough upper bounds would be somewhere on the order of 1.3 undecillion possible instructions so I want to figure out did one of these MSR bits create a new instruction but I've got a search through 1.3 undecillion instructions in order to find that new instruction if it appeared at all so even taking a really optimistic estimate if we could scan through a billion possible activity six instructions every second you can do some quick Fermi calculations in order to see 1.3 undecillion divided by 1 billion divided by a 60-second sub minute divided by 60 minutes an hour divided by 24 hours-a-day divided by 365 days a year means it's going to take about an eternity to scan that entire instruction set so so that's that's not reasonable I'm making things worse that's that's for one skin we have to scan each of these 2752 bits that's about 2752 eternity x' in order to find which bit creates a new instruction so fortunately there's a better solution I actually released this tool last year called sand sifter which was was kind of neat it found a smart way of searching through the x86 instruction set using page fault analysis and depth first search algorithm in order to scan all of x86 for the most probable instructions in about a day he used to find things like undocumented instructions or new instructions up hearing but it still can't be run 27:52 times if it takes a day to scan a single bit so what we can do instead is we can try to toggle each of those 2752 candidate MSR bits one by one but these are configuration bits these control the inner workings of the processor and I have no idea what these bits actually do so a lot of them are going to lock the processor or freeze it or panic the kernel or reset the system entirely so trying to toggle 2000 some bits one by one can't be a manual process we need some sort of automation in order to make this work so this system I came up with looks something like this where we would have a target that's the c3 processor hooked up to a relay basically a wire would be soldered on to the power switch on the target and then that's hooked up to a relay that relays hooked up to a master system so the master can power cycle the target through the relay you target also network boots through switch from the master and the master then can SSH into the target well the master can tell the target is toggle this MSR bit and a mess is going to check did something become stable that the system stopped responding to the kernel panic if not it's going to try to toggle the next MSR bit if so it's going to power cycle the target this way the master can sort of repeatedly go through and identify exactly which of those 27:52 MSR bits can be activated without the target becoming unstable so over the course of about a week through hundreds of automated reboots were able to identify exactly which bits we can actually turn on on that processor so after that what we do is we try to turn on all the bits that we possibly can get all of those bits on at once and only then do we actually run stance if they're only then do we scan the processor for new instructions so that looks something like this using a SAN scepter for this purpose so like I said sand sifter uses a page fault analysis and depth-first search in order to search through the x86 is a we're watching sand sifters sort of in the middle of a scan probably about 12 or 15 hours into a scan and what you can see it doing is generating machine code in order to try to search through the feasible x86 instructions on this processor and if you let's see and sifter run for long enough eventually it will spit out something new in that lower window there since after after about a day finds exactly one new instruction on that architecture Oh F 3 F so this must be the launch instruction that the patents are talking about the new instruction that got activated by these MSR bits so through gdb and some trial and error I was able to figure out that the launch instruction it's basically a jump EAX instruction so from that we can figure out what the global configuration register is now I activated all of these possible MSR bits in order to find a launch instruction now I'm curious which bit was really responsible for activating that launch instruction fortunately now that I've identified the Oh F 3 F instruction I no longer need to run sand sifter for additional scans what I can do is activate each of those MSR bits one by one and after activating mine I'll try to execute the launch instruction if it doesn't work then that was the wrong bit if suddenly the launch instruction appeared on the architecture then that means I found the bit that activated the launch instruction so using that approach were able to find out that MSR number 1107 actually was the MSR that enable with the launch instructions specifically it was bit 0 inside of MSR 1107 now I suspect that this is going to open the door for another architecture on the processor which will let me bypass those ring protections once and for all and circumvent all the processor securities checks so because of how much power that single bit has just potentially enabled I started calling bit zero of MSR 1107 the god mode bit so at this point we've we've figured out the god mode bit we've figured out this this hidden launch instruction in the x86 is a the the next question and is how do I actually execute instructions on this x86 or on this on this new risk core
that we've enabled so we can sort of dive into patents to try to figure this out and the patents sort of hint at this idea that instructions are fetched out of memory and then has two different decoders depending on whether you're in x86 or risk mode and I had to go through
a lot of trial in there to figure out exactly what that might look like under the hood but this is sort of what I ended up with essentially an instruction we fetched out of the instruction cache and you would be passed to some sort of pre x86 decoder that decoders going to break apart the components of an x86 instruction and then those components are going to get passed to a check and that check is going to determine EMI in RISC mode or not namely has that launch instruction just been executed if the answer is no then those components are all passed on to a further x86 decoder and the instruction executes says x86 if the answer is yes one of the components of that instruction 32-bit constant value will be torn out and passed over to the RISC decoder in order to execute as a RISC instruction so basically with this setup there needs to be some x86 instruction where if the processor is in risk mode it can pass a portion of itself over to the RISC processor on this chip and since this instruction whatever it is sort of joins the x86 six and RISC cores I called it a bridge instruction and it basically gives me a way to feed instructions over to the RISC score so the the next question now is how do we find this unknown x86 Bridge instruction that will let me execute RISC instructions it's it's not easy but it should be just sufficient just to detect that a RISC instruction has been executed so how can we detect if a RISC instruction has been executed given that I have no idea what these instructions can actually do or what their execution will look like well well an easy way would be if our theories right if this risk for really can circumvent processor security checks then there should be some risk instruction I don't know what but some RISC instruction even when executed in ring 3 should be able to corrupt the system and corruptions are usually pretty easy to detect that should look like a processor lock or a kernel panic or a system reset basically that I observe any of those behaviors that means I've executed one of these mysterious risk instructions because none of those things should ever happen executing ring three six instructions normally so I sort of tore apart seeing sifter in order to help me with this I ripped out the corpse and sifter changed it to run in a brute force mode so it's still executing x86 instructions but before you checks eighty-six instruction that it generates it's going to execute the launch instruction one at all in order to switch to risk mode and what it's trying to find is some x86 instruction that corrupted the system and that's sexually what we just saw here we saw the processor lock when the sand scepter hit exactly the right combination of x86 and RISC instructions so once we observe that processor lock that means that we found this bridge instruction this x86 instruction that can execute RISC instruction so it takes about an hour of buzzing to get here we just saw a short snippet but it turns out that this bound EAX instruction is this Bridge instruction it's what's going to let me feed instructions over to the RISC core and specifically this 32-bit constant value used in the bound instruction appears to be the RISC instructions that get execute executed on the deeply embedded core and that's a pretty easy thing to check basically I can see that for some specific 32-bit values the processor locks every single time and for other 32-bit values nothing seems to happen very very consistently so so now we found the bridge instruction I know how to send a text ructions over to that deeply embedded core so the next question is what do you want to execute on this on its alternate instruction set what do these instructions even look like what architecture am I even dealing with so ideally moving forward I would just assume that this this other architecture is probably some known common architecture it's probably something like on or Power PC or MIPS it doesn't make a whole lot of sense to invent an architecture from scratch so we could assume that this other architecture is some common architecture and I could try to execute a common architectures instructions on this on this other core so for example if I thought maybe I'm dealing with arm I might try to execute and add one to our zero instruction the problem or what I encountered was that for some of these very very simple RISC instructions I was generating the processor would lock so if I generate an instruction like like add one to our zero and I try to execute that on this risk core and the processor locks that probably means I'm not dealing with the architecture I thought I was dealing with and probably not dealing with arm if that instruction that simple instruction locked everything up so I was actually able to rule out 30 different common architectures this way and I still think most likely this other architecture this risk core that we're dealing with is probably derived from some common architecture but it seemed like it was maybe modified enough that I couldn't identify it so that sort of forced me to move forward just assuming this thing was a black box basically treating it as some totally unknown architecture that we've never seen before that means that we have to reverse-engineer the format of these instructions for this deeply embedded core and I spent probably the bulk of this research actually trying to reverse-engineer the format of these instructions and so I started calling those the deeply embedded instruction set or the dice so the question is how do we begin reverse engineering a totally unknown instruction set ideally what I would do is I would execute one of these risk instructions and observe its results the challenge though is I have absolutely no knowledge of this instruction set architecture that I'm dealing with and I probably can't observe the results on the risk course so for example if I did generate an instruction like add 1 to r0 but I can only view or I can do questions then how do I view our 0 from my x86 core how do I even detect that I've made any changes to the risk core fortunately there's there is an approach for this the patents actually suggest that these two cores shared register file at least partially share a register file which means I may not be able to view all the effects of my RISC instruction but it should be able to use some of the
effects of the RISC instruction from the x86 core so so for example the patents show an example where an arm and x86 core share some of their registers so that means from the x86 core I can actually see some of the effects of
these instructions so what this would look like then is I can generate an initial state for the processor I'm gonna generate a bunch of values for the processor registers maybe random values maybe set all those registers to be pointers to various buffers and I'm gonna record that information I'm gonna generate some buffers in userland memory and in kernel memory I'm gonna record those buffers basically I record all the information I can about the system state what I'm going to do is I'm going to execute that launch instruction that Oh F 3 F instruction in order to toggle the RISC core after that I'll execute the bridge instruction the bound EAX instruction that lets me send a RISC instruction over to the RISC core and I'm going to generate an arbitrary RISC instruction to try out after that instruction X execute I'll record the results of the system state I'll record all the registers all the buffers and everything else and what I want to see is did something change on the system did that input state and output state differ in any way unfortunately we run into even more challenges here I'm dealing with a totally unknown instruction set that probably has unfettered access to ring 0 so it is really really easy to accidentally generate instructions that cause kernel panics or processor locks or complete system reboots I mean in practice I could only generate about 20 RISC instructions before the system became unrecoverable corrupted before I had to reboot the system and start over so even after optimization it took about two minutes for one of these systems to boot so some more rough approximations kind of indicated it was going to take months and months of buzzing in order to gather enough data in order to reverse engineer this instruction set so I
expanded my initial setup instead of fuzzing one target system I bought as many systems as I could on eBay turned out to be seven of these sin clients and I hooked them all up if you look carefully at this what you'll see is each of these systems has a little green wire coming out of a chassis that wires soldered onto the system's power switch all of those wires go to a relay module that relay module is hooked up to a master system over USB all of these systems boot over the network from the master system so the master system can boot up these systems can SSH into them and can assign each of them fuzzing tasks for this RISC architecture it can then record all the results of that those are those fuzzing tasks for offline analysis and when it detects that one of these systems has become corrupted when it stops responding or at the kernel panics the master can use those relays in order to reboot the
target so we can actually see example of this in action so what I'm going to do is I'm going to tell the master to start startup fuzzing test it's going to think for a little bit as it generates as I generate some tasks for the target and then if you listen to the relays you can hear you Trulia clicking on you can see the other relay lights and if you look closely at those target systems you can see the green LEDs on them coming on one by one as each of those systems boots up so I think for brevity's sake I'm going to cut this video a little bit short but if you'd like to see the full demo afterwards you can grab me but in about a minute when these systems come online we'll be able to see the targets start asking them with buzzing jobs and we'll be able to see logs coming into the target and eventually the target will detect that some of these systems have frozen and it will bring them offline through the relay so that this process can continue so this was this was slow fuzzing work is fairly laborious I let the system run for about three weeks I collected 13 or 15 gigabytes of logs across 2.3 million different state disks for about 4,000 total hours of compute time but he's really exciting after I had the final results we can start sifting through these logs in order to try to find something interesting basically I'm curious can I actually do anything to ring 0 from rank 3 according to these logs and it turns out we can we can pretty quickly start to find some things that shouldn't exist in the secure world so for example the instruction a seven seven one nine five six three if you look at that EDX register I was able to read control register zero into the edx registered control register zero is supposed to only be accessible by ring zero but we just read it from ring three using one of these risk instructions and we're not limited to reading ring zero data if we look here instruction eight ad four was actually able to write debug registers zero you'll see that the EBP register got written to d r0 d r0 should only be accessible to ring zero or circumventing the ring protections through this risk
core so at this point it's kind of time to start thinking about a payload what we should actually be having the this risk core do for us and really once you can reach through the ring boundaries the sky's the limit you can do whatever you want I thought it'd be useful to have some sort of easy to demonstrate payload so I thought a good demonstration would simply be elevating the current process to root permissions so that would look something like grabbing a structure called the global descriptor table parsing out a field on the gdt called the FS segment that FS register can actually point you to a structure memory called the task structure and if you grab a certain field from the task structure you can get a pointer to what's called the cred structure and from there you can actually set yourself to have root permissions as long as you can reach directly into kernel memory and grab and modify all this information now there's really only a few pieces here this parts highlighted in red that require us to cross ring boundaries things like addition and bit manipulation we can technically do that on the x86 core we don't need this risk core to do that for us but it was kind of fun having this this unknown core execute instructions so I thought it'd be a little bit more impactful if I wrote this entire payload for the deeply embedded core and never used x86 so I've
got 15 gigabytes of logs I want to try to build this payload so it's time to start sifting through those logs trying to find some primitives to use and this sort of feels like building a rope chain you've got this you've got these like tiny little pieces of functionality you have an overall idea of what you're trying to accomplish and you're trying to figure out how to piece together those little pieces in order to form your your final payload so I'm so we can start finding those little pieces that we need so for example a three three one or a three one three was an instruction that was able to read the global descriptor table that's the first piece of our payload I'm able to find a kernel read instruction inside of those logs D 407 you'll notice a low byte of EBP got read out of a kernel memory buffer I can actually modify kernel memory e to be seven was able to write a single byte in to kernel memory now this is really promising if I can write a bite into kernel memory it means I can do precision exploitation through the deeply embedded core but at the end I sort of decided you know sifting through logs like this by hand just doesn't scale I want to be able to write more robust payloads for this deeply embedded core so I wanted some sort of automated approach for for doing this so what I wanted was some way to extract behavior patterns from these state dips in order to identify the bit patterns inside of these instructions so I built a tool called the collector so the collector basically helped us do automated reverse engineering of completely unknown instruction sets so the way the clock collector worked it is it would basically look at those state disks recorded by the fuzzer and it would try to identify some basic operations like loading immediate values into registers like transferring one register to another there we go like reading memory like modifying memory like shifting registers any number of arithmetic and bitwise instructions the collector would try to identify those just by looking at differences in the input state in the output state between RISC instructions and then what it would do is it would be in instructions based on what effects it saw those instructions having so it might spit out well here is a bin of instructions that all exhibited this property of transferring one register to another and after it had these instruction bin what it would look for is patterns within the bins so for example one of the first things we might want to figure out is how are the register values for each of these instructions encoded inside of the instruction for example where is EAX encoded in this instruction so it would look for each of these instructions which bits might represent EAX which might represent edx so that's what I have highlighted in purple basically for each individual instruction which bits might be encoding the registers that the collector saw being used by that instruction so then what the collector does is it tries to identify patterns across the entire bin so we can see then that these two middle columns are the only consistent piece of this pattern in other words these two middle columns must be the thing that records or encodes the input and output registers for these instructions so we need that for all sorts of the different facets of an instruction encoding you can try to figure out which bits are used to encode the opcode of the instruction and it's not a perfect process ideally we would see totally consistent patterns across the entire band that's not what we see here but what collector will do will be it will try to pick the things that are most common in other words in this situation they'll say well these are the most likely bits that encode your hop codes I can even try to find things like don't care bits and the instructions or bits that seem to follow some sort of statistical patterns but we can't necessarily tell what they are and then it'll match all this information together in order to sort of automatically derive the bitwise encoding of each possible instruction on this on this deeply embedded core so looking at the bins that it creates and what sort of functionality we might want to get out of the deeply embedded instruction set these are sort of the encoding that it came up to can't came up with for of different instructions so we've got instructions to move registers around and load the global descriptor table
basically we have a primitive assembly language now so that we can write payloads for this for this deeply embedded core so so we could now that we now that we know the instruction encodings we could write some payloads by hand if we want it but I thought it'd be cooler if I wrote an actual assembler for these things so that I could I could write some of these programs at a higher level so so basically I wrote this dice assembler a custom assembler just for this unknown instruction set that will assemble these different primitives into their raw binary representation and then wrapped in in one of these x86 bridge instructions so that the dice instruction can be sent over to the deeply embedded core for execution so now we're ready to revisit that payload idea that we have we can use the LGD instruction in order to read out the global descriptor table we can use some of our other dice instructions in order to parse that descriptor field in order to grab a pointer to the task structure in order to grab a pointer to the cred structure in order to write to the cred structure circumventing the rank protections modifying kernel memory in order to give ourselves route level access on the system all using nothing but instructions specifically for this this unknown architecture embedded alongside our x86 core so the output of this you run it through the assembler looks something like this basically we will activate that deeply embedded core with the launch instruction that's on the left then we execute all these bound instructions each bound instruction sends a single instruction over to the deeply embedded core for execution by the RISC processor these instructions are going to circumvent the processor security mechanisms in order to grant the current process root permissions and then we launch a shell so let's revisit that demonstration from the the beginning and walk through that in a little bit more depth so here we have our complete payload doing the steps that we just talked about we load the address of the payload into the EAX register and then we execute this launch instruction on the c3 processor after that we've got our actual payload all of these bound instructions the x86 instructions that are able to communicate with the korte each pound instruction sends a single risk instruction over to the deeply embedded core and at the very end of of this sequence we launch a shell in order to hopefully gain new permissions so if we exit out of here and run this program we'll compile it first we'll double-check exactly who we are again so we are just a regular unprivileged user but when we run the program we use that deeply embedded core in order to gain root permissions so if things so so I want to toss this out here only sort of tongue-in-cheek researchers have sort of explored down to what we were calling ring minus three but I'm gonna pitch this as sort of a ring minus 4 it is in some ways more powerful than our previously known ring minus 3 it is a core sort of co-located with the x86 core it has unrestricted access to the x86 cores register file unlike ring minus 3 none like ring - trade shares a lot of its execution pipeline with the x86 core which gives it sort of more power than ring minus 3 but at the same time the whole thing is just nebulous when we're this deep the whole ring model has completely broken down and this is sort of nonsensical anymore but the the point is we've now gotten direct ring 3 - ring 0 Hardware privilege escalation on x86 that has never actually been accomplished before everything else where we've come close has had to rely on code in the operating system or other bugs to up to sort of work as a launching point this is a purely Hardware approach for accomplishing this so the good news is that fortunately we need initial ring 0 access everything else in this payload happens in ring 3 but we have to activate that backdoor using one-time ring 0 access to set the god mode bit at least in theory but I wanted to poke
around with this a little bit more so here we're looking at a different system this is another system I had sitting on my on my shelf here so this is a BS III processor but it's a samuel 2 core instead of the Nahum i acquire that i was using what we just saw is the system clean the boot this a freshly booted operating system totally unmodified I'm going to log in as a regular unprivileged user aren't the fastest processors but once we get to a prompt what I'm going to do is insert the MSR kernel module in order to gain access to the MSR s so you notice I'm sudo here but I'm not writing anything I'm only
using this to read things out to show you I don't have the read MSR tool installed on here we're going to just use the hex thumb tool in order to dump out that global configuration register MSR number 1107 it's a little bit of that register with the godmode bit if you look at that low byte what you see is one one zero one zero one one one a little bit there the god mode bit is on by default when the system boot
that means on the cl2 core any unprivileged code that knows what it's doing can escalate to kernel level privileges at at any time and when you totally break down at city six ring protections and privileges like this all of our other protections fall like dominoes like antivirus does nothing when you can just reach directly into ring zero ASLR depth don't help you anymore when you can just circumvent the ring privilege model code sign and control flow integrity kernel integrity checks none of this helps once rings mean nothing but there are mitigations none of them are great but there are opportunities here so one is we could update the processor microcode in order to lock down the god a little bit just don't let people set the god mode that to begin with we can update the microcode in order to disable any new code assists on the bridge instruction so that you couldn't send instructions over to that risk or you could update the operating system in firmware in order to disable the god mode bit during boots and then periodically make sure that that bit has stayed off that nobody's enabled the backdoor but at the end of the day the points kind of moved this is an older processor it's not in widespread use and I don't really want to throw B under the bus here I think their target market is embedded and I think this was probably a useful feature for their customers it was just a very flawed implementation in some of their earlier iterations of the processor so so instead let's take this as a case study the reality is that backdoors do exist this is not just conspiracy theory stuff but they don't have to remain invisible we do have the techniques necessary in order to find these so I'm sort of looking forward even though this is not a comment processor I do think this is a big deal this is not just a big deal for C 3 this is not just a big deal for x86 I think this is a big problem for all of computer engineering in general because these this hardware that we're using to protect all of our secrets to do all of our computation we have no way to introspect it so whether or not backdoors are real in a general sense is is sort of irrelevant that question is going to haunt us as long as we can't introspect our own hardware so I hope that's what people will take out of this I hope when you see something that seems off-limits when you see these weird hints in patents because the patents that I saw were just the tip of the iceberg in some of this good stuff that's out there I hope people will dive in deep and really see what's under the hood because because that's how security works that's how we build trust in any system so sort of along those lines I open sourced all of the research that I use for this all the tools techniques all the code and data I've demonstrated today are freely available so I hope people will use those to scan their own systems I hope people will use this as a starting point for much much deeper processor security research and along those lines I want to quickly throw out this other idea that I didn't have time to cover in any depth during this presentation but I I demonstrated this the side channel attack on MSRs and I actually found some really cool uses for this totally outside of what we just looked at now and found some kind of unsettling results on the number of processors so that's topic for another presentation I'll be talking about this tool project nightshift specifically tomorrow at DEFCON if you're around and I'll sort of show some of the totally different things I found using this approach so I hope you'll check that out if you have time but in the meantime all this stuff is available on my github that's github.com x or e ax ax DX this project I'm calling project Rosen bridge all the back to our research is there C n sifter that processor buzzer that I used heavily throughout this is also available there I wrote a single instruction C compiler a few years back I've written some cool control flow stuff there's a system management mode exploits on there and a lot of other fun stuff I've just tinkered with over the last few years it's all available there so I hope you'll check it out what I really love is some feedback and ideas on these on these topics so if you wanted to reach out either after this presentation I'll just be next to the stage or you can ping me at X or Y ax ax e ax on Twitter or same thing at email com if you have any feedback or anything else you'd like to discuss about this so thank you everyone for coming and that's it you [Music]
Feedback