We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Inside the AMD Microcode ROM

00:00

Formal Metadata

Title
Inside the AMD Microcode ROM
Subtitle
(Ab)Using AMD Microcode for fun and security
Title of Series
Number of Parts
165
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Microcode runs in most modern CPUs and translates the outer instruction set (e.g. x86) into a simpler form (usually a RISC architecture). It is updatable to fix bugs in the silicon (see Meltdown/Spectre), but these updates are encrypted and signed, so no one knows how microcode works on conventional CPUs. We successfully reverse engineered part of the microde semantics of AMD CPUs and are able to write our own programs. We also recovered the mapping between the physical readout (electron microscope) and the "virtual" addresses used by microcode itself. In this talk we present background on microcode, our findings, our open source framework to write custom microcode and our custom defensive measures implemented in microcode.
Keywords
Semiconductor memoryMusical ensembleComputer animation
Information securityStudent's t-testFocus (optics)SoftwareDataflowInformation securityCodeReverse engineeringComputer fontCrash (computing)MikroarchitekturComputer animationMeeting/Interview
ArchitectureCrash (computing)Software frameworkComputer fontCoprocessorFirmwareBefehlsprozessorException handlingComplex (psychology)IntelScheduling (computing)Vector spaceMilitary operationData bufferSpherical capClique problemCartesian coordinate systemBit2 (number)Crash (computing)Software frameworkConstructor (object-oriented programming)Asynchronous Transfer ModePort scannerRing (mathematics)Disk read-and-write headComplex (psychology)Data managementException handlingPatch (Unix)TowerBus (computing)Prime idealSystem programmingCodePhase transitionOperating systemBefehlsprozessorSoftware bugData storage deviceBuffer solutionCodierung <Programmierung>Covering spaceInterrupt <Informatik>Online helpPower (physics)Vector spaceCoprocessorError messageMultiplicationDivision (mathematics)Execution unitJSONXMLProgram flowchart
Vector spacePoint (geometry)Kernel (computing)EmailPatch (Unix)Address spaceGame controllerBefehlsprozessor3 (number)OracleDevice driverSelf-organizationSurgerySoftwareMusical ensembleData managementData storage deviceHeat transfer
Kernel (computing)Email3 (number)Data structureMilitary operationWordSequenceAuthenticationSource codeBitComputer configurationFlagSequenceWordBranch (computer science)CodePointer (computer programming)Condition numberStructural loadData storage deviceResultantComputer architectureVirtual machineHeat transferSemiconductor memoryCASE <Informatik>Different (Kate Ryan album)Pairwise comparisonQuicksortComplete metric spaceFormal languageDigital photographyCache (computing)Online helpUniversal product codeConstructor (object-oriented programming)Cue sportsStreaming mediaChainComputer fileFatou-MengeClassical physicsPositional notationWorkstation <Musikinstrument>Computer animation
Mathematical analysisComputer hardwareSemiconductor memoryBitStructured programmingTelecommunicationData storage deviceDialectOperator (mathematics)Process (computing)Content (media)Product (business)Data managementComputer animation
Data recoveryProcess (computing)Military operationWordBitMereology3 (number)Process (computing)String (computer science)God
Data recoveryProcess (computing)Address spaceLogic programmingVirtual realityTexture mappingSemantics (computer science)Core dumpImplementationEmulator3 (number)Function (mathematics)outputOperations researchTravelling salesman problemInstallable File SystemReverse engineeringSpeichermodellState of matterAdditionBlock (periodic table)Sweep line algorithmoutputAddress spaceFunction (mathematics)Array data structureData recovery3 (number)PermutationPhysicalismBitGenderVirtualizationSemantics (computer science)MappingDifferent (Kate Ryan album)EmulatorShift operatorOperator (mathematics)ImplementationTexture mappingCodeSoftwarePoint (geometry)AstrophysicsComputer configurationLevel (video gaming)NeuroinformatikCore dumpQuadrilateralBlogPhysicistUniform resource locatorSystem programmingMatching (graph theory)Computer animation
3 (number)FlagCartesian closed categoryMilitary operationAsynchronous Transfer ModeGroup actionSocial classImplementationGroup actionFlagCountingElectronic visual displayWorkstation <Musikinstrument>Operator (mathematics)BitSign (mathematics)Combinational logicData storage deviceVideo game consoleSystem programmingSpacetimeResultantInformationShift operatorBefehlsprozessorRow (database)Computer animation
Cloud computingAsynchronous Transfer ModeMilitary operationCartesian closed categoryFlagSocial classCycle (graph theory)Auditory maskingStructural loadMereologyResultantBitBefehlsprozessor3 (number)Shift operatorBranch (computer science)SpacetimeMultilaterationPattern languageOperator (mathematics)Stress (mechanics)Open setIdentity managementEndliche ModelltheorieData storage deviceSubject indexingCombinational logicVector potentialWhiteboardComputer animation
3 (number)Social classMilitary operationFlagAsynchronous Transfer ModeCartesian closed categoryCondition numberGroup actionCloud computingCASE <Informatik>Multiplication signWorkstation <Musikinstrument>Projective planePosition operatorMusical ensembleRevision controlFerry CorstenBranch (computer science)Exception handlingCodeMagnetic stripe card3 (number)Interrupt <Informatik>FlagError messageShift operatorTimestampCountingLatent heat32-bitBitComputer animation
BuildingCartesian closed categoryCloud computingFlagSocial classMilitary operation3 (number)Group actionCondition numberAsynchronous Transfer ModeVirtual machineNamespaceData managementImplementationComplex (psychology)BitExtension (kinesiology)Forcing (mathematics)Data storage deviceHand fanStandard deviationFlagAuditory maskingBranch (computer science)Magnetic stripe cardAddress spaceExclusive orOperator (mathematics)Revision controlReading (process)Computer animation
Computer hardwareComputer configurationAddress spaceRandomizationIntegrated development environmentRead-only memorySoftwareAcoustic shadowImplementationInterface (computing)Group actionSingle-precision floating-point formatError messageBenchmarkSoftware frameworkSoftware testingControl flowServer (computing)WordOperator (mathematics)SoftwareCoprocessorConstructor (object-oriented programming)ImplementationProjective planeSoftware testingCanonical ensembleTraffic reportingHypermediaLogic gateMusical ensembleJava appletRemote administrationSystem programmingLine (geometry)Computer programmingSoftware developerMotherboardTelecommunicationLinear programmingAsynchronous Transfer ModeGroup actionElement (mathematics)Disk read-and-write headRight angleCartesian coordinate systemMultiplication signTwitterContent (media)CuboidConnected spaceLength of staySemiconductor memoryState of matterMathematicsLatin squareAuthorizationBitCompilerLorenz curveBound stateSystem callCompilation albumMechanism designParameter (computer programming)Parallel portInterrupt <Informatik>Acoustic shadowAddress spaceBefehlsprozessorNeuroinformatikCodeConfiguration spaceInterface (computing)RandomizationKernel (computing)WritingPosition operatorAlgorithmRevision controlAuditory maskingAdditionResultantSoftware bugException handlingRun time (program lifecycle phase)Serial portProgram codeLevel (video gaming)Power (physics)NP-hardCovering spaceCache (computing)Computer hardware2 (number)Utility softwareComputer animation
Computer hardwarePower (physics)Integrated development environmentWorkstation <Musikinstrument>TelecommunicationMotherboardEvent horizonBefehlsprozessorPower (physics)OpticsWhiteboardScalabilityoutputState of matterComputer animation
System programmingControl flowSerial portTraffic reportingCodeError messageAssembly languageDisassemblerSoftware frameworkContent (media)Device driverWrapper (data mining)DisassemblerFunction (mathematics)Software testingStreaming mediaProcess (computing)EmulatorMultiplication signDevice driverError messageGame controllerPoint (geometry)Different (Kate Ryan album)CodeCodeSerial portSoftware frameworkComputer programmingAssembly languageBinary fileBefehlsprozessorCodierung <Programmierung>Remote procedure callStructured programmingHydraulic motorVideoconferencingStatement (computer science)Traffic reportingComputer configurationSystem programmingElectronic mailing listElectronic program guideSeries (mathematics)Software developerBitTheoryFamilyVacuumState of matterCasting (performing arts)OSI modelOrder (biology)JSONXML
Software frameworkComputer programProof theoryInformation securityNeuroinformatikGroup actionMusical ensembleComputer programmingString (computer science)RandomizationBefehlsprozessorBitComputer animation
Musical ensembleLine (geometry)NumberVideo gamePressureSoftwareConnectivity (graph theory)Address spaceInsertion lossAcoustic shadowComputer hardwareExtension (kinesiology)ImplementationCodeBootstrap aggregatingCodecTexture mappingResource allocationCompilerSource codeRun time (program lifecycle phase)DistanceUniform resource locatorWeb pageMultiplication signPerturbation theoryLecture/ConferenceMeeting/Interview
Revision controlShared memoryComputer hardwareMachine visionSoftwareImplementationCompilerAddress spaceStandard deviationReal numberKernel (computing)BenchmarkSemiconductor memoryLecture/Conference
NumberMusical ensembleCodeAddress spaceError messageBitParticle systemSubsetClient (computing)Single-precision floating-point formatLogistic distributionMultiplication signCausalityConnectivity (graph theory)Lecture/Conference
Grass (card game)Public-key cryptographyBefehlsprozessor1 (number)CryptographyNumberData managementGoodness of fitGodDigital photographyLecture/Conference
Computer programmingComplex (psychology)DivisorStatement (computer science)Operator (mathematics)Data managementImplementationLogical constantLimit (category theory)Constraint (mathematics)Computer programmingMechanism designEncryptionLecture/Conference
1 (number)Scheduling (computing)CodeControl flowBefehlsprozessorOperator (mathematics)CodecCondition numberVector potentialConstructor (object-oriented programming)SoftwareService (economics)Game controllerLecture/Conference
NumberBefehlsprozessorOpticsLecture/Conference
NumberTwitterReverse engineeringPower (physics)ImplementationSingle-precision floating-point formatYouTubeGroup actionConnectivity (graph theory)Musical ensembleWeb 2.0Data managementHypermediaException handlingMagnetic stripe cardTexture mappingReading (process)Multiplication signSet (mathematics)WritingParameter (computer programming)Branch (computer science)Combinational logicLecture/Conference
Semiconductor memoryCartesian closed categoryMusical ensembleDiagram
Transcript: English(auto-generated)
So, the next talk, Benjamin Kolenda and Philipp Koppe, they will refresh our memories, because
they already had a talk on 34.33, where they talked about the microcode ROM, and today they're going to give us more insights on how microcode works, and more details on the ROM itself.
Benjamin is a PhD student and has a focus on software attacks and defenses, and together with Philipp, they will now abuse AMD microcode for fun and security. Please enjoy.
Thank you. So, as mentioned, we were able to reverse engineer the AMD microcode and the AMD microcode ROM, and I'm going to talk about our journey, what we learned on the way, and how we did it. So, this joint work with my colleagues at Rio Nestedi Brochum, and a quick outline how we're going to do this.
We're going to start with a quick crash course on microarchitectural basics and what microcode actually is, then I'll talk about how we reconstructed the microcode ROM, what we learned along the way, then I'll quickly give some examples of the applications we implemented, with the knowledge we gained from the second step, and lastly I'll talk
about the framework we used, how it works, and what we can do with it, and also this framework is available on GitHub along with some other tools, so you're free to continue our work. Okay, so, when I'm talking about microcode, you can think of it essentially as a framework for your processor. It handles multiple purposes, for example, you can use it to fix CPU bugs that you
have in silicon and you want to fix later in the design phase. It is used for instruction coding, I'll cover this one a bit more, it is also used for exception handling, for example, if an exception or an interrupt is raised, microcode has a first chance of modifying this interrupt, ignoring it, or just passing it along to the operating
system. It's also used for power management and some other complex features like inter-SJX, and most importantly for us, microcode is updatable, this is used to patch errors in the field, everyone remembers Spectromelton patches and this is a microcode update.
So your x86 CPU takes multiple steps to execute an instruction, the first step is decoding x86 instruction into multiple smaller micro-ops. These are then scheduled into the pipeline from where they are dispatched to the different functional units, like your ALU, AGU, multiplication division units.
For our purposes, the decode step is the most interesting one, in the decode step you have instruction buffers that feeds instructions to some decoders, you have short decoders that handle really simple instructions, along decoders you can handle some more advanced instructions, and finally the vector decoder, the vector decoder handles the most
complex instructions with the help of microcode, so the microcode engine is essentially the vector decoder. The microcode engine, in essence, is comprised out of a microcode ROM that stores the instructions for the microcode engine, think of it as your standard instructions.
Then there's also a writeable memory, the microcode RAM, this is where the microcode updates end up when you apply a microcode update, and of course around the storage there's a whole lot of things that make it actually run, for this talk you only need to know about the metrogistors, metrogistors are essentially breakpoint
registers, so if you write an address from inside the microdome, inside the metrogistor, whenever this address is fetched, execution control is transferred to the microdome, so our patch gets executed. And the microcode updates are usually loaded by the BIOS or by the kernel,
Linux has an update driver, sometimes the BIOS updates it with an appearance that version, and they have a pretty simple structure, a partially documented header, and followed by the actual microcode that is loaded inside the CPU. And the microcode is organized in something called triads, each triad has three operations,
essentially existing instructions but with some differences, and lastly you have a sequence word. The sequence word indicates which microcode instruction should be executed next, we have options of executing just the next triad, executing another one by branching to it, or just saying okay I'm done with decoding this instruction, continue with x86 code.
These updates are protected by some weaker dedication which we were able to break, so we can create our own, we can analyze existing ones, and we can apply these to your standard laptop and desktop. However there can only ever be one update loaded at a time,
and when you reboot your machine this update will be gone. Also for the talk we are going to look at some microcode, and we will present this microcode using a register transfer language, it is heavily based on x86, I'm just going to cover the differences between these two. Most importantly the microcode can have three operands for an
instruction in comparison to x86 which usually only has two, so you can specify a destination and two source operands. Also microcode has some certain bit flags that need to be set,
and these we do with these annotations, for example .c means this instruction also updates the carry flag based on the result. Then you have the instruction jcc which is a conditional branch, and the first operand denotes the condition upon which this branch is taken, in this case branch if the carry flag is one,
and second operand indicates the offset to add to the instruction pointer. Then we also have some sequence build annotations, next complete and branch. Also it should be noted that the internal microcode architecture is a load store architecture, you can't use memory operands in other instructions like you can on x86, you always need to load and
store memory explicitly. Now we're going to talk about how we managed to recover the microcode ROM. Microcode ROM is baked into your CPU, you can't change it anymore, it is defined in the
silicon during the fabrication process, and in this picture you can see a die shot taken with an electron microscope, and this is one of three regions that contains the bits for your microcode operations. If you zoom in a bit more, each of these
regions consists out of four arrays, and these are further subdivided into blocks. Really interesting is array 2, which is a bit smaller than the other ones, but it has some structures above it which are of a different visual layout.
This is SRAM, which stores the microcode update, so this is one-time repocamellable memory that is still pretty fast. The microcode RAM is located right next to the microcode ROM, which also makes sense from a design standpoint.
Just an overview of how we went about, we started with pictures, and then we used some process to transform them into bit strings, which we can then fill the process. These bit strings were arranged into triads. We could already gather that we got individual triads right, because there were data dependencies all over the place,
but between triads there were no or very few data dependencies, so the ordering of the triads was wrong, and this was a major part when we went ahead, but we had to reverse engineer, and this is mapping a certain physical address of a triad that we gathered from the ROM readout
to a virtual address that is used inside the microcode update or the microcode ROM. But after reverse engineer this, you can just do a linear sweep disassembly of the microcode ROM and arrive at human readable output. But this recovery was a bit tricky,
because we required physical virtual address pairs, but gathering these is a bit harder, because we worked through the available updates, but we could only find two pairs of them. These pairs were actually easy to find, because every update replaces a certain triad inside your microcode ROM, and this triad is usually also placed in the microcode update,
so by matching the address, this update replaces with a microcode ROM readout, you can just get your two data points. But we had to get more data points, so we generated these mappings by matching the semantics of triads in the microcode ROM readout and the semantics when
we force execution of a certain microcode address. And gathering the semantics of the readout microcode, we implemented a simple microcode simulator. Essentially, it works on triad level, so you give it an input state and a triad, and it calculates the output state of it. Input and output state are comprised out of
x86 now state, which is your standard registers, and also the internal microcode registers. There are multiple temporary registers that get reset for every new x86 instruction that is executed, but they can also be modified by microcode, of course.
Our emulator supports all known arithmetic operations, and we have a white list of operations that do not produce any observable change in state, just so that we could process more triads and gather more data points. In total, we gathered 54 additional data and additional gender spheres, which turned out to be enough to recover the whole mapping.
This mapping, essentially you have the four different arrays that map to individual blocks, and these blocks in these arrays are then again permuted a bit, and then the triads inside these blocks have some table-based permutations. This is not an obfuscation, this is just,
from a hardware design standpoint, it can make sense to readout it a bit differently. Also, now that we can actually map a certain address to the microcode readout, and we know the addresses of different x86 instructions from our earlier experiments,
we can look at the implementation of instructions. Let's start with a pretty simple one, shift to a double, which essentially takes a register, shifts it by a given amount, and shifts in bits from another register. Of course, you would expect a lot of shifts and rows in its implementation, and this is exactly what we're seeing here.
You have two shift-right operands, and you can see RecMD6 and RecMD4. These are placeholders. The microcode engine can replace certain bit combinations with the registers that are used in the x86 operation. For example, this one would be replaced by ECX or EAX,
depending on what you wrote in x86. At this point, we can also already gather more information about microcode than we previously knew, because we know, okay, this is a source, this is also a source, and this is the destination. But this source, which indicates a shift amount, this one was previously unknown,
because it is a high temporary microcode adjuster, and we found out that these usually implement a specific different purpose. If you write to them, sometimes the CPU behaves erratically, sometimes it crashes, sometimes nothing happens. But in this case, this seems to be the shift count, and the shift count is given by a third operand in the instruction.
In this case, we already learned, okay, if we want to read the third operand of instruction, we need to read T41. This is how we went about recovering more information about microcode. The rest of the implementation is essentially concerned with implementing the rest of the
mantix of the x86 instruction and updating the flags correctly. Okay, so now let's look at an instruction that is a bit more complicated. If you check out RDT-C, RDT-C returns internal cycle counter in EAX and EAX. So the upper part ends up in EAX, lower part in EAX. So in the end, we want to see writes to these adjusters,
potentially with a shift somewhere in there. But somewhere the CPU needs to gather the cycle counter. In the beginning, we have two load style operations. This one is a proper load, which we identified, and this one is unknown. But despite that we do not know the
instruction, we know the target. Because the result of this instruction will end up in T9, and the result of this instruction will end up in T10, so we can follow the uses of these two registers. So for simplicity, I'm going to start with T10. And T10, which we later found
out, this is another register, which essentially denotes a specific internal register. And if you look further, T10 is then ended with this bit mask. And if you look in the menu, you find out
that this bit in CR4 denotes the bit that determines whether RDT-C is available from user space or not. So this is the check if this instruction should be executed.
So now let's just keep in mind that T9 holds some other loaded value from some other internal adjuster, and we will come back to this one a bit later. For now, let's follow execution. This triad is essentially a padding triad. This is a common pattern we see. So let's look at where this branch takes us. And this branch takes us
to a conditional branch triad. And if you look a bit up, this end instruction actually updated this flag. So this is a conditional branch that determines whether this check was accessible
or not. So it branches towards the error triad or to the access triad. But here we already see the exit. We see a write to RDX or EDX in this case, with a shift from T9 by 32-bit, which is exactly what you would expect to write the timestamp count on the upper 32-bits of the
timestamp count to EDX. And you have a unknown instruction, but we know, OK, we move something from T9 to EX, which is the lowest 32-bits. But we are not done here because we can still look at the error path that is taken if the access is denied. So if you scroll a bit down,
we can see a move of an immediate into a certain internal register. And this immediate actually encodes a general protection for interrupt code and denotes to the exception handler what that is what general protection fault. And later, this triad branches to this address.
And if you look at the uses of this address, we can find other immediate that also correspond to x86 instructions. So now we learned how we can actually raise our own interrupts. We just need to load the code we want into the specific register and branch to this address.
And now we learned a lot about how we can actually write microcode. But it's also interesting to see how certain instructions are implemented. So let's look at a pretty complicated one, write MSR. Write MSR essentially
writes some data it has given to a machine-specific register. This machine-specific register differs between CPUs, between vendors, sometimes between revisions. And these implement non-standard extensions or pretty complex features, for example, by you trigger a microcode update by writing to a machine-specific register. The address you want to write to is given
in ECX. And now we can see ECX is red. And it is shifted by 16 bits to T10. So again, we follow uses of T10. And we see it is stored with this bit mask. And this bit mask is
C000, which actually denotes the namespace of the model-specific registers. In this case, this should be an AMD-specific namespace. And of course, this one again sets some flags. And you can see a conditional branch depending on these flags to what should be the handler
for this namespace. Next one, we have another XOR that uses a different bit mask. In this case, C001. C001 is the namespace where the microcode update routine is actually located in.
Again, we branch to this handler. And if we just continue on, there are more operations on all the X followed by more branches. And this continues until everything is dispatched to the correct handler. And this is how internally write MSR is implemented. And also, read MSR is going to be implemented pretty similar because it implements some kind of similar thing.
Now I showed you how we actually went ahead of reconstructing the knowledge we currently have. And now I'm going to show you what we can actually do with it. And for this, I'm going to quickly cover what applications we wrote in microcode.
We wrote a simple configurable RTC position. This means a bit mask is ended to the result of RTC. So you can reduce the accuracy of it, which can sometimes prevent timing attacks. We also implemented microcode assisted address sanitizer, which I'll cover quickly in a second.
We also have some basic microcode instruction set randomization. Some microcode assisted instrumentation. What this means is you can write a filter for your instrumentation in microcode itself. So instead of hooking an instruction, instead of debugging your code or emulating it, you can just say whenever the instruction
is executed, filter if this is not relevant for me. And if it is, call my x86 handler entirely in microcode without changing the instruction in the RAM. We also implemented some basic attended microcode updates. So the usual update mechanism is weak. That's how we got our foot in the door in the first place. So we improved up a little bit.
Also, we found out that microcode actually has some enclave-like features. Because once you're executing in microcode, your kernel can't interrupt you, your hypervisor can't interrupt you, and any state you want visible to the outside world, you actually need to write explicitly. So all these microcode internal adjusters are not accessible from
the outside world. So any computation you perform in microcode cannot be interfered with. So you can implement a simple enclave on top of this one. Our hardware-assisted address sanitizer variant is based on the work by the original authors. And address sanitizer is a software instrumentation that detects invalid memory access by using
a shadow map, shadow memory, to just say which memory is valid to be read and written to. The authors proposed hardware and address sanitizer, which is essentially doing the same checks, but using a new instruction. And this instruction should raise it forward if
an invalid access is detected. This algorithm they proposed, the details are not important. What is important is, in a sense, it's pretty simple. You load from a certain address, perform some operations on it, and if there's a shadow after these operations, you just report a bug. And advantages of hardware-assisted address sanitizer are, for example, you get better
performance out of it. Because you only have a single instruction, maybe you can do some fancy tricks inside your CPU that are faster than using XS6 instructions. You get more compact code. And you have the possibility of runtime configuration, which is a bit hard with a software address sanitizer. We implemented hardware-assisted sanitizer, our variant,
by replacing the bound instruction. Bound is an old instruction that is no longer used by compilers because, in fact, it is slow to use bound instead of performing the checks with Modibig's XS6 instructions. We change the interface. The first argument is
the register, which holds the address you want to access. The second argument holds the size you want this access to be, so one byte, two byte, and so on. This instruction is a noob, it's a check succeeds. So if there's no bug, it just continues on like nothing happened. However, if we detect an invalid access, we can take a configurable action.
We can, for example, just raise the normal page fault. Or we can raise the bound interrupt, which is a custom interrupt that only denotes this one. Or we can branch to an x86 handler that either performs an additional checking, for example, write listing, or it generates a pretty error report for you. Most importantly, this is a single instruction.
We also do not do any x86 registers because there are some intermediate results. You need to store these somewhere, and this is usually due to an x86 register. So we increase the register pressure, maybe cause spilling, so overall your performance gets worse.
We also found out that we're actually faster than doing the checking using x86 instructions. So just by moving the implementation from the x86 level to microcode, which in some way is still kind of like software, we already improved the performance. Also, on top of this, you get better cache utilization because you have less instructions, thus less bytes in the cache,
so we get fuller cache lines. And also, it is really easy to tell which is testing code and which is your actual program code. Lastly, I'm going to show you just a rough overview of our framework, which we used during our development, and which you can also find on
GitHub. Early on, we found out that we are probably going to need to test a lot of microcode updates because in the beginning, you just throw everything at the CPU and see how it behaves, and we wanted to do this in parallel. So we developed a small custom OS called
Angry OS and deployed it to mainboards. These mainboards are just old AMD mainboards. All these mainboards are hooked up via serial for communication and GPIO to Raspberry Pi. With the GPIO, you can reset the board, power it on, power it down, and just have remote control
of this mainboard. And then you can connect to the Raspberry Pi from anywhere on earth and just deploy and play around with it. This was the first version. In the beginning, we didn't really know much about electronics, so we used one Raspberry Pi per mainboard, and it turns out Raspberry Pis are more expensive than these old mainboards.
But we improved up in this, and now we are down to one Raspberry Pi for four or five setups. For example, you only need three GPIO ports per mainboard. You connect each of these to optocouplers just to separate the voltage levels, and then you connect one side of the
optocoupler to the GPIO, the other side to your reset pin, to your power pin, and for input to know whether your board is up or down, you connect the power LED. And that way, you can save a lot of space, a lot of money. And also, if you are really constrained, you can just remove the power LED sensing, because usually you know the state your setup is in.
And as I already said, we wrote our custom operating system, and it is intentionally really, really minimal, because the major feature we wanted is control over every instruction that is going to be executed from a certain point on, because we are playing around with instruction decoding. And if we execute an instruction that we did not intend, we might crash
the CPU, we might go into an invalid state, and we do not even know which instruction caused it. And Angular is essentially only listens on the serial port for something to do. What it can do is apply an update. These updates are not my code updates, they are streamed via serial. We can also stream
X86 code, which is then run by Angular OS. This is just so we do not need to reflash the USB stick every time we want to update our testing code. And so we saw it, all the errors are reported back to the Raspberry Pi, and thus they are forwarded to us.
The framework we use, most importantly, has a microcode assembler and a pretty verbose disassembler. This disassembler generates the output I showed you earlier, and using this, you can just quickly write your own microcode. We also included an X86 assembler, because we wanted to rapidly test different X86 testing codes.
Using this framework, we were able to disassemble the existing updates, and we also used it to disassemble our ROM after we reordered it, and also during the process when we fed it to our emulator. And we can also create the program binary files that can be loaded by the Linux driver.
We modified the stock one to just load any update you give it without checking if it is the correct CPU ID and all these things, just for testing purposes. It is also available. And also, of course, the framework can control Angular OS to make testing easier.
And we implemented a really basic remote execution wrapper, so you can work on a remote Raspberry Pi as if you were using it locally. And this brings me to the end of the talk, and in conclusion, we can say Reversing the ROM opened up a lot of new possibilities. We learned a lot about how
microcode works. We learned about how to actually use it properly instead of just inferring from the really small dataset that we have from the updates or, yeah, from the random bit strings we sent to the CPU and observed what happened. But there is a lot left to do. So if you really want to hack on it, just get in contact. We will be happy to share our findings with you.
And as I said, the framework, Angular OS, example programs that we implemented, and some other stuff like the variable on GitHub. So with that, I'll be happy to answer any questions you might have.
Thank you very much. So we have 10 minutes for questions. Please line up at the microphones. We start with this one, microphone number two. Hi, thanks for a nice talk. A few questions about your hardware address sanitizer.
As I understand, you don't need the source code instrumentation, because the microcode is responsible for checking the shadow memory, right?
The original hardware address sanitizer implementation is also based on a compiler extension that inserts a new instruction because it doesn't exist usually, and it also inserts a bootstrap code that inits your shadow map and also instruments your allocators to update the shadow map during runtime. And we essentially need the same component
but we do not need the software to sanitize a component that essentially inserts 10 or 20 x86 instructions before every memory access. So yes, we still need the compile-time component and we are still source code based in essence.
And I didn't see, maybe I missed the numbers, how much it is faster than this initial version. Do you mean the initial hardware sanitizer version or the software address sanitizer? I mean, let's say Kasan kernel address sanitizer for Linux kernel, which is the usual one and
your approach. We only performed a micro benchmark on AngularOS and we essentially took the instrumentation as emitted by the compiler for certain memory access which is your standard software sanitizer and compared it to our version using only the modified bound instructions.
So I really can't talk about how it compares to Kasan or something or some like real world implementation because we only have the prototype and basic instrumentation. Thank you very much. Okay, microphone number four, please. Thanks for the talk. And did you find any
weird microcode implementations? I don't mean security-wise, just like you weren't expecting to see the implementation, for it to be implemented that way.
The problem is there's a lot of microcode to begin with. You have F000 triads, each of which has three opcodes, so you have a lot of code to cover and also we have readout errors.
Sometimes you're seeing a bit flips which kind of slows you down because you always need to consider, okay, maybe this register is something else, maybe this address is wrong and also sometimes you have a dust particle that kind of knobs out in the entire region. So we only looked at the components, we were pretty sure that we
recovered correctly and we only looked at a really tiny subset compared to all of the microdome. It's just not feasible to do and to go through it and look at everything. So no, we didn't find anything funny but we also wouldn't know what funny looks like because we don't know what the official spec for microcode is.
Interesting. We have one question from the internet, from the Signal Angel, please. This is still based on the work on our first talk and this only works on pretty old ones, K8, K10, CPUs produced until 2013. This was the last year MDE produced anything like that.
Newer ones use some public key-based cryptography from what we can tell and we haven't yet managed to break it. Same goes for Intel and they seem to be using public key cryptography and we haven't gotten a foot in the door yet. Thank you. We go around one microphone number three,
please. Yeah, thank you. I would like to know how complex could the microcode programs be that you could write. So what's the complexity of new operations you could implement?
The only limiting factor is the size of your microcode updateram but this one is really, really limited. For example, on K8 where we performed the majority of our experiments, we are limited to 32 trials which comes down to 69 instructions and you also have some constraints on these instructions. For example, the next trial will always be executed no matter
what. Some operations can only go at the second slot, some can only go on the another slot. So it's really, really hard and you're also limited from our knowledge to loading 16-bit immediate instead of 32-bit or even 64-bit immediate. So your whole program grows really
fast if you're trying to do something complex. For example, our authenticated microcode update mechanism is the most complex one we wrote. It nearly fills out the RAM and we used TR, tiny encryption algorithm, because that was the only one we managed to fit mostly due to S-box and other constants we would need to load. So it's really small.
Thank you. Microphone number one. Okay, so you said the microcode is used for instruction ops to the scheduler and the micropq in some way. Did you find out how that works?
In essence, we are not actually executing code inside the microcode engine. From what we understand, the microcode engine is just some kind of a software-based recipe that describes how to decode an instruction. So you don't actually get execution, you just commit
instructions into the pipeline that do what you want. And because we have some control flow possibilities that are actually inside the microcode engine, because you can branch to different addresses, you can conditionally branch and loop, you kind of get an execution, but in essence you just commit stuff in the pipeline and the CPU does what you tell it to.
One more question. Microphone number two, please. How did you take the picture of the internal CPU? Did you open it? Yeah, we worked together with Chris. He's a hardware guy. He has access to the equipment to de-layer it, to take high-resolution optical shots, and to also take shots with a scanning
electron microscope. So I think about five or six CPUs per hunt in the making of this paper. So we have one more last question. Microphone number two, please. Are you aware of the research done by Christopher Domus, where he mapped out the
instruction set for x86 processes? Yeah, we actually talked with him. We are aware that there is a map essentially of the instruction set, and also maybe you can
combine it, because in the beginning we reverse engineered where certain x86 instructions are implemented in the microcode. So if you plug these two together, you kind of map out the whole microcode ROM at the same time that you map out the whole instruction set. However, there are some components of the microcode ROM that are most likely not triggered by instructions. For example, things like power management or everything that is
behind a write MSR or read MSR. Write MSR is a single instruction, but depending on the arguments you give it, it just branches to totally different triads. And the microcode update itself is implemented in microcode, and this one is a huge chunk you wouldn't even find without brute forcing all combinations for all instructions, which is not really
feasible. Thank you. Thank you Benjamin.