We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

ERASER: Early-stage Reliability And Security Estimation for RISC-V

00:00

Formal Metadata

Title
ERASER: Early-stage Reliability And Security Estimation for RISC-V
Subtitle
An open source framework for resilience/security evaluation and validation in RISC-V processors
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
RISC-V processors have gained acceptance across a wide range of computing domains, from IoT to embedded/mobile class and even in server-class processing systems. In processing systems ranging from connected cars and autonomous vehicles, to those on-board satellites and spacecrafts, these processors are targeted to function in safety-critical systems, where Reliability, Availability and Serviceability (RAS)-based considerations are of paramount importance. Along with potential system vulnerabilities caused primarily due to random errors, these processors may also be sensitive to targeted errors, possibly from malicious entities, which raises serious concerns regarding the security and safety of the processing system. Consequently, such systems necessitate the incorporation of RAS-based considerations right from an early stage of processor design. While the hardware and software ecosystem around RISC-V has been steadily maturing, there have, however, been limited developments in early stage reliability-aware design and verification. The Early-stage Reliability And Security Estimation for RISC-V (ERASER) tool attempts to address this shortcoming. It consists of an open source framework aimed at providing directions to incorporate such reliability and security features at an early, pre-silicon stage of design. These features may include what kind of protection to be applied and which components within the processor should they be applied to. The proposed infrastructure comprises of an open source toolchain for early stage modeling of latch vulnerability in a RISC-V core (SERMiner [1]), a tool for automated generation of stress marks that maximize the likelihood of a transient-failure induced error (Microprobe (RISC-V) [2]), and verification by means of statistical and/or targeted fault injection (Chiffre [3]). While the infrastructure is targeted towards any core that uses the RISC-V ISA, the repository provides an end-to-end flow for the Rocket core [4]. ERASER thus evaluates “RAS-readiness”, or the effectiveness of protection techniques in processor design such that processor vulnerability in terms of Failures-In-time (FIT) rate is minimized, for a specified power/performance overhead. FIT rate is defined as the number of failures in one billion hours of operation and is a standard vulnerability metric used in industry. ERASER is an open source tool available for download at https://github.com/IBM/eraser. The tool currently supports analysis of all latches in the design across a single Rocket core and the generation of stressmarks that can be used to evaluate the vulnerability of these latches. In addition to radiation-induced soft errors, we plan to extend ERASER to also model errors due to voltage noise, thermal and aging-induced failures, both in memory and logic, and generate representative stressmarks. ERASER is an initial effort to devise a comprehensive methodology for RAS analysis, particularly for open-source hardware, with the hope that it spurs further research and development into reliability-aware design both in industry and academia. References: K. Swaminathan, R. Bertran, H. Jacobson, P. Kudva, P. Bose, ‘Generation of Stressmarks for Early-stage Soft-error Modeling’, International Conference on Dependable Systems and Networks (DSN) 2019 S. Eldridge R. Bertran, A. Buyuktosunoglu, P. Bose, ‘MicroProbe: An Open Source Microbenchmark Generator, ported to the RISC-V ISA, the 7th RISC-V workshop, 2017 S. Eldridge, A. Buyuktosunoglu and P. Bose, ‘Chiffre A Configurable Hardware Fault Injection Framework for RISC-V Systems’ 2nd Workshop on Computer Architecture Research with RISC-V (CARRV), 2018 Krste Asanović, Rimas Avižienis, Jonathan Bachrach, Scott Beamer, David Biancolin, Christopher Celio, Henry Cook, Palmer Dabbelt, John Hauser, Adam Izraelevitz, Sagar Karandikar, Benjamin Keller, Donggyu Kim, John Koenig, Yunsup Lee, Eric Love, Martin Maas, Albert Magyar, Howard Mao, Miquel Moreto, Albert Ou, David Patterson, Brian Richards, Colin Schmidt, Stephen Twigg, Huy Vo, and Andrew Waterman, The Rocket Chip Generator, Technical Report UCB/EECS-2016-17, EECS Department, University of California, Berkeley, April 2016 The attached figure shows a representative flow for the RAS estimation methodology. An initial characterization of all instructions in the RISC-V ISA is carried out via RTL simulation using an existing core model (eg. the Rocket core). The simulation is configured to generate VCD (Value- Change Dump) files for every single instruction testcase. The SERMiner tool parses these VCD files to determine latch activities across the core, aggregated at a macro (or RTL module) level. Based on these per-instruction latch activities, SERMiner outputs an instruction sequence, which forms the basis of the SER stressmark to be generated by Microprobe (RISC-V). Microprobe (RISC-V) is a microbenchmark generation tool that is capable of generating microbenchmarks geared towards specific architecture and micro-architecture level characterization. One of its key applications is in the generation of stressmarks, or viruses, that target various worst-case corners of processor operation. These stressmarks may be targeted at maximizing power, voltage noise, temperature, or soft-error vulnerability as in case of this tool. The generated stressmark is then used to generate a list of latches that show a high residency and hence a high SER vulnerability. These latches are the focus of fault injection-based validation experiments using the Chiffre tool. Chiffre provides a framework for automatically instrumenting a hardware design with run-time configurable fault injectors. The vulnerable latches obtained from running the generated stressmarks through the Rocket core model, and then through SERMiner, are earmarked for targeted fault injection experiments using Chiffre. The objective of these experiments is to further prune the list of vulnerable latches by eliminating those that are derated, that is, they do not affect the overall output even when a fault is injected in them. Focusing any and all protection strategies on this final list of latches would maximize RAS coverage across the entire core. Ongoing and future work: ERASER currently only supports analysis of all latches in the design across a single Rocket core and the generated stressmarks can be used to evaluate the vulnerability of these latches. Most on-chip memory structures such as register files and caches, are equipped with parity/ECC protection and are as such protected against most radiation-induced soft errors. However, they are still vulnerable to supply voltage noise, thermal and aging-induced failures, and other hard or permanent errors. We plan to extend ERASER to model such errors, both in memory and logic, and generate stressmarks representative of worst-case thermal emergencies and voltage noise, in addition to soft errors.
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
Open sourceSoftware frameworkPerformance appraisalBefehlsprozessorEstimationInformation securityReduced instruction set computingFunction (mathematics)Server (computing)Military operationVideo trackingMobile WebMaxima and minimaInformation securityProcess (computing)EstimatorHypercubePhysical systemOperator (mathematics)MicroprocessorMobile WebComputer animation
Process (computing)Virtual machineCASE <Informatik>Spectrum (functional analysis)System programmingSocial classPower (physics)Multi-core processorOpen sourceFlow separationServer (computing)Computer animation
Open sourceBefehlsprozessorDegree (graph theory)Thermal radiationModel theoryVulnerability (computing)Software frameworkImplementationPerformance appraisalInformation securityDisintegrationMicroprocessorBit rateAverageFrequencyRepresentation (politics)Digital signalRemote Access ServiceFunktionalanalysisStandard deviationMetric systemCurve fittingTotal S.A.Digital electronicsTransformation (genetics)FeedbackEstimationReduced instruction set computingInformationComponent-based software engineeringPlastikkarteCode generationGeneric programmingMaxima and minimaStress (mechanics)POWER <Computerarchitektur>Scalable Coherent InterfaceSimulationFunction (mathematics)Atomic nucleusInjektivitätMulti-core processorStatisticsComputer configurationSoftware testingSystem programmingFile formatRing (mathematics)Core dumpVideo-CDFlip-flop (electronics)Mathematical analysisModul <Software>Macro (computer science)Personal digital assistantComputing platformTerm (mathematics)SequenceArchitectureParameter (computer programming)Range (statistics)Software frameworkMultiplicationInformation securityCASE <Informatik>CausalityBitThermal radiationValuation (algebra)Term (mathematics)Business modelMacro (computer science)Context awarenessComputer programmingDependent and independent variablesValidity (statistics)BuildingSimulationBridging (networking)Metric systemComputer fileOperator (mathematics)Computer simulationText editorMathematical analysisPower (physics)Denial-of-service attackCollaborationismGenderData storage deviceStress (mechanics)Utility softwareMessage passingLevel (video gaming)Flow separationProduct (business)InjektivitätField (computer science)Group actionAdditionStatisticsMultilaterationSimilarity (geometry)Semiconductor memoryCartesian coordinate systemSoftware testingSpacetimeNumberRight angleStandard deviationEmulatorElectric generatorTotal S.A.MetrologieSet (mathematics)Shift operatorKey (cryptography)MathematicsVulnerability (computing)Matching (graph theory)AreaProcess (computing)Sound effectSingle-precision floating-point formatOpen sourceRegulator geneMulti-core processorInformationGrass (card game)Multiplication signState of matterMaxima and minimaDataflowCycle (graph theory)EstimatorExecution unitDegree (graph theory)Physical systemPerformance appraisalArithmetic meanComputer hardwareTopological algebraRemote Access ServiceMikroarchitekturVideo-CDGamma functionAlpha (investment)Entire functionSinc functionModul <Software>outputMicroprocessorComponent-based software engineeringComputing platformSystem on a chipFlip-flop (electronics)Computer animation
Maxima and minimaPersonal digital assistantMacro (computer science)Multi-core processorVulnerability (computing)Term (mathematics)SequenceMetric systemArchitectureParameter (computer programming)Data dictionaryPrice indexQuicksortSample (statistics)Entire functionSingle-precision floating-point formatWorkloadPerformance appraisalBit rateElectronic mailing listComputing platformMathematical analysisOpen sourceAdditionEnterprise architecturePower (physics)Domain nameLogicSemiconductor memoryBefehlsprozessorGame controllerMaß <Mathematik>Read-only memoryInjektivitätScale (map)SimulationMicroprocessorModel theoryFlip-flop (electronics)Information securityLink (knot theory)Demo (music)SubsetCASE <Informatik>SequenceProduct (business)Metric systemSoftware testingMacro (computer science)Topological algebraWorkloadProxy serverSampling (statistics)Stress (mechanics)Enterprise architectureSingle-precision floating-point formatComponent-based software engineeringSemiconductor memoryVulnerability (computing)BitSkeleton (computer programming)InjektivitätMathematical analysisSet (mathematics)SimulationModel theoryLoop (music)MikroarchitekturCartesian coordinate systemFlip-flop (electronics)Term (mathematics)Parameter (computer programming)Sound effectThresholding (image processing)ResultantMereologyFrequencyProcess (computing)Performance appraisalMaxima and minimaChainMulti-core processorScreensaverFreewareLaptopModul <Software>Point (geometry)Scaling (geometry)Level (video gaming)Validity (statistics)Row (database)Entire functionLink (knot theory)Demo (music)Different (Kate Ryan album)Greedy algorithmGame controllerAdaptive behaviorAlgorithmAreaMatching (graph theory)Inheritance (object-oriented programming)InformationParity (mathematics)Multiplication signLattice (order)NumberLattice (group)Goodness of fitPressureCloningInductive reasoningBit rateMoment (mathematics)MultiplicationMedical imagingComputer animation
Conic sectionSingle-precision floating-point formatCASE <Informatik>Task (computing)Software testingGame theoryComputer animation
Maxima and minimaMenu (computing)Uniform resource nameSingle-precision floating-point formatTask (computing)CASE <Informatik>Software testingCompilation albumSource codeComputer animationLecture/Conference
Hill differential equationWorkloadTopological algebraScalable Coherent InterfaceComputer fileVideo-CDSource code
Nim-SpielMacro (computer science)StatisticsProfil (magazine)Stress (mechanics)Source code
Computer iconMenu (computing)Very-large-scale integrationStatisticsLevel (video gaming)Macro (computer science)Doubling the cubeRow (database)Source codeComputer animation
Stress (mechanics)Loop (music)CASE <Informatik>InjektivitätAlgorithmElectronic mailing listSkeleton (computer programming)Software testingFunction (mathematics)Performance appraisalScaling (geometry)Topological algebraDemo (music)Source code
Integrated development environmentInjektivitätDemo (music)Scaling (geometry)Source codeComputer animation
Open sourcePoint cloudFacebook
Transcript: English(auto-generated)
Hi, I'm Karthik Swaminathan, also from IBM Research, and we're presenting some of the work we have done on developing this early-stage reliability and security estimation tool for its five processors called Razer.
As I don't need to elaborate on this, reliability of a design operation is essential for pretty much every domain, ranging from servers, hyperscale systems, down to embedded systems, autonomous driving systems in particular, mobile phones, and so on. So if you have processors, say, at either end of the spectrum, say, for high-performance or server
-class machines like IBM Power9, or, of course, like a RISC-V rocket core in this case, which could be fitted into something like an autonomous driving system, they are obviously vulnerable to several sources of errors. One of the main issues is radiation-induced soft errors, particularly those cores deployed in the field are vulnerable
to alpha particles, beta, gamma rays, and so on, and this can cause random bit flips and consequent errors. In addition, there can also be targeted errors due to something like raw-hammer attacks,
where bits, particularly memory, targeted bits can be flipped, and this can cause major security violations. So we need a methodology to incorporate protection and mitigation against these kinds of errors right from an early stage of design. And that's what we propose to do in the ERASER tool, which is an open-source framework for this kind of reliability and security evaluation.
As a larger context, even the preceding two talks from Luka and Skyler were in the larger ambit of the DSOC program, which is sponsored by DARPA, which looks at an entire stack of building heterogeneous SoCs.
And in this talk, we particularly focus on the security and reliability of a design, in this case with CPUs, but can be easily extended to a whole bunch of hardware units as shown here. And just an overview of some of the terms, of course, like Fertil, which Skyler has
already gone through, and I would like to focus on just a couple of metrics here. One is RAS, which is the way processors are usually qualified in terms of the resilience and reliability. And the other one is residency, which is the amount of time in which a latch is stayed remains unchanged.
So this is a key metric which we'll be considering for our reliability evaluation. And we evaluate it as the total number of execution cycles by the number of data switches in this case. Finally, we also have failures in time, which is the failures in a billion hours of operation. And that's an industry standard metric for determining the process of vulnerability.
So it's possible to carry out this kind of evaluation at various stages of processor design, right from an analytical stage down to building a cycle-accurate simulator, and the RTL simulation, FPGA-based emulation, and finally the processor fabrication.
You can notice that at the first two stages, there's not enough information on the physical design, particularly in terms of the latches, their size, and their vulnerabilities. And the last two stages, it's probably too late to affect any changes.
It can be argued that you can have some significant design input even at the FPGA stage. But we focus in this case on the RTL level simulation stage, and we can actually look at some of the latches and carry out these simulations around this methodology to evaluate the vulnerability of latches and proactively make design changes to mitigate them.
So because of this, we have the eraser tool which can evaluate the RAS readiness of a processor and even the effectiveness of existing protection techniques and whether we need even more protection techniques. So this does provide a comprehensive framework for such a vulnerability estimation, even at such a pre-silicon stage.
So this is an overview of some of the components used in eraser. One of the components used is microprobe. This was a tool developed primarily for IBM systems.
It looked at power and Z systems in particular, and it was an automated microarchitecture test case generation methodology. And it has been used heavily in various stages of design in these IBM systems. The SCRminer tool, which I developed in collaboration with colleagues, was an automated
generation of these kinds of SCR stress marks, particularly for the power tool. And these were based on utilization or clock switching based metrics, and we ported this to RISC-V, which is the kind of overview of what we present in this talk. And the SCRminer tool, as I mentioned, looks at these switching files and generates latch-level switching statistics.
And finally, there's a fault injection tool which was developed by Skyler here, and it looks at statistical and targeted fault injection into latches within a RISC-V core. And this leverages some of the fertile passes that he had talked about, and it's got a wide range of applicability even in this space.
As an overview of the entire eraser tool flow, we generate single instruction test cases for all the instructions in the RISC-V ISA. This is run through a RISC-V base core model. In this case, we adopt the rocket core,
but this can be easily extended to multiple other cores, since it's just dependent on the particular ISA. We generate VCD files from RTL-level simulations, in this case using the rocket chip emulator. We generate macro-level or RTL module-level switching information and use that to get residence information, which is used to generate a stress mark.
These stress marks are then run through a similar flow of emulation, macro-level switching information to generate a set of vulnerable latches. And finally, you have a targeted fault injection methodology on these vulnerable latches using the SHIF tool.
This will finally give us a final set of latches that we can deem to be vulnerable and determine what kind of protection needs to be adopted for these particular components. So, from the previous slide, some of the key features of eraser support the analysis of latches by means of RTL simulation.
We have switching residency analysis aggregated at the RTL module or macro-level. We use these to generate stress marks to evaluate the worst-case vulnerability, particularly to minimize the derating of latches in case of a soft error strike or a radiation strike.
We then have obviously the validation platform that I mentioned. And finally, as I mentioned, we demonstrate on the rocket core and we are in the process of extending to other cores as well.
As an overview of the exact methodology for generation of the stress marks, the basic idea for a soft error stress mark would be one that minimizes the derating or maximizes the exposure of a bit flip error. And this would happen when a maximum number of macros are vulnerable predominantly through the execution.
So, for example, if you have a bunch of macros that have high degrees of residency across their latches as opposed to a few macros with the residency concentrated only on a few latches or a few macros, the former would be much more vulnerable.
So we have two metrics, the latch residency and the macro, what we term as macro coverage, which we need to maximize. We use a kind of greedy algorithm by which we select each macro depending on the residency, as I will show here. So we have, assume that we have, on the vertical axis we have the
macros and we have every instruction and the residencies corresponding to each of these instructions. So, for example, R11 would be the residency of macro 1 when instruction 1 is run, R12 is the residency of macro 1 when instruction 2 is run, and so on.
And here we want to focus on the most vulnerable macros. So we use a parameter called rho, which is the residency threshold. We can vary rho depending on how, to use a parameter and this can be fine-tuned to maximize the effectiveness of the generated stress mark.
And it's a user-defined parameter between 0 and 1 and we only consider the residencies of those macros that are above the rho percentile in terms of the maximum residencies. For example, if the residency of macro 2 is less than, say, rho percent of the maximum residency seen across all instruction, we will just set it to 0.
And based on this, we determine a joint SCR metric in terms of the macro coverage, the residency and the CPI instruction. So in this case, just for the purpose of an initial evaluation, we considered single CPI instructions because it obviously depends on the clock frequency. So as a joint SCR metric, we just consider the product of the macro coverage M and the residency R.
And this looks at the entire ISA, looks at the entire processor, but we can actually adapt it to a subset just of a few instructions or a few macros to focus on the targeted errors that I spoke about. So if you want to look at a particular set of vulnerable bits or vulnerable latches or macros, we can do that as well.
So as we go on selecting macros one by one, we will kind of knock out those particular macros from their instruction and this will continue. We continue successively to select instructions until all macros are covered.
And the sequence that is generated in this manner is our skeleton sequence which can be used to generate the test case. The test case is basically an infinite loop running these sequences of sequence of instructions one after the other.
So we have some sample results. So we evaluate on three metrics, the residency, the coverage and the joint metric which is the product of the two. The evaluated workloads, we look at the entire ISA around 140 instructions of the single instruction test cases. And these we use as the baseline so it comes to the average metrics and the peak metrics of all the instructions.
There are also ways to generate workload proxies of entire workloads like spec which is an ongoing work. And finally we also have the stress mark that we determined and we try to calculate the metrics for this. As you can see, the stress mark is clearly worse than the maximum of the instructions in all these three metrics.
This is a single data point which is around 99 percent row. As we vary the residency threshold we can get different values and get even higher values of these metrics for the stress mark.
So this as we mentioned is initial work. It's available public and we encourage people to contribute different cases, different scenarios, different algorithms to them. There are ways we would like to extend it to be on SCR, be on soft errors to voltage noise, thermal aging induced errors. Look at further kinds of architecture enhancements, look at encore parameters,
look at interconnects, the memory controller and other components as well. We also would like to adapt application level derating considerations into the fault injection. This is purely latch and microarchitectural level analysis at the moment, but there is obviously a
lot of work at the architecture and application level which we would try to incorporate as well. And finally the sheaf, the fault injection methodology is pretty basic in which we run single tests on latches. We would like to develop an infrastructure for large scale fault injection simulation experiments to have a statistically significant number of results.
So that's another part which is ongoing. So to summarize, we have this early stage modeling tool of vulnerability called Eraser which we use for characterizing process vulnerability at the latch level. We use it to generate and evaluate stress marks that maximize the latch residency and determine the most vulnerable latches.
We also, it comprises of this fault injection based validation tool chain. I have a brief, so these are some of the links, key links. This is all available on GitHub. It's all supported by the Apache tool license and it's free for use. Many of the tools which are developed, many of the other tools like Microprobe Sheaf and of
course the rocket ship which is our evaluation core are also can be accessed through this GitHub module. I have a brief demo for this. Hopefully the sound doesn't give up on me.
Ok, I don't think the sound is working but that's ok. So all it shows is the way to set up the workflow and we just run a test, exam the test case.
Unfortunately it doesn't, it seems to cause my laptop to hang for some reason.
How am I doing on time? Ok, might as well give it a shot. Ok, I think it doesn't seem to like this. Sorry about this.
Ok, maybe I can just run through this. So the first task would be to generate the single instruction test cases.
So these are all the instructions in the RISC-V ISA. So we generate these test cases and compile them. It's stuck again. Yeah, this shows the entire workload being compiled. We then run these through the rocket ship emulator, generate VCD files which we then parse and generate latch activities.
And these latch activities are then used to aggregate, are aggregated to get macro level statistics and to get this kind of a 2D macro versus instruction residency profile that had shown.
Sorry about this.
And then finally we use these macro statistics to generate the stress marks.
So these are the examples of the macro and instruction level statistics. So for each macro we have the residency value across the entire ISA. So these are for instruction 1, 2 and so on for every single instruction. And a few of them are 0 because they have been thresholded out, as I mentioned, depending on the value of rho.
And finally we use this to generate the stress marks.
Yeah, so according to the algorithm that I described earlier, these were the instructions that were output. So SC, SC.V0, FCVT and so on. And we use this as the basic skeleton to generate our test cases which is run in an infinite loop.
And these are then again evaluated, run through it and the list of most vulnerable latches are obtained from this evaluation. We then carry out a fault injection methodology as I described in this. Of course I didn't include the fault injection because we would like to do it for a more larger scale environment. So sorry about the demo but this is a basic overview of the way the tool works.
We would encourage you to contribute to it and be happy to take any questions. Any questions?
Yep. Any questions?