We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Automating Programming and Development of Heterogeneous SoCs with LLVM Tools

00:00

Formal Metadata

Title
Automating Programming and Development of Heterogeneous SoCs with LLVM Tools
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Historically, programming heterogeneous systems has been quite a challenge. While programming support for basic general-purpose accelerators such as GPUs has become quite mature in many ways, general heterogeneous SoCs in particular can feature a much broader range of accelerators in their efforts to minimize power consumption while maximizing performance. Many SoCs, though, are designed with accelerators tailored for the domain -- such as signal processing -- in which they’ll be used: Domain-Specific SoCs. As SoC platforms become ever-more heterogeneous, we think that application developers shouldn’t need to waste time reading datasheets or APIs for SoC-specific kernel extensions just to take full advantage of their hardware. With this in mind, in this talk we will discuss strategies we are using to automate mapping of LLVM-compatible languages to heterogeneous platforms with no intervention (not even #pragmas) from the programmer. To this end, we present our prototype of a software stack that seeks to address both of these needs. To meet the first need, we developed an LLVM-based hybrid compile/run-time toolchain to extract the semantic operations being performed in a given application. With these semantic operations extracted, we can link in additional libraries that enable dispatch of certain kernels (such as a Fast Fourier Transform) to accelerators on the SoC without user intervention. To evaluate the functionality of this toolchain, we developed a runtime system built on top of QEMU+Linux that includes scheduling and task dispatch capabilities targeting hypothetical SoC configurations. This enables behavioral modeling of these accelerators before silicon (or even FPGA) implementations are available. The focus here will be on the LLVM-mapping aspects, but a brief overview of our SoC simulation environment will be presented as well.
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
Computer programmingState of matterComputer programmingSoftware developerLattice (order)CollaborationismGroup actionArmDynamical systemComputer animation
Problemorientierte ProgrammierspracheResultantElectronic mailing listDifferent (Kate Ryan album)Computing platformMathematical optimizationCartesian coordinate systemProjective planeReal numberSignal processingPoint (geometry)Variety (linguistics)BefehlsprozessorSet (mathematics)Software developerCollaborationismDomain nameBitClosed setComputer hardwareTwitterOnline helpChainStandard deviationView (database)DataflowMereologyProgrammer (hardware)CASE <Informatik>Operator (mathematics)Line (geometry)Computer animation
Resource allocationComputer hardwareDisintegrationScheduling (computing)Cartesian coordinate systemUsabilityLine (geometry)ImplementationPointer (computer programming)Interface (computing)ComputerArmGreatest elementSlide ruleData managementSoftware developerComputer programmingComputer hardwareMultiplication signDomain nameMobile appFrequencyStandard deviationLink (knot theory)Variety (linguistics)Different (Kate Ryan album)Operating systemPhysical systemWorkloadBinary codeSurjective functionCompilation albumCore dumpSoftware frameworkComputer animation
CodePrototypeTracing (software)Mathematical analysisKernel (computing)Slide ruleSignal processingPhysical systemINTEGRALOpen sourceDataflowChainElectronic mailing listNetwork topologyRun time (program lifecycle phase)Computer animation
CompilerKernel (computing)AerodynamicsComplex (psychology)Series (mathematics)SpacetimeKernel (computing)Complete metric spaceMessage passingCodeMultiplication signCartesian coordinate systemData compressionBlock (periodic table)MiniDiscFiber bundleReduction of orderGroup actionComputer programmingMarkov chainLibrary (computing)Dynamical systemSource codeRange (statistics)Affine spaceSeries (mathematics)Tracing (software)Signal processingSystem callBlogInsertion lossNetwork topologyState of matterComputer animation
Kernel (computing)Representation (politics)Binary fileResultantFunction (mathematics)Signal processingKernel (computing)Data structureGraph (mathematics)Computing platformCartesian coordinate systemGroup actionSheaf (mathematics)Boundary value problemMathematical analysisAdditionFast Fourier transformError messageBlock (periodic table)CodeRepresentation (politics)Binary codePrice indexFehlererkennungComputer animation
Run time (program lifecycle phase)SpacetimeScheduling (computing)Run time (program lifecycle phase)HeuristicSemiconductor memoryPhase transitionStandard deviationDataflowQueue (abstract data type)Computer hardwarePhysical systemMathematical analysisSoftwareGraph (mathematics)Cartesian coordinate systemData managementSoftware developerImplementationKernel (computing)Integrated development environmentDifferent (Kate Ryan album)Thread (computing)Right angleSignal processingTask (computing)Multiplication signCompilation albumData structureSpacetimeComputer animation
Run time (program lifecycle phase)Asynchronous Transfer ModeParameter (computer programming)Functional (mathematics)Compilation albumState of matterSource codeSemiconductor memorySignal processingSoftware testingRepresentation (politics)InformationCartesian coordinate systemData structureStatistical dispersionKernel (computing)SequenceCodeSheaf (mathematics)System callCross-platformUniform resource locatorPhysical system
Compilation albumExecution unitCodePhysical systemSlide ruleFunction (mathematics)Computer animation
Compilation albumBlock (periodic table)Signal processingLevel (video gaming)Block (periodic table)Coordinate systemPhase transitionComputer animation
Tracing (software)AerodynamicsFunction (mathematics)Block (periodic table)Core dumpStructural loadBinary codeData storage deviceFront and back endsSystem callIntrusion detection systemTracing (software)Cartesian coordinate systemCodeKernel (computing)Mechanism design
Kernel (computing)Function (mathematics)Kernel (computing)Mechanism designTracing (software)Block (periodic table)CodeDivisorCartesian coordinate systemSheaf (mathematics)InformationComputer animation
Code refactoringInstance (computer science)Limit (category theory)Convex hullKernel (computing)Block (periodic table)Cartesian coordinate systemSheaf (mathematics)Gene clusterResource allocationBlock (periodic table)Loop (music)Group actionCodeKernel (computing)Table (information)CASE <Informatik>Semiconductor memoryDistanceCorrespondence (mathematics)Computer animation
System callResource allocationSymbol tableDistanceSemiconductor memoryPointer (computer programming)CodeResolvent formalismComputer fontDifferent (Kate Ryan album)Sheaf (mathematics)MathematicsLogical constantComputerComputer animation
Binary fileCodeSheaf (mathematics)Statistical dispersionObject (grammar)Computer animation
Binary fileObject (grammar)Run time (program lifecycle phase)SpacetimeObject (grammar)DataflowRun time (program lifecycle phase)Functional (mathematics)Function (mathematics)Mobile appInstance (computer science)Computer animation
Binary fileDataflowComputing platformSoftware frameworkEmulationCartesian coordinate systemComputing platformSpacetimePerformance appraisalSoftware frameworkComputer hardwareDifferent (Kate Ryan album)Digital photographyCoefficient of determinationProjective planeVirtualizationModal logicReal numberFunctional (mathematics)Scheduling (computing)Semiconductor memoryTelecommunicationArmIP addressDataflowImplementationSignal processingData transmissionComputing platformVirtual memoryMulti-core processorMereologySystem on a chipPhysical systemEstimatorVirtual machineSoftware developerInterface (computing)Portable communications deviceFast Fourier transformEmulatorMultiplicationWhiteboardComplete metric spaceInterpreter (computing)Physical lawValidity (statistics)Entropie <Informationstheorie>Sampling (statistics)Metropolitan area networkLink (knot theory)Order (biology)ChainComputer programmingComputer animation
Functional (mathematics)Standard deviationCodeSampling (statistics)Software frameworkCross-correlationDataflowCompilation albumStandard deviationElectronic mailing listWebsiteNetwork topologySequelMultiplication signFunction (mathematics)Kernel (computing)Binary codeObject (grammar)SpacetimeCartesian coordinate systemRepresentation (politics)INTEGRALComputer fileComputer hardwareOrder (biology)Real numberUniform resource locatorComputing platformChord (peer-to-peer)Computer animation
Software testingRange (statistics)Pulse (signal processing)Asynchronous Transfer ModeRippingHausdorff spaceFrequencyoutputFast Fourier transformSystem on a chipScheduling (computing)Configuration spaceAlgorithmStrategy gameComputer hardwareComputing platformEmulatorSoftware frameworkIntegrated development environmentWärmestrahlungFood energyProblemorientierte ProgrammierspracheEvent horizonDiscrete groupSimulationSystem programmingComputer hardwareComputing platformPhase transitionPulse (signal processing)Scheduling (computing)EmulatorDifferent (Kate Ryan album)Software frameworkLatent heatMultiplication signDomain nameRun time (program lifecycle phase)outputPower (physics)Configuration spaceOrder (biology)SpacetimeCartesian coordinate systemWorkloadReal numberBenchmarkDemo (music)Stack (abstract data type)Asynchronous Transfer ModeInstance (computer science)Operator (mathematics)Doppler-EffektFood energyValidity (statistics)Tracing (software)SoftwareComputer fileFlow separationCore dumpSimulationDiscrete groupEvent horizonHeuristicCountingSoftware developerMathematical analysisAreaData managementPhysical systemDecision theoryLevel (video gaming)INTEGRALComplete metric spacePlotterOffice suiteFrequencyNetwork topologyMultiplicationArithmetic meanLattice (order)AirfoilSlide ruleDigital photographyProjective planeLocal ringInjektivitätContent (media)ChainRevision controlEvoluteFrame problemCoefficient of determinationComputer animation
VideoportalCompilation albumDataflowCartesian coordinate systemDemo (music)Kernel (computing)Pairwise comparisonDensity functional theoryPoint (geometry)Maxima and minimaPhase transitionZeitdilatationFast Fourier transformFunction (mathematics)Object (grammar)Computing platformVideoconferencingComputer fileCross-correlationTracing (software)Multiplication signBitBlock (periodic table)ImplementationGraph (mathematics)Computer hardwareStandard deviationComputerCASE <Informatik>CodeCalculationLoop (music)Scaling (geometry)Presentation of a groupThread (computing)Drop (liquid)Price indexComputer animation
Theory of everythingDensity functional theoryFunction (mathematics)Thread (computing)Kernel (computing)CodeDiagramCore dumpSource codeComputer animation
BefehlsprozessorCore dumpDemo (music)INTEGRALLoop (music)CodeComputer hardwareTask (computing)Phase transitionBitSoftwareProgram flowchart
DisintegrationComputing platformLink (knot theory)CalculationAffine spaceMereologySignal processingCartesian coordinate systemComputer hardwareHypothesisAreaKernel (computing)Tracing (software)Pattern languageMatching (graph theory)Computer configurationGroup actionLibrary (computing)Task (computing)INTEGRALChainDomain nameCodeSoftwarePattern matchingCapability Maturity ModelBitRepository (publishing)Physical systemBefehlsprozessorRevision controlCompilerCycle (graph theory)Boundary value problemGraph (mathematics)Parallel portGene clusterMathematical analysisView (database)Level (video gaming)Pattern recognitionImplementationFile archiver
Open sourcePoint cloudOpen set
Transcript: English(auto-generated)
So, hi everyone, I'm Joshua Mack, this is Nirmal Kumbari, and we're talking about automating programming and development of heterogeneous SSCs with LLVM tools.
So to give a quick background on who we are, we're a group of collaboration between Arizona State University, University of Arizona, Michigan, Carnegie Mellon, and two industry partners, ARM and General Dynamics. And the overall, and the people specifically behind the work in this presentation, other
than myself and Nirmal, are listed up here because without the work of the team together we wouldn't actually have results to show today. And so some background on what this collaboration of ours is trying to do is,
historically, you know, there's always a performance gap between just general purpose CPUs and ASICs, and you can kind of span the gap there with different kinds of heterogeneous platforms, and ultimately all our collaboration is trying to do is kind of close that gap a bit. So we want to build something that is still pretty general, like still usable easily
by programmers, but can help approach kind of the ASIC performance trend line. And the way we want to do that is by building a heterogeneous chip. This heterogeneous chip will have a variety of different accelerators, standard traditional CPU cores, and what we want to do with this chip specifically is we don't want it
to try and be, you know, let's compute all the things. We want it to be focused and have kind of a purpose to it. So we're not trying to do a general purpose heterogeneous chip. We're trying to take kind of a domain of applications like signal processing and see
if by kind of restricting the set of applications you're using you can come up with some new clever optimizations on how you then build for that chip. And so where the real emphasis kind of is in this project is we want to focus on making the tooling as easy to use as possible, because from a custom hardware point of view,
you can pretty easily build something that no one wants to use, and so the real selling point is going to be not that you can build a chip, but that you can get a workflow that makes developers want to use it as well. And ultimately what we would want out of this collaboration isn't just a single chip
either. We want some kind of repeatable methodology for given a new domain, how do you go about finding what accelerators to include and building a tool chain that will let you target that. And so traditionally you can think of computing as like a three stack cake where you have
the hardware on the bottom, and then there's some resource management like an operating system in between that provides nice interfaces for developer apps, and then the applications sit on top. And when you try to apply this to a DSoC, you kind of have some questions that come up. So in the hardware side, what accelerators do you want to include?
It's a bigger question now than just how many cores do you want to have, or what frequency are they going to run at. What fundamental accelerators, what pieces of your domain are worth accelerating in the first place? How do you launch work onto accelerators in some kind of standardized way because this
chip is going to need to handle a variety of accelerators, and we want the interfaces for that to be simple. What exactly are we scheduling? With a lot of heterogeneous programming, you might say we've optimized this for the GPU, and what that does is just statically at compile time links in support for the GPU
and it always runs there every time you run the binary. But in a world where everything, every application is heterogeneous and all using the resources of your chip, you might want a smarter approach where you have more of a flexibility that you see in CPU scheduling that you can fall back on to different implementations on
different devices depending on the system workload. And then another question that you get with this is how do you integrate new applications to this framework because, as we said, the usability is incredibly important. Along the same lines as usability worth mentioning is how do you debug accelerators?
There's no stack, there's no instruction pointer, there's no registers necessarily in a generic accelerator, and so how do you provide interfaces that make that easy? In this talk we're really only going to focus on just how do you integrate new applications,
but it's worth noting that we have people thinking about all of the questions on the previous slide. And so for the flow that we're presenting today, we have a prototype toolchain that uses dynamic tracing to collect an application trace and then uses that to try and recognize
relevant kernels in your code and perform some additional analysis that allows us to retarget them for different accelerators in the system without the user needing to intervene. And so to step through the process here, we start with there's an open source toolchain
called Trace Atlas that has been worked on heavily by Richard Ury at ASU. And what Trace Atlas does is it is a whole toolchain for collecting runtime-based dynamic application traces from LLVM code. And so the process of how it works is there's some libraries for implementing the tracing
methods, and you inject those tracing calls through an LLVM opt pass. You then compile the binary and run it, and that produces a complete application trace. And then because of some dynamic trace compression that uses Zlib that we've implemented, along
with some other clever techniques on choosing what and what not to trace, we're able to actually make this usable for a very wide range of applications and not just kind of trivial small examples without running out of disk space. So like I said, built on Zlib, and compared to state-of-the-art, it's between two and
a thousand X reduction in time to collect the trace, and two X reduction in on-disk trace size. And so once you have the trace collected, then you can analyze that and see not just statically what the program's behavior could be, but actually what the behavior was.
And so to do that, we use this concept of kernels, which are basically just groups of basic blocks that are highly correlated, and they recur very frequently throughout the program. And there's this notion of kernel affinity, which is basically the transition probability between any two kernels in your original source program.
And what we can do with that is we can, because we have this application trace, we can collect or we can determine empirically what all of the affinities were in our given application execution, and then we can cluster basically all of the related basic blocks together.
And then, so this provides a series of collections of basic blocks where you end up with, say, these blocks here recur very frequently in the source program, so those are related to each other, and then these blocks recur very frequently. And then the other blocks are not considered kernels.
And what we can do from there is, say if the original source program had looked something like this, where these were the two kernels that were labeled before, we can think of this as a directed acyclic graph representation, where the transitions between each kernel or non-kernel boundary kind of gets grouped together into a node of LLVM
basic blocks. And so what we can do on top of that is, because we know that the kernels are important sections of the code, we can add additional analysis on that, and if we can detect that, say, this kernel was a fast Fourier transform, then based on our knowledge of the DSoC that
we're targeting, we can add in support for other platform invocations automatically without the user having to define what those are. And similarly, if this is, say, forward error correction, we can do a similar process.
And the result of that is that you have this acyclic graph where the actual platform that you're dispatching on isn't determined at compile time, and so you build in support for any possible platform for each node into the binary, and that gives you kind of a fat binary structure.
And together with the DAG metadata, that is what we output to be considered an application. And then to run this application, we could try and target Linux directly, but it was more advantageous for us getting started to use a user-space-based application runtime.
And so the way this ends up working is, we essentially allocate pthreads to account as schedulable resources in our DSoC system. We manage a ready queue of tasks, so you inject some applications and then you can,
because of the graph structure, you can tell all the dependencies. And then we are able to implement custom resource management techniques that involve different heterogeneous scheduling heuristics that wouldn't be present in just a standard Linux-based kernel scheduler. And so the question is, why did we do this in user-space? Because it allows us to iterate
a lot faster while we're still in this early pre-Silicon development phase. So it lowers turnaround time, and it allows for easy adding of accelerators, because you can just directly use memory mapped registers or whatever interface that you need to access
whatever is in your design. And really it allows us to co-evolve both all the software the developers are writing along with the hardware designs, all in one easy environment. And so with that together, we end up with this overall flow that captures where the
left-hand side is the dynamic tracing analysis and the right-hand side is the runtime. And to give an idea of, before we had the compilation process going, we had some handwritten applications going, what do we want applications to look like?
And so this is one of the test apps that we had written, where essentially the source code on the left here doesn't actually contain any of the original knowledge about what the sequence of calls needed to be to recreate this original application.
It's just each section of the code has been segmented out into basically a stateless function with all of the arguments and state passed in as memory references. And so what we can do with this is we can have the different nodes in the DAG, and
then if a particular node supports multiple platforms, we can have multiple different functions that that can dispatch to. And then this couples with just a JSON-based DAG representation that includes information about essentially all the memory requirements of the application, what do variables need to be allocated as, what values do they need to be initialized with, along with the
dependency structure in which arguments each kernel requires. And so with that together we're able to run that through our system, but ultimately we don't want to rewrite everything to be like that. We want to just take a simple C code that is used as the example here only because
it's small enough to fit on a slide, and we want to turn that into something that can run in this system. And so we're going to work through an example with this, and for reference this is the output of what this gives if you just compile and run it. So stepping through the high-level process from earlier, we start by just compiling to
intermediate LLVM IR, and once we do that we renumber the basic blocks with a very simple opt-pass that basically just allows us to coordinate which basic blocks belong to which kernels after we've done our analysis phase.
So we see here that these basic blocks are labeled, we instrument with the dynamic tracing calls, so we can see here that we do things like dump IDs of basic blocks on basic block entry, dump loads in stores, and then that gets linked in the back end.
So we compile the tracer binary, we run that binary, and running that binary produces the output trace of the application. And what we can do then is use Trace Atlas' kernel detection mechanisms to detect which basic blocks in the original code are considered kernels by their definition, as well as what
the producer-consumer relationships were between those various kernels. So we can see here that we have two kernels, basic blocks 1, 2, 3, and 5, 6, 7, and then kernel 1 consumes from kernel 0. And so together with all of this information, we're able to take the original LLVM IR that's
been unchanged from the user, and we can use this to refactor the application to essentially outline each of the sections that need to be outlined and create a DAG-based application. So in this case, the kernels are 1, 2, 3, and 5, 6, 7, and then the non-kernel blocks
in between correspond to these sections of the original code, where these two nodes were essentially clustered as kernels because they were considered hot enough, but say this for loop here wasn't, and so it was grouped with all of the rest of this here.
At the same time, we are able to analyze the memory requirements in the application, so we essentially build kind of a symbol table where we just determine, okay, how big does each variable need to be if it's initialized with anything within a reasonable search distance of where the allocation happened?
Can we resolve that to be constant? If it's a pointer, can we try and resolve any malloc calls as constant, essentially? And from there, we use, we outline each of those sections of the code in two different nodes, and so the changed LLVM IR essentially looks like this, where nothing happens in
between each of the node calls, and then we can generate the JSON-based DAG that calls those same nodes in the same sequence, along with allocating all of the variables that are necessary.
And so with these two together, we're able to compile this new LLVM into a shared object, and then with that shared object coupled with the JSON, we can hand that off to our runtime and run it through the flow. So just to validate that all of this stuff we've done has actually preserved the functionality,
we'll run five instances of this app here, and we'll note that the output here matches the output from before. And so with that, I'll hand it over to Nirmal to explain how we then use this for more advanced applications. Thank you, Josh.
So as a part of this project, we have created our user space scheduling framework to rapidly evaluate different solutions that we are coming up for the target DSOC. So this framework is designed to run in the user space, which makes it portable across different virtual and hardware platforms.
So for today's discussion, I'm going to use this particular DSOC data flow as our target DSOC. And this particular target DSOC is composed of a quad-core ARM processor and two FFT accelerators. This FFT accelerators communicates with main memory using DMIPs.
And these IPs are mainly used for bulk data transfer. And we have this ARM processor which uses memory map interface to configure our accelerators and IPs. So we use our framework to emulate this target DSOC on top of a real hardware platform that
is ZC102. And also we demonstrate its portability by running the same platform on top of Xilinx KEMU, which is a virtual platform. So the ZC102 board, it consists of an ZYNC MP SoC, and it has an on-chip quad-core ARM processor and also on-chip programmable logic.
We use the programmable fabric of the ZYNC SoC to implement our accelerators and IP. And we use the ARM processor as is to replicate the emulation of this particular target DSOC. So benefit of using ZC102 is that it helps us in getting more realistic performance estimate
compared to a virtual platform. And it also helps us in performing full system functional validation, especially validating the implementations of our accelerators and IPs. We use Xilinx KEMU as our virtual platform. So this is basically provided by Xilinx. And what it does is that it basically emulates ARM V8 ISA on top of X86 ISA.
It runs as an independent process in the host operating system. And we implement our accelerators and IP in System C, which are also running as an independent process in the host operating system. These two processes communicate with each other using inter-process communication.
Our benefit of using Xilinx KEMU on top of real hardware is that you don't really need a hardware. It can be used by multiple developers, application developers in parallel. So basically it isolates the necessity of real hardware. So next I will talk about how we use our framework to validate the toolchain that
Josh introduced before me. So we take the sample radar correlator code and this is the traditional flow of compilation and execution of a standard C code. We compile it and we generate its output, which is the lag value.
And then we send this particular C code through the toolchain, which is TraceAtlas that is being used for kernel detection. The FAD binary and code refactoring, which basically takes the kernel identified by TraceAtlas and it creates the shared object for the binary and the DAG representation of the application.
And then we send this particular file through our user space framework and then we compare the output of monolithic C code with the output generated by our user space framework and if they are equal, we assume that the application integration has been successful with our user space framework.
So once we complete the integration of application with the framework, we want to see how we can use this particular framework and the applications to do the search design space exploration of the target BSOC. So in order to do that, what we do is that we use the real hardware platform and
we create real workload using more realistic benchmarks from Wi-Fi and pulse Doppler domain or radar domain and our emulation framework basically supports two operation mode, which one is validation and the other one is performance. In validation mode, basically we can inject multiple instances of applications simultaneously,
whereas in performance mode, the applications are injected in time-separated manner. The time separation can be periodic or random. We also have a feature where a user can provide an input trace file to create the workload. For our target BSOC, we assume our resource pool is composed of 3 CPU cores and 2 accelerators.
So in this slide, we use our platform and our toolchain to do design space exploration for different configurations and heuristics. So on the left-hand side, what we have is we run our emulation space framework in
validation mode and on the x-axis, we iterate over different configurations by changing the core count and accelerator count. The workload is created by injecting one instance of each application that I showed earlier and on the y-axis, we have the execution time of the given workload.
So basically, we do the design space exploration of which configuration will suit best for our target applications and we then narrow down on this particular configuration to depending on the performance requirement and energy requirement, we can select any of this configuration
for doing further analysis. So for this particular plot, we have selected a configuration which is composed of 2 CPU cores and 2 FFT accelerators. So this particular analysis has been done in the performance mode where applications are injected in periodic manner for 100 milliseconds and on the y-axis, we have the execution
time of the generated workload trace. In this particular plot, we are basically evaluating different scheduling heuristics. So at the end of this slide, what I want to say is that we have a complete software stack and scheduler framework which can be used for performing design space exploration
during the initial phase of DSoC development. Other than that, I would also like to say that in this particular project, we have been trying to develop an ecosystem of tools for the early development of DSoC and one of the tools that we have developed is a domain specific DS3 simulation.
It's a discrete event simulation tool and it can be used for evaluating different scheduling algorithms, power management policies and design space exploration for energy performance and area trade-off. So benefit of this particular tool is that during the early phase when you don't have
applications ready for the target DSoC or even the accelerators are not being implemented for the target DSoC, this tool can be used for doing all those system level design decisions as soon as possible and depending on the outcome of this, we can target the required accelerators for creating emulation platform.
Okay, then I will hand it back to Josh to go over the demo. So I'm not going to trust the inline video to even play there.
So what we're going to see in this demo is stepping through the full compilation flow with the radar correlator example that we had mentioned earlier. So what we're going to do first is just go through the application and just verify that it is a very basic standard C application. So there's some file IO up at the top, there's some DFT calculations, there's some pairwise
multiplication, there's an IDFT, there's a maximum for loop, and then you print out the value at the end. No pragmas, no anything. And then what we do is we pass it through the compilation flow.
What it's doing here is it's collecting the execution trace and what this also happens to give us is some reference points for the DFT1 and the DFT2 execution as well as the output value of 0.2516. And then what it's doing now is it's going through the kernel detection phase
and analyzing the trace as well as now extracting the producer-consumer relationships. There's a little bit of time dilation here to save on time in the presentation. But yeah, now what it's done after that is it went and it refactored the application
into the blocks of non-kernel and kernel code. And as we can see, it just alternates kernel, non-kernel. And then what it did was it analyzed each node in the graph to see if it can swap in an optimized implementation. So in this case, nodes 7 and 9 were detected as DFT kernels.
And so we were able to swap in an FFT accelerator invocation that we can use instead. And so what we're doing here is we're copying the output shared object in JSON to our hardware platform. And then we run it and show that the modified application is now able to dispatch onto
our hardware accelerator without the user intervening. And we can see that those two DFT kernels are much faster. And just to show that this kind of scales, we do 10 back-to-back. And similarly, they are able to dispatch successfully.
They don't step on each other's toes with multithreading. And all of the outputs individually still remain correct. And so just to illustrate this with a diagram, we generate a Gantt chart that shows the activity on each core as well as accelerators.
And we can see that the FFT accelerator sees some activity as well as the existing standard CPU code. And so, yeah, that's the end of the demo here.
And so I think just the takeaway here is that this was no human-in-the-loop standalone integration from C code to running on an accelerator. And while it's definitely in an early phase, we're excited to see where it can go. So in conclusion, we're pretty happy with what we've accomplished so far.
Having any kind of vertically integrated software and hardware stack is a bit of a challenging task. But for next, the upcoming releases, we do hope to have more mature system C and or RTL accelerators available in the GitHub repository for everyone to mess with.
The current version now is only CPU support. And then also improve our integration of our compiler toolchain with richer and richer applications. And with that, we'll take any questions.
The kernels, is that to remove cycles from the graph? So the question is to, for determining kernels, is that to remove cycles from the graph? So are you questioning kind of the, like, is the determining the boundaries of the
start and end of some recursive process so that the graph doesn't cycle back around? So the goal is actually to have kind of parallelization of kernels.
Because with, I guess to the second part there, with this idea of the producer-consumer relationships, the DAG might be a lot more complicated than just a simple linear chain. And the hope is that by having this knowledge of producer-consumer, you can say, oh, this
kernel never consumes or produces anything that this kernel ever needs. They both consume from some common ancestor, and so we can schedule them simultaneously. But to your first question, I guess the clustering of the kernels is essentially more so that we can identify what areas of the code are important.
We're not necessarily trying to eliminate cycles entirely, but yeah, we just know that this area of the code is important and warrants further analysis. Yeah, I'm sure there's someone working on that.
Would I do what? Oh, how do we calculate the affinity values? Okay, I'd actually defer you to the Trace Atlas paper here up on archive.
It goes into excruciating detail about all of the calculations behind the process.
So the question is, how do you actually identify the kernel? And at this stage, the answer is, yeah, we're manually identifying them. The hope is that there are some other people working on better, more generalizable ways to do kernel recognition, and part of the hypothesis is kind of maybe by
restricting to it a domain of applications rather than all applications. You know, there are recurring ways that people code things within a particular area. You're using the dynamic traces, but then doing pattern matching on the traces to figure out what kind of kernel it is.
And that's a probabilistic match, so then you can figure out if you can get a high-probability match with some existing accelerator, then you know what kind of accelerator you use. Okay, and the comment here is that David Brooks' group at Harvard is taking a similar approach
to kernel detection and probabilistic pattern matching of kernels. So assuming you don't have a library of hands-tuned optimized kernels, I'm assuming there also exists some solutions where you start from almost some code,
and it lowers to a programmable accelerator. Maybe you can look at an FPGA as a programmable accelerator. Would that still be a useful option to go to, or would you just lose too much efficiency because the tools aren't as good as the hand-tuned kernels?
So the question is, assuming you don't have a library of existing optimized kernel implementations, and you still wanted to target an accelerator automatically, is there any kind of process where you could just gradually essentially generate an accelerator that's applicable for a given kernel?
I don't know that any work has been done in that, but I would assume that some of the HLS tool chains would have a role to fill in there, and I think it would probably be pretty interesting to see what could come out of that.