We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

OpenCSD, simple and intuitive computational storage emulation with QEMU and eBPF

00:00

Formal Metadata

Title
OpenCSD, simple and intuitive computational storage emulation with QEMU and eBPF
Subtitle
After all, why not turn your computer into a distributed system?
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Computational storage devices (or CSDs) are the new emerging technology that allows offloading computation to storage devices. In this technology, computation is pushed to the storage device (close to the data), and only the final result is returned to main system memory. The efficiency and performance gains come from the reduction in data movement over the I/O interconnects, thus relieving pressure on the memory bandwidth in the traditional Von Nuemann architecture where all data is first moved to the main memory before processing. Despite lots of enthusiasm, proposals, and research publications, there are no immediately available open-source ready to use CSDs available. Due to the lack of such prototypes, it is very hard and challenging to explore hardware, physical interfaces, application APIs (block-level, file system, key-value stores) for CSD devices. In this talk, I will present OpenCSD, a completely open-source CSD exploration platform designed with QEMU. OpenCSD uses eBPF as the means to offload computation to the CSD, and includes an accompanying file system. FluffleFS, the file system, uses POSIX extended attributes to interact with the CSD device. The full, open-source implementation including a block-device, programming toolchain and a file system interface; allow anyone to explore the paradigm of computational storage.
MUDEmulatorData storage deviceComputational physicsSystem programmingOpen setStack (abstract data type)Software engineeringComputerData storage deviceComputational intelligenceSemiconductor memoryNeumann boundary conditionFood energyBand matrixServer (computing)Kernel (computing)Computer hardwareComputer programReduction of orderField programmable gate arrayPrototypeElectric currentState of matterInterface (computing)Physical systemPresentation of a groupSpacetimeSystem callSource codeModul <Datentyp>Installable File SystemAttribute grammarImplementationFront and back endsBlock (periodic table)Interface (computing)Maß <Mathematik>Flash memoryTranslation (relic)Address spaceLogicSoftware developerExpressionTime zonePerfect groupFaster-than-lightComputer fileInterface (computing)State of matterComputerFile systemComputer hardwareProcess (computing)Physical systemInterface (computing)Extension (kinesiology)Attribute grammarData managementMultiplication signOrder (biology)System programmingTranslation (relic)Semiconductor memoryData storage deviceRight angleBasis <Mathematik>Shared memoryBand matrixComputer programmingInformationCoprocessorPower (physics)Metropolitan area networkComputerStudent's t-testData structureConnectivity (graph theory)Library (computing)Set (mathematics)Directory serviceEmulatorInternetworkingPoint cloudOnline helpEstimationKey (cryptography)SoftwareDegree (graph theory)Network topologyEnergy conversion efficiencyPrisoner's dilemmaPrototypeFerry CorstenSynchronizationLine (geometry)CollaborationismAssociative propertyParallel portWindowUniformer RaumOperator (mathematics)Data storage deviceFlash memorySoftware developerComputer architectureResultantSpacetimeBlock (periodic table)SpeicherbereinigungModule (mathematics)Hard disk driveNamespaceOcean currentOpen sourceOperating systemTime zoneStandard deviationKernel (computing)BefehlsprozessorMathematical optimizationFront and back endsPresentation of a groupLoginSlide ruleLink (knot theory)System callFood energyServer (computing)HypothesisComputer animation
Physical systemData modelConsistencyKernel (computing)TelecommunicationBlock (periodic table)Regular graphIndependence (probability theory)EmailImplementationDistribution (mathematics)Binary fileHistogramDatabase normalizationData compressionEntropie <Informationstheorie>Demo (music)Link (knot theory)Computer programData structureFunction (mathematics)Stack (abstract data type)System programmingPointer (computer programming)ModemLimit (category theory)Data bufferComputational intelligenceEvent horizonStreaming mediaConnectivity (graph theory)ImplementationPhysical systemInformation securityInterface (computing)Reduction of orderMetadataComputerRepresentation (politics)Computer programmingComputer fileCASE <Informatik>Data storage deviceFood energyLatent heatFile formatSoftwareData compressionFitness functionData managementFlash memoryKernel (computing)Event horizonFile systemData structureOpen sourceStreaming mediaSystem callRegular graphType theoryDifferent (Kate Ryan album)Data bufferAverageTable (information)Pointer (computer programming)WritingMaxima and minimaSystem programmingBinary fileNumberCodeResultantObject (grammar)Moment (mathematics)Reading (process)Stack (abstract data type)MathematicsInterface (computing)Endliche ModelltheorieMemory managementLoginDistribution (mathematics)ComputerRandomizationTerm (mathematics)MereologyArmConsistencyFunctional (mathematics)AdditionGroup actionSemiconductor memoryInformationSurvival analysisSubject indexingForm (programming)QuicksortShift operatorDatabaseData storage deviceMachine visionRight angleAddress spaceEmailIndependence (probability theory)Extension (kinesiology)Dependent and independent variablesAttribute grammarBitEnergy conversion efficiencyPhysical lawTheory of everythingOnline helpPredictability
Data storage deviceComputational intelligenceKernel (computing)Physical systemEvent horizonStreaming mediaAttribute grammarLimit (category theory)Web pageType theoryKey (cryptography)Different (Kate Ryan album)EmailProcess (computing)Demo (music)Entropie <Informationstheorie>Range (statistics)Extension (kinesiology)Maxima and minimaLine (geometry)Installable File SystemLibrary (computing)Data conversionData structureGame controllerProof theorySampling (statistics)HypothesisArchaeological field surveyTranslation (relic)Block (periodic table)Archaeological field surveySpacetimeData conversionHypothesisReduction of orderData structureKernel (computing)Event horizonNamespaceComputer filePCI ExpressNormal (geometry)Flash memoryFunctional (mathematics)Library (computing)Computer programmingEmailInstallation artCoprocessorRepresentation (politics)Cartesian coordinate systemMultiplication signSource codeStreaming mediaRevision controlCache (computing)Client (computing)Multi-core processorSoftwareInformationMetric systemKey (cryptography)Moment (mathematics)PrototypeMereologyData storage deviceMatching (graph theory)Limit (category theory)MicrocontrollerPoint (geometry)Context awarenessWeb pageBitWritingReading (process)NumberImplementationGreatest elementFile systemPhysicalismSemantics (computer science)Latent heatMetadataComputerPhysical systemComputerInterface (computing)Gastropod shellSystem programmingEntropie <Informationstheorie>Bit rateDisk read-and-write headPerspective (visual)Group actionFrequencyAttribute grammarSemiconductor memorySearch algorithmRun time (program lifecycle phase)SpeicherbereinigungFood energyState of matterEstimationUniform resource locatorMultiplicationSoftware testingObservational studyProcess (computing)Execution unitProof theoryData storage deviceLevel (video gaming)Goodness of fitComputer animation
Data storage deviceComputational intelligenceInstallable File SystemHypothesisArchaeological field surveyExtension (kinesiology)Attribute grammarSystem callKernel (computing)Entropie <Informationstheorie>Demo (music)Limit (category theory)Web pageType theoryKey (cryptography)Different (Kate Ryan album)Range (statistics)Streaming mediaProcess (computing)InformationStatuteLoop (music)Streaming mediaBitPrototypeCodeTerm (mathematics)Complex (psychology)CodeKernel (computing)Field (computer science)Flash memoryComputer hardwareTime zoneLatent heatField programmable gate arrayCASE <Informatik>MereologyComputerLogicObservational studySource codeClosed setOpen sourceDatabaseCoprocessorMagnetic-core memoryLimit (category theory)Data managementArmNatural numberPoint (geometry)Software testingNamespaceMathematical analysisData storage deviceVirtualizationBefehlsprozessorAdditionMetric systemMultiplication signComputerBand matrixDistribution (mathematics)Computer fileResultantHypothesisProcess (computing)ImplementationThread (computing)Reading (process)SpacetimeCartesian coordinate systemReduction of orderComputer programmingMathematicsCase moddingEntropie <Informationstheorie>DampingNumberStaff (military)Presentation of a groupOrder (biology)DemosceneOcean currentWorkloadMultiplicationCommutatorIdeal (ethics)Basis <Mathematik>LaptopCloningDataflowWater vaporVirtual machineWeightComputer animation
Program flowchart
Transcript: English(auto-generated)
Okay. So, hello everyone. This presentation is about OpenCSD, which is a Computational Storage Emulation platform. The reason we're emulating that, I'll get into shortly.
But first, I think I owe you an explanation of computational storage and what is actually ish, because I don't think many people are familiar with that even in this deaf room, but I'm pretty sure most people are familiar with Camu and eBPF. You can email me, there's a link to the repo, and this has been a long-time collaboration with my master's thesis at the food.
So, let's get started. I'm going to briefly explain who am I. I'm Kone Luka and my handle online is mostly Dentalian. I'm also a licensed ham radio operator, Papa Delta 3 Sierra uniform that is. My expertise is in parallel and distributed system. I've been in academia for some while, associate degree, bachelor degree,
master's degree, and I've had some experiences throughout that time. So, I've worked on health technology for officially impaired people, worked on OpenStack with cloud optimizations. I've done computational storage for my master thesis, that's what this talk is about, and currently work on SCADA systems for the Lover 2 radio telescope at Astron.
So, why do we actually need computational storage? And that's because we live in a data-driven society nowadays, so the world is practically exploding with data, so much so that we're expected to store 200 zettabytes of data by 2050, and these high data and throughput requirements pose significant challenges on storage interfaces and technologies that we are using today.
So, if you look at your traditional computer architecture, the one that's being used on x86, it's based on the Von Neumann architecture, and here we basically need to move all data into main system memory before we can begin processing. So, this poses memory bottlenecks
and internet interconnect bottlenecks on networks or PCI Express, and it also drastically hinders energy efficiency, to an extent. So, how much of a bandwidth gap are we talking here? Well, if you look at the server from 2021, say, using Epic Milan with 64 SSDs, we're losing about four and a half times
the amount of bandwidth that could be offered by all the SSDs in tandem, but can't be utilized because we can't move it into memory that fast. So, that's quite significant. Now, what is this computational storage, and how does this solve this, actually? Well, we fit a computational storage device, so a flash storage device, with its own CPU and memory,
and now the user, the host processor, can submit small programs to this computational device, let it execute, and only the result data from this computation can then be returned over the interconnect into system memory, thereby reducing data movement and potentially improving energy efficiency, because these lower power cores,
using more specialized hardware, are typically more energy efficient than your general purpose x86 processor. If you then look at the state of current prototypes as of September 2022, we see three main impediments. Firstly, is the API between the host and device interface. There's no standardization here. People aren't building hardware prototypes,
but not so much looking at these software interfaces. And we also have the problem of a file system, because these flash devices, they have file systems, and we want to keep that synchronized between the host and device. So how do we achieve that? We can't use cache-coherent interconnects or shared virtual memory, because by the time we back round trip between the PCI Express interface,
we'll have lost all the performance that we decide to gain. And how do we stick to existing interfaces? People that access file systems, they read, they write, they use system calls. They are very used to this. If you would suddenly need to link a shared library to access your file system, people wouldn't be up for that. So we need some solutions here.
That's what OpenCSD and FluffleFS introduce. We have a simple and intuitive system. All the dependencies and the software itself can run a user space. You don't need any kernel modules or things like this. We manage to entirely reuse existing system calls that are available in all operating systems, not all most typical operating systems,
FreeBSD, Windows, macOS, and Linux. So I'd say that's pretty good. And we do something that's never been done before in computational storage. We allow a regular user on the host to access a file concurrently while a kernel that is executing on the computational storage device is also accessing that file.
And this has never been done before. And we managed to do this using existing open source libraries. So we've Boost, Xenium, FUSE, UBPF, and SPDK. Some of you will be familiar with some of these. And this allows any user like you to, after this talk, try and experience this yourself in Camu without buying any additional hardware.
And I'll get into that hardware in a second because there's some specialized hardware that if we want to have this physically in our hands, we have to do some things. And if you look at the design, then we see four key components and a fifth one that I'll explain on the next slide. We're using a log structured file system which supports no in-place updates.
So everything is appended and appended. And we have a module interface where we have backends and frontends. So this allows us to experiment and try out new things. We can basically swap the backends and keep the frontend the same. And we're using this new technology in Flash SSDs that's called zone namespaces.
They are commercially available now, but they're pretty hard to get still, but that's going to improve in the future. And the system calls that we managed to reuse, those are extended attributes. So extended attributes on any file on directory on most file systems, on the file system you are using likely now,
you can set arbitrary key value pairs on these files. And we can use this as a hint from the user to the file system to instruct the file system that something special needs to happen. And basically we just reserve some keys there and assign special behavior to them. Now let's get back to the topic of zone namespaces
because I only use some explanation here. Back when we had hard drives, we could perform arbitrary reads and writes to arbitrary sectors. Sectors could be rewritten all the time without requiring any erasure beforehand. This is what is known as the traditional block interface.
But there's a problem, and that is that NAND Flash doesn't actually support this behavior. So when you have NAND Flash, your sectors are concentrated in blocks, and this block needs to be linearly written. And before you can rewrite the information in a block,
the block needs to be erased as a whole. So in order to accommodate, Flash SSDs have to incorporate what is known as a Flash Translation Layer, where basically all these requests that go to the same sectors are somehow translated and put somewhere else physically, just so that the user can still use the same block interface that they have been used to from the time of hard drives.
So there's this physical translation between these logical and physical blocks, and when we try to synchronize the file system from the host with the device while a kernel is running, this introduces a whole lot of problems. So how do we solve this? Now, you know the answer, it's the zone namespaces. We basically present an interface
that's not the block interface, and it's an interface that fits to NAND Flash behavior. So when you use the zone namespaces SSD, you need, as a developer of a file system or the kernel, need to linearly write each sector in the block, and you need to erase the block as a whole. So effectively, you become the manager of this SSD.
The Flash Translation Layer and the garbage collection lives on the host, and we call this whole system host managed. If we now combine this with a log-structured file system, which also didn't have any in-place updates, and then you naturally see that this becomes a very good fit.
And now, together with these two technologies, we can finally synchronize the host and the file's file system, and we can do that by making the file temporarily immutable while the kernel is running. And we do that using a snapshot consistency model by creating in-memory snapshots. So we were able to create a representation of the file
as it was on the host with metadata, put that to the computational storage device memory, and we can assure that all the data that is there will remain immutable during the execution of the kernel. Meanwhile, the user can actually still write to the file, and the metadata of the file and the host will differ,
but that's not a problem. So this is very powerful, and it allows us to also understand kernel behavior in a way, because we can now have metadata and send it to the computational storage device that says, well, actually, if the kernel tries to do this,
remember, it's a user-submitted program, it might be malicious, then we want to block those actions, so we have a security interface as well. The final kick in the bucket for this design is that we want to be architecture-independent, and we do that through eBPF, the system that you're also using for network hooks and event hooks in the Linux kernel nowadays.
With eBPF, you can define system calls and expose those in a header, and this is actually the format of how you would do that, that's a real example, and then the vendor would implement that code, and you would define in a specification some behavior, but the vendor doesn't have to open source their code, which in the case of Flash SSDs and vendors
is pretty important, because they don't seem to be that keen on that. And this way, we can still have an interface that users can write programs once and reuse them across all vendors without any problem, and the nice thing about eBPF is that this instruction set architecture, what eBPF essentially is,
is easily implementable in a VM. So there's even pre-existing open source implementations of this, and that's what we're using, eBPF. Now that I've explained all the key components to OpenCSD and Flov.LFS, I want to start with a little demo and show you
what are some of the actual practical use cases for this. So how can we use such a computational storage system in a way that it makes sense in terms of data reduction and energy efficiency? And for that, we're gonna go to the example of Shannon Entropy. This is heavily used by file systems
who can perform background compression or by just compression programs that compress in the background, what you basically do is you try to quantify the randomness you have in a file. Typically it's between zero and one, but for computers that doesn't really make sense. So we use this log B that's over here
to normalize this for bytes. Then we can say what's the distribution of bytes. So we create, because a byte has 265 different possible values, we create 265 bins and we submit a program to calculate this, it runs in the background and only the result is returned
to the host operating system. And then the host operating system is free to decide whether or not this file should be compressed or not. So how does a search kernel look like? The kernel that you actually submit to the computational storage device or you can just write them in C and compile them with Clang.
So you write them in C and we have two individual interfaces here that we are exposing. The yellow commands, those are introduced by the system calls, the EBPF ABI that we are defining and the purple ones, those are introduced by a file system.
What that means is that using this system as is now that it's not agnostic to the file system. So it is agnostic to the vendor and the architecture of the vendor. If it's ARM or x86, that doesn't matter, but now it's specific to the FluffleFS file system that we have written.
And I will address some possible solutions for this at the end. Other things we need to realize is that the EBPF stack size is typically very small. We're talking bytes here instead of kilobytes. So we need a way to address this. So what you can do is in UBPF, you can allocate a heap just as your stack.
And then we have this BPF get mem info that we have defined as part of the ABI that allows you to get your heap pointer. Now, currently you have to manually offset this, which is a bit tedious, if you will. You see that it's actually done here. To store the bins, we offset the buffer
by the sector size. And then the data from the sector reads is actually stored at the top of the buffer and the bins are stored at the offset for precisely one sector size. Now, when we go to look at the file system interface and all the helpers and data structures and additional function calls that we introduce,
we can later see that we can also make a basic implementation of monologue and free here and then just resolve this. But for now, for this example, it's a bit tedious. Now, how do you actually trigger this? So we had the extended attributes, we had all these systems in place, but now you just have this kernel,
you have compiled it, you've stored it to a file. And now you want to actually offload your computation, well, in an emulated fashion, but you want to see how you do that. So the first thing you do is you call start on the kernel object. So this is your compiled die code. And then you get the inode number.
This inode number you have then to remember and you then open the file that you want to read upon or write upon, but for the examples we're using read mostly. Then you use set extended attributes, you use our reserved key, you set it to the inode number of the kernel file.
And then when you actually issue read commands, the read commands will actually go to the computational storage device and they'll run on there. But when do you actually take these snapshots? And the trick is as soon as you set extended attributes, this is just by design, right? It could also be once you call the first read or once you execute the first write.
But we have decided to do it at the moment that you set extended attribute. That means that if you make any changes to your kernel, once you've actually set extended attributes, then nothing changes anymore. And the same goes to the file. Now I want to briefly explain some different types of kernel that you can have.
And what the example here is mainly showing is what we call a stream kernel. So a stream kernel happens in place of the regular read or write request. So the regular read or write request doesn't happen. Only the computational storage request happens on the computational storage device.
And with an event kernel, it's like the opposite way around. First, the regular event happens normally, and then the kernel is presented with the metadata from that request and can do additional things. This is, for databases, interesting. For example, say you're writing a big table and you want to know the average
or the minimum or the maximum, and you want to emit that as metadata at the end of your table write. Well, you could use an event kernel to let it write as is, then you get presented with the data, the kernel runs on the computational storage device, and you emit the metadata after, and you can store it as like an index.
We have also decided to isolate the context of this computational storage offloading, so what is considered once you set X attribute, by PID, but we also could make this by file handle, or you could even set it for the whole inode. More so, we could use specific keys
for file handle PID or inode offloading. So it's just a matter of semantics here. Now I have some source code in Python of this execution steps that I've just shown here, because there's a little bit of details that I left out in the brief overview. The first is that you have to stride your requests,
and those have to be strided by 500 to 12K. And why is this so? Well, infuse the amount of kernel pages that are allocated to move data between the kernel and the user space is statically fixed. So if you go over this, then your request will seem fine from the user perspective,
but what the kernel will do is it will chop up your requests. Why is that problematic? Well, then multiple kernels spawn, because from the context of the file system, every time it sees a read or write request, it's gonna spawn this kernel and move it to the computational storage device.
Then here you can see how I set the extended attribute and get the kernel, the inode number. And what I want to show here at the bottom is that I'm getting 265 integers, and that's for each of the buckets of the entropy read, but I'm having a request of 512K. So that shows you the amount of data reduction
that you can achieve using systems like this. 265 integers, 512K, pretty good. Could be better though. The reason it's not better is floating point support in eBPF is limited to the fact where you need to implement fixed point match yourself. So we could do this as part of the file system helpers,
but that's not done for this prototype at the moment. Now, some limitations. This was a master thesis work. This was my first time defining a file system ever. It's solely a proof of concept. There's no garbage collection, no deletion, no space reclaiming.
Please don't use it. Please don't use it to store your files, yeah. eBPF has an endiness, just like any ISA would have, and there's currently no conversions. So if you happen to use something that uses different endiness, all your data will be upside down. So you have to deal with that yourself for now,
but once again, we can make it part of the file system helpers to help with these data structure layout conversions and the endiness conversions. As I mentioned briefly earlier, floating point support in eBPF respectively non-existent, but we can implement fixed point map.
And currently I haven't shown any performance examples because I don't think that they are that interesting because what's currently happening when you emulate offloading is that it just runs on the host processor as is in eBPF. So it isn't representative of the microcontrollers that you would find on SSDs.
So the runtime, the time that it would take to execute these kernels would be much too fast. So that's something that we need to deal on, I think, because then we can more easily reason about what would be the actual performance if we would offload these applications to SSDs. Frankly, these SSDs do have very capable microcontrollers,
typically even multi-core processors. The reason they do that is because they need to manage the flash sensation layer. So they are already fairly capable devices actually. Only read stream kernels have been fully implemented for this prototype as well. And that's maybe mainly because event kernel performance is problematic because the data from the event kernel,
remember the IO request happens regularly. So all the data is moved back into the host processor and only then is the event kernel started. But what you really need is a two-stage system where you prevent the data being moved back from the host. This requires some more tinkering.
And the final thing, we need to make this agnostic to the file system and we can very easily achieve this using this file system runtime, where through an ICD and installable client driver, much the same way that Vulkan and OpenCL and OpenGL are working,
you can dynamically load a shared library that implements all the functions you have defined in a header. And this can also dynamically compile your programs and then stored cache versions of this program. And using statfs, we can easily identify on what file system it is running. And that allows users to write their programs one,
run on any architecture and for any computational file system, which I think is pretty powerful and flexible. So that's it. I encourage you to try this. I've also written a thesis on this that does have some performance metrics. It also shows you some interesting data structures that we had to design for the file system
to be able to support these in-memory snapshots. There's a previous work called SetCSD that also has some early performance information. And I've written quite an extensive survey on the last decade's history or so of computational flash storage devices, which is also quite interesting.
So thank you. Oh, that's quite good. I imagine this is quite difficult, right?
Computational storage, what the fuck's that? So please don't feel, please don't hesitate to ask questions if anything is unclear. The computational storage. There's one vendor that is selling
a computational storage device that's not based on zoned namespaces storage. So it's using conventional SSDs and it supports computational storage to a network interface. So you have the normal PCIe interface and then there's like this transport over needs to do TCP IP.
And then you basically just connect over it to SSH and then you can do things on the SSD. That one's commercially available. I don't know what they would ask for that product. Do you have to do zoned namespaces? Nothing in principle, but you need a way to synchronize the file system
between the host and device. And zoned namespaces make that trivial. Whereas conventional SSDs, the logical and physical block translation severely hinders this, makes it extremely difficult to perform. So why didn't you include the performance metrics
from your thesis, or better ones? Because the performance- Oh, sorry. Oh yeah, why don't I, ah yeah, very good. I forgot that actually, all the time. Yeah, so why didn't I include any performance metrics if I have them? And the answer is because I don't think I would have time
and I don't think they're interesting enough to include. This is a very complicated subject. It's very new for most people. Computational search, most people have never heard of it. So I'd much rather spend the time to explain this properly and try to show you that this is a very interesting concept to solve this bandwidth gap, rather than show you some metrics that are not representative anyway,
because the kernel is running on the host CPU and you're not gonna have an additional host CPU on the flash SSD. Yeah. Can you talk about what kind of test setup you have for your metrics? So I don't, yeah, of the metrics themselves. Yeah, so the framework, okay, yeah, yeah, very good. What kind of test setup I had to do all this analysis
and to try these things out. So I run Camu on my own host machine, just a normal laptop, basically this one. And Camu then creates a virtual zone namespaces device that's actually quite recently introduced to Camu. So you can now try some namespaces without owning some namespaces.
That's the whole reason Camu comes into play because otherwise people wouldn't buy, need to buy a certain namespaces SSD, which is quite badly available. And then you just run the prototype as is. So that's all you need. And you really don't need any special hardware. Yeah, it could be even on an ARM laptop, doesn't matter.
Did you test that? No, I did not test it, but whether or not I tested if it works on ARM. The answer is no, I did not test it, but I'm pretty sure our Camu compiles on ARM. So I'm pretty sure we're good there. Because you have to remember that that's maybe not intrinsically clear from this presentation,
but we didn't extend Camu in any way. It's just a normal Camu accumulation. You don't even need to custom install it. You can just get it from the package manager and use this. I have a lot of questions on that. The computational part, what are the limitations of what may run on these devices?
I think the main limitations, oh yeah, yeah. What are the limitations of the kernels that you run on these devices? Well, first of all, you need to have data reduction. If you're gonna read one gigabyte from the flash storage and you're gonna return one gigabyte of data to the host, then there's no real point in offloading this
because the data is gonna be moved anyway. So the first thing that the limitation is that you have to find an application that is reductive in nature. Once you do the computation, you return less data. The nice thing is that's 99% of all workloads, right? So that's pretty good.
And the second thing is that if it's timing critical and the computation takes a long time, then it's probably not that interesting because the latency will then be too bad because the performance of these cores is much less than your host processor. But you can implement specialized instructions that could be very efficient
in doing database filtering or things like this. And that is where the whole ASIC and FPGA part would come into play. But if it's not timing critical and it's in the background like the Shannon entropy compression, those are ideal cases. Reduction in data and not timing critical. To repeat the question whether or not it's just software
or whether we also program the hardware.
Of course, FPGAs can be reprogrammed from the fly. And we have seen prototypes in the past for computational storage devices where they do just that. From the host device, the user sends a bit stream that dynamically reprograms the FPGA and then the kernel starts running. That's not what we're trying to achieve here. What I envision in this study FPGA
has specialized logic to do certain computations. And then from the ABI, from the EBPF ABI, once the Fender code triggers those instructions, will utilize the FPGA to do those computations. But it would be defined in the specification beforehand. Because typically reflashing a FPGA with a new bit stream
takes quite some time. So in the interest of performance it might not be that interesting. I'm gonna ask a question for the audience. So you might have mentioned it, but are there closed source competitors here?
If there are closed source competitors in the space of computational storage. Well, actually that's one of the things that's been growing really well in the scene. I'd say the vast majority of everything is open source. At least if you look at the recent things.
If you look at the past decade, then it's a bit worse because there's a lot of research published that doesn't actually publish the source codes. Or rather the source codes is published but everything is a hardware prototype and they didn't publish the bit streams or the VHDL or the Ferrelog. So you then stuck as well. But it didn't have any PCB designs so you can't reproduce the work, if you will.
I say this is a much bigger problem than just computational storage in the field of academia. But it's also present here. Yes. Come back to the Python code, please. Which one?
The Python code. Complexity in terms of? The reason this is a nested loop.
In the phase of performance, I have a nested loop here. So why that? And why in the terms of performance? And Python. And Python. Yeah, the trick is, this is just for demonstration purposes. There's one you can easily make this example in C or C++ and you should if you care about performance. The trick is that this program is already spending 99%
of its time in IO wait because it's waiting for the kernel to complete. So in the face of that, it's not that interesting. And the reason we have the nested loop is because the floating point performance in EBP, the floating point implemented in EBPF is not existent or at least I didn't implement a fixed point mod. So what I have to do after this,
but at the bottom, what you don't see here is that from all these buckets of these bins, I'm actually computing the distribution using floating point math in Python, which is why I don't get a single number from this kernel. Because if I would have floating point implementation in EBPF, I could already do that computation in EBPF
and only return a single 32-bit float as a result instead of these 265 integers. But the reason this is a loop is because I still have to stride for the read request because I can't go above 512k even if my file is bigger than 512k.
What's the time in IO wait? Well, the trick is, okay. Couldn't I implement multi-threading here?
Currently, the EBPF VM runs as a single process. So even if you submit multiple kernels, only one will execute at a time. Why? It's a thesis prototype, right? Time and things like this. Okay. Thank you very much. No worries.