We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

vhost-user-blk: a fast userspace block I/O interface

00:00

Formale Metadaten

Titel
vhost-user-blk: a fast userspace block I/O interface
Serientitel
Anzahl der Teile
542
Autor
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
vhost-user-blk is a userspace block I/O interface that has traditionally been used to connect software-defined storage to hypervisors. This talk covers how any application that needs fast userspace block I/O can use vhost-user-blk and its advantages over network protocols. A client library called libblkio is available for C and Rust applications will be introduced. The protocol is also summarized for those wishing to understand how it works or implement it from scratch. This talk is intended for developers interested in connecting applications to SPDK or qemu-storage-daemon and those who want to know more about software-defined storage interfaces. - Local block storage interfaces - Kernel vs userspace - Notifications vs polling - Message-passing vs zero-copy - What is vhost-user-blk? - Implemented by qemu-storage-daemon and SPDK - virtio-blk and VIRTIO - How to connect using libblkio (C/Rust) - How to implement a server using libvhost-user (C) or vhost-user-backend (Rust) - How to integrate with the Linux kernel block layer using VDUSE
14
15
43
87
Vorschaubild
26:29
146
Vorschaubild
18:05
199
207
Vorschaubild
22:17
264
278
Vorschaubild
30:52
293
Vorschaubild
15:53
341
Vorschaubild
31:01
354
359
410
Minkowski-Metrikp-BlockSchnittstelleVerzeichnisdienstHierarchische StrukturElektronische PublikationSchreiben <Datenverarbeitung>Lesen <Datenverarbeitung>InformationsspeicherungObjekt <Kategorie>Endliche ModelltheorieÄhnlichkeitsgeometrieSCSIAbstraktionsebeneFunktion <Mathematik>SpeicherabzugDatenverwaltungKanalkapazitätNegative BinomialverteilungFaltungsoperatorDesintegration <Mathematik>Komplex <Algebra>FlächentheorieGrenzschichtablösungDatenpfadSystemaufrufTaskVollständigkeitp-BlockInformationsspeicherungSchnittstelleMinkowski-MetrikKartesische KoordinatenIdentitätsverwaltungPhysikalisches SystemOrdnung <Mathematik>SoftwareElektronische PublikationEndliche ModelltheorieBefehlsprozessorFunktionalSchnittmengeiSCSIObjekt <Kategorie>Schreiben <Datenverarbeitung>Protokoll <Datenverarbeitungssystem>SpeicherabzugÄhnlichkeitsgeometrieKanalkapazitätAbstraktionsebeneExogene VariableMathematische LogikProgrammierungVollständigkeitTaskProzess <Informatik>ZahlenbereichHauptplatineMAPCASE <Informatik>HardwareDifferenteMereologieFaltungsoperatorComputersicherheitKeller <Informatik>BitLesen <Datenverarbeitung>Mapping <Computergraphik>BetriebsmittelverwaltungDatenverwaltungDateiverwaltungInterface <Schaltung>SystemaufrufATMNetzbetriebssystemPeer-to-Peer-NetzLeistung <Physik>Interrupt <Informatik>MaßerweiterungCodeZeitzoneFlächentheoriePunktNormalvektorMultiplikationsoperatorSchedulingAdditionServerPhysikalischer EffektPlastikkarteInformation RetrievalRechter WinkelBruchrechnungSCSISkriptspracheGesetz <Physik>RechenschieberKomplex <Algebra>QuaderCoxeter-GruppeSichtenkonzeptIdeal <Mathematik>PlotterComputeranimation
TaskVollständigkeitROM <Informatik>Message-PassingGemeinsamer SpeicherMinkowski-MetrikStellenringp-BlockSchnittstelleEmulationInformationsspeicherungProtokoll <Datenverarbeitungssystem>ZeitbereichSocketStrebePufferspeicherPufferspeicherProtokoll <Datenverarbeitungssystem>Virtuelle MaschineGamecontrollerProzess <Informatik>WärmeübergangWarteschlangePuffer <Netzplantechnik>Ganze FunktionUmwandlungsenthalpieResultanteBefehlsprozessorOverhead <Kommunikationstechnik>Interrupt <Informatik>Äußere Algebra eines ModulsMinkowski-MetrikTaskSchnittstelleKartesische KoordinatenObjekt <Kategorie>Message-PassingHalbleiterspeicherMultiplikationsoperatorSocketDomain <Netzwerk>ÄhnlichkeitsgeometrieStellenringImplementierungOrdnung <Mathematik>Gewicht <Ausgleichsrechnung>Mini-DiscKonfigurationsraumEndliche ModelltheorieZahlenbereichp-BlockLeistung <Physik>InformationsspeicherungDreiecksfreier GraphInterface <Schaltung>FokalpunktFunktionalHyperbelverfahrenSoftwareInterprozesskommunikationNetzwerkdatenbanksystemFaltungsoperatorVirtualisierungGüte der AnpassungStandardabweichungVideokonferenzGenerizitätSchedulingTypentheorieSichtenkonzeptGemeinsamer SpeicherFormation <Mathematik>AggregatzustandDatenstrukturComputersicherheitVollständigkeitProdukt <Mathematik>SystemaufrufPhysikalisches SystemVerschlingungRechenschieberQuick-SortKnotenmengePartikelsystemTelekommunikationProxy ServerVersionsverwaltungPolstelleProgrammiergerätProgrammierungZeiger <Informatik>Computeranimation
IndexberechnungProgrammbibliothekp-BlockSchnittstelleCodeKonfiguration <Informatik>InformationsspeicherungFokalpunktProgrammierungTreiber <Programm>EreignishorizontArchitektur <Informatik>Gebäude <Mathematik>Elektronisches ForumCOMClientVollständigkeitStrebeSoftwaretestSkriptspracheServerTopologieWarteschlangeProzess <Informatik>Minkowski-MetrikLesen <Datenverarbeitung>Protokoll <Datenverarbeitungssystem>FaltungsoperatorSocketZeitbereichOffene MengeImplementierungSystemprogrammierungFaltungsoperatorKartesische KoordinatenInformationsspeicherungComputerarchitekturp-BlockSocketCodeKonfigurationsraumLesen <Datenverarbeitung>Software Development KitDomain <Netzwerk>RohdatenVerschlingungSchnittstelleMinkowski-MetrikYouTubeElektronische PublikationEreignisgesteuerte ProgrammierungIntegralInstantiierungFunktionalOrdnung <Mathematik>Treiber <Programm>WarteschlangeSystemaufrufInterface <Schaltung>ProgrammbibliothekServerProzess <Informatik>Mini-DiscPhysikalisches SystemProgrammierungRechenschieberKomplex <Algebra>GamecontrollerDifferenteBitCoxeter-GruppeThreadMultiplikationsoperatorOffene MengeSoftwaretestMereologieDämon <Informatik>SoftwareEinfach zusammenhängender RaumInverser LimesDatenfeldComputersicherheitBildgebendes VerfahrenSichtenkonzeptEreignishorizontIntelProxy ServerVollständigkeitRechter WinkelProtokoll <Datenverarbeitungssystem>ImplementierungVarietät <Mathematik>SynchronisierungSuite <Programmpaket>CASE <Informatik>RechenwerkVerkehrsinformationMatchingGraphfärbungComputeranimation
Offene MengeCodeMinkowski-MetrikSchnittstellep-BlockImplementierungClientSystemprogrammierungServerInformationsspeicherungIntelComputeranimationFlussdiagramm
Transkript: Englisch(automatisch erzeugt)
Hi, my name is Stefan Heinecke, and I work on QEMU and Linux. And today I want to talk about vhost-user-block, a fast user space block IO interface. So what is vhost-user-block? vhost-user-block allows an application to connect to a software-defined storage
system that is running on the same node. So in software-defined storage, or in storage in general, there are three popular storage models. There's block storage, file storage, and object storage. And vhost-user-block is about block storage. So for the rest of this presentation, we're going to be talking about block storage.
And block storage interfaces, they have a common set of functionality. First of all, there's the core IO, reads, writes, and flushes. These are the common commands that are used in order to store and retrieve data from the block device. Then there's data management commands.
These are used for mapping and allocation of blocks. Discard and write zeros are examples of these kinds of commands. There are also auxiliary commands, like getting the capacity of the device. And then finally, there can be extensions to the model, like zone storage, that go beyond the traditional block device model.
Vhost-user-block supports all of these things, and it's at a similar level of abstraction to NVMe or to SCSI. So let's start by looking at how vhost-user-block is a little bit different from the things like NVMe or SCSI,
things that are network protocols or hardware storage interfaces. Vhost-user-block is a software user space interface. So let's begin by imagining we have a software defined storage system that is running in user space. And it wants to expose storage to applications. So if we're using the kernel storage stack, what will happen is we'll need
some way to connect our software defined storage to the kernel and present a block device. Ways of doing that might be NVMe over TCP, or as an iSCSI LUN, or maybe as an NBD server, and so on.
And so that's how a software defined storage system might expose its storage to the kernel. And when our application opens a block device, it gets a file descriptor and then it can read or write using system calls from that file descriptor. And what happens is execution goes into the kernel's file system and
block layers. And they will then talk to the software defined storage system. Now that can be somewhat convoluted because if we've attached, say, using NVMe over TCP, the network stack might be involved and so on. And at the end of the day, all we're trying to do is communicate between
our application and the software defined storage processes that are both on the same node. They're both running on the same operating system. User space storage interfaces, they leave out this kernel storage stack. And instead they allow the application to talk directly to the software defined storage process.
Now there are a number of pros and cons to using a user space interface. And I'll go through them here. So I've already kind of alluded to the fact that if you have a user space interface and you don't go through the kernel storage stack, then you can bypass some of that long path that we discussed.
For example, going down into the kernel, coming back out using something like MBD or iSCSI in order to connect to another process on the same node. There must be a faster way of doing that, right? So with VIO's user block, it turns out we can actually get rid of system calls entirely from the data path.
So reads and writes and so on from the device don't require any system calls at all. And we'll have a look at how that's possible later on in this talk. But speed is one of the reasons why a peer user space interface for block IO is an interesting thing. Another reason is for security.
Typically, in order to connect a block device to the kernel, you need to have privileges because it can be a security risk to connect untrusted storage to your kernel. And the reason for that is that there's a bunch of code in the storage stack that's going to run and it's going to process and be exposed to this
untrusted data. If you think about a file system and all its metadata, that can be complex. And so there's a security risk associated with that. And therefore, privileges are required to create block devices. An ordinary unprivileged process cannot attach and mount a block device. So in a scenario where you do have an untrusted block device and you
would like to remove the attack surface there, then using a user space interface allows you to avoid that. Also, if you don't have permissions, if you simply don't have permissions, then you won't be able to create a kernel block device. So then a user space interface is beneficial as well.
Now, those were the pros. Of course, there are drawbacks to having a user space interface. First of all, it's complex. Compared to simply opening a file and reading and writing from the file descriptor, you're going to have to do a lot more because all the logic for actually doing IO and
communicating is now the responsibility of the application and not the kernel. So there's that. In addition, if you think about existing programs that you might want to use to access your storage, they won't have support for any new interface that is user space only. They are probably using the POSIX system calls and read and
write and so on, and that's what they expect. So you'll have to port those applications in order to access your software-defined storage system if you rely on a user space interface. Another disadvantage is that if you have a user space interface, then the kernel storage stack isn't involved.
So if you decide you need a feature from the kernel storage stack, whatever that may be, or if you have a legacy application that you cannot port and that needs to talk to a kernel block device, then again, you have a problem because your software-defined storage system is isolated.
Its block devices aren't connected to the kernel. What we're going to do today is we're going to look at both these pros and cons, and we're going to also see how with vhost user block, we can actually overcome these cons. So let's start a little bit looking at some of the performance aspects,
how this can be fast. I said no system calls are required, so how does that even work? If the software-defined storage system and the application need to communicate, how can they communicate without system calls? All right, so one of the important concepts in IO is how to wait for the completion of IO.
When you submit an IO request, maybe you have no more work for your process to do. Maybe the CPU is essentially idle until that IO request completes, and at that point, you'll be able to do more work. The normal thing to do in that case is to then de-schedule your application
and let other threads, other tasks on the system run. And maybe if there are no other tasks, then the kernel will just put the CPU into power-saving mode. It'll put it into some kind of low power state, and it will awake once the completion interrupt comes in. And you can see that at the top of this slide, at the top diagram,
you can see that there's the green part where we submit the IO, and at that point, we run out of things to do because we're going to wait for completion. So then there's this gray part where other tasks are running, power-saving is taking place, and during that time, the first portion is spent with the IO actually in flight. That's where we're legitimately waiting
for the IO request to complete so that we can proceed. But then what happens is that the IO request completes, and we need to somehow get back to our de-scheduled process. Now, depending on what other tasks are running, their priorities, the scheduler, and so on, our task might not get woken up immediately. Or maybe if the CPU is in a low power state,
it'll just take some time to wake up, handle that interrupt, restore the user space process, and resume execution. So this leads to a wake-up latency, an overhead that is added. And so this is why notifications, also sometimes called interrupt,
interrupts can be something that actually slows down your IO processing. An alternative is to use polling. So polling is an approach where once you have no more work to do, instead of de-scheduling, you repeatedly check whether the IO is complete yet. And by doing that, you're not giving up the CPU.
So you keep running and you keep consuming CPU. The advantage is that you don't have this wake-up latency. Instead, your process will respond immediately once the IO is complete. The drawback, of course, is that you're hogging the CPU and you're wasting power while there's nothing to do.
So these are two techniques, and I think we're going to keep them in mind because we'll see how they come into play later. The next performance aspect I wanted to mention that that's kind of important to understanding how vhost user block is different from maybe using a network protocol or an existing storage interface is message passing versus zero copy.
As programmers, we learn that when we have a large object in our program, we shouldn't pass it around by value because it will be copied and that will be inefficient. And instead, what we do is we use references or we use pointers, allowing the function that receives the object to just go and access it in place
rather than taking copies. And in inter-process communication and in networking, there's similar concepts. By default, things are message passing. We build a message. It gets copied through various buffers along the network path. Eventually the receiver receives it into its buffer and then it parses it.
And so that model is the traditional networking model. It's also the IPC model. It has strong isolation. So for security, it's great because it means that the sender and the receiver don't have access to each other's memory. Therefore they cannot interfere or crash each other and do various things. But the downside is that we have these intermediate copies
and that consumes CPU cycles and it's inefficient. So the zero copy approach is an approach where the sender and receiver, they've somehow agreed on the memory buffer where the data to be transferred lives. And that way, the sender, for example, can simply place the data directly into the receiver's buffer
and all it then has to do is let the receiver know, hey, there's some data there for you. It doesn't actually have to copy the data. So this is another important concept that we're gonna see with vhost-userblock. So now that we've got those things out of the way, let's look at vhost-userblock. What is it? It's a local block IO interface.
So it only works on a single node, on a single machine. It is not a network protocol. Two, it's a user space interface. It's not a kernel solution in itself. It's a pure user space solution. That means it's unprivileged. It doesn't require any privileges
for two processes to communicate in this way. It's also a zero copy solution and the way it does that is it uses shared memory. And finally, vhost-userblock supports both notifications and polling. So depending on your performance requirements, you can choose whether you want to de-schedule
your process and receive a wake up when it's time to process an IO completion, or you can just poll and consume CPU and have the lowest possible latency. And vhost-userblock is available on Linux, BSD and on Mac OS. And the implementations of this started around 2017.
Now it's used, it came from SPDK and working together with QEMU. So those communities, they implemented vhost-userblock, but there are also implementations in other hypervisors like CrossVM and Cloud Hypervisor. So primarily this kind of came from virtualization,
from this problem of how do we do software-defined storage and let a virtual machine connect to it? But that's not all that vhost-user is good for. It's actually a general storage interface. It's generic, just like NVMe or SCSI is. So you could use vhost-userblock if you had some kind of data intensive application
that needs to do a lot of storage IO and needs high performance or needs to be unprivileged. And that's why I'm talking about vhost-userblock today. So let's have a look at the protocol. So the way that this is realized is that there's a Unix domain socket for our user space storage interface.
And we speak the vhost-user protocol over this socket. What the socket does and the vhost-user protocol allows us to do is it lets us set up access to a Vert.io block device. So a block device that lives in the software-defined storage process. So when we have two processes running on a system,
a software-defined storage process and an application, the application is using vhost-user in order to communicate with the Vert.io block device. And that's how it does its IO. So what is Vert.io block? Vert.io block is a standard. You can check out the Vert.io specification.
Vert.io has a number of other devices, but it includes Vert.io block. Some of the other devices are Vert.io net or Vert.io SCSI and so on. But Vert.io block is one we'll focus on here. And it consists of one or more request queues where you can place IO requests. And each one of these has a little structure. You can do all the requests I mentioned
in the beginning of the talk. Reads, writes, flushes, discard, write zero and so on. And you have multiple queues. So if you want to do multi-queue, say you're multi-threaded, you can do that as well. And it has a config space that describes the capabilities of the device. The disk size, the number of queues and so on.
So that's what you can think of Vert.io block as. That's the model we have here. And that's the block device that our application can interact with. If you think of any other storage interfaces or network protocols that you're familiar with, this should be more or less familiar. Most of the existing protocols also work in this way.
You can inquire about a device to find out its size and so on. And then you can set up queues and you can submit IO. So underneath Vert.io block, we have the vhost-user protocol. And the vhost-user protocol is this Unix domain socket protocol that allows the two processes to communicate.
But it's not the data path. So vhost-user is not how the application actually does IO. Instead, it's a control path that is used to set up access to these queues, these request queues that I've mentioned. And the IO buffer memory and the queue memory actually belongs to the application. And the application sends it over the Unix domain socket.
It sends that shared memory over so that the software-defined storage process has access to the IO buffer memory and the queue memory. The application and the software-defined storage process, they share access to that memory. That way we can do zero copy. So this is going back to the message passing versus zero copy thing.
We don't need to transfer entire IO buffers between the two processes. Instead, the software-defined storage process can just read the bytes out of the IO buffer that live in the application process. And it can write the result into the buffer as well.
So if you wanna look at the specification and the details of how vhost-user works, I've put a link on this slide. But really, if you're writing an application, I think the way to do it is to use libblock.io. Libblock.io is a library that has both C and Rust APIs that allows you to connect to vhost-user block
as well as other storage interfaces. So vhost-user block is not the only thing, but for the purpose of this talk, we'll just focus on that. Libblock.io is not a framework. It's a library. It allows you to integrate it into your application regardless of what your architecture is. That means it supports blocking IO,
it supports event-driven IO, and it also supports polling. So no matter how you've decided you want to do your application, you can use libblock.io and you won't have to change the architecture of your application just to integrate libblock.io. I have given a full talk about libblock.io. So if you wanna understand the details
and also some of the background and everything it can do, then please check out that talk. I put a YouTube link on this slide for you. I'll give you a short code example here. So this shows how to connect to a vhost-user block socket using libblock.io.
And this is pretty straightforward. We essentially just need to give it the path of the Unix domain socket, and then we connect and start the block.io instance. And then in order to do IO, we can submit a read request. That's just a function call. That's straightforward as well. And notice here that we do get the queue.
We call the getQ function in order to grab a queue. That's because libblock.io is a multi-queue library. If you have a multi-threaded application, you could create one dedicated queue for each thread and then avoid any kind of locking and synchronization. All the threads can do IO at the same time. So for completion, what this example shows
is it shows blocking completion. So here, the program is actually gonna wait in the doIO function until the IO is complete. But as I mentioned, the library also supports event-driven IO, and it also supports polling. So whatever you like, you'll be able to do that. If you develop your application,
you'll need something to test against. And I think the easiest way to test against the vhost-user-block device is to use the QEMU storage daemon. It's packaged for all the main Linux distros as part of the QEMU packages. And you can just run the storage daemon. You can give it a raw image file
and tell it the name of a vhost-user-block Unix domain socket that you want to have. And then you can connect your application to it. All right, so that's how you can do that. If you want to implement a server, if you're already in the SPDK ecosystem and you're using Intel's software performance
development kit in order to write your software-defined storage system, then it's very easy because vhost-user-block support is already built in. So I've put a link to the documentation. There are also RPCs if you want to invoke it from the command line.
And just for testing, you can create a vhost-user-block server using this. Now, if you're not using SPDK, instead you're writing your own C daemon, your own process, then one way of using vhost-user-block is to use the libvhost-user library.
So this is a C library that implements the vhost-user protocol, the server side of it. So this will allow you to accept vhost-user connections. It doesn't actually implement virtio-block. That's your job. That's the job of the software-defined storage system. But virtio-block consists of basically just processing the IO requests
like reads and writes and so on, and also setting the configuration space so that the disk size is reported there. And you can find an example of a C program that implements vhost-user-block using libvhost-user. I've put a link on the slide here for you. So that's how you can do it in C.
In Rust, similarly, there is a library available for you. So there's the vhost-user-backend RustCrate, and it plays a similar role to the libvhost-user library for C. So this allows you to easily implement whatever vhost-user device you want. And in this case, it's your job to implement
the virtio-block, just as I mentioned. Okay. Now, I still wanted to touch on one con that we hadn't covered yet, because we've explained how, although a user space interface is complex and is more work than just using file descriptors and read and write,
I think that the libblock.io and libvhost-user-block and so on, these libraries that are ready for you to integrate into your applications or software-defined storage systems, they take away that complexity and they make the integration easier as well. You don't need to duplicate code or write a lot of stuff, but we're still left with one of the disadvantages.
How do we connect this back to the kernel if it turns out we want to use some functionality from the kernel storage stack, or if we have a legacy application that we can't port to use the user space interface? So for vhost-user-block, there is a solution here.
There's a Linux VDUSE feature, which is relatively new, and what it does is it allows a vhost-like device to be attached to the kernel. So even though your software-defined storage system is in user space, this gives you a way of attaching your block device
to the kernel, and then in the kernel, the virtio block driver will be used to communicate with your device. And what happens is that a devvda or devvdb block device node will appear, and your application can open that like any other block device, and it can read and write and do everything through there.
One of the nice features of this is that because it's quite similar to vhost-user-block, the code can be largely shared. I think the only difference would be that instead of having the vhost user code, you would have the VDUSE code, which opens this character device
that the VDUSE driver in the kernel offers instead of a Unix domain socket, and the setup and the control path is a little bit different, but the actual data path in the virtio block is still the same, so you can reuse that code. So that's an effective way of doing it. There's another new Linux feature that I wanted to mention that is interesting here,
and also a little bit more general, even outside of vhost-user-block, and that's ublock. ublock is a new Linux interface for user space BlockIO, so that your software-defined storage system can present host kernel block devices, so you can have your block device
and process it in user space, and it uses IOU ring. It's an exciting feature, and it's pretty interesting, so I've left the link here. The only thing with this is that compared to VDUSE, it does not reuse or share any of the vhost-user-block stuff, so if you already have vhost-user-block support
in your software-defined storage system, or you just want to streamline things, then ublock is kind of a whole different interface that you have to integrate, so that's the only disadvantage, but I think it's pretty exciting too. Okay, so to summarize, if you need a user space BlockIO interface for the performance,
or because you need to be able to do unprivileged IO, or for security, then implement vhost-user-block. There are open specs, code, and community. Please let me know if you have any questions, and thank you. Have great fast time.