Lightning fast networking in your virtual machine
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 26 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/19172 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
1
2
7
8
10
11
13
14
15
18
19
20
21
00:00
Virtual machineWebsiteDrum memoryMathematicsArithmetic meanMenu (computing)Local ringStapeldateiUltimatum gameConvex hullArmNetwork operating systemLevel (video gaming)Principle of maximum entropyEmulationMetropolitan area networkMIDIAxiomDelay differential equationGamma functionMP3DialectAddress spaceMulti-agent systemElectronic data interchangeOffice suiteView (database)Virtual machineSoftwareStudent's t-testGoodness of fitCartesian coordinate systemFile viewerRight angleDebuggerFront and back endsCASE <Informatik>Interrupt <Informatik>CodeSemiconductor memoryInformationScatteringFerry CorstenDevice driverLevel (video gaming)Computer configurationMultiplication signRepresentation (politics)Process (computing)Real numberMereologySpacetimeConnected spaceMoment (mathematics)Equivalence relationEmulatorRevision controlGame controllerUser interfaceBit rateStack (abstract data type)Operator (mathematics)Order of magnitudeKernel (computing)Bridging (networking)PeripheralDatabase normalizationInterface (computing)Similarity (geometry)Maxima and minimaNumberBefehlsprozessorStapeldateiGroup actionSoftware testingAsynchronous Transfer ModeConfiguration spaceThread (computing)Parallel portBitDevice driverMedical imagingBlock (periodic table)Internet forumOverhead (computing)Broadcasting (networking)Absolute valueTerm (mathematics)Variable (mathematics)Type theoryVisualization (computer graphics)Set (mathematics)Data structureImage resolutionMathematicsProjective planePhysical systemGreatest elementSinc functionVirtualizationGoogolInstance (computer science)2 (number)Figurate numberPoint (geometry)Communications protocolMathematical optimizationMechanism designPower (physics)Computer hardwareAlgorithmCore dumpLink (knot theory)Module (mathematics)WindowPatch (Unix)SynchronizationLine (geometry)Selectivity (electronic)Electronic mailing listComputer fileSlide ruleKeyboard shortcutCurveDivisorLibrary (computing)Combinational logicRing (mathematics)Raw image formatClient (computing)Order (biology)Standard deviationCuboidCausalityLimit (category theory)ResultantBand matrixInternetworkingShared memoryChemical equationOperating systemComputer clusterDifferent (Kate Ryan album)Software frameworkTraffic reportingFrame problemAcoustic shadowContent (media)Uniform resource locatorSubject indexingDirection (geometry)Data transmissionPointer (computer programming)Buffer solutionComplete metric spaceContext awarenessTap (transformer)State of matterTable (information)TelecommunicationUtility softwareMappingNetwork socketAddress spaceCoroutineEndliche ModelltheoriePhysicalismMultiplicationSystem callMobile appImplementationHookingStructural loadLoop (music)Queue (abstract data type)Execution unitPosition operatorIntegrated development environmentSpeech synthesisPrinciple of maximum entropySimulationFile formatFormal languageParameter (computer programming)Computer programmingVery-high-bit-rate digital subscriber lineEqualiser (mathematics)Particle systemSpeicheradresseMultimediaVideoconferencingGotcha <Informatik>DialectRange (statistics)TheoryEnterprise architectureFreezingProduct (business)QuantumBeat (acoustics)WritingPolygonCanonical ensembleDiffuser (automotive)Lattice (order)Key (cryptography)Variety (linguistics)Degree (graph theory)Nominal numberWeightComputer virusSummierbarkeitMachine visionReduction of orderLemma (mathematics)WebsiteCategory of beingDependent and independent variablesBasis <Mathematik>AdditionEvent horizonGraph (mathematics)Numeral (linguistics)PlotterReplication (computing)PurchasingHeat transferMessage passingTransmitterHypermediaHydraulic jumpArchaeological field surveyServer (computing)Arithmetic meanError messageSource codeFreewareProcedural programmingSingle-precision floating-point formatForm (programming)Socket-SchnittstellePairwise comparisonGenderText editorSocial classGame theoryForestQuicksortComputer architectureTheory of relativityService (economics)Computer animation
Transcript: English(auto-generated)
00:00
Good afternoon, everyone. And my name is Luigi Witzen, and I'm presenting this work on accelerating the network IO on virtual machines. This is something I've done with my colleague Giuseppe Lettieri and former student Vincenzo Maffione in Pisa. The pictures that you see here are on the left, the view from my office in Pisa,
00:23
and on the right, another view since I spent a few months in Mountain View at Google. So what is this work about? What are our goals here? First of all, we want to accelerate the network IO within virtual machines.
00:42
Consider that the typical packet rates that you can achieve on 3BST, for instance, are about a million packets per second per core, and much less than line rate at 10 gigabit per second. This is on bare metal, on real hardware. If you were here. In a virtual machine, performance
01:01
is 10 times worse, possibly even more than 10 times worse, just on 3BST. So we wanted to find a solution to close the gap between these two performance levels. We wanted to accelerate network IO, not just for bulk TCP connections, because that's a problem that's
01:21
relatively easier. We have large frames. We have TSO, hardware floats that can relieve the work that is done on the CPU. So in a way, you can get close to line rate at one gigabit, and possibly even 10 gigabits on a virtual machine using more or less standard techniques, and perhaps power-visualized device drivers and devices.
01:45
However, there are more applications that might be interesting to run into virtual machines, software routers, metal boxes, applications that use short-lived connections. And so we would like to have the same level of performance
02:04
both on the real hardware and on the virtual machine. There is another thing that when you approach virtual machine, everybody tells you, and is that device emulation is inherently slow. You cannot do high-speed IO with an emulated NIC.
02:21
You really need some sophisticated solution, XAML Front, with IO, via MaxNet. Actually, what we found out is that we got exactly the same result, and possibly slightly better, with an emulated E1000 within a virtual machine. So at least you probably don't need a special device.
02:41
And it's interesting that you don't need a special device, because there are cases where perhaps you have a well established code base that involves your standard NIC driver. You'd rather exercise that one, rather than a completely different device driver that you have
03:02
no way to use on the real hardware. And also, doing our previous work on NetMap, the balance switch, et cetera, we have accumulated a few tricks that we would like to apply in other environments and see if they are effective. So that was a nice application for our previous work.
03:22
The main tools that we've been using are NetMap, something that was done a couple of years ago, presented here at BSDCon last year. And it's a framework for doing packet IO very, very fast. It is a standard part of FreeBSD now. There is a loadable model for Linux. And you can do line rate on a 10 gig interface
03:43
very, very easily. A follow up of this work was a balance switch, which is basically a software internet bridge that uses the same API imports as NetMap. It was designed as a generic interconnect. However, our goal was to use it
04:01
to interconnect virtual machines. And it's interesting for a number of applications, including testing things that you should run later on a 10 gig interface using NetMap. So you can really stress test your application, even without hardware. And possibly even at speeds faster than the 10 gigabit
04:20
limitation that the NIC will enforce. Bali runs up to 20 million packets per second per port, or 70 gigabit per second with large packets. Large, I mean, 1,500 bytes. The result that I'm presenting today is an accelerated version of the QEMU
04:41
and matching patches for device drivers, both on FreeBSD and Linux. That can achieve over a million packets per second when you're using standard socket applications in the guest.
05:03
This is guest-to-guest communication. And about 6 gigabits per second with 1500 byte frames. And over 5 million packets per second when the clients in the guest are using the NetMap API. The changes that we have introduced in the device drivers and in QEMU are really small.
05:23
The order of a few hundred lines of code per subsystem. And that's also a nice thing, because it's easier to debug small pieces of code rather than an entirely new device driver or an entirely new subsystem. The FreeBSD driver patches are about to be committed to the LEM device driver.
05:41
We have an equivalent thing for Linux and QEMU patches that are probably going to be distributed as separate things from our website. So just a little bit of background. NetMap is an API to send and receive raw frames from user space. It relies heavily on batching, packet buffers,
06:01
and the descriptor rings are exposed to user space through shared memory revision. And you have a selectable file descriptor for synchronization, which means NetMap is implemented with very small modifications to device drivers and kernel module. And there is a libpcap library that is supported in the application.
06:22
So in the best case, you don't even need to recompile your application. You just preload our version of the libpcap library, and you're often using the faster IOPAT to send and receive traffic. This slide shows the basic protocol
06:41
to access NIC using NetMap. You open a special file. You get the file descriptor, issue an IOCTL to bind the file descriptor to a given interface or something else. Map the memory region, and you're ready to send and receive packets and use select or poll for synchronization.
07:02
Performance, as I mentioned, is really good. In this graph, you see how fast you can send or receive packets using one core. For instance, at less than one gigahertz, you're already able to saturate a 10 gig link with almost 50 milli packets per second.
07:20
And it scales pretty well with multiple cores, up to four. I haven't tried more than that because I didn't have the algorithm, and also because at that point, you're already saturating the network at 300 megahertz. So that's not much of a point in going further. As a comparison, the internal packet generator in Linux
07:41
can do about 4 million packets per second at maximum speed. On FreeBSD or even on Linux using sockets, you're running at about a million packets per second per core. A follow-up to NetMap was to extend the API and the kernel module to implement a virtual switch.
08:02
So basically, the NetMap module in FreeBSD now is able to interpret port names of this form, val x, y, as a request to create a virtual switch named x, a port named y on that virtual switch,
08:21
and run the Ethernet learning bridge algorithm on that particular switch. You can create multiple switches. You can create multiple ports. And you can connect clients using the NetMap API to a valid port in the same way as you can connect to a physical NIC working in NetMap mode. So you can test your application over a valid switch
08:42
and then run over a real interface. Or these blocks could be virtual machines. Or the system is extremely flexible. The operation is entirely standard driven. And each incoming packet is dispatched to one or more destinations. So the throughput depends on how many copies
09:01
you need to do of the packet. In terms of performance, if you implemented val x in a straightforward way, processing one packet at a time, and we have done this, you would only reach between 2 and 5 million packets per second, which is not very fast.
09:21
However, by exploiting batching and try to reduce the access to locks so that you're requesting one lock for each batch per interface, we managed to reach a throughput of about 18 to 20 million packets per second
09:40
with minimum size frames and 70 gigabits per second with larger frames. And 70 gigabit per second is basically limited by the memory bandwidth of the system you're using. And these curves show the performance of a valid switch. This is the top curve with variable numbers of destination in the case of broadcast.
10:00
This is the time that is spent per packet, nanoseconds per packet, depending on the packet size. For the valid switch and for the solution, for instance, this is the Linux bridge using top interfaces. Now, a little bit of background on virtual machines.
10:21
Probably in this room, there's people who know more than me, but I apologize for mistakes and incorrectness. What happens how virtual machine is implemented in modern systems? Basically, the CPUs that we have these days can run the entire guest operating
10:41
system on the actual CPU running in virtual mode. And each virtual CPU appears as a thread to the host operating system and machines. So you can create a guest machine with multiple virtual CPUs. There are threads sharing the same memory image.
11:03
There are things that cannot be done by the CPU in virtual mode specifically when you need to access registers or handle interrupts, et cetera. Sometimes you need to exit from this virtual execution mode and jump back to the real execution mode of the CPU and perform actions that, for instance,
11:23
emulate the peripherals that you're trying to access and so on. These VM exits and similar things that happens on interrupt dispatching are actually very, very expensive, much more expensive than on a real machine.
11:41
Accessing a register is probably 100 nanoseconds or a little more on a real machine. But on a virtual machine, due to the VM exits and further processing, you might spend anything between 5 and 10 microseconds on that single operation. So there is a big gap in performance.
12:00
And especially when you are accessing devices, there might be cases where you have a lot of these accesses on each packet. So if you are not careful in emulating the device, or at least in writing the device driver in a way that is performing well with the virtual machine,
12:20
you can get a very high performance hit. So even if you solve the performance in the device simulation, there are still possible bottlenecks in the host backend. For instance, the connection between the virtual machine
12:42
and the physical interface or another virtual machine goes through several stages in the hypervisor and in the kernel of the host operating systems. And again, if you are not careful in coding those
13:00
processing stages very carefully, you might end up doing expensive operation or redundant operation, copies, et cetera, that impact your performance a lot. Of course, if you start with the device driver emulation, which is very expensive, that's the main bottleneck. And so you might not even see that there
13:24
are other performance problems in your architecture. But doing this work, we actually solved initially the device driver problem and then hit a number of subsequent bottlenecks, and then we tried to solve all of them. So the paravitalized device drivers, or sorry,
13:43
paravitalized peripherals that have been introduced over time by exam, by VMware, and by the QMO forks, tried to solve the problem of an efficient device model that would work well under hypervisor,
14:06
under control of a hypervisor. And however, those are only one part of the problem. Those solve only one part of the problem. All the other bottlenecks that I mentioned still exist. And so they need to be dealt with in the hypervisor
14:21
and in the software switch that connects virtual machines among themselves or to the physical need. So one of the things that we did was demystify the belief that device emulation is low. Because once you look at the paravitalized device, you see that the data representation
14:41
of those devices, packet buffers, ringers, et cetera, is not too different from what you have in a physical NIC. What is different is the way you access register or the equivalent of registers to get information on what is the current packet to transmit or the current interrupt status, et cetera.
15:01
But that can be addressed without introducing a completely new device model. The real problem that paravitalized device tried to solve is to reduce the number of virtual machine exits. And as I mentioned, those are related to interrupts and to accesses to IR registers.
15:22
So if you have a way to reduce the number of interrupts and perhaps replace access to registers with information that is in a shared memory, you can reduce this number of exits and get decent performance even with an emulated E1000 or Realtek or similar devices. The second thing we did was improve
15:42
the throughput of the hypervisor, moving packets from between the front end and the back end. The front end is the emulated side of the network device. The back end is whatever is used to connect the hypervisor to the host network stack. However, if you have a more performant connection and more
16:06
performant back end, you might still have a slow switch inside the host, which is what happens, for instance, if you use FreeBSD bridging or Linux bridging or OpenV switch. And so we were forced to also replace this switch
16:22
with something faster to get better performance. Now, the first thing we did was look at something that is already implemented in most modern hardware, and it is interrupt moderation. Instead of, especially on the receive part, instead of sending one interrupt on every packet
16:42
that you receive, modern hardware tries to enforce a minimum interval of time between interrupts so that you don't cause too much overhead in the processing of traffic. And the problem is that most emulated, most hypervisors
17:01
don't really implement these features that exist in the hardware. We have seen that it doesn't exist in QMO. It doesn't exist in virtual box. And of course, we have no access to VMware, et cetera. But according to what we have read, moderation is not implemented there either. Implementing moderation is not terribly hard.
17:23
And the amount of code that it takes is really limited. The only problem that exists is that in order to implement intermoderation, you need to set some timers that tell the hypervisor, OK, I'm not interrupting you now, but I'm interrupting you in 20 microseconds or 50 microseconds or something
17:42
like that. That's the order of magnitude of the delays that are implemented by real hardware. They are programmable. But basically, those are the numbers that you normally use. Now, quite often, you don't have this fine grained timer resolution in the operating system. And that might be a reason why moderation wasn't implemented
18:02
in the first place. Anyways, we implemented that and tried to use that. And we will see some performance number. And also, we will show that moderation by itself is not necessarily solving the performance problem. And in fact, here are the numbers
18:22
in one set of experiments that we made. The type of experiments that I'm reporting are basically using KVM and QMO as a hypervisor running on top of Linux, Linux because we don't have KVM on FreeBSD. So we cannot really use hardware
18:41
support for visualization. We had the two guests, which are FreeBSD had as of February, more or less. Those are PicoBSD builds, so standalone images. And they are connected to either top interfaces, as in this particular experiment, or to a valid switch,
19:01
as we will see later. And they're running on the same machine, which is about 3 or 3.2 gigahertz CPU. So if we take an unmodified QMO, and we try to measure the guest-to-guest throughput,
19:23
the transmit rate is actually very, very low. It's about 24,000 packets per second on one virtual CPU. And if you have two virtual CPUs, you get a transmit rate of about 65,000 packets per second. By implementing interrupt moderation,
19:40
we actually get a little bit of improvement. Well, quite a substantial improvement, but we are still dealing with very low packet rates in absolute terms. We move from 24,000 to 80,000 packets per second with one virtual CPU, and from 65,000 to 87,000 with two. So why is that for two virtual CPUs,
20:00
the improvement is so modest? Well, the thing is, without interrupt moderation, when you try to transmit a packet, the guest virtual CPU does a VM exit, transmits the packet, and immediately generates an interrupt. And so by the time you return control to the guest operating system, you are hit by the interrupt,
20:22
and you have to serve the interrupt immediately. And so that's the source of overhead. In the case of two virtual CPUs, when you do the exit and return, one of the CPU will serve the interrupt, and the other one will be able to continue processing of your traffic. So you have a little bit of parallelism going on,
20:41
and that permits a little bit more batching and performance improvement. So having the interrupt moderation changes the situation for the one CPU case, but it doesn't change the situation too much for the two virtual CPU cases. On the receive side, we didn't see a lot of improvement
21:01
with interrupt moderation, but this is mostly because of the test configuration that we used. Basically, the sender itself transmits packets in batches to the other virtual machine. And so even without interrupt moderation, you are still getting batches of packets on the receive side, and not just one interrupt per packet.
21:26
These numbers are on 3BST and are pretty interesting. Receive side is faster than the transmit side. We didn't even manage to receive live lock instead. Using Linux as a guest operating system, we actually had some live lock cases,
21:43
but those really depends on the operating system. The second technique that we use is called send combining. It's actually a very old thing. I was reading a paper, 2011 paper from VMware, where they documented probably for the first time
22:01
the techniques that they use in their initial emulators back in the late 90s. And they use a similar technique. The idea is that whenever you have a pending transmission, sorry, whenever you want to transmit a packet, you write to a register in the NIC to tell the hardware, if you have the hardware,
22:22
or the hypervisor if you are on a virtual machine, that you want to send out a packet. And that write to the register is very expensive. It is what causes DMX. Now, in case you are requesting an interrupt at the completion of the packet, then you can forget.
22:42
You can postpone writing to the register for subsequent packet transmission requests. And instead, just remember that there are pending transmissions to be sent out. And when you get the interrupt, you do the actual write to the register. So that batches writes to the register
23:00
and reduces the number by a significant amount, especially if you have interrupt moderation, of course. If you don't have interrupt moderation, you will get an interrupt immediately here after sending this request. In terms of performance, you see here, again, the implementation of send combining
23:23
requires a very modest amount of code. And it's only in the guest device driver, so you don't even need to modify the hypervisor. Whereas the interrupt moderation only needs to modify the hypervisor, assuming that the operating system supports the feature. So with moderation and one virtual CPU,
23:44
these are the numbers that I showed in the previous slide. With send combining alone without interrupt moderation, you basically have no gain in the case of one virtual CPU. You have a significant gain in the case of two virtual CPUs. And if you have both interrupt moderation and send combining, the speed up in the one virtual CPU case
24:03
is impressive. And two virtual CPUs goes about at the same speed. Just because you have reduced the interrupt load by a large factor, and the number of VM exits by a large factor. And so the second CPU in this particular test is doing almost nothing. Basically, it's just serving interrupts.
24:20
And the main CPU is doing most of the work. So we have a tenfold speed up in the case of 15 times speed up in the case of one virtual CPU, five times faster with the two virtual CPUs. And we are approaching pretty decent pocket rates on the transmit side. Send combining only works on the transmit side.
24:41
So the next step we tried to implement was parallel utilization. And the idea of parallel utilization is to reduce the number of VM exits by making the host and the guest communicate through shared memory, instead of communicating through interrupts and writes to registers.
25:02
Of course, in order to communicate through shared memory, given that you have no synchronization mechanism, you need that both entities are active at the same time. So you need to, and you cannot afford to have some thread in the host per month always running and pulling the status of some shared memory,
25:21
because that would be too expensive. So the way it works is that you start from initial state where both the guest and the host are idle. And then, whenever one of the two entities want to start this communication, sends a message, which is called a kick. A kick from the guest to the host
25:41
is typically sent by writing to a register, because the register causes a VM exit. And so you transfer control to the host, and you can do operation in the context of the host. A kick in the other direction is typically sent through an interruptor, because that's the way the host can communicate to the guest operating system and tell it to start, for instance,
26:02
a polling thread of some kind. So what we implemented to do the parallelization of the E1000 device was to slightly modify the hypervisor so that the right to the transmit register, the TDT, is the name of the register on the E1000,
26:24
also is interpreted as a kick by the hypervisor. And interrupts also are interpreted by kicks by the guest operating system.
26:41
The region that is used to exchange information, we call that common status block, or CSP. And basically, it contains a couple of pointer indexes for each direction of the communication. So after the kick, what happens is that, for instance,
27:00
if you are transmitting, the guest will write in a shadow register, which is actually a memory location, which reflects the value of the transmit register on the transmit ring. And the host will poll the content of this shared memory location to see if there are more packets to be transmitted or not.
27:21
As long as there are new packets to be transmitted, there is no need to write to a register in order to send them out. The polling loop on the host will fetch the packets from the buffers and send them to the back end, whatever it is, and will notify completion to another shadow
27:43
register in the CSP. And of course, there is already an implicit notification through status bits in the ring of descriptors that is used by the NIC. And the same happens in the other direction. When you have a new packet coming in
28:00
and everything is idle, you send an interrupt. The guest starts processing data. But instead of reading from the status register, if any, to get information on whether or not there are more packets, it will just get information from the ring or from the CSP. And this way, it doesn't need to access registers.
28:22
And also, when the guest on the receive side frees buffer and returns them to the NIC to perform more receptions, it doesn't do it through registers, but just writes the information in the CSP. Again, another very small change to both the guest
28:42
and the host side, about 100 lines each. And the performance gains are also impressive in this case. In order to use power virtualization, we don't need any interrupt moderation or same combining. And you see that we approach half a million packets per second on FreeBSD, both with one and two virtual CPUs.
29:05
So now we get to a level of performance which is perfectly equivalent to that of virtual IO. We're using a more or less standard 1000 device driver. Now, how can we go faster than this?
29:22
This is basically the throughput of the switch Linux bridge or whatever using the top interface as a communication channel. So we need to improve that part of the system. And that part of the system involves using a faster virtual switch as at the bottom,
29:43
interconnecting the two virtual machines. And the valid switch is a perfect thing to use in this particular case. All we needed to do is to write a back end for QEMU to talk to the valid switch instead of the top interfaces or other mechanism that QEMU has.
30:02
And QEMU code is quite modular, so it wasn't a difficult task, about 350 lines of code. And another advantage of using this approach is that we can connect now a QEMU instance directly to a NIC using the NetMap API. So we can do almost line rate without too much difficulty.
30:21
However, just improving the switch didn't get us much performance improvement. Because in fact, the data path within the hypervisor was really slow. And this figure shows you what happens in the communication between the guest and the software
30:41
switch at the bottom. The standard QEMU implementation basically consumed about 200 nanoseconds to copy data from the buffers. This is the amortized time per packet. Consumed about 200 nanoseconds to copy the data from the buffers that
31:01
are supplied by the guest operating system into the front end, then another 80 nanoseconds to transfer them to the back end, and another 500 nanoseconds per system call. Because using the app interface, you can only send one packet per system call. So we had to clean up these data parts in order to get better performance. And that was done by noting that, for instance,
31:25
part of these 200 nanoseconds was due to the fact that for every access to the descriptor, there was a call to the routine that mapped guest physical addresses into host virtual addresses.
31:40
Now, this mapping is there forever until the guest virtual machine migrates to somewhere else. So there is absolutely no need to repeat the check every time. So we just cached the result and reduced this time by four times. Then the data copy was done using more or less memcpy, which is quite slow. And so we replaced that with an optimized copy routine, which
32:06
also used the same trick that we use in Netmap and Valle. Instead of trying to copy exactly a number of bytes that you have, like 65 or some odd number, we just run the number to multiple of 32 or 64, which makes the entire process a lot more efficient,
32:23
and so reduces time from 80 to 40 nanoseconds. And then by replacing the back end with the Valle switch, we got an amortized time of about 50 nanoseconds per packet in the last part of this path. So now the interconnection between the guest
32:42
and the switch is a lot faster. And we were able to push packets up to this point at about 10 million packets per second. And then going to the switch, we have, of course, a slight reduction in performance, but still pretty fast. Overall, and this is almost the final table
33:02
that I have to show you, this is the performance using TAP and Linux bridge as a switch interconnecting with one machine in the various configuration. So you see that we started from about 24,000 packets per second in the worst case, standard case with one
33:20
virtual CPU. And we got to between 300 and 400,000 packets per second, or 500,000 packets per second, depending on the type of optimization that we implemented. This number here, ITR, is the delay that we used in the interrupt of moderation. And it is in microseconds, I believe.
33:44
So a delay of one microsecond. Of course, I mean, these delays are nominal. But when you implement them in the operating system, there is some granularity in the timer. You don't really have one microsecond. You have probably much larger delays. The problem is that increasing the interrupt moderation
34:03
improves performance, but it has an impact on the latency of your part. And so it might be something that you don't want to do if you want low latency or high throughput
34:20
with the small window sizes. And here is the situation in red if you use the value switch as a back end with all the improvements that we included in the system. So you see that we almost doubled the performance, the peak performance, which is these million packets per second in the case of parallel visualization.
34:42
But even without parallel visualization, we are getting pretty close, between 800 and 900,000 packets per second. Again, this test is done between two FreeBSD guests using NetSend and NetReceive. These are part of tools for sending and receiving UDP packets. And these numbers are for 64-byte packets.
35:04
With 1,500-byte frames, we reach about half a million packets per second, which amounts to five or six gigabits per second, I think. Now, what happens if we use NetMap within the guests
35:20
instead of using a socket-based application to send and receive data? Of course, in order to use NetMap, you need a NetMap-capable NIC. Unfortunately, E1000 is one of the NICs that are supported by NetMap. So it was just a matter of running the test and see how fast it went. And the kind of throughput that we reached
35:41
was about five million packets per second, guest to guest, again. So it's pretty fast. And between sender and in terms of absolute throughput, with the 1,500-byte packets, we do about 25 gigabits per second on the receive side and slightly faster on the transmit side. So I think that this is the kind of performance
36:01
that is really comparable with what we can achieve on the algorithm. Of course, not on the E1000, which is a one-gigabit interface, but, for instance, on the 10-gig interface, IXGBA or others, we are pretty much close to these values.
36:22
Now, what's the status of this stuff? There are three sets of changes involved. One is on the guest operating systems. And we have patches for the E1000 device on FreeBSD that are going to be committed soon. I'm talking to Jack Vogel about including them in our software.
36:42
And I think I've lost the microphone. And there are also changes for the Linux E1000 driver, the same. Since the mechanism for power virtualization is completely general, we are actually using the same data structure to do power virtualization
37:01
of the Realtek device, which is not particularly interesting. But some hypervisors implement Realtek and not E1000, so why not try that? On the hypervisor, we have QMO backend for NetMap that we sent to the QMO list a few months ago. We are improving that while it gets
37:20
accepted or rejected or whatever. But anyways, we can surely include that as a patch in our FreeBSD port of QMO. Again, the changes that we are making on the hypervisor are completely general. So, for instance, it is feasible to write a backend for VirtualBox if you don't want to run QMO
37:43
or if you want a FreeBSD solution that has support for power virtualization in the CPU. And on the other side, this is not really a change. There is nothing to change. All you need is to load the NetMap and Valet module, which
38:01
is just a matter of recompiling it on FreeBSD and on Linux. So the conclusions for this thing is that I believe we have reached a point where we can get about the same performance for network IO on the virtual machine as on the real hardware. And that's great because, for instance,
38:21
if you want to test optimization of the protocol stack now, you can do them on the virtual machine very, very easily without having to worry about the performance of the final part of the path, the NIC and the switch, et cetera. And I hope that this tool will help us improve the network
38:42
stack on FreeBSD. This work has been done with a contribution by my students listed here and funding from some European project and also companies NetApp and Google who supported my stay at Mountain View. I'd like to conclude with a few comments
39:02
on the status of NetMap and the Valet switch since last year. So last summer in August, I tried to implement a user space version of APFW and Dominate, which talks to NetMap interfaces rather than being
39:24
embedded in the kernel. And the filtering performance of this thing is pretty good. Between two Valet switches, a single CPU can filter about 6 million packets per second. And I have reports from a user who said he can do about 10 million packets per second
39:42
between two physical interfaces. So connecting the user space APFW to two physical interfaces running Valet. If we use Dominate, there is an additional data copy involved. And that reduces the performance
40:00
to between 2 and 3 million packets per second. But that's still much faster than the internal version. Another thing that we implemented in February was a transparent mode for NetMap. So one of the issues with NetMap is that you basically, your application grabs the interface and disconnects it
40:20
from the host stack. So the only way for traffic to reach the host stack is that your application rejects the traffic using another NetMap file descriptor into the host stack. With transparent mode, an application using NetMap can, as a chance to see packets,
40:41
mark those packets that should be intercepted by the application itself. And all the others are automatically forwarded to the host stack. And the same goes for the other direction. So that makes the behavior of NetMap a lot more similar to what you have on BPF. With the additional ability to filter out
41:00
packets, which might be interesting in some cases. Miki Onda, a colleague from NEC in Europe, started working on NetMap recently, and in April implemented a feature that allows you to hook network interfaces to a valid switch.
41:21
So basically, now you have the same abilities that you have with the internal bridge to attach interfaces to a switch or to the host stack and move traffic between ports, completed transparently without any user space process that does the switching for you.
41:41
And this should be committed to FreeBSD shortly. There is an ongoing work to use the valid switch as the data path for OpenV switch. This initially will be a Linux only thing, because the internal version of OpenV switch only runs on Linux at the moment. And I've been discussing with some of you
42:03
the option to support scatter gather IO on NetMap. And that is useful for a number of things. For instance, it helps implementing software version of DSO. It helps for implementation of reassembly if you want to move traffic to a switch
42:20
with different MQs on the various ports. So that's all for now. If you have questions, I'll be glad to answer. Yes?
42:42
No, absolutely nothing. No, in general, I try to avoid the use of features that are specific to operating system or hardware version, et cetera, because that makes my work more portable.
43:08
I guess I killed you. Yeah. No, I don't have numbers on latency so far.
43:24
Yes, definitely, yes. One thing I have to say, I don't think I have a figure here, but in terms of latency, one thing that kills you is the fact that when you transmit a packet, you
43:40
need to send the packet back to the user space thread in order to communicate with the back end. Now, Linux has, for instance, some optimization with a VOST thing. So on a VM exit, the packet is sent directly to the network stack, whatever it is, without going through the user space process.
44:01
And that's the way one should do things in order to reduce the latency. You have a question? To some degree. It's possible that the machine will code in the client device
44:32
driver to implement the system. That is true. That is true. Now, it is partly true. I mean, it is partly true.
44:41
For instance, the interrupt moderation, and where is the slide with performance? The interrupt moderation doesn't need any change in the client because the client typically already has an interrupt moderation. So the kind of gains that you can get are here.
45:01
Not very much, though. Yes. That requires client-side changes.
45:20
Yes. So yeah, actually, that's true. You could get up to, in our test case, you could get up to 140,000 packets per second from the original 65, which is not a lot. I mean, it's a factor of 2. Of course, when you're used to see number that are a factor of 10, a factor of 2 doesn't seem a lot, but yes.
45:50
But the best thing is when you can combine interrupt moderation with send combining, at least on the transmit side, or you can do power visualization.
46:01
But that, of course, requires some changes on the guest side. Now, my point is it is easier to add 100 lines of code. Assuming you have the ability to change a little bit on the guest side, it might be easier to add 100 lines of code than to write an entirely new device driver.
46:23
Unfortunately, not for you.
46:48
Something can be done there, yes.
47:02
Well, probably there are other bottlenecks on your system given that it's so old somehow. OK, we're on time. Oh.