We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

High Performance Network Function Virtualization with ClickOS

00:00

Formale Metadaten

Titel
High Performance Network Function Virtualization with ClickOS
Serientitel
Anzahl der Teile
199
Autor
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Middleboxes are both crucial to today's networks and ubiquitous, but embed knowledge of today's protocols and applications to the detriment of those of tomorrow, making the network harder to evolve. While virtualization technologies like Xen have been around for a long time, it is only in recent years that they have started to be targeted as viable systems for implementing middlebox processing (e.g., firewalls, NATs). Can they provide this functionality while yielding the high performance expected from hardware-based middlebox offerings? In this talk Joao Martins will introduce ClickOS, a tiny, MiniOS-based virtual machine tailored for network processing. In addition to the vm itself, this talk will hopefully help to clarify where some of the bottlenecks in Xen's network I/O pipe are, and describe performance improvements done to the entire system. Finally, Joao Martins will discuss an evaluation showing that ClickOS can be instantiated in 30 msecs, can process traffic at 10Gb/s for almost all packet sizes, introduces delay of only 40 microseconds and can run middleboxes at rates of 5 Mp/s. The audience is anyone interested in improving the network performance of Xen, including improvements to the MiniOS and Linux netfront drivers. In addition, the talk should interest people working towards running large numbers of small virtual machines for network processing, as well as those involved with the recent network function virtualization trend
Thermodynamischer ProzessInternetworkingMultiplikationsoperatorRechenwerkGrundraumPeer-to-Peer-NetzSchnittmengeEuler-WinkelFirewallMaßerweiterungComputersicherheitRouterZahlenbereichEindringerkennungQuaderSoftwareDatennetzLastXMLVorlesung/Konferenz
QuaderInverser LimesTranscodierungSummengleichungMultiplikationsoperatorDatensichtgerätEinfügungsdämpfungHypermediaNatürliche ZahlAdressraumSystemprogrammierungBitrateHardwareComputeranimationFlussdiagrammVorlesung/Konferenz
SoftwareQuaderHardwareInterface <Schaltung>Nichtlinearer OperatorElektronische PublikationKonfigurationsraumZählenElement <Gruppentheorie>VariableVirtuelle MaschineDateiverwaltungFramework <Informatik>Kernel <Informatik>RouterBimodulProzess <Physik>Minkowski-MetrikBitKartesische KoordinatenEinsSummengleichungOffice-PaketProfil <Aerodynamik>Endliche ModelltheorieAggregatzustandXMLVorlesung/KonferenzTechnische ZeichnungFlussdiagramm
KonfigurationsraumElement <Gruppentheorie>Message-PassingLeistung <Physik>NetzbetriebssystemCASE <Informatik>Coxeter-GruppeKernel <Informatik>Kartesische KoordinatenTwitter <Softwareplattform>Nichtlinearer OperatorVorlesung/KonferenzXMLComputeranimation
SystemprogrammierungHalbleiterspeicherProgram SlicingAutomatische HandlungsplanungGamecontrollerNetzbetriebssystemBildgebendes VerfahrenMultiplikationsoperatorProfil <Aerodynamik>RechenschieberProzess <Physik>InformationsspeicherungBootenDateiverwaltungVorlesung/Konferenz
Clique <Graphentheorie>SystemprogrammVersionsverwaltungStützpunkt <Mathematik>CompilerSystemprogrammierungVerkehrsinformationRechenschieberOrdnung <Mathematik>Kartesische KoordinatenSuite <Programmpaket>SchnelltasteProgrammbibliothekEinsKeller <Informatik>ComputeranimationVorlesung/Konferenz
ImplementierungSkriptspracheProzess <Physik>MultiplikationsoperatorQuick-SortSoftwareBootenStörungstheorieGlobale OptimierungElektronische PublikationArithmetische FolgeBildgebendes VerfahrenAutomatische HandlungsplanungGamecontrollerDatenkompressionObjekt <Kategorie>Rechter WinkelLokales MinimumZweiComputeranimationVorlesung/KonferenzFlussdiagramm
AnalysisSoftwareBridge <Kommunikationstechnik>ZahlenbereichZweiLokales MinimumGemeinsamer SpeicherMultiplikationsoperatorProgrammierumgebungBitTreiber <Programm>EreignishorizontMereologieSystemverwaltungSprachsyntheseMeterInterface <Schaltung>PerspektiveElement <Gruppentheorie>NetzwerktopologieVorlesung/Konferenz
Bridge <Kommunikationstechnik>Puffer <Netzplantechnik>Domain <Netzwerk>Web-SeiteQuick-SortMultiplikationsoperatorBetriebsmittelverwaltungRechter WinkelArithmetisches MittelVerschlingungStapeldateiSchätzfunktionTreiber <Programm>Front-End <Software>SocketXMLVorlesung/Konferenz
Rechter WinkelStichprobenumfangFront-End <Software>ATMProgrammierumgebungSprachsyntheseGlobale OptimierungBitResultanteHinterlegungsverfahren <Kryptologie>Lokales MinimumArithmetisches MittelMultiplikationsoperatorZweiMetadatenComputeranimationVorlesung/Konferenz
BitrateDatenverwaltungVerschlingungFramework <Informatik>SoftwareentwicklerNetzwerktopologieSoftwareEinfacher RingKernel <Informatik>Minkowski-MetrikPhysikalismusMathematikSpeicheradresseLokales MinimumMaßerweiterungForcingCoxeter-GruppeXMLVorlesung/Konferenz
Quick-SortDatennetzMaßerweiterungGraphSoftwareZweiHochdruckPaarvergleichBridge <Kommunikationstechnik>PunktgitterKartesische KoordinatenNetzwerktopologieZahlenbereichNormalvektorBimodulCASE <Informatik>Vorlesung/Konferenz
SoftwareRechter WinkelKernel <Informatik>Proxy ServerEinfacher RingFront-End <Software>EreignishorizontProtokoll <Datenverarbeitungssystem>Kartesische KoordinatenGlobale OptimierungPuffer <Netzplantechnik>KontrollstrukturCASE <Informatik>TeilbarkeitExogene VariableMinkowski-MetrikPolstelleInverser LimesSoftwarepiraterieComputeranimationVorlesung/Konferenz
Einfacher RingPuffer <Netzplantechnik>NormalvektorHalbleiterspeicherWeb-SeiteBildschirmmaskeTotal <Mathematik>StapeldateiVersionsverwaltungDifferenteFront-End <Software>Interface <Schaltung>MAPSpannweite <Stochastik>VerschlingungSoftwareFlussdiagrammVorlesung/Konferenz
Lokales MinimumPuffer <Netzplantechnik>Spannweite <Stochastik>SystemprogrammierungEinfacher RingRechenschieberHalbleiterspeicherLeistungsbewertungOrdnung <Mathematik>Front-End <Software>BootenIndexberechnungBitrateVirtuelle MaschineSichtenkonzeptZahlenbereichOrdinalzahlMultiplikationsoperatorZellularer AutomatVorlesung/KonferenzDiagramm
MultiplikationsoperatorKonfigurationsraumZweiVersionsverwaltungBitLokales MinimumBitrateEinfacher RingErwartungswertSpeicherabzugServerEreignishorizontGefangenendilemmaVorlesung/KonferenzComputeranimation
Zentrische StreckungBefehlsprozessorSpeicherabzugLokales MinimumMultiplikationBitrateGenerator <Informatik>GeradeSimplexFormation <Mathematik>Rechter WinkelVorlesung/Konferenz
Rechter WinkelTabelleQuaderInternetworkingPaarvergleichInterrupt <Informatik>BitTUNIS <Programm>SpeicherabzugWasserdampftafelPhysikalische SchichtKeller <Informatik>Technische ZeichnungDiagrammVorlesung/Konferenz
RandomisierungWort <Informatik>StandardabweichungSprachsyntheseForcingDatenflussMultiplikationsoperatorSystemprogrammierungFächer <Mathematik>VerschlingungLastteilungQuellcodeGruppenoperationMathematikWasserdampftafelSummierbarkeitGraphSchlussregelVirtuelle MaschineElektronische PublikationAutomatische IndexierungInhalt <Mathematik>Thermodynamischer ProzessDatenverwaltungRadikal <Mathematik>Ideal <Mathematik>Gewicht <Ausgleichsrechnung>ZweiQuaderMaschinenschreibenDemo <Programm>Coxeter-GruppeRouterStatistikLastSoftwareFirewallEindringerkennung
KonfigurationsraumMultiplikationsoperatorAnalytische FortsetzungSystemprogrammierungCoxeter-GruppeOpen SourceFunktionalNichtlinearer OperatorRechenwerkMathematikPatch <Software>Front-End <Software>Virtuelle MaschineDienst <Informatik>ÜbertragungsfunktionVorlesung/Konferenz
KonfigurationsraumKategorie <Mathematik>SprachsyntheseCASE <Informatik>BeobachtungsstudieFunktionalEreignishorizontFigurierte ZahlResultanteVorlesung/Konferenz
Virtuelle MaschineBitrateNormalvektorKeller <Informatik>QuaderKonfigurationsraumAuflösung <Mathematik>MultiplikationsoperatorFitnessfunktionDemo <Programm>Bildgebendes VerfahrenMetropolitan area networkInstantiierungEinfache GenauigkeitRelativitätstheorieBootenÜbertragungsfunktion
ProgrammierumgebungVariableMomentenproblemAggregatzustandGüte der AnpassungGrundsätze ordnungsmäßiger DatenverarbeitungRechter WinkelMAPCoxeter-GruppeMereologieAuflösung <Mathematik>ComputeranimationDiagrammJSONXMLUML
Quick-SortEinfach zusammenhängender RaumPerfekte GruppeMultiplikationsoperatorVorlesung/KonferenzXMLUML
ZahlenbereichDatennetzSchedulingVorlesung/Konferenz
Migration <Informatik>SchedulingKonfigurationsraumInformationsspeicherungDatenverwaltungZahlenbereichSoftwareplattformStichprobenumfangMailing-ListeOrbit <Mathematik>Einfach zusammenhängender RaumMultiplikationsoperatorBitSoftwareComputeranimation
Lesen <Datenverarbeitung>MultiplikationsoperatorDistributionenraumOrdnung <Mathematik>AlgorithmusSchedulingSchnittmengeGewicht <Ausgleichsrechnung>Mathematische LogikZahlenbereichFilter <Stochastik>StichprobenumfangLastRechenschieberNormalvektorEnergiedichteBefehlsprozessorKategorie <Mathematik>CodeEinfache GenauigkeitLeistung <Physik>Vorlesung/KonferenzComputeranimation
InformationDomain <Netzwerk>Migration <Informatik>VirtualisierungKlasse <Mathematik>RechenzentrumProzess <Physik>BitGewicht <Ausgleichsrechnung>SchedulingVorlesung/Konferenz
MehrrechnersystemGewicht <Ausgleichsrechnung>LastMaßerweiterungFilter <Stochastik>SchedulingMereologieServerKlasse <Mathematik>Wort <Informatik>LastteilungOffene MengeMigration <Informatik>Computeranimation
ResultanteMathematische LogikSchnittmengeGewicht <Ausgleichsrechnung>DiagrammLastKategorie <Mathematik>MehrrechnersystemEndliche ModelltheorieKlasse <Mathematik>Filter <Stochastik>TabelleKette <Mathematik>Vorlesung/Konferenz
Filter <Stochastik>Mathematische LogikValiditätSchedulingStichprobenumfangBitKlasse <Mathematik>ServerMultiplikationsoperatorOrdnung <Mathematik>Kategorie <Mathematik>Elektronische UnterschriftSchnittmengeBefehlsprozessorSystemprogrammierungLastGewicht <Ausgleichsrechnung>DatenfeldComputeranimationBesprechung/InterviewVorlesung/KonferenzJSONXML
Endliche ModelltheorieGewicht <Ausgleichsrechnung>StichprobenumfangTeilbarkeitQuick-SortFilter <Stochastik>SchnittmengeLastMathematische LogikGlobale OptimierungMehrrechnersystemMultiplikationsoperatorZahlenbereichProzess <Physik>LastteilungTupelCodeDistributionenraumMailing-ListeSummengleichungSoftware Development KitBefehlsprozessorEinfach zusammenhängender RaumMigration <Informatik>ServerEinsFrequenzEinfache GenauigkeitUnrundheitSchnitt <Mathematik>SchedulingAggregatzustandChatten <Kommunikation>Klasse <Mathematik>GraphSystemaufrufFreewareVorlesung/KonferenzXML
Wort <Informatik>Leistung <Physik>LastteilungSchedulingVariableKlasse <Mathematik>Offene MengeMehrrechnersystemProzess <Physik>Globale OptimierungGewicht <Ausgleichsrechnung>Filter <Stochastik>KonfigurationsraumSummengleichungVorlesung/KonferenzBesprechung/InterviewComputeranimation
Physikalische TheorieMehrrechnersystemMultiplikationsoperatorZeichenkette
MehrrechnersystemGemeinsamer SpeicherDatenfeldDefaultSystemaufrufSummengleichungMathematische LogikServerFilter <Stochastik>Gewicht <Ausgleichsrechnung>UnrundheitVerzeichnisdienstDienst <Informatik>SystemprogrammierungStabSchedulingEndliche ModelltheorieSchlussregelFormale SpracheGruppenoperationAppletRechter WinkelZahlenbereichSchnelltasteVerknüpfungsgliedSchreib-Lese-KopfLastteilungSystemprogrammHalbleiterspeicherVersionsverwaltungPlug inCodeVorlesung/KonferenzBesprechung/InterviewComputeranimation
Offene MengeSchedulingSystemprogrammierungBefehlsprozessorInformationsspeicherungAuswahlaxiomInformationSoftwareHalbleiterspeicherMulti-Tier-ArchitekturSichtenkonzeptDatennetzFramework <Informatik>Leistung <Physik>Virtuelle MaschineUnternehmensarchitekturMailing-ListeFächer <Mathematik>SystemzusammenbruchParametersystemCASE <Informatik>Wort <Informatik>Service providerElektronischer ProgrammführerRechter WinkelAggregatzustandQuick-SortDigitaltechnikGesetz <Physik>Vorlesung/Konferenz
MultiplikationsoperatorXML
Transkript: Englisch(automatisch erzeugt)
I come from the NEC Labs Research, the research labs in Germany, and this, I'll be presenting ClickOS and this is work done by me and my folks at NEC as well as some other people at the University of Bucharest. So in one sentence ClickOS is a VM to do network processing.
So what's the motivation behind it? A long time ago some folks designed the internet and how does it look like as we thought in, we are thought in schools. And those with a networking routine containing switches and routers, pretty simple.
But in reality middle boxes are commonplace in today's networks. They are very useful for a high number of reasons and they extend operators' network for a lot of features. Security, such as firewalls, intrusion text systems, monitoring, such as DPI boxes, load
balancers dealing with address exhaustion issues, performance such as media gateways, transcoders, YDR accelerators, and some others of a more dubious nature like the advertisement
insertion box over there. I mean, I'm having some issues with the display port so you cannot see exactly how this looks like, but these middle boxes are really useful. But there are a lot of disadvantages at the same time. First of all, they are really, really expensive.
One of the boxes I showed you, like the media gateway, costs around 200k. It's really difficult to add new features. Often you get locked in to vendors to upgrade your firmware, add new features, and so on and so forth. They are difficult to manage.
It often requires specialized personnel to manage the box. And normally you deploy these boxes to keep with the peak rate so they cannot be scaled like you add more like you want, and it's difficult to share these devices among other parties.
And since it's a hardware market, it's hard for new players to come along with hardware. So clearly if we had some way to virtualize these middle boxes or shift these middle boxes to be software instead of expensive boxes, we would have, we mostly would solve
most of these issues we have here. But it is still not clear if we can do it in software and really coping with the performance that the hardware middle boxes can achieve. So we believe so, and that's what I'm talking about here today, and that's what we proposed
with QlikOS. QlikOS is a miniOS-based zen virtual machine that runs Qlik module router. So let me first talk a little bit about Qlik. Qlik is a framework to design routers and process packets and has this concept of elements.
Each of these elements does an operation to the packets or receive packets, for example. The configuration on the side basically decreases TTL and forwards back the packet to a different interface. These elements are contained in a configuration and basically you install it either as a user
space application or as a kernel module. These elements are exposed, they're counters, if we have a counter on an interface, for example, via variables on the slash proc file system, and for example, the counter you
have a count element, just a read-only, and to reset the counter you just write one. For example, for QlikOS we compile 262 of these elements, excluding the ones that deal with the file system.
And if you want to further extend the Qlik, you can just write new elements, it's quite easy. So an example of a Qlik configuration looks something like that. You have from Netfront to receive packets and to Netfront to send packets out, you have an IP filter element that allows packets from 10.0.1 and destination 10.1.0.1, UDP
packets, otherwise drops it all. So you can see that it's quite easy just to filter packets, for example. So what's QlikOS? Typically, we have in Xen, our pre-visualized operating system is slightly modified, working
on top of Xen, and the kernel and the application is running on top. In the case of QlikOS, we use MinUS as our best operating system, and we just run Qlik, and we call the whole thing QlikOS.
So it kind of follows a trend of OSV in the previous presentation, or Erlen, Monzen, Mirage, so that it's application-specific operating systems. The work we did was having a build system to build these miniOS apps that we use to
build QlikOS, and what we get is a five megabytes image in total, in size, and we had to emulate the control plane, so we don't have a slash-proc file system between VMs, so we had to emulate this over Xen via the Xen store. And we further reduced boot times as well.
We started with something like one second, and we get it down to 30 milliseconds boot time. And the most important of the contributions are actually driving 10 gigabits for almost all packet sizes.
So in these VMs, what matters is not bandwidth, but the ability to process really, really high packet rates. The talk will be focused on this last bullet point, but let me just give you one slide of what consists of the toolchain and the tool stack.
We'll compile a new lib, which is a slightly modest libc. It's normally used for embedded systems, and it's already quite old, that version, and we compile as well LDYP as our networking stack, and some other utilities, in order
to compile Qlik to miniOS. We also use this toolchain to build, for example, iperf, and the fast package generator based on Netmap called Packagen. We expect to actually port other applications on top of it.
Our tool stack is based, we have our own tool stack, we don't use XCP, Excel, or SAP on Xen, we design our own Cosmos tool stack, which is basically using the Xen libraries, and we use SWIG to generate the bindings for our tool stack.
For now we support Python, JavaScript, and OCaml, it just compiles for now. The optimizations were, okay, it was not exactly, optimization was like a learning process, but most of the contributions of having slow boot times were preferring OZAMSTOR
did, the new OZAMSTOR implementation shipped in 4.2, I'm not wrong. Different network attach, the hot plug scripts really slow things down, like 200 milliseconds, and further booting uncompressed image as opposed to gzipped image. Actually, uncompressing the image takes longer than booting the VM itself.
So that's it for the control plane. For this talk is about how we actually improve this, and our main objective is to drive this packet rate, so 14.88 million packets per second for minimum size packets, for
reference TCP hack is a minimum size packet. Or for bigger size packets, 810,000 packets per second. So we did the performance analysis of this, the whole pipe, and normally we have our
network driver attached to an OpenV switch, a Linux bridge, and a VIF for each VM which is managed by Netback. Then we have the shared memory fault used with the Zenring API in event channels to exchange notifications, our netfront, and our click environment.
Additionally, we have two elements, so for our network interface, which is netfront, and so that's the configurations, actually, full packets and so on. So when we plug all of this together, there were several bottlenecks. First, the Linux bridge, just a disclaimer, numbers change all the time, so these are actually
a bit better with OpenV switch, but a long time ago when we did this analysis and started working on improvements, it was something like this. For maximum size packets, it was able to afford more than 300,000 packets per second.
When we place Netback in, 350, and with the whole thing together, just roughly 225,000 packets per second, which corresponds to two to three gigs of throughput.
So there are a lot of issues on this pipe, and first of them all is the bridge that we use. Afterwards, it relies on TCP, the Linux for stack, which is not exactly, it drives 10 gigs, but not exactly with really high packet rates.
Copying, each time you want to send a packet, you copy that, you need to copy the page to the backend domain, and this copy, although it's done in batches, is really, really expensive. Relying on SK buffs, it uses the Linux stack, so the allocation and manipulation of these
socket buffers are really expensive, and also MiniOS backend and frontend driver is not exactly as major as Linux, it's pretty functional, so it's also slow, roughly two times as performance than Linux. And RX was a mere 100 megabits of performance when we got it.
So, as starters, we started not, you know, killing the whole backend and frontend, and started doing small optimizations on the backend driver, and we started by replacing the switch, called Valle, I will talk a bit about Valle and Netmap right after.
We did slight modifications to ZenNetback to support multi-page rings, this is work basically apported to Netpack, but it's work from Wang Liu, and because we are relying on
Valle, it doesn't use STDs, we removed that packet metadata manipulation, and overall the results were actually nice, pretty nice, we get three times better performance, and 1.2 million packets per second for minimum size packets. So, before telling the very last piece, let me give you just a background on Netmap.
Netmap, available in FreeBSD and Linux as a out of the tree build, is a fast packet IO framework that drives line rates for a nine megahertz CPU, downclocked to nine megahertz, it's roughly corresponds to 67 nanoseconds per packet, to fill up to 10
gigabits with minimum size packets, it requires changes on the device driver, but these are minimal changes to support the Netmap mode, but all the NIC registers and physical memory
addresses and the packet descriptors are not exposed to user space in any way, these must be validated by the kernel when you pull for packets and does not access by user space. It's similar to what Netchannel in the previous presentation, where you map
the software ring into user space, and you use all the data structures, copy packets, and basically pull, and it flushes out, bypassing the Linux host stack.
Valle, on the other side, was born like a small extension to Netmap and its software switch, that drives between virtual ports, around 18 million packets per second, lately the latest release is like 22 million packets per second. The graph on top shows a comparison of all the three switches, the FreeBSD bridge, OpenV switch and Valle, but just a small
detail, the Valle case is between virtual ports, whereas the OpenV switch and are between two NICs. So, we did a number of extensions to Valle, to let us attach
NICs to the switch, to modularize the switch, which means you could implement your own switching function, and having your current modules that extends the switch, and basically
the way you use the switch is like a normal Netmap API, not normal application using Netmap API. So, plugging all of this together, we got rid of the Netpack and Netfront and we implemented
one of our own, whereas it's a really small Netpack that tries to mimic what the kernel does to the application. We remove the extra copy, which means the buffers are granted to the VM.
The ring protocol is not the same as in XAN, like you have a request and expect a response, you copy packets to the ring, notify the backends and you forget. Event channels are used to proxy the poll, so when you send a packet you need to
poll in the user space, you need to poll the ring to send the packets out, and in this case we emulate this with event channels, but the biggest problem with this is that we break Linux and other frontends, but to show that these optimizations are viable,
we also implemented Linux as a compliant frontend. We eliminate the copy, so I thought it would be good to mention how much memory do we share.
The Netmap buffers are allocated in contiguous pages, and buffers are around 2k, so which means for the normal ring size of Netmap, which is 1024, you share a total of two megabytes per ring as of Netmap API version 3. We provide different ring sizes as a form of, depending on the throughput,
what our CliccOS VM or Linux requires, you require a smaller batch. So basically the way Netmap works, you basically issue an ICTL to open the Netmap device, you open the Netmap device and you do an ICTL to register the interface to Netmap,
which detaches from the Linux host stack, then our backend will grant the frontend, the ring and the buffers, including the brand references of the ring of the buffers inside the ring slots. The frontend on the other side grabs the grants regarding the ring
and checks the, reads the slots in order to map the rest of the buffers. So to run our CliccOS VM, the minimum memory requirement is, for 64 ring slots,
it's six megabytes in total, and we drive 10 gigs for, with this ring size as well. So let me just show you some performance evaluation of our system.
I told you that our VMs boot in 30 milliseconds, and this is when we boot 400, how does the boot times evolve. So you ask yourselves why we want 300 milliseconds boot time. The main motivation is that you are able to react as the traffic which is peak,
reach a peak, and you can expand the number of future machines in other machines in order to cope with higher rates. When I say higher rates, like 10, 20, 30, 40 gigs, for example. So the time to create a VM evolves from 22 milliseconds up to 220 milliseconds. The time to instantiate Clicc configuration takes around between 5 to 20 milliseconds. Regarding the
packet performance, we got a huge improvements, and we basically achieve around 9 million packets per second for receive sites, and for TX we achieve 1438 million packets per second,
which is 95% of line rate for minimum size packets. This although has changed a little bit on the latest netmap API version, which we are down to 11, 12. And you can see that as we decrease the ring size, we decrease the throughput, as expected. The cost of
channels starts to be a bit more visible. But as you can see, all the other packet sizes, we fill up the pipe. This was tested on a low-end server of four cores at 3.2 gigahertz.
We do a port 10 gigabit NICs. Each core is assigned one CPU port for the VM, the rest for DOM0. So after getting
10 gigs, we asked ourselves how we would scale for more NICs. So we actually run some experiments to see if we were able to drive 40 more, 40 and up to 80 gigs of throughput. And we actually did. So the bars are TX only, just VMs doing packet generation. And the lines represent
forwarding. So VMs actually processing packets and forwarding back to the outside world. And we achieve 40, roughly 30 gigs for three ports. And this is only the rate you receive,
so remind that it fits full duplex. It's the double of the performance, roughly. So if we manage TX with RX, it's actually 60 gigs around that. And we don't exactly scale
that well for minimum size packets over multiple NICs. Again, this was on the same setup as the DOM0 and three cores for the VMs. Each of the cores has all the NIC interrupts assigned to it.
Otherwise, the NIC starts to starve and they aren't able to process all the packets. So there was a bit fine-grained tuning on the interrupts. So the Linux guest performance
also gained some huge improvements. But a disclaimer, this is not exactly a fair comparison. So the highest bars are using NetMap API, which Xen and KVM don't support it.
And you still get all the bottlenecks of the host, the TCP IPO stack. So this is actually my favorite graph, is when we actually run needle boxes on ClickOS.
So the first one is just packet forwarding, so it doesn't touch the packets. The second group of bars is an Ethernet mirror. Basically, it changes source and destination MAC address, forwards back the packet. The third one is a standards compliant IP router. The fourth is a firewall loaded with 10 rules.
The car grid NAT, which has, I think, 10 flows with randomized ports. Software BRS, which for those who don't know, it's the first hop of the DSL subscriber. And it does the PPP termination, IPLCP handling and session management as well.
A load balancer, a flow monitor that gathers statistics about the flow and intrusion detection system, just checking the contents of the packet with just five rules.
After that, since this is a needle box, it's not a hand system, it's actually important to have really low delay. And our ClickOS machines add just 40 microseconds of delay to the whole traffic. And we compare it with other systems and we cope with DOM0's delay. And we did the best of our efforts to put the KVM setup the lowest
possible. And we got also huge improvement compared to the Linux DOM delay. So this is ClickOS, a really tiny virtual machine to drive, to do packet processing.
We drive 10 gigs. I will show you a demo just right after the presentation. I will be at the Zen booth tomorrow between 10 and 11. So feel free to
go there and ask any questions. I will show you more demos. And in the future work, we are exploring the performance on NUMA systems, which is not exactly that good. Doing high consolidation of these VMs,
something like running 1000, 2000, 2000 VMs, we did some experiments and already are two 2K VMs on a single machine. And doing service changing of these virtual machines, like changing these network functions with each other
to do more packet operations. Last and not least, in the Zen Summit, we thought that we will be open source, but we weren't clear how it would take. There is a presentation on NSDI 2014 about ClickOS and we will be open sourcing on that time,
which is April 4th. And that comprises patches to MiniOS, requests for comments on the backend and frontend drivers, and more things to come. So that's it. Hope you guys like it. Any questions?
Yeah.
The configuration is pretty simple. It just decreases the TTL on the ICMP request, just that. It's just to visualize how the configuration works. Just decreases TTL and
forwards back the packet out. Paint is just a click internal thing to mark the packet, if you see it said again. Yeah, not actually something on the packet.
He was asking what the configuration does at TTL and what being paint means on click.
This is for just one VM. Actually, right now it's on FreeBSD 10.
It's already on FreeBSD 10. Yeah, exactly. So NetMap is normally referred to Nix.
Valet is a switch, so they'll kind of work together and use the same API. So the API is really quite simple. If you go to, so this is from Luigi Rizzo, and we made some contributions which are already on FreeBSD 10. So you can attach a NIC, you can attach the host stack, you can extend the switch with your own switching functions,
and all of that is already on FreeBSD 10. Any more questions? Yeah, yeah, yeah, yeah.
The rates I'm showing here, so for example, the middle boxes, it's actually, for example, that's a top. So the middle boxes on one design machine, one of them and two handles
with normal Linux stacks and all that. Sorry? Yeah, yeah, yeah. So I'm not sure if, I still have nine minutes, and I would like to show you the demo. So this demo, it's, so the
resolution is not exactly quite good, but this is just to show how the, just to show the boot
times, and we, so basically you have the create VM over there, and Cosmos is very similar to Excel, for example, and basically create the VM over here, and we start the
click configuration, and let's see how it does. It basically boots 100 VMs. Now
on the other machine, they are connected back to back. We just think the VMs to see if they're actually working. So it's an R ping, so it goes to every single VM, and every single
VM runs the same configuration, answers to the same ping, so that's it. There are more demos, like on demand 10 gigs, single HTTP transfer with the VM booting on demand, and you can see all that. Just go to the booth. I'll be there at lunchtime in 10 to 11. Thank you.
No, you put it on. When you speak in, I'll go up, and you say something, and then if it's whatever,
and whatever, and then if it's, if I, if I say, if I do that, it's fine, and if not, you may, you know, you may have to adjust the microphone. Okay. Okay. And what resolution are you on your
laptop? I think I configured it. Yeah, I've configured it already. Wait a second.
Okay, no problem.
What do I need to do? Yeah, just talk it up. I'll go up. Okay, start talking. Why don't you take a ping gun to show off the slides?
Okay. Okay.
Just, just tell me when to start.
I'll do the timing for this one. If you occasionally look up, I cannot give you 10
minutes to five. Okay. I have a clock. Oh, you have a clock. Perfect.
Okay. Okay. Can you hear me? Okay. I want to start. Hi, my name is Gilad. I work for Red Hat. I'm a co-maintainer in the Ovid project, and I'm working for the SLA and scheduling
team. Today I'm going to talk about VM scheduling within Ovid, how it got evolved, and etc. Basically, we'll talk about the problem of scheduling a little bit, then we'll go ahead and talk about Nova filter schedule concepts, and then we'll deep dive into Ovid scheduling,
giving a lot of samples. I think the best way to understand stuff are through samples. Okay. Can you raise your hand if you heard about Ovid? Yeah, I guess a lot of you are.
It's a home crowd, so Ovid, for those of you who don't know what is Ovid, Ovid is a management platform for VMs based on KVM hypervisors, can handle thousands of VMs, has live snapshot, live storage migration, live VM migration, live, everything is live, basically.
And, sorry, we support also advanced network configuration for those hosts, and a lot of storage connections, Sun, NFS, Gluster, etc. Basically, let's see what we've got. A long time
ago, we got the questions from the users list. I urge you to use that list. If you have any questions about Ovid, it's quite active. How can I define a maximum number of running
VMs per host? It's pretty trivial, but we didn't support it that yet then, so we get back to it later. I just wanted to say that you should use the users list. Okay, what we had in Ovid, a long time ago, basically, we had two distribution algorithms for
running and migrating VMs. When we selected a VM, when we ran a VM, then we selected a host based on CPU load policies, either even distribution CPU load or power saving, and then
it's pretty, you know, limited. We have only two distribution algorithms, and we can't construct a user defined one. Basically, that is it. Then, we take a look. We looked at the
Nova scheduler, which brought us filters and weights. Very simple and easy way to schedule VM on hosts. Basically, the filters are very cut and clear logic that gets a set of hosts
on the left-hand side, then we run a filter on them, then we apply weights on top of them. Another cool slide from Nova scheduler. We collect a set of weights, then we sum them up,
and then we ordered the host that we got. Let's see a simple sample for Nova scheduler. This is a RAM filter written in Python. Basically, it's a very simple method that gets
a single host data and a set of properties, and then run a really simple code that either decides if this host is in the scheduling process or excluding it from the scheduling
process according to the requested RAM for that VM. When we looked at it and took it to the data center virtualization, which is over, we saw that we have a little bit of a problem
with that concept, because in Nova, each filter and weight is applied on a single host, and we have a larger concept of clustering in over it, which is called migration domain.
So in the migration domain, each VM on a host can be migrated to a separate host on that migration domain, that cluster, and we have another concept of load balancing for that cluster, and also a policy container, which a user can decide and create its own
policy and apply that on that cluster. So let's take a look what we had in overt. Basically, we can create internal and external filter and weights. The internal part is used
for better performance. We are within the server, so we get a quick access to the DB, and originally all the filter and weights from the previous, back then when we didn't have filters and weight, were migrated into that internal scheduler,
and the external that basically all the users can support and extend, and we apply all the filters, we apply the filter and weights on all the hosts in the cluster for better performance, and we want to have a better grasp of how the cluster behaves.
We have containers, we call cluster policies, for each cluster we can define the set of filters, set of weights, and a single load balancing for each cluster, and then we support custom
properties that you can, it's kind of passed through to the filter and weight. This is a diagram of the new model, on the left hand side we can see a set of hosts within the cluster, then we apply each filter, chaining each filter one on top of the other, and then we
construct a weight table, which gets us the result of the selected host that we want to schedule the VM on. We had the same concept as Nova and filters, the existing logic that we
had previously, which were basically validation, was migrated into internal filters, and we can extend it in Python using the external scheduler that I will get into later.
I want to show you a really easy sample of how we can use filters, basically this is a filter in Python, the name of the class will be the name that the server gets,
the properties validation are basically a set of properties that you can add within the server, and then the filter will get it. This is the name of the filter, and this is actually the signature that you need to override in order to extend a filter, to add a filter to the system.
This is how you get the custom properties within the filter, basically I didn't tell you what the filter is all about, but here you can see we get the time, and if the time is within the
interval that we get from the user, then we print the host, we return all the hosts that we got, if not we just remove, exclude all the hosts from the filter, that's kind of a bank example.
Let's talk about the weights, it weights all the hosts that pass through the filters, we have the predefined weights that the first two are CPU load filters, and then in 3.4
we added a lot more weight models, weight models, it's kind of easy now to add, because we have the new architecture, each filter can have factors, we can prioritize the
filters, each weight can have factors, we can prioritize the weights using factors, we can also add external weights. Let's see another sample,
in this sample we use a connection to the server using a Python SDK that we have, this SDK is backward compatible and stable, we connect to the SDK, and then the logic of the
weight is basically we iterate over all the hosts, and we append to a list a tuple of the host ID and the weight of that host, here it's a little bit cut, but the weight of the host is
the number of active VMs on that host, so if we have a host with, basically it would be ordered by a number of running VMs within that host. Let's continue, talk about the load balancing,
that's the third model we have in the cluster policy, each cluster policy can have only one load balancing logic, basically you can do whatever you want within the load balancing,
connect to the SDK and you can shut down everything basically, but what we support internally is that the load balancing logic will return a VM and a set of hosts, I will show it in a sample later on, and we will migrate that VM, a single VM within the server,
it's basically more safe to migrate a single VM within a period of time, not to cause a migration rush, and I don't know, use all our resources for that, it's pretty not safe to do that.
We also have a pretty fine load balancing within OVERT, the two legacy ones on CPU and now added in 3.4 even distribution one that we, even VM distribution that we didn't have,
the same balancing sample that I want to show, as I showed earlier, is we get the same numbers that we want to shut down all the VMs, but in this example we will
actually do that and not exclude those that prevent users from running hosts VMs, like in the filter example, here we will show how in that hour, in the wake-up hour, we will basically basically get all the hosts and if it's wake-up hour we will activate all the hosts,
if we need to sleep and then we will connect using our SDK and get all the VMs from that host, we will stop all the VMs and we'll deactivate the hosts. Same bank example,
this is how we use it internally to migrate VMs, you get according to some logic, you get the overloaded host, it's a code snippet, then you select here it's random, the first VM on that
host, then we actually print it because we're using a CDIO to get the data from the model and
return it and return a write listed host, which is kind of a filter, the first filtering we do before we activate the filter and waits and the normal process that we do. Basically, as I said, we have a cluster policy which is a container for all the filters and
waits and a single load balancing logic and we have two optimizations for a cluster policy, speed and overbooking, basically we run each time we schedule a VM, we run it one by one
because we want to prevent overbooking, we want to guarantee the same resources for each VM, if we will try to schedule two VMs together we can fail because they both see the same
resources, so basically the speed optimization is to exclude the weight of the hosts, so later on the load balancing will do the weighing for us and balance out the cluster for us and the overbooking is to omit the, I don't know, just be able to parallelize the
scheduling process. Let's see if we can show those pictures within Overt. Okay, this is Overt,
the VM is, you know, because of wi-fi and VPN, the VM is unknown, it's running somewhere,
okay, here I configure, that's it, here I can configure a new cluster policy,
I can give it a name like the shutdown one, I can add the external filter that I've added to the system, the shutdown host filter to the enabled filters,
as a weight I will give optimal for power saving, let's try to aggregate all the VMs for within a single host as much as possible, then I will select the balancer that I
created earlier, I can give it wake up hour like 8 am and shut down hour at 8, press okay, should be created, okay let's go back to
the, no, no, shouldn't take some time, what, I'm not connected to the VPN, no, no, something happened to the VPN, doesn't like me, yeah I have screen, I have screenshots
but you know, maybe it will work, second, okay let's take a look at the screen, it works, believe me, okay we were here, second, still no, never mind, I will show you that one,
then you go to cluster, still not, forget about it, okay I create the cluster policy
that I've showed you, then I create it to a cluster that I already defined, so all the hosts within that cluster will act according to that cluster policy,
okay let's talk about how it's implemented in the back, it's disabled by default, whoever wants to extend to add filters should be able to know how to install the external
scheduler, the external scheduler is a separate process written in Python, we externalize it because we want to guarantee the engine safetyness, you know if a user writes a code it can be dangerous to the system, we want to allow
other languages as well, if you know the end is written in Java, so and this model is written in Python and going forward.
We want to support SaaS, which is kind of a scheduling as a service. It's a separate RPM. You need to manually install it.
How it works, basically it's initialized. It reads from a local directory all the filters, weights, and balancing logic that you wrote. Then it's publishing an internal API. The engine, the server reads it, and then it
waits for calls from the engine for filtering, weighing, and balancing. This is how it looks like when it's loaded, the filter and load balancing here. OK, back to the users list.
Now, we can easily write a filter that maximizes the number of running VMs per host. Pretty early.
OK, to sum it up, we support easy Python plugins for you to use for VM scheduling. You can manage a separate policy for each cluster, for each group of hosts. And every version that comes out for OVID
gets new models for scheduling. Questions? So the question is that I have the possibility to read from what the hypervisor provides, memory utilization, CPU utilization.
Since you have a Python-extensive API, if I got it right, you may have the possibility to know if the storage behind or the storage framework behind is constrained. So you have the possibility to say, all right, do a storage migration, these virtual machines to the other storage, which is SSD-based or whatever.
So if you just want, for example, to have a good talk with all the storage big guys and find more things about their IOPS and what we are doing on their storage. We have to give that information, right? Basically. Yeah, because we have the memory, we have this view. Maybe we have some network information
about throughputs and stuff. So what's left behind is the IOPS and the quality of the storage? You can think of whatever. You all heard a question? I think so. Yeah. I think within when you extend, you
can do whatever you want when you extend a filter or when you basically can connect to every provider or use any SDK that you like. So what we provide within the engine
is all that you ask, memory and CPU load. If you want to connect to other external providers, it's your own choice. So we had a few guys in the overt workshop in Bangalore.
They were asking us to connect the scheduler to a BMC system that is monitoring extra parameters. For example, they can monitor the CPU temperature and the fan speed. And they can actually predict that if the fan
speed is constant or zero and the CPU temperature is high, that host is going to crash and burn in a few minutes. So what they asked from us is to give them a list of hosts. And they can actually blacklist some of them because they are aware of more information than what
overt should have. And there are so many other examples which are very similar. For example, Cisco has very similar concepts. They actually want to blacklist some of the hosts because the networking is going to go down there. There are a lot of very similar scenarios in very big enterprise. This is actually highly wanted.
Absolutely, sounds promising. So now you have the power. You can actually do it yourself. Well, my first simple question was, all right, I know about my CPU, my memory. Maybe I know about my networking because we do the networking. The next good thing is if I had the storage IO information or information that had to do
with the quality of my storage, perhaps I could utilize multi-tierings in a storage or storage vMotion virtual machine from one place to the other. That could be nice with this engine.
Thank you very much. Go drink a beer. Thank you.