Lightning fast networking in your virtual machine

Video in TIB AV-Portal: Lightning fast networking in your virtual machine

Formal Metadata

Lightning fast networking in your virtual machine
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
High speed network communication is challenging on bare metal, and even more so in virtual machines. There we have to deal with expensive I/O instruction emulation, format manipulation, and handing off data through multiple threads, device drivers and virtual switches. Common solutions to the problem rely on hardware support (such as PCI passthrough) to make portions of the NIC directly accessible to the guest operating system, or specialized drivers (virtio-net, vmxnet, xenfront) built around a device model that is easier to emulate. These solutions can reach 10 Gbit/s and higher speeds (with suitably large frames), one order of magnitude faster than emulated conventional NICs (e.g. Intel e1000). Despite popular belief, NIC emulation is not inherently slow. In this paper we will show how we achieved VM-to-VM throughputs of 4 Mpps and latencies as low as 100us with only minimal modifications to an e1000 device driver and frontend running on KVM. Our work relies on four main components, which can be applied independently: 1) proper emulation of certain NIC features, such as interrupt mitigation, which greatly contribute to reduce the emulation overhead; 2) modified device drivers that reduce the number of I/O instructions, much more expensive on virtual machines than on real hardware; 3) a small extension of the device model, which permits shared-memory communication with the hypervisor without requiring a completely new device driver 4) a fast network backend (VALE), based on the netmap framework, which can sustain multiple millions of packets per second; With the combination of these techniques, our VM-to-VM throughput (two FreeBSD guests running on top of QEMU-KVM) went from 80 Kpps to almost 1 Mpps using socket based applications, and 4 Mpps with netmap clients running on the guest. Similarly, latency was reduced by more than 5 times, reaching values of less than 100 us. It is important that these techniques can be applied independently depending on the circumstances. In particular, #1 and #4 modify the hypervisor but do not require any change in the guest operating system. #2 introduces a minuscule change in the guest device driver, but does not touch the hypervisor. #4 relies on device driver and hypervisor changes, but these are limited to a few hundreds of lines of code, compared to the 3-5 Klines that are necessary to implement a new device driver and its corresponding frontend on the hypervisor.
View (database) Multiplication sign Virtual machine Student's t-test Instance (computer science) Cartesian coordinate system Virtual machine Formal language Goodness of fit Software Energy level Right angle File viewer Office suite Website Drum memory Library (computing)
Equaliser (mathematics) Virtual machine Principle of maximum entropy Device driver Parameter (computer programming) Mereology Bit rate Very-high-bit-rate digital subscriber line Arithmetic mean Different (Kate Ryan album) Cuboid Energy level Software framework Endliche Modelltheorie Dialect Simulation Interface (computing) Line (geometry) Cartesian coordinate system Frame problem Connected space Mathematics Particle system Integrated development environment Software Personal digital assistant Order (biology) Resultant
Revision control Addition Interface (computing) Patch (Unix) Virtual machine Device driver Right angle Cartesian coordinate system Resultant Number
Computer virus Standard deviation Computer file Code Lemma (mathematics) Device driver Menu (computing) Bit Line (geometry) Cartesian coordinate system Replication (computing) Machine vision Revision control Category of being Mathematics Ring (mathematics) Synchronization Personal digital assistant Network socket Website Selectivity (electronic) Library (computing) Spacetime
Point (geometry) Slide rule Pairwise comparison Multiplication Socket-Schnittstelle Graph (mathematics) Computer file State of matter Interface (computing) Gender Keyboard shortcut Maxima and minima Instance (computer science) Software Core dump Endliche Modelltheorie Local ring Communications protocol
Module (mathematics) Stapeldatei Algorithm Block (periodic table) Interface (computing) Multiplication sign Virtual machine Client (computing) Cartesian coordinate system Mereology Particle system Ultimatum game Process (computing) Bit rate Convex hull Software testing Form (programming) Physical system Social class
Band matrix Broadcasting (networking) Semiconductor memory Personal digital assistant Multiplication sign Interface (computing) Gotcha <Informatik> Instance (computer science) Mereology Frame problem Physical system Number
Group action Game controller Ferry Corsten State of matter Real number Multiplication sign Equaliser (mathematics) Virtual machine Device driver Mereology Event horizon Number Revision control Medical imaging Semiconductor memory Operator (mathematics) Forest Cuboid Office suite Endliche Modelltheorie Hydraulic jump Physical system Computer architecture User interface Simulation Theory of relativity Instance (computer science) Cartesian coordinate system Befehlsprozessor Process (computing) Kernel (computing) Personal digital assistant Order (biology) Interrupt <Informatik> Text editor Quicksort Asynchronous Transfer Mode
Greatest element Service (economics) Code Real number Plotter Multiplication sign Direction (geometry) Image resolution Parameter (computer programming) Mereology Theory Order of magnitude Front and back ends Number Internet forum Computer hardware Operating system Representation (politics) Cuboid Office suite Enterprise architecture Information Shared memory Instance (computer science) Cartesian coordinate system Equivalence relation Connected space Mathematics Numeral (linguistics) Process (computing) Order (biology) Interrupt <Informatik> Window
Ferry Corsten Multiplication sign View (database) Direction (geometry) Source code Execution unit Archaeological field survey Set (mathematics) Mereology Arm Data transmission Medical imaging Mathematics Bit rate Hypermedia Single-precision floating-point format Videoconferencing Position operator Physical system Metropolitan area network Enterprise architecture Block (periodic table) Menu (computing) Bit Complete metric space Principle of maximum entropy Type theory Befehlsprozessor Order (biology) Interrupt <Informatik> Right angle Freeware Purchasing Slide rule Game controller Server (computing) Implementation Mapping Divisor Network operating system Virtual machine Emulation Number Internet forum Term (mathematics) Operating system Software testing Absolute value Stapeldatei Transmitter Personal digital assistant
Context awareness Group action State of matter Direction (geometry) Equaliser (mathematics) Multiplication sign Range (statistics) Axiom Mereology Mechanism design Mathematics Semiconductor memory Synchronization Position operator Physical system Metropolitan area network Block (periodic table) Shared memory Instance (computer science) Lattice (order) Complete metric space Message passing Arithmetic mean Telecommunication Order (biology) Interrupt <Informatik> Right angle Summierbarkeit Writing Speicheradresse Point (geometry) Beat (acoustics) Game controller MIDI Heat transfer Number Product (business) Revision control Delay differential equation Internet forum Operator (mathematics) Operating system Energy level Acoustic shadow Gamma function Information Interface (computing) Content (media) Line (geometry) Subject indexing Uniform resource locator Personal digital assistant Freezing
Greatest element Group action Multiplication sign Mereology Front and back ends Mechanism design Network socket Multimedia Error message Position operator Physical system Metropolitan area network Mapping Physicalism Nominal number Lattice (order) Instance (computer science) Category of being Type theory Process (computing) Befehlsprozessor Telecommunication Buffer solution Order (biology) Gotcha <Informatik> Quantum Configuration space Summierbarkeit Figurate number Procedural programming Resultant Implementation Link (knot theory) Virtual machine Number 2 (number) Product (business) Internet forum Term (mathematics) Reduction of order Operating system Software testing Address space Mathematical optimization Key (cryptography) Interface (computing) Polygon Diffuser (automotive) Cartesian coordinate system System call Frame problem Personal digital assistant Table (information) Window
Point (geometry) MP3 Patch (Unix) Weight Real number Projective plane Virtual machine Device driver Line (geometry) Student's t-test Instance (computer science) Mereology Mathematics Mechanism design Software Queue (abstract data type) Data structure Endliche Modelltheorie Position operator Mathematical optimization Physical system Utility software
Dialect Variety (linguistics) Direction (geometry) Range (statistics) Execution unit Canonical ensemble Mereology Event horizon Arm Computer programming Number Revision control Computer configuration Operating system Traffic reporting User interface Dependent and independent variables Mapping File format Interface (computing) Chemical equation Weight Moment (mathematics) Basis <Mathematik> Instance (computer science) Cartesian coordinate system Befehlsprozessor Process (computing) Software Personal digital assistant Order (biology) Speech synthesis Asynchronous Transfer Mode Spacetime
Degree (graph theory) Information Term (mathematics) Order (biology) Figurate number Instance (computer science) Stack (abstract data type) Mathematical optimization Spacetime
Point (geometry) Metropolitan area network Electronic data interchange Mapping Divisor Code Multiplication sign Execution unit Combinational logic Bit Line (geometry) Client (computing) Mereology Arm Number Mathematics Internet forum Multi-agent system Personal digital assistant Game theory Physical system Address space
and good afternoon everyone and monomers which of its own I'm presenting this work on accelerating the network I O on different machines is something that done with my colleague urges application and students former students Chancellor fluent in these out the features that is the here on the left and the view from my office and these up and on the right and the viewers has spent a few months in Mountain View at Google and
what is this work about what our goes here 1st of all we
want to do is to accelerate the network within a little machine consider that to be held by the traits that you can see the previous BSD for instance are about to be about a specific before and much less than languages and even the 2nd this is on impairment on the library if you have it in the individual machine performance is 10 times worse possibly even more than 10 times worse than some previously so we wanted to find solutions to a close the gap between these 2 performance level that we wanted to
accelerate the network are not just for barter TCP connections because that's the problem that's relatively easier we have a large frame so we have the idea so and harbor frauds that can be used to do the work that is done on the CPU so in in a way you can get close to the line rate at when you get beat possibly even thinking of it and little machine using more or less standard techniques and the props of particle equalizers device drivers and devices however there are more applications that might be it is interesting to run to the into little machines psychologists of parameters with the boxes the application that through use of short-lived connections and so we would like to to have the same level of performance both on the real hardware and on the vitamins she did there is another thing that when you approach with a machine that everybody can still and is that device simulation is inherently slow you cannot do high speed dial with in weighted make you really need some sophisticated solutions in it's in front of the of and MaxEnt actually what we find out with found out is that we get exactly the same result and possibly slightly better we denominated that you 1000 we didn't so at least that you probably don't need a special device it's interesting that you don't need a special devices that because there are cases where perhaps you have well established that about a story well established Bayes that involves your standard driver you'd rather exercise that 1 rather than in a completely different device driver that you have no way to use it on on the real line and also doing our work previous work from that from about reaches that we have a lot of related a few tricks that we would like to apply In order environments and see if are effective so that was a nice application for our previous work the main tools
that we been using garden that's not something that was amicable user a will presented here these 2 can last year and it's a framework for doing by very very fast it is a standard part of the BST now there is a notable model for for Linux and you can learn rated on a thing interface the user follow-up
of this work was abolished which which is basically an even softer you can the region uh that uses the same in the i imports as that much but was designed as a generic interconnect however our goal was to use it to to interconnect people machines and needs you interesting for a number of applications including testing the things that you should run later on thinking interface using next month so you you can really stress your petition even without and possibly even a Spitzer faster than thinking of it in addition that the the need to really enforce body runs up to 20 million but by specific purported or 70 give it 2nd larger pockets large women victims rights and
the result that I'm presenting today is an accelerated version of the 2 and and matching go patches for device drivers would previously in the IXA that can achieve a you that can
achieve a and over million but if a 2nd when you're using standard socket applications in the gas that is against the best communication and about 6 here with the 2nd with the young about frames and over the 5 million bytes per 2nd when the the clients so that when the clients and against the using the the net property I but the changes that we have introduced in device drivers and you and I are really is more new order of few hundred lines of code subsystems and that's also nice thing because it's easier to to the body of small piece of code rather than entirely new device entirely new subsystem the previously driver parties are about to be committed to the lemma device driver and we have an equivalent thing for the notes and that you're Protestant are probably going to be distributed as separate 2 things from our website so
just a little bit of background the net number of and is an idea i to sender receiver refrains from user space it relies heavily on matching the bucket profits and the rings descriptor rings that are exposed to basically should assure Members vision and the you have a selectable file descriptors for synchronization which means that what is implemented with very little modifications to devise virus and the and the model and that there is only pick up library that is is supported the of application so in the best case that you don't even need to recompile replication just below the our version of the peak of library and you're often using the faster you to send receiver traffic as
this slide shows the basic protocol to access the Nikkei using that model you open special file you get from the state duration at NIH together to bind the transcript to given interface or something else not remember region and you're ready to to send and receive pockets and user selected for forcing
performance as a mention that is really good uh in this graph you see how fast you can send or receive pocket using 1 core for instance that at less than 1 because you're ready able to separated and with almost 50 new participants and his his pretty well with multiple of course uh up to 4 I haven't tried more than that because the idea that the other so uh because it at that point in generating separating the network detrended make so there's not much of a point in going forward as a comparison the Haganah led by gender being so can do about 2 4 million packets per 2nd the maximum speed of 3 this maybe or even only in choosing sockets your you're running at about bottom million but it's the 2nd vertical a follow-up
up to that of was to extend the API at the end of the kind module 2 implemented it so basically the net molecule previously out that is able to interpret the port names which this form of violent X and y as a request to create a the trust region named accent at ports named White and that would be the feature and then the learning rate algorithm on that if you can create multiple features you can pretty much parts and you can connect clients so using Management API to about a quarter in the same way as you can connect to a and physical working at a more so you and test application of about a switch and then run over the interface or these blocks could be built on machines or I mean the system is extremely flexible durations in Paris and the men and there is the incoming particle passing 1 or more destinations so that would put them depends on how many copies of you you need to go over the in terms of
performance of a violent if you implemented value is a non starter forward-wave processing on by the time we're done this you would only reaches between 2 and 5 million bucks the 2nd which is often faster however by exploiting matching and try to produced the the the access to lots so that you're sending your request in my locker for each class the interfaces we managed to reach a throughput of about 18 to 20 million
parts per 2nd which mean size frames and 70 you BPs with larger friends and a 17 year before 7 is basically limited by their memory bandwidth of the system used
in this course of social the performance of the policy which is the proper we divided by 1 number of destination in the case of broadcast this is the time that is spent per pocket nanoseconds for pocket depending on the outside the for the values which and for the solution for instance is the Linux region using top interfaces now it is a bit of
background on the 2 machines uh forward the industry this people who know more than me I apologize for mistakes and incorrectness of what what happened so how a little machine is implemented in modern systems basically the CPU was that we have these days can run the guest editor guest operating system on the office you running in from and HBO perceive you appears as a credit to the host operating system and machines so you can create a guest machine with much more because if you would then they're tried to sharing the same memory image there are things that cannot be done by distributing it 1 more specifically when you need to access registers or handling interrupts etc. sometimes you need to exit from this viewpoint execution model jump back to the real execution mode of of this appeal it and perform actions that for instance and related the peripheral stuff that you're trying to to answer since on these early BMX sits there and similar things that happens on the job dispatching actually very very expensive much
more expensive than on really real machine accessing registers probably 100 nanoseconds or a little more on real machine but on of the machine that due to the Amex states and for the processing of you might spend anything between 5 and 10 microseconds on that single operation so there is a big gap in performance and especially when your existing devices there might be cases where you have a lot of these boxes sensors off axis is on on each time so if you're not careful in handling the device or at at least in writing the device driver in a way that is performing well with the little machine and you can get a very high performance 1 even if you sold the performance in the device simulation there are still possible bottlenecks in India hospital and for instance the connection on between the machine and physical interface or another with machine goes through several stages in the hypervisor and the kernel of the host operating systems and again if you if you're not careful in coding those those of processing stages very carefully you might end up doing this expensive operation or version copies etc. that impact your performance of a lot of course if you start with a device driver in relation which is very expensive but that's the minds bottleneck and so you might even you might not even see that the order of the performance of MCO architecture but doing this work we actually solve initially devised our problem and then keep the number of subsequent book the next and then try to sort of so partly equalize device drivers or song by revised 3 forests that have been introduced over time by examined by the the and well and by 2 1 with the you for so try to resolve that to solve the problem of an efficient device model that would would work well under hypervisor under control of a hypervisor and the however those are only 1 part of those solve in 1 part of the problem there all the other about the next event mentions that exist and and so on the they need to be that we the hypervisor and the soft switch that connects with the machines among themselves for the the physical so 1 of the things that
we did that was mystified that the belief that my simulation and because the if 1 wants to look at the bottom replies devices you see that the doctor representation those devices part of my office window that there is not too different from what you have in a physical me what it what is different is the way you will access register or the equivalent of registered to get information on what is the the current market transmit meat or with the current interim status etc. but that can be addressed without introducing a completely new devise more and don't do the real problem that part replies devices try to solve it is to reduce the number of of the Commission axis and as a mentioned those related to the intensity and the 2 officers to idea registers so if you have a way to reduce the number of interrupts and plot replaced access to registers with the information that is in the shared memory you can do this number exceeds and and get the center performance even in the numerator the 1 thousand the Realtek or to right the 2nd thing we did was to improve the throughput of the hypervisor and moving pockets from between the front-end and the back-end and front-end is emulated side of the metal device the candies whatever it is used to connect the hypervisor to the Austin in that respect however if you add the that and more performant connections more performance became and you might still have as long as a switch inside of the hostel which is what happens so for instance if you use a previously bridging only in the region were from the speech and so we were forced to also replace this this which we something that faster for them to get better before now the 1st thing we did was
looking for something that is already implemented in most modern hardware and is incorrect moderation instead of especially on the sea but instead of sending 1 interrupt up on every on every pocket that you receive more than 100 tries to enforce it meaning intermodal time between direct service so that you don't goes too much of a dent in the in the in the processing of traffic and the problem is that most of the time in which had the most hypervisors don't really implemented these features that exists in the as we have seen that doesn't exist in your home or a doesn't exist in boxes and of course we have no access to be aware of that but according to what we afraid moderation is not implemented in implementing the moderation is not only how the only and the and the amount of code that it takes it is really needed during problem that that exists is that in order to implementing don't moderation you need to set some time and that that enterprises cannot interrupting you now that I'm interrupting you in 29 microseconds of 50 microseconds or something like that that's the order of magnitude of the theories that I implemented by the around the parameters but basically those are the numbers that you normally use now quite often you know that is a fine-grained timers resolution in the new operating system and that might be a reason why moderation wasn't implemented in the 1st place anyway so we implemented that and try to to use them and we will see some performance number and also we will show that moderation by itself is not necessarily in solving the the performance from and in
fact here are the numbers in 1 of those 1 set of experiments that we need have DD type of experiments that I'm reporting are basically using the JVM and you animal as a hypervisor running on top of the notes uh units because we don't have to be an Free BSD so we can reuse of hybrid support for the position and we had to guess which our Free BSD had as of February more those are Picabia's deviances source and images and they are connected to the stop interfaces as in this particular experiment or to abolish which as we will see later and they're running on the on the same machine which is a um as a 3 as about 3 trip for 2 years so if you so if we take an unmodified to and and we're we try to measure the guest 2 guest to put down the transmit rate is actually very very low it's about of 24 thousand parts per 2nd on a 1 little CPU and if you have 2 with us if you you get a private rate of of about 65 thousand parts per and by implementing the interrupt moderation league actually get a little bit of improvement well but substantial improvement but we still dealing with very low bit rates in absolute terms removed from 24 20 thousand parts per 2nd with 1 glucose appear and from 65 to 87 with 2 so why is that the uh that for 2 literacy viewed improvement is so someone as well as the thing is that we don't deeper into our moderation when you tried to transmit the packet that the uh the guest literacy viewed as a BMX it transmits the packet and immediately generates and there and so by the time return control to be a guest operating system you are hit by interrupt and you have to serve survey data the media and so that's a source of over in the visible to everybody you know when that when you do the exit and return of 1 of this if you would have been better than the other 1 will be able to continue processing of of so you have a little bit of quantities and going on and that from it's a little bit more matching and performance and so having the having the internal moderation changes the situation for the 1 of the few cases about that doesn't change the situation to much for them to people on the on suicide that see a lot of improvement to reading the modulation but this is mostly because of the best configuration that that we used to have basically the center itself but transmit pockets in batches to the uh to the order machine and so even without the need of moderation you're still getting batches of pocket on the receiver side and on just 1 for providing on December solve the STM pretty interesting the receive side is faster than the present side we we didn't even managed to get resupply block instead of on using the Linux as a guest operating system we are actually had some like look cases but those really depends on your purchases the 2nd techniques
that we use what is called the sand combining it's actually a very old the thing that I was reading a paper of 2 thousand and 11 paper from the and where where they documented for a for the 1st time the techniques that they use in their initial and latest back the late nineties and that they use a similar technique the idea is that whenever you have a pending transmission sorry whenever you want to transmit a packet to right to register in the in the naked to to dollars geography view of the of or the hypervisor if you remove it the machine that you want to send out about and that's right to the register is very expensive is it is what causes the end now In in cases where you are testing interrupt the completion of the of the pocket that then you you can forget you can postpone writing to the register for the subsequent parts of transmission requests and said just remember that there are pending missions to to to be sent out and when you get to interrupt you do the actual right into to the register so that matches the rights to the register and reduces the number of by a significant amount of especially if you haven't our moderation of course if you don't have a general direction you would get the needed up immediately he after after sending his this and in terms of performance and you see here again the implementation of a and combining it requires a very modest amount of called there and it's only in the guest device that you don't even need to modify the provides and where's the interrupt moderation only needs to modify the enterprises assuming the different system supports the future and so on moderation and 1 the videos if you desire the numbers that are shown in the previous slide that would send combining alone without thinking about moderation you basically have no gain in the case of 1 because if you if you have a significant gain of to user and if you have bought into moderation and single binding the speedup and the 1 we prosecute cases impressive and to abuse goes at about the same speed just because you have reduced the debt uploaded by large factor and a number of them actually by a large factor and so the 2nd CPU in this particular test is doing almost nothing to because basically just serving interrupts massive he was doing the most of the so we have a 10 fold speedup in the case of 15 times the up in the case of 1 because if you 5 times faster than with the to recursively and we're approaching pretty decent a pocket rates on their own transmit side in doing this and
combining only works with means so the next step we we try to the next thing we try to implement was a part of the position and yet you probably positions to reduce the number of BMX it by making the um in the gas communicate through shared memory instead of a communicating through interrupts and rights to register of course in order to communicate through shared memory is given that you have no synchronization mechanism you need that what entities are active at the same time so you need and and you cannot afford to have some trading host remind you always running in importing at the status of of sentient memory because that would be too expensive so the way it works is that the you start from an initial state where with the gas and also idle and then that whenever 1 of the 2 entities want to start his communication sends a message which is called a qeq Akiko from the guest OS is typically sent by writing to a register because the register thousands because of the Mexican so you transfer of control to the host and you can do operation in the context of the of a kick in the other direction is to typically sent to an interrupt because that's the way they and communicated to the guest operating system and the and to stop for instance supporting from of some kind so what do we we implemented the to to do their part of equalization of the 1 thousand devices was to modify the modified a hypervisor so that the right to to the register TDT I wasn't it is the name of the register only 1 thousand also is interpreted as a cake by the hypervisor and it interacts also generate a sum of are interpreted by cakes by by the guest operating system and the region that is used to exchange information and we call that a common status block CSP and basically did you take contains a couple of points are indexes for each direction of the communication so after the KK 1 up that for instance if you're meeting the data in the against their will write in in the shadow registers which is often a memory location which reflects the value of the transmit the register on the transmit drink and the hostel with all of the content of the shared memory location to see if there are more products to be transmitted over as long as there are pockets new but has to be transmitted there is no need to write register in order to send them out of the polling group on the hostel with vastly pocket from the buffers and and send them to the back and whatever it is and will notify completion to another version registered in the In the CSP and of course there is already an implicit notification status beats in the in the in the range of descriptors that is used by and the same happens in the other direction when you have a new product coming in and everything is I don't use and any directed the gas that starts processing the data but instead of reading of from the status register if any to to get information on whether or not there are more packets it with just get information from the single or from the CSP and this way he doesn't need to access registers and also when the gas on the receive side freeze laughter and returns to to the Nikkei to perform more receptions it doesn't do it through the registers but that just as the information in the CSP again in order at very small change to to bulleted PS and they also side above 100 lines each and the
performance gains are also interested in this case in order to use fire realization we don't need any uh interrupt moderation combining and that you see that we approach half a million bucks per 2nd of this both with 1 interview and so now we get to a level of performance which is perfectly equivalent to that of Italian we using this more or less standard and 1 thousand most of the rest the knowledge How can we go faster than this but this is basically the throughput of the 1st which ministrija whatever using the top interfaces as a communication channel so we need to improve that part of the system in that part of the system
involves using a faster than the procedure is at the bottom interconnecting with the with the machine and the device which is a perfect think to to use in this particular case the old we needed to do was to write a for at you and want to attempt to talk to the bodies which instead of the top interfaces or other mechanism that that you and has and GeneCodis quite modular so it wasn't a difficult task of about 350 times of product and then another montage of using this approach is that we can connect molecule assessed directly to a unique using the given that property P I so we can do almost greater and without too much of if you however just improving the this which didn't get us so much performance improvement because in fact it at about meeting that proviso was reusable and this figure shows you what happens in the communication between the gas and the softest each the the the sum of the error Tuomo a more implementation uh basically consumed about funded nanoseconds to copy the data from the this is the time amortized provide consume about 100 ms copy the data and from the buffers that are supplied by a guest operating system into the front and then another 800 seconds to transfer them to the kind and another 500 2nd persistent called because they're using that interface you can only send 1 part of the system called so we had to clean up to these days apart and we have to get better performance and that was done by um noting that for instance part of this fundamental seconds was due to the fact that for every access to the descriptor there was a call to the routine that not all stiffer got guess physical addresses into host people others is now this mapping is there forever until the guest machine built machine migrates to somewhere else but so there is absolutely no need to repeat the check every time so we just after the result and the reduced this time by 4 times then the data copied the the copy was done using more or less and copy which is quite slow and so we we replace that we can optimize the copy routine which also use the same trick that we use in our knowledge of instead of trying to copy exactly a number of bytes that you have like 60 65 or some other number we just run the number to multiple physical was 64 which makes the entire process of more efficient and so reduce this time from 8 people 40 seconds and then and by replacing the um they can deal with that of others which is really gotten amortized time of about 15 seconds the in the last part of the of this so now the interconnection between the gas and this which is a lot faster and we were able to push pockets of up to this point and about 10 million bucks per 2nd and then I'm going to this which means we have a courses like reduction performance still pretty fast
overall and this is almost the final tables that I'm sure you can this the performance of using the uh happens Linux region as as a switching to connecting with the machine in the base configuration so you see that we started from about 24 thousand parts per 2nd worst case stomach is with 1 group of CPU and we got to between 3 and 400 thousand posts per 2nd 5 unintelligible as specific depending on the type of optimization that we implement this number here height yeah is the delay that we use in the introduction motivation and keys in the micro seconds I believe that so a delay of 1 might cause of course I mean this laser nominal but when you implement them in the and operating system and that there is some granularity in the in the time of so you only have 1 microseconds probably much larger delays the problem is that increasing the and then you get moderation improved performance but it doesn't impact on the latency of your your part and so it might be something that you don't want to do that if you have if you want latency or I have to put up with a small window the sizes and here is the situation here at the diffuser of the polis which has a backend we with the improvements that we included in the system so you see that we almost doubled that the performance of the peak performance and which is in the is a medium part a 2nd in the case of the position but even without quantum position we are getting pretty close between 8 and 9 and the possible this again this test is done between to 2 Free BSD guess so using that standard to receive these are part of tools schools trade for for sending and receiving UDP pockets and this is these numbers are for 64 by pockets we don't 50 underbite frames which about half a million parts per 2nd which means to success do you BPS I think and in the now what happens if we use Netlab within the guest instead of using a socket based application to send receive of course in order to Usenet multimedia you need and that the enactment capable of naked unfortunately 1 thousand is 1 of the links that are supported by so it was just a matter of running the test and CEO fostered went and that kind of that reach was about amenable to second-guessed against again so it's pretty fast and between sender and in terms of absolute would so with 50 and thereby pockets we do about 25 give it 2nd on the receive side and slightly faster on the so I think that is the kind of performance that is really compatible with what we can achieve on on the other of course not the only 1 thousand which is the 1 of its interface but for for instance on the thinking interfaces to and or others we are pretty much close to this that now what's the status
of the user star directly sets of changes in both the 1 on the guest operating systems and we have patches for the 1 thousand device on previously they're going to be completed soon talking projective all of them including them in our software and then I think was microphone and then there are also changes for a BN O C 1 thousand driver missing since the mechanism for part of position is completely general and we're actually using the same data structure to to the parameterization of the array of device which stop politically interesting about that some i provides an inventory of technology 1 thousand so when operated by on advisor we have q m of i can for not not but that was sent to the queue unleashed a few months ago we are proving that the wider gets accepted or rejected or whatever but anyways uh we we can surely include that as a party not previously part of 2 and again and the changes that we're making good on the i proviso I completed generous of as it is feasible to write back and forth with a box and if you don't want to run 2 and more if you want to produce the solution that has support for a 5 utilization in this in this appeal and on the outside that is not really a change of the started change all you need is to load that the net not uh net about a model which is just mother recompiling it of previously and on the so a
confusions for for this thing is that I believe we uh we haven't uh reached a point where we can get about the same performance network on the vehicle machine as on the real line and that's great because for instance if you want to test the optimization of the probable stock now you can see them on them with machine that without having to worry about the performance of the final part of the part of can then uh the switch and I hope that this this tool will help us improve the natural stock contributes to this market does not mean that we don't have a contribution by my students listed here and finding from some European project and also companies about the work was supported my stay at the mountain I like to conclude with a few comments
on the and the status of net map and in the balance which since last year so last summer in August I try to implement a userspace version of variety of W and dominant which talks to that popped interfaces rather than to rather than being embedded in the canon and beneficiary performance of this thing it is pretty a pretty good between 2 we use a single CPU confused about 6 mean about the a 2nd and their reports from user who said that you can do about them in a part of the 2nd between 2 physical interfaces for connecting the in user space if W 2 to physical interfaces running the about before out we use dominated there is an additional difficulty involved that and that reduces the performance to between 2 and 3 million by the specific but that's still much faster than the incan version of another thing that we implemented in in February was a transparent mode format and so 1 of the issues with that not this that you basically your application grabs the interface and disconnected from the host stack so the only way for traffic to reach just stack is that your application range X the trophic using in order to make up for this feature of into the whole with transparent mode of an application using that problem again has a chance to to see to see pockets of Markos pockets that that should be intercepted by the application itself and all the others can go are automatically forwarded to the stock and the same goes for the other direction so that makes the behavior of that's not a lot more similar to what you have on the PF we get additional to few their pockets which might be interesting in some cases I meet you know colleague from NEC in your program but started working on that map recently and April implemented the feature that allows you to go network interfaces to abolish switch so basically now you have the same can abilities that you have read reader in Canada which to attach interfaces to a switch over to the hostile can move to Africa between between Boston completely transparently without any any user-space process that does that the switching for you and this should be committed to to 4 D is the shortest there is a ongoing work is to use the knowledge features that part for all in this speech this initially would be a Linux-only thing because the encoded version of what this feature neurons of the units at the moment and there have been discussing with the some of you have the option to supporters got stuck together in your own that not and that is useful for a number of things that are for instance events of implementing a softer version of this so I can say fragmentation and reassembly if you want to profit was which with different use the response so that's all for now and but if you have questions begin yes many the
no no absolute I no in general I I try to to avoid the use of features that are specific to the operating system or a hybrid version of because this makes my work more what about the and I guess I can yeah so then it has no I don't that I don't that number on the basis of the yes definitely and I 1 thing I have to say and
together figures here but there in
terms of latency 1 thing that gives you is the fact that when you translate it back at you and you need to understand the packet back to the user space tried that in order to communicate with the but can now you know says for instance some optimization and we do we also things all on the Amex at the departments and directly to the network stack of whatever it is that we don't going into the userspace processes and that's the way 1 should do things in order to reduce the delay your question is really so and the more not have changed the the was to some degree also during this in the text the information the interest of the that is so that is to now and it is partly to I mean it is passing so for instance is the
moderation and work phrases
perform the direct
moderation doesn't need any change in time because the client typically already as a moderation so the the kind of games so that you can get here not much of yes I want to be the end of the year place to be yes so you get naturally this so you could get up to now at this case you could get to under the 40 thousand but at certain from the original 65 which is not a lot to me it's a factor of 2 and of course when you're used to seeing number that are part of benefactor of Buddhism simple but yes in which but the best thing is when you can combine don't moderation weeds and combining atleast on the on the transmit side or you can the privatization but that of course requires some changes on on the on the guess no not my point is that it is easier to other ontological assuming you have the ability to to to change a little bit on the on the guest side it might be easier to either the lines of code and that to write an entirely new device that unfortunately not for you I and this but because of the speed of score you many of so what is it or something and guess what well probably there are the bottlenecks in on a system units had his so content of their time


  575 ms - page object


AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)