Add to Watchlist

Lightning fast networking in your virtual machine

1 views

Citation of segment
Embed Code
Purchasing a DVD Cite video
Series
Annotations
Transcript
and good afternoon everyone and monomers which of its own I'm presenting this work on accelerating the network I O on different machines is something that done with my colleague urges application and students former students Chancellor fluent in these out the features that is the here on the left and the view from my office and these up and on the right and the viewers has spent a few months in Mountain View at Google and
what is this work about what our goes here 1st of all we
want to do is to accelerate the network within a little machine consider that to be held by the traits that you can see the previous BSD for instance are about to be about a specific before and much less than languages and even the 2nd this is on impairment on the library if you have it in the individual machine performance is 10 times worse possibly even more than 10 times worse than some previously so we wanted to find solutions to a close the gap between these 2 performance level that we wanted to
accelerate the network are not just for barter TCP connections because that's the problem that's relatively easier we have a large frame so we have the idea so and harbor frauds that can be used to do the work that is done on the CPU so in in a way you can get close to the line rate at when you get beat possibly even thinking of it and little machine using more or less standard techniques and the props of particle equalizers device drivers and devices however there are more applications that might be it is interesting to run to the into little machines psychologists of parameters with the boxes the application that through use of short-lived connections and so we would like to to have the same level of performance both on the real hardware and on the vitamins she did there is another thing that when you approach with a machine that everybody can still and is that device simulation is inherently slow you cannot do high speed dial with in weighted make you really need some sophisticated solutions in it's in front of the of and MaxEnt actually what we find out with found out is that we get exactly the same result and possibly slightly better we denominated that you 1000 we didn't so at least that you probably don't need a special device it's interesting that you don't need a special devices that because there are cases where perhaps you have well established that about a story well established Bayes that involves your standard driver you'd rather exercise that 1 rather than in a completely different device driver that you have no way to use it on on the real line and also doing our work previous work from that from about reaches that we have a lot of related a few tricks that we would like to apply In order environments and see if are effective so that was a nice application for our previous work the main tools
that we been using garden that's not something that was amicable user a will presented here these 2 can last year and it's a framework for doing by very very fast it is a standard part of the BST now there is a notable model for for Linux and you can learn rated on a thing interface the user follow-up
of this work was abolished which which is basically an even softer you can the region uh that uses the same in the i imports as that much but was designed as a generic interconnect however our goal was to use it to to interconnect people machines and needs you interesting for a number of applications including testing the things that you should run later on thinking interface using next month so you you can really stress your petition even without and possibly even a Spitzer faster than thinking of it in addition that the the need to really enforce body runs up to 20 million but by specific purported or 70 give it 2nd larger pockets large women victims rights and
the result that I'm presenting today is an accelerated version of the 2 and and matching go patches for device drivers would previously in the IXA that can achieve a you that can
achieve a and over million but if a 2nd when you're using standard socket applications in the gas that is against the best communication and about 6 here with the 2nd with the young about frames and over the 5 million bytes per 2nd when the the clients so that when the clients and against the using the the net property I but the changes that we have introduced in device drivers and you and I are really is more new order of few hundred lines of code subsystems and that's also nice thing because it's easier to to the body of small piece of code rather than entirely new device entirely new subsystem the previously driver parties are about to be committed to the lemma device driver and we have an equivalent thing for the notes and that you're Protestant are probably going to be distributed as separate 2 things from our website so
just a little bit of background the net number of and is an idea i to sender receiver refrains from user space it relies heavily on matching the bucket profits and the rings descriptor rings that are exposed to basically should assure Members vision and the you have a selectable file descriptors for synchronization which means that what is implemented with very little modifications to devise virus and the and the model and that there is only pick up library that is is supported the of application so in the best case that you don't even need to recompile replication just below the our version of the peak of library and you're often using the faster you to send receiver traffic as
this slide shows the basic protocol to access the Nikkei using that model you open special file you get from the state duration at NIH together to bind the transcript to given interface or something else not remember region and you're ready to to send and receive pockets and user selected for forcing
performance as a mention that is really good uh in this graph you see how fast you can send or receive pocket using 1 core for instance that at less than 1 because you're ready able to separated and with almost 50 new participants and his his pretty well with multiple of course uh up to 4 I haven't tried more than that because the idea that the other so uh because it at that point in generating separating the network detrended make so there's not much of a point in going forward as a comparison the Haganah led by gender being so can do about 2 4 million packets per 2nd the maximum speed of 3 this maybe or even only in choosing sockets your you're running at about bottom million but it's the 2nd vertical a follow-up
up to that of was to extend the API at the end of the kind module 2 implemented it so basically the net molecule previously out that is able to interpret the port names which this form of violent X and y as a request to create a the trust region named accent at ports named White and that would be the feature and then the learning rate algorithm on that if you can create multiple features you can pretty much parts and you can connect clients so using Management API to about a quarter in the same way as you can connect to a and physical working at a more so you and test application of about a switch and then run over the interface or these blocks could be built on machines or I mean the system is extremely flexible durations in Paris and the men and there is the incoming particle passing 1 or more destinations so that would put them depends on how many copies of you you need to go over the in terms of
performance of a violent if you implemented value is a non starter forward-wave processing on by the time we're done this you would only reaches between 2 and 5 million bucks the 2nd which is often faster however by exploiting matching and try to produced the the the access to lots so that you're sending your request in my locker for each class the interfaces we managed to reach a throughput of about 18 to 20 million
parts per 2nd which mean size frames and 70 you BPs with larger friends and a 17 year before 7 is basically limited by their memory bandwidth of the system used
in this course of social the performance of the policy which is the proper we divided by 1 number of destination in the case of broadcast this is the time that is spent per pocket nanoseconds for pocket depending on the outside the for the values which and for the solution for instance is the Linux region using top interfaces now it is a bit of
background on the 2 machines uh forward the industry this people who know more than me I apologize for mistakes and incorrectness of what what happened so how a little machine is implemented in modern systems basically the CPU was that we have these days can run the guest editor guest operating system on the office you running in from and HBO perceive you appears as a credit to the host operating system and machines so you can create a guest machine with much more because if you would then they're tried to sharing the same memory image there are things that cannot be done by distributing it 1 more specifically when you need to access registers or handling interrupts etc. sometimes you need to exit from this viewpoint execution model jump back to the real execution mode of of this appeal it and perform actions that for instance and related the peripheral stuff that you're trying to to answer since on these early BMX sits there and similar things that happens on the job dispatching actually very very expensive much
more expensive than on really real machine accessing registers probably 100 nanoseconds or a little more on real machine but on of the machine that due to the Amex states and for the processing of you might spend anything between 5 and 10 microseconds on that single operation so there is a big gap in performance and especially when your existing devices there might be cases where you have a lot of these boxes sensors off axis is on on each time so if you're not careful in handling the device or at at least in writing the device driver in a way that is performing well with the little machine and you can get a very high performance 1 even if you sold the performance in the device simulation there are still possible bottlenecks in India hospital and for instance the connection on between the machine and physical interface or another with machine goes through several stages in the hypervisor and the kernel of the host operating systems and again if you if you're not careful in coding those those of processing stages very carefully you might end up doing this expensive operation or version copies etc. that impact your performance of a lot of course if you start with a device driver in relation which is very expensive but that's the minds bottleneck and so you might even you might not even see that the order of the performance of MCO architecture but doing this work we actually solve initially devised our problem and then keep the number of subsequent book the next and then try to sort of so partly equalize device drivers or song by revised 3 forests that have been introduced over time by examined by the the and well and by 2 1 with the you for so try to resolve that to solve the problem of an efficient device model that would would work well under hypervisor under control of a hypervisor and the however those are only 1 part of those solve in 1 part of the problem there all the other about the next event mentions that exist and and so on the they need to be that we the hypervisor and the soft switch that connects with the machines among themselves for the the physical so 1 of the things that
we did that was mystified that the belief that my simulation and because the if 1 wants to look at the bottom replies devices you see that the doctor representation those devices part of my office window that there is not too different from what you have in a physical me what it what is different is the way you will access register or the equivalent of registered to get information on what is the the current market transmit meat or with the current interim status etc. but that can be addressed without introducing a completely new devise more and don't do the real problem that part replies devices try to solve it is to reduce the number of of the Commission axis and as a mentioned those related to the intensity and the 2 officers to idea registers so if you have a way to reduce the number of interrupts and plot replaced access to registers with the information that is in the shared memory you can do this number exceeds and and get the center performance even in the numerator the 1 thousand the Realtek or to right the 2nd thing we did was to improve the throughput of the hypervisor and moving pockets from between the front-end and the back-end and front-end is emulated side of the metal device the candies whatever it is used to connect the hypervisor to the Austin in that respect however if you add the that and more performant connections more performance became and you might still have as long as a switch inside of the hostel which is what happens so for instance if you use a previously bridging only in the region were from the speech and so we were forced to also replace this this which we something that faster for them to get better before now the 1st thing we did was
looking for something that is already implemented in most modern hardware and is incorrect moderation instead of especially on the sea but instead of sending 1 interrupt up on every on every pocket that you receive more than 100 tries to enforce it meaning intermodal time between direct service so that you don't goes too much of a dent in the in the in the processing of traffic and the problem is that most of the time in which had the most hypervisors don't really implemented these features that exists in the as we have seen that doesn't exist in your home or a doesn't exist in boxes and of course we have no access to be aware of that but according to what we afraid moderation is not implemented in implementing the moderation is not only how the only and the and the amount of code that it takes it is really needed during problem that that exists is that in order to implementing don't moderation you need to set some time and that that enterprises cannot interrupting you now that I'm interrupting you in 29 microseconds of 50 microseconds or something like that that's the order of magnitude of the theories that I implemented by the around the parameters but basically those are the numbers that you normally use now quite often you know that is a fine-grained timers resolution in the new operating system and that might be a reason why moderation wasn't implemented in the 1st place anyway so we implemented that and try to to use them and we will see some performance number and also we will show that moderation by itself is not necessarily in solving the the performance from and in
fact here are the numbers in 1 of those 1 set of experiments that we need have DD type of experiments that I'm reporting are basically using the JVM and you animal as a hypervisor running on top of the notes uh units because we don't have to be an Free BSD so we can reuse of hybrid support for the position and we had to guess which our Free BSD had as of February more those are Picabia's deviances source and images and they are connected to the stop interfaces as in this particular experiment or to abolish which as we will see later and they're running on the on the same machine which is a um as a 3 as about 3 trip for 2 years so if you so if we take an unmodified to and and we're we try to measure the guest 2 guest to put down the transmit rate is actually very very low it's about of 24 thousand parts per 2nd on a 1 little CPU and if you have 2 with us if you you get a private rate of of about 65 thousand parts per and by implementing the interrupt moderation league actually get a little bit of improvement well but substantial improvement but we still dealing with very low bit rates in absolute terms removed from 24 20 thousand parts per 2nd with 1 glucose appear and from 65 to 87 with 2 so why is that the uh that for 2 literacy viewed improvement is so someone as well as the thing is that we don't deeper into our moderation when you tried to transmit the packet that the uh the guest literacy viewed as a BMX it transmits the packet and immediately generates and there and so by the time return control to be a guest operating system you are hit by interrupt and you have to serve survey data the media and so that's a source of over in the visible to everybody you know when that when you do the exit and return of 1 of this if you would have been better than the other 1 will be able to continue processing of of so you have a little bit of quantities and going on and that from it's a little bit more matching and performance and so having the having the internal moderation changes the situation for the 1 of the few cases about that doesn't change the situation to much for them to people on the on suicide that see a lot of improvement to reading the modulation but this is mostly because of the best configuration that that we used to have basically the center itself but transmit pockets in batches to the uh to the order machine and so even without the need of moderation you're still getting batches of pocket on the receiver side and on just 1 for providing on December solve the STM pretty interesting the receive side is faster than the present side we we didn't even managed to get resupply block instead of on using the Linux as a guest operating system we are actually had some like look cases but those really depends on your purchases the 2nd techniques
that we use what is called the sand combining it's actually a very old the thing that I was reading a paper of 2 thousand and 11 paper from the and where where they documented for a for the 1st time the techniques that they use in their initial and latest back the late nineties and that they use a similar technique the idea is that whenever you have a pending transmission sorry whenever you want to transmit a packet to right to register in the in the naked to to dollars geography view of the of or the hypervisor if you remove it the machine that you want to send out about and that's right to the register is very expensive is it is what causes the end now In in cases where you are testing interrupt the completion of the of the pocket that then you you can forget you can postpone writing to the register for the subsequent parts of transmission requests and said just remember that there are pending missions to to to be sent out and when you get to interrupt you do the actual right into to the register so that matches the rights to the register and reduces the number of by a significant amount of especially if you haven't our moderation of course if you don't have a general direction you would get the needed up immediately he after after sending his this and in terms of performance and you see here again the implementation of a and combining it requires a very modest amount of called there and it's only in the guest device that you don't even need to modify the provides and where's the interrupt moderation only needs to modify the enterprises assuming the different system supports the future and so on moderation and 1 the videos if you desire the numbers that are shown in the previous slide that would send combining alone without thinking about moderation you basically have no gain in the case of 1 because if you if you have a significant gain of to user and if you have bought into moderation and single binding the speedup and the 1 we prosecute cases impressive and to abuse goes at about the same speed just because you have reduced the debt uploaded by large factor and a number of them actually by a large factor and so the 2nd CPU in this particular test is doing almost nothing to because basically just serving interrupts massive he was doing the most of the so we have a 10 fold speedup in the case of 15 times the up in the case of 1 because if you 5 times faster than with the to recursively and we're approaching pretty decent a pocket rates on their own transmit side in doing this and
combining only works with means so the next step we we try to the next thing we try to implement was a part of the position and yet you probably positions to reduce the number of BMX it by making the um in the gas communicate through shared memory instead of a communicating through interrupts and rights to register of course in order to communicate through shared memory is given that you have no synchronization mechanism you need that what entities are active at the same time so you need and and you cannot afford to have some trading host remind you always running in importing at the status of of sentient memory because that would be too expensive so the way it works is that the you start from an initial state where with the gas and also idle and then that whenever 1 of the 2 entities want to start his communication sends a message which is called a qeq Akiko from the guest OS is typically sent by writing to a register because the register thousands because of the Mexican so you transfer of control to the host and you can do operation in the context of the of a kick in the other direction is to typically sent to an interrupt because that's the way they and communicated to the guest operating system and the and to stop for instance supporting from of some kind so what do we we implemented the to to do their part of equalization of the 1 thousand devices was to modify the modified a hypervisor so that the right to to the register TDT I wasn't it is the name of the register only 1 thousand also is interpreted as a cake by the hypervisor and it interacts also generate a sum of are interpreted by cakes by by the guest operating system and the region that is used to exchange information and we call that a common status block CSP and basically did you take contains a couple of points are indexes for each direction of the communication so after the KK 1 up that for instance if you're meeting the data in the against their will write in in the shadow registers which is often a memory location which reflects the value of the transmit the register on the transmit drink and the hostel with all of the content of the shared memory location to see if there are more products to be transmitted over as long as there are pockets new but has to be transmitted there is no need to write register in order to send them out of the polling group on the hostel with vastly pocket from the buffers and and send them to the back and whatever it is and will notify completion to another version registered in the In the CSP and of course there is already an implicit notification status beats in the in the in the range of descriptors that is used by and the same happens in the other direction when you have a new product coming in and everything is I don't use and any directed the gas that starts processing the data but instead of reading of from the status register if any to to get information on whether or not there are more packets it with just get information from the single or from the CSP and this way he doesn't need to access registers and also when the gas on the receive side freeze laughter and returns to to the Nikkei to perform more receptions it doesn't do it through the registers but that just as the information in the CSP again in order at very small change to to bulleted PS and they also side above 100 lines each and the
performance gains are also interested in this case in order to use fire realization we don't need any uh interrupt moderation combining and that you see that we approach half a million bucks per 2nd of this both with 1 interview and so now we get to a level of performance which is perfectly equivalent to that of Italian we using this more or less standard and 1 thousand most of the rest the knowledge How can we go faster than this but this is basically the throughput of the 1st which ministrija whatever using the top interfaces as a communication channel so we need to improve that part of the system in that part of the system
involves using a faster than the procedure is at the bottom interconnecting with the with the machine and the device which is a perfect think to to use in this particular case the old we needed to do was to write a for at you and want to attempt to talk to the bodies which instead of the top interfaces or other mechanism that that you and has and GeneCodis quite modular so it wasn't a difficult task of about 350 times of product and then another montage of using this approach is that we can connect molecule assessed directly to a unique using the given that property P I so we can do almost greater and without too much of if you however just improving the this which didn't get us so much performance improvement because in fact it at about meeting that proviso was reusable and this figure shows you what happens in the communication between the gas and the softest each the the the sum of the error Tuomo a more implementation uh basically consumed about funded nanoseconds to copy the data from the this is the time amortized provide consume about 100 ms copy the data and from the buffers that are supplied by a guest operating system into the front and then another 800 seconds to transfer them to the kind and another 500 2nd persistent called because they're using that interface you can only send 1 part of the system called so we had to clean up to these days apart and we have to get better performance and that was done by um noting that for instance part of this fundamental seconds was due to the fact that for every access to the descriptor there was a call to the routine that not all stiffer got guess physical addresses into host people others is now this mapping is there forever until the guest machine built machine migrates to somewhere else but so there is absolutely no need to repeat the check every time so we just after the result and the reduced this time by 4 times then the data copied the the copy was done using more or less and copy which is quite slow and so we we replace that we can optimize the copy routine which also use the same trick that we use in our knowledge of instead of trying to copy exactly a number of bytes that you have like 60 65 or some other number we just run the number to multiple physical was 64 which makes the entire process of more efficient and so reduce this time from 8 people 40 seconds and then and by replacing the um they can deal with that of others which is really gotten amortized time of about 15 seconds the in the last part of the of this so now the interconnection between the gas and this which is a lot faster and we were able to push pockets of up to this point and about 10 million bucks per 2nd and then I'm going to this which means we have a courses like reduction performance still pretty fast
overall and this is almost the final tables that I'm sure you can this the performance of using the uh happens Linux region as as a switching to connecting with the machine in the base configuration so you see that we started from about 24 thousand parts per 2nd worst case stomach is with 1 group of CPU and we got to between 3 and 400 thousand posts per 2nd 5 unintelligible as specific depending on the type of optimization that we implement this number here height yeah is the delay that we use in the introduction motivation and keys in the micro seconds I believe that so a delay of 1 might cause of course I mean this laser nominal but when you implement them in the and operating system and that there is some granularity in the in the time of so you only have 1 microseconds probably much larger delays the problem is that increasing the and then you get moderation improved performance but it doesn't impact on the latency of your your part and so it might be something that you don't want to do that if you have if you want latency or I have to put up with a small window the sizes and here is the situation here at the diffuser of the polis which has a backend we with the improvements that we included in the system so you see that we almost doubled that the performance of the peak performance and which is in the is a medium part a 2nd in the case of the position but even without quantum position we are getting pretty close between 8 and 9 and the possible this again this test is done between to 2 Free BSD guess so using that standard to receive these are part of tools schools trade for for sending and receiving UDP pockets and this is these numbers are for 64 by pockets we don't 50 underbite frames which about half a million parts per 2nd which means to success do you BPS I think and in the now what happens if we use Netlab within the guest instead of using a socket based application to send receive of course in order to Usenet multimedia you need and that the enactment capable of naked unfortunately 1 thousand is 1 of the links that are supported by so it was just a matter of running the test and CEO fostered went and that kind of that reach was about amenable to second-guessed against again so it's pretty fast and between sender and in terms of absolute would so with 50 and thereby pockets we do about 25 give it 2nd on the receive side and slightly faster on the so I think that is the kind of performance that is really compatible with what we can achieve on on the other of course not the only 1 thousand which is the 1 of its interface but for for instance on the thinking interfaces to and or others we are pretty much close to this that now what's the status
of the user star directly sets of changes in both the 1 on the guest operating systems and we have patches for the 1 thousand device on previously they're going to be completed soon talking projective all of them including them in our software and then I think was microphone and then there are also changes for a BN O C 1 thousand driver missing since the mechanism for part of position is completely general and we're actually using the same data structure to to the parameterization of the array of device which stop politically interesting about that some i provides an inventory of technology 1 thousand so when operated by on advisor we have q m of i can for not not but that was sent to the queue unleashed a few months ago we are proving that the wider gets accepted or rejected or whatever but anyways uh we we can surely include that as a party not previously part of 2 and again and the changes that we're making good on the i proviso I completed generous of as it is feasible to write back and forth with a box and if you don't want to run 2 and more if you want to produce the solution that has support for a 5 utilization in this in this appeal and on the outside that is not really a change of the started change all you need is to load that the net not uh net about a model which is just mother recompiling it of previously and on the so a
confusions for for this thing is that I believe we uh we haven't uh reached a point where we can get about the same performance network on the vehicle machine as on the real line and that's great because for instance if you want to test the optimization of the probable stock now you can see them on them with machine that without having to worry about the performance of the final part of the part of can then uh the switch and I hope that this this tool will help us improve the natural stock contributes to this market does not mean that we don't have a contribution by my students listed here and finding from some European project and also companies about the work was supported my stay at the mountain I like to conclude with a few comments
on the and the status of net map and in the balance which since last year so last summer in August I try to implement a userspace version of variety of W and dominant which talks to that popped interfaces rather than to rather than being embedded in the canon and beneficiary performance of this thing it is pretty a pretty good between 2 we use a single CPU confused about 6 mean about the a 2nd and their reports from user who said that you can do about them in a part of the 2nd between 2 physical interfaces for connecting the in user space if W 2 to physical interfaces running the about before out we use dominated there is an additional difficulty involved that and that reduces the performance to between 2 and 3 million by the specific but that's still much faster than the incan version of another thing that we implemented in in February was a transparent mode format and so 1 of the issues with that not this that you basically your application grabs the interface and disconnected from the host stack so the only way for traffic to reach just stack is that your application range X the trophic using in order to make up for this feature of into the whole with transparent mode of an application using that problem again has a chance to to see to see pockets of Markos pockets that that should be intercepted by the application itself and all the others can go are automatically forwarded to the stock and the same goes for the other direction so that makes the behavior of that's not a lot more similar to what you have on the PF we get additional to few their pockets which might be interesting in some cases I meet you know colleague from NEC in your program but started working on that map recently and April implemented the feature that allows you to go network interfaces to abolish switch so basically now you have the same can abilities that you have read reader in Canada which to attach interfaces to a switch over to the hostile can move to Africa between between Boston completely transparently without any any user-space process that does that the switching for you and this should be committed to to 4 D is the shortest there is a ongoing work is to use the knowledge features that part for all in this speech this initially would be a Linux-only thing because the encoded version of what this feature neurons of the units at the moment and there have been discussing with the some of you have the option to supporters got stuck together in your own that not and that is useful for a number of things that are for instance events of implementing a softer version of this so I can say fragmentation and reassembly if you want to profit was which with different use the response so that's all for now and but if you have questions begin yes many the
no no absolute I no in general I I try to to avoid the use of features that are specific to the operating system or a hybrid version of because this makes my work more what about the and I guess I can yeah so then it has no I don't that I don't that number on the basis of the yes definitely and I 1 thing I have to say and
together figures here but there in
terms of latency 1 thing that gives you is the fact that when you translate it back at you and you need to understand the packet back to the user space tried that in order to communicate with the but can now you know says for instance some optimization and we do we also things all on the Amex at the departments and directly to the network stack of whatever it is that we don't going into the userspace processes and that's the way 1 should do things in order to reduce the delay your question is really so and the more not have changed the the was to some degree also during this in the text the information the interest of the that is so that is to now and it is partly to I mean it is passing so for instance is the
moderation and work phrases
perform the direct
moderation doesn't need any change in time because the client typically already as a moderation so the the kind of games so that you can get here not much of yes I want to be the end of the year place to be yes so you get naturally this so you could get up to now at this case you could get to under the 40 thousand but at certain from the original 65 which is not a lot to me it's a factor of 2 and of course when you're used to seeing number that are part of benefactor of Buddhism simple but yes in which but the best thing is when you can combine don't moderation weeds and combining atleast on the on the transmit side or you can the privatization but that of course requires some changes on on the on the guess no not my point is that it is easier to other ontological assuming you have the ability to to to change a little bit on the on the guest side it might be easier to either the lines of code and that to write an entirely new device that unfortunately not for you I and this but because of the speed of score you many of so what is it or something and guess what well probably there are the bottlenecks in on a system units had his so content of their time
View (database)
Multiplication sign
Virtual machine
Student's t-test
Instance (computer science)
Cartesian coordinate system
Virtual machine
Formal language
Goodness of fit
Computer animation
Computer network
Energy level
File viewer
Right angle
Office suite
Website
Drum memory
Library (computing)
Equaliser (mathematics)
Scientific modelling
Virtual machine
Principle of maximum entropy
Parameter (computer programming)
Mereology
Very-high-bit-rate digital subscriber line
Bit rate
Arithmetic mean
Energy level
Cuboid
Central processing unit
Software framework
Subtraction
Dialect
Simulation
Interface (computing)
Line (geometry)
Cartesian coordinate system
Frame problem
Connected space
Particle system
Computer animation
Integrated development environment
Personal digital assistant
Mathematics
Computer network
Order (biology)
Resultant
Revision control
Addition
Computer animation
Patch (Unix)
Virtual machine
Interface (computing)
Right angle
Cartesian coordinate system
Resultant
Number
Computer virus
Standard deviation
Spacetime
Computer file
Code
Machine vision
Lemma (mathematics)
Bit
Line (geometry)
Ring (mathematics)
Cartesian coordinate system
Replication (computing)
Revision control
Category of being
Mathematics
Computer animation
Synchronization
Personal digital assistant
Network socket
Website
Selectivity (electronic)
Library (computing)
Point (geometry)
Slide rule
Pairwise comparison
Multiplication
Socket-Schnittstelle
Graph (mathematics)
Computer file
State of matter
Scientific modelling
Gender
Keyboard shortcut
Interface (computing)
Instance (computer science)
Maxima and minima
Computer animation
Computer network
Core dump
Local ring
Communications protocol
Algorithm
Process (computing)
Block (periodic table)
Multiplication sign
Virtual machine
Interface (computing)
Client (computing)
Mereology
Cartesian coordinate system
Batch processing
Particle system
Ultimatum game
Computer animation
Bit rate
Module (mathematics)
Convex hull
Software testing
Data management
Form (programming)
Physical system
Social class
Read-only memory
Multiplication sign
Interface (computing)
Instance (computer science)
Mereology
Frame problem
Number
Band matrix
Broadcasting (networking)
Computer animation
Personal digital assistant
Gotcha <Informatik>
Physical system
Read-only memory
Group action
Game controller
Ferry Corsten
State of matter
Real number
Scientific modelling
Multiplication sign
Equaliser (mathematics)
Virtual machine
Mereology
Event horizon
Number
Revision control
Medical imaging
Forest
Operator (mathematics)
Central processing unit
Cuboid
Office suite
Hydraulic jump
Physical system
Computer architecture
User interface
Simulation
Theory of relativity
Process (computing)
Instance (computer science)
Cartesian coordinate system
Process (computing)
Kernel (computing)
Computer animation
Personal digital assistant
Order (biology)
Interrupt <Informatik>
Text editor
Quicksort
Asynchronous Transfer Mode
Greatest element
Service (economics)
Code
Real number
Multiplication sign
Direction (geometry)
Parameter (computer programming)
Mereology
Order of magnitude
Theory
Number
Front and back ends
Image resolution
Internet forum
Computer hardware
Operating system
Representation (politics)
Cuboid
Office suite
Enterprise architecture
Process (computing)
Information
Shared memory
Instance (computer science)
Cartesian coordinate system
Equivalence relation
Plot (narrative)
Connected space
Numeral (linguistics)
Computer animation
Mathematics
Order (biology)
Interrupt <Informatik>
Window
Ferry Corsten
Multiplication sign
Direction (geometry)
View (database)
Source code
Archaeological field survey
Complete metric space
Mereology
Data transmission
Medical imaging
Mathematics
Bit rate
Hypermedia
Internet forum
Single-precision floating-point format
Videoconferencing
Central processing unit
Position operator
Physical system
Metropolitan area network
Enterprise architecture
Block (periodic table)
Bit
Principle of maximum entropy
Order (biology)
Interrupt <Informatik>
Right angle
Freeware
Data type
Purchasing
Slide rule
Game controller
Batch processing
Server (computing)
Implementation
Mapping
Divisor
Network operating system
Virtual machine
Emulation
Number
Term (mathematics)
Operating system
Software testing
Absolute value
Units of measurement
Set (mathematics)
Transmitter
Computer animation
Personal digital assistant
Context awareness
State of matter
Direction (geometry)
Multiplication sign
Equaliser (mathematics)
Range (statistics)
Axiom
Complete metric space
Mereology
Mathematics
Synchronization
Internet forum
Position operator
Physical system
Metropolitan area network
Product (category theory)
Block (periodic table)
Shared memory
Interface (computing)
Instance (computer science)
Arithmetic mean
Message passing
Lattice (order)
Telecommunication
Order (biology)
Interrupt <Informatik>
Right angle
Speicheradresse
Writing
Point (geometry)
Read-only memory
Beat (acoustics)
Game controller
MIDI
Auto mechanic
Heat transfer
Number
Revision control
Delay differential equation
Operator (mathematics)
Operating system
Energy level
Acoustic shadow
Gamma function
Information
Content (media)
Line (geometry)
Local Group
Subject indexing
Summation
Uniform resource locator
Computer animation
Personal digital assistant
Freezing
Greatest element
Multiplication sign
Mereology
Front and back ends
Linker (computing)
Internet forum
Network socket
Multimedia
Error message
Position operator
Physical system
Metropolitan area network
Product (category theory)
Process (computing)
Mapping
Interface (computing)
Physicalism
Nominal number
Instance (computer science)
Category of being
Lattice (order)
Telecommunication
Buffer solution
Order (biology)
Gotcha <Informatik>
Quantum
Configuration space
Figurate number
Procedural programming
Data type
Mathematical optimization
Resultant
Implementation
Virtual machine
Auto mechanic
2 (number)
Number
Term (mathematics)
Operating system
Reduction of order
Software testing
Address space
Key (cryptography)
Polygon
Diffuser (automotive)
Cartesian coordinate system
System call
Local Group
Frame problem
Table (information)
Summation
Computer animation
Personal digital assistant
Window
Point (geometry)
MP3
Scientific modelling
Patch (Unix)
Real number
Projective plane
Virtual machine
Auto mechanic
Instance (computer science)
Line (geometry)
Student's t-test
Mereology
Weight
Mathematics
Computer animation
Software
Utility software
Computer network
Queue (abstract data type)
Data structure
Mathematical optimization
Position operator
Physical system
Dialect
Computer programming
Variety (linguistics)
Direction (geometry)
Range (statistics)
Canonical ensemble
Mereology
Weight
Event horizon
Number
Revision control
Computer configuration
Operating system
Central processing unit
Traffic reporting
Units of measurement
User interface
Process (computing)
Spacetime
Mapping
File format
Chemical equation
Moment (mathematics)
Basis (linear algebra)
Interface (computing)
Instance (computer science)
Cartesian coordinate system
Computer animation
Personal digital assistant
Order (biology)
Computer network
Speech synthesis
Dependent and independent variables
Asynchronous Transfer Mode
Degree (graph theory)
Spacetime
Computer animation
Information
Term (mathematics)
Order (biology)
Figurate number
Instance (computer science)
Stack (abstract data type)
Mathematical optimization
Point (geometry)
Metropolitan area network
Electronic data interchange
Mapping
Divisor
Code
Multiplication sign
Combinational logic
Bit
Client (computing)
Line (geometry)
Mereology
Number
Mathematics
Computer animation
Multi-agent system
Personal digital assistant
Internet forum
Game theory
Units of measurement
Physical system
Address space
Loading...

Metadata

Formal Metadata

Title Lightning fast networking in your virtual machine
Title of Series The Technical BSD Conference 2013
Author Rizzo, Luigi
License CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
DOI 10.5446/19172
Publisher Berkeley System Distribution (BSD), Andrea Ross
Release Date 2013
Language English

Content Metadata

Subject Area Information technology
Abstract High speed network communication is challenging on bare metal, and even more so in virtual machines. There we have to deal with expensive I/O instruction emulation, format manipulation, and handing off data through multiple threads, device drivers and virtual switches. Common solutions to the problem rely on hardware support (such as PCI passthrough) to make portions of the NIC directly accessible to the guest operating system, or specialized drivers (virtio-net, vmxnet, xenfront) built around a device model that is easier to emulate. These solutions can reach 10 Gbit/s and higher speeds (with suitably large frames), one order of magnitude faster than emulated conventional NICs (e.g. Intel e1000). Despite popular belief, NIC emulation is not inherently slow. In this paper we will show how we achieved VM-to-VM throughputs of 4 Mpps and latencies as low as 100us with only minimal modifications to an e1000 device driver and frontend running on KVM. Our work relies on four main components, which can be applied independently: 1) proper emulation of certain NIC features, such as interrupt mitigation, which greatly contribute to reduce the emulation overhead; 2) modified device drivers that reduce the number of I/O instructions, much more expensive on virtual machines than on real hardware; 3) a small extension of the device model, which permits shared-memory communication with the hypervisor without requiring a completely new device driver 4) a fast network backend (VALE), based on the netmap framework, which can sustain multiple millions of packets per second; With the combination of these techniques, our VM-to-VM throughput (two FreeBSD guests running on top of QEMU-KVM) went from 80 Kpps to almost 1 Mpps using socket based applications, and 4 Mpps with netmap clients running on the guest. Similarly, latency was reduced by more than 5 times, reaching values of less than 100 us. It is important that these techniques can be applied independently depending on the circumstances. In particular, #1 and #4 modify the hypervisor but do not require any change in the guest operating system. #2 introduces a minuscule change in the guest device driver, but does not touch the hypervisor. #4 relies on device driver and hypervisor changes, but these are limited to a few hundreds of lines of code, compared to the 3-5 Klines that are necessary to implement a new device driver and its corresponding frontend on the hypervisor.
Loading...
Feedback

Timings

  624 ms - page object

Version

AV-Portal 3.7.0 (943df4b4639bec127ddc6b93adb0c7d8d995f77c)