VNF development made easy with netmap

Video in TIB AV-Portal: VNF development made easy with netmap

Formal Metadata

Title
VNF development made easy with netmap
Subtitle
High-speed packet processing with QEMU VMs
Alternative Title
A flexible framework for high performance packet processing within QEMU VMs
Title of Series
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2019
Language
English

Content Metadata

Subject Area
Abstract
The Netmap framework provides a simple and efficient user-space API for direct access to Ethernet NICs and other fast software interfaces (e.g., VALE switches, pipes and monitors). Because of its flexibility, performance and ease of use, Netmap is an attractive solution to implement high-speed portable Virtual Network Functions. This talk shows how to write packet processing applications using the Netmap API, and run them inside QEMU VMs, over passed-through Netmap interfaces. With Netmap, applications running in two VMs or containers can exchange up to 20-30 Mpps (per-core) at minimum packet size. The need for an alternative mechanism and APIs for network I/O has been recognized by several O.S. bypass projects (DPDK, PF RING), and comes from the performance limitations of the traditional socket API (and the associated O.S. implementation) in terms of maximum packet rate. Using a traditional socket API a single processor core is not able to send or receive more than 1-2 million packets per second (Mpps) at minimum packet size (60 bytes), despite of much faster modern NICs which support 10-100 Mpps. These limitations are largely due to per-packet size-independent costs: system call, packet copy across user/kernel boundary, VFS layer overheads, dynamic (de)allocation of packet metadata (e.g. sk buff on Linux), NIC register access and interrupts. Moreover, moving networking to userspace facilitates experimentation and improves portability. The bypass solutions overcome these limitations by pre-allocating packet buffers, mapping those buffers in the application address space and allowing applications to send and receive multiple packets using a single operation (e.g. a system call or a NIC register access). They also use simple packet representation structures optimized for raw packet I/O rather than for full-fledged protocol stack. When combined together, these techniques allow user-space applications to send/receive tens of millions of packets per second, saturating the NIC capacity even with short packets. Exploring Netmap is a good introduction to these topics, common to all frameworks. However, Netmap brings some additional benefits that are not found elsewhere: it does not force applications to resort to busy-polling, it protects devices from uncontrolled user-space access, and it introduces a common API which can also be used for fast VM networking and Inter-Process Communication. Netmap is available on both Linux and FreeBSD. The Netmap framework has evolved significantly from its inception as a user-space packet I/O interface to NIC hardware in 2011. It is now a flexible network I/O tool that supports many backends (in addition to NICs) and virtualized environments, accessible with the same API. The VALE programmable switch (part of Netmap) acts as a virtual switch for Virtual Machines (VMs) and physical NICs, supporting hundreds of virtual ports and over 20 Mpps per-core between its ports. Netmap pipes are point-to-point virtual links that connect processes or VMs at over 40 Mpps, useful for service function chaining. Netmap as a fast network backend has been integrated into hypervisors like QEMU, Bhyve and VirtualBox. Accelerated network I/O is also possible for lightweight virtualization (containers) by means of native support for Linux veth devices (over 40 Mpps). Finally, a virtual pass-through device allows any Netmap interface (e.g. a VALE port, NIC or pipe endpoint) to be safely exposed inside a VM, enabling unprecedented packet-rates (20-40 Mpps) between VMs. These Netmap features constitute the datapath building blocks for Network Function Virtualization (NFV) deployments. We are not aware of other technologies that allow applications running in two VMs or containers to exchange up to 20-30 Mpps at minimum packet size. With such powerful I/O capabilities, we believe Netmap is the preferred candidate to implement NFV applications such as load balancers, Intrusion Detection Systems, firewalls, etc.
Loading...
Thermodynamischer Prozess System call Direction (geometry) Synchronization Kernel (computing) Buffer solution Extension (kinesiology) Information security Physical system Thermodynamischer Prozess Mapping Point (geometry) Interior (topology) Range (statistics) Instance (computer science) Arithmetic mean Ring (mathematics) Buffer solution Software framework Escape character Freeware Data structure Spacetime Asynchronous Transfer Mode Data buffer Stapeldatei Asynchronous Transfer Mode Functional (mathematics) Implementation Module (mathematics) Mapping Datenpfad Virtual machine Metadata Time domain Crash (computing) Operator (mathematics) Ring (mathematics) Subject indexing Graph (mathematics) Queue (abstract data type) Representation (politics) Boundary value problem Spacetime Directed set Data structure Data buffer Domain name Module (mathematics) Modal logic Multiplication Interface (computing) Weight Prisoner's dilemma Memory management Computer network Cartesian coordinate system Transmitter Mathematics Subject indexing Uniform boundedness principle Uniform resource locator Pointer (computer programming) Kernel (computing) Resource allocation Software Personal digital assistant Computer hardware Codec Musical ensemble Abstraction Address space
Empennage Game controller Socket-Schnittstelle Thermodynamischer Prozess Computer file State of matter Real number Multiplication sign Price index Disk read-and-write head Event horizon Sound effect Latent heat Synchronization Operator (mathematics) Ring (mathematics) Computer hardware Abstraction Descriptive statistics Position operator Physical system Operations research Thermodynamischer Prozess Overhead (computing) Standard deviation Keyboard shortcut Sound effect Instance (computer science) Auto mechanic Cartesian coordinate system System call Sequence Transmitter Subject indexing Arithmetic mean Kernel (computing) Pointer (computer programming) Ring (mathematics) Software Personal digital assistant Synchronization Abstraction Pole (complex analysis) Spacetime Asynchronous Transfer Mode
Mountain pass View (database) Direction (geometry) Mereology Digital photography Virtual reality Type theory Synchronization Buffer solution Kernel (computing) Physical system Thermodynamischer Prozess Process capability index Ext functor Virtualization Bit Maxima and minima Instance (computer science) Category of being Message passing Ring (mathematics) Telecommunication Buffer solution Universal product code Configuration space Website Quicksort Point (geometry) Digital filter Module (mathematics) Link (knot theory) Virtual machine Rule of inference Plot (narrative) Interprozesskommunikation 2 (number) Time domain Term (mathematics) Internetworking Software Ring (mathematics) Computer hardware Directed set Interrupt <Informatik> Traffic reporting Rule of inference Stapeldatei Standard deviation Core dump Device driver Cartesian coordinate system Mathematics Number Software Personal digital assistant Local ring Operating system
Empennage Functional (mathematics) Overhead (computing) Thread (computing) Identifiability Computer file Code Direction (geometry) Virtual machine Streaming media Mereology Event horizon Rule of inference Power (physics) Synchronization Logic MiniDisc Gamma function Data buffer Rule of inference Pairwise comparison Electric generator Mapping Line (geometry) Instance (computer science) Price index Cartesian coordinate system Disk read-and-write head Subject indexing Word Loop (music) Befehlsprozessor Kernel (computing) Event horizon Ring (mathematics) Software Vector space Logic Personal digital assistant Synchronization Pole (complex analysis) Library (computing) Spacetime Directed graph Flag Data buffer
Axiom of choice Standard deviation Computer program Context awareness Presentation of a group Injektivität Thread (computing) Structural load Ferry Corsten Code Multiplication sign Complete metric space Open set Virtual reality Synchronization Kernel (computing) Network socket Core dump Damping Software framework Endliche Modelltheorie Pairwise comparison Physical system Scalable Coherent Interface Thermodynamischer Prozess Link (knot theory) Theory of relativity Mapping Kolmogorov complexity Web page Keyboard shortcut Interior (topology) Fitness function Bit Virtual machine Chain Interrupt <Informatik> Right angle Web page Socket-Schnittstelle Module (mathematics) Overhead (computing) Maxima and minima Device driver Rule of inference Number Read-only memory Term (mathematics) Ring (mathematics) Spacetime Interrupt <Informatik> Pairwise comparison Overhead (computing) Standard deviation Weight Projective plane Code Core dump Computer network Device driver Cartesian coordinate system Data transmission System call Uniform boundedness principle Software Extreme programming Synchronization Office suite Musical ensemble
Standard deviation Structural load Mountain pass Kolmogorov complexity Feedback Core dump Device driver Number Type theory Read-only memory Kernel (computing) Software Chain Universal product code Website Spacetime Maize Game theory Musical ensemble Pairwise comparison
[Music] okay hi so um all right so this this talk is a very short introduction to net map and out you can use it to implement betwen network functions so what is network we've been talking about deepening kpf freeing xvp and so on so Netra is just yet another other independent API to access for direct access to Nick transmit and receive functionalities in user space so this case is the very same as d pd k + PF freeing and so on the idea is that you have a nick and you open that in network mode and once you do that you are able to temporarily stealing that from the network stack and use that with a very much oriented and efficient API for fast application networking application so it's very important to note that this is implementing we need the operating system kernel for instance differently from DP DK should I love their louder ok ok sorry this is implemented in the operating system kernel and we will see why this is important later it's included in FreeBSD and also in NAT map as well so in Linux has an outer three kernel module so these are the design principles we are in death map I think those are important because of the very same design principles also behind the things like DP D K so some extent and and w k PF freeing and x DP so the first and most important one is budget operation so the idea is that whenever you want to talk to the Nick for instance to charging into transmit packets and receive packets you know you need to do that you need for instance to tell the need to transmit many pockets at once because talking to unique is expensive and in general when you pocket do packet processing whenever you have fixed costs like locking or cups try to to to to do to prison stick take a look and process with that mini pockets this is important because the fixed cost can be amortized over many operations the second principle is pair locational pocket buffers so in essence try to avoid dynamic allocation of pocket metadata like for instance in nooks you have ski buff that you need to a low-end you need typically to allocate and de-allocate escape of metadata structures for each pocket that you process third zero copy access to packet buffer so the idea is that your application should be able to directly read and write packets in each other space and the the danique can DMA packets for instance directly in the application of other space so you don't need the traditional copy across user kernel boundary fourth kernel provides the protection it means that your application cannot crash Nick your application cannot crash the system so for instance in a pub you the application cannot direct access to Nick registers and rings this is very important because all the protection and isolation you need is provided directly by kernel and last you must have the possibility to use full synchronization so for instance free marks like the PDK rely on busy waiting even if there are other options but with both net mappings index CP we are able to actually use standard synchronization means like Paul system called selective sand calls you can wait for packets to come or wait for English space more on this later so the data structures using by net are very simple there is a net one interface which is just a bunch of pointers to next map rings rings are basically an abstract representation or other queues so you can have one or more receive rings on one or more transmit rings for its for each net one interface and a ring is just a security or descriptors with producer and consumer indexes and all of these data structures are contained in a so-called net up a locator so the idea is that you may have multiple NIC ports on your machines and a single net of a locator may serve more than a port so a locator is a domain of trust meaning that the only applications that are working on the same a locator must trust each other if you don't want to trust just use separate a locators and the basic idea is that you in order to access those data structures so basically rings and buffers you want to open a special device and then use and you can use an nmap operation to make those data structures available you know in your application so a nipple
ring as I was saying is an abstraction of a real bring the real Hardware ring so what happens is that applications operate on the upstart ring and then the user special sync operation to sync the state-of-the-art absent ring to the state of a hardware ring so there are two pointers this is the abstract ring there are two pointers here and in tail so the the meaning of this pointer is that everything within between ed and tail is owned by the application so for instance for receive rings those are new pockets that are ready to be read for transmit rings it's free space that you can use for new English operations quite the rest of the descriptors in the ring so everything between tail and head it's owned by net1 is owned by the kernel this is an example on how we would process our easybring right so say that your application has many descriptions available many new packets that you can read it can for instance process seven new packets and then increment the add index while the tail is just read only after incrementing this index it can sink there is a special IO control call to sync a receiver in so what happens we have do two effects here so everything between the previous position of ADD and the new positional head is just returned to the kernel is turn to the system for a use so it so that it can be used for receiving the more packets and if any new pockets are live since the last time we synced tail is incremented accordingly so in this case we received three new pockets so this is a very simple synchronization between your application and the NIC okay a very important thing that I was
mentioning before it's blocking versus busy waiting so the sync operations for both receive and transmit rings are synchronous non-blocking and basically they operate on all the rings that are bound to a specific metal file descriptor so the basic idea is that you open an e in FM mode binding a certain certain ring so you can bind the just the doozy ring so just transmit rings or everything whatever you like and once you sync operations the sync operations operate on all the Rings you bound you bound and you can use sync operations for to implement busy waiting so if you want to if you don't want to block but this is actually not the usual way use network because you may want to actually block for instance waiting for more packets to come or when you want to wait for more space to transmit and you can use the post system called Alexis tinkle or on Linux even a pole or cave and this is supported so if you for instance you want to wait for more English space you would do the pauline event so this is just a standard sequence it's very similar to what you would do if you were using sockets right so far I've talked
about Nick's so agri but actually net on supports many kind of virtual ports it built reports are important because that they are they can be implemented they can be used to implement very fast local IPC communication so for instance we have zero copy pipes the idea is very similar to UNIX pipes so you have two ends and you can let processes took communicated through the pipe but but point is that you cannot renews metal pipes using the net API so you are able to transmit and receive packets in batches which means that you can be very very efficient and also it zero copy it zero copy because you can just swap the scriptures as I will show later and this means that you can have independently on packet size you can have very very fast communication over 100 mega packets per second of course this is an benchmark assuming you are not touching packets but still it's an interesting upper upper bound we have was a software switch which is designed for virtual machines so by definition with Onision you want isolations between two virtual machines which means that the switch must copy packets when transmitting from one port to the other port and because again because of the Netta api the the ability to work in batch - you are able to actually transmit 20 million packets per pack or per port we also support monitor ports for sniffing so similarly to what you it's it's sometimes you have a natural application using some ports and you want to see what's happening from a separate process if you want to sniff traffic you can do that with a special port today I'm gonna talk mostly about the last one which is passed through port so the idea is that you have a network port in the Austin machine it can be a NIC it can be a part of a switch it can a software switch it can be a pipe or whatever you like and I want to export this port within a virtual machine so the idea of this is the basically the idea of network function virtualization when you want what you want to run your application within a virtual machine so this is possible with Neto pass-through okay
there are two main use cases so we have a k vm guest give me a mutant machine if you pass through a port ballast which on the software switch this is very interesting to implement a very fast local Internet working so think of to be ends on your machine that are able to exchange up to 20 millions packets per second and minimum package sites so that's pretty impressive if you want to implement some sort of fast packet processing application in your in your machine or you can pass through a art report in this this is a sort of direct assignment that you can of course implement using standard PCI pass through techniques but it may still be interesting because you can implement the direct assignment without iommu support in the advisor and without actual support for piece a pass through so it's just a different way to do the same thing from the point of view the guest the guest operating system sees Bertrand Inc okay and the bit one NIC as the very same configuration as the underlying network port so if you pass through hardware NIC with eight receive links you will see a bill to an Inc with eight receive beings within the veto machine and again there is no overland into in terms of coping because the guest has direct access to the buffers in the links on the of the real port so you can do basically zero copy from within the virtual machine any sync on month so are used are basically forwarded to the host so this is an example of application that you may implement with this system it's a very simple two ports application so we have an external port think of it as a public port on some network and an internal port so you want to forward packets from the receive drinks of the external port to the transmit rings of the internal ports and the other way around and when going from the external to internal what you also want to apply some rules so I don't know depending maybe on destination IP or destination port you may want to drop some pockets or select some pockets while on the other direction you don't want to filter okay how do you do that in very in a few
lines of code so this is the the main synchronization logic first we open two parts okay so in the internal port in the external port using a simple very simple library and then we have a simple pole based loop for loop okay so this is the pole so here we have two ports so we have two file descriptors one file descriptor fridge port what we need to do in this simple up for wording application is to decide to decide which events we want to wait for so the logic is very simple let's take the external port for instance so if we have no pockets ready to be received on the external port what I could want to do is to spawn in to wait for them okay on that file descriptor otherwise it means if' i do have pockets so what i want to do since i want to forward on the other port and just wait for egress space on the other port that's why you pull out on this second part and this is specularly for the other direction so if I have so if I don't have any pockets ready ready to be received on the internal port I wait for them otherwise have we await for egress space on opposite port and then I call the pole function when power returns it means that some events are ready and so I can forward and they forward in both directions right from the external to internal from the end the other way around this is the function that
implements zero copy forwarding it's interesting what I'm doing here is just a parallel scan of two rings so what I want to do is I have receive ring from the source port and the transmit ring of the destination port I want just to forward a bunch of packets from the receive ring to the transmit ring so I do a perilous Carnot and the nice thing is that whenever you can implement zero copy forwarding so each descriptor enduring as inside as inside a buffer index and the buffer indexes and the identifier of a buffer with indicator so all you need to do to forward from my ring to the other adjust to swap the index of the two descriptors this is what they do here and adjust the land of course I also need to tell network that the buffer has changed because it may need to change the DMA mapping inside the kernel but otherwise what I wanted to show you is that in you can implement some simple forwarding rule in a very elegant way and simple way so this is the example I you can run I have a key move it on machines and two pipes so I could have implemented a different example with our airports but pipe is easy to do because it's all in software so I have two metal pipes I pass through one end of each pipe to the vector machine so the Bristol machine which is at QA movies no machine will see to pass through network port while the other end of the pipe are used for packet generation so here I generate up a stream of packet and on the other end of the pipe I receive and and so what I measured here with short packets so 64 bytes I mentioned about 17 to 20 minimum packets per second which is pretty impressive because consider this application is bloom nted with just one thread okay why with Fulham to sides so 1500 bytes I get about 8 I actually tried to both 0 copy and copy and it's interesting to see that in the in the copy case actually for very short packets the overhead of changing the DMA may mapping because you need to do that it's actually Eiger than copy so with short packets actually makes that this is this is actually the same than you were saying before so with short packets actually doing things in cpu copying we have something in cash you can very easily process them okay I also prepared a very short a comparison between Nathan
indicate there is no time actually to go through this but what I wanted to show here is some some comparison item so one advantage of advantage of net map and also xdp actually is that it's very easy to setup okay so with DP DK you need to care about each pages you need to care about iommu you need to bind and unbind drivers from the kernel driver to the driver while it would exit via net map you just have to do nothing actually we do need some very small EBP F program to redirect your packets to an AF x DP socket also the other advantages you get by the using kernel drivers is that you can reuse the standard EPO to and a th tool tools while in DB DK of course you need to rewrite adopt rules also the trading model is a bit a little bit more flexible because we DP decay we have a fixed we have L course so you write your your code and your code is run in the context of a typical callback on a core why do we net1 xep you can say FX VP you can basically intake you can basically open F here AF fix DP sockets or open network ports wherever you wish and run your packet processing code in any thread right another advantage again of native over DP DK is that you can and disadvantages actually shared with xtp is that you get standard synchronization tools so Paul Paul and select actually will vindicate you can use receive interrupts but that's a bit harder than just using the standard system calls of course GP decays are extreme performance so when I prepared this comparison it's very clear is that if you want the best performance you must use Tbk because we met up and a FX DP we are still using system calls so that as an overhead and comes with advantages in terms of improve this relation standard synchronization but if you all you want is performance you should use the PDK ok conclusion I
showed you a very simple example I would write simple but efficient name application I think the design principles behind Network are important they inspire XTP because in the comparison it's very evident as many choices taken by net pub and xep are similar why do you want to snap map it's the biggest advantage I think it's easy to set up very simple API in standard synchronization of course it's a smaller project in the other project and it's easy to integrate with the existing application ok why pre since we do PDK usually need to write your application from scratch and make your application fit within the davidic a framework of course you can get high performance if you want to reproduce the tutorial that this simple setup you can just follow the tutorials link and there are a detailed instruction on how to reproduce that all the code and get those numbers thank you I'm ready to work their questions [Applause] [Music] thank you so much thank you before the next presentation and it starts we'll just use the opportunity to
remind people that could you please leave Vincenzo feedback through the costum website there's a pitch at the end of the link at the end of the talk and also I will poke the meathook this evening my game
[Music] are you working with
Loading...
Feedback

Timings

  503 ms - page object

Version

AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)
hidden