We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Is your elephant a gazelle?

00:00

Formal Metadata

Title
Is your elephant a gazelle?
Subtitle
How to accelerate IPsec elephant flows
Title of Series
Number of Parts
637
Author
Contributors
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Elephant flows appear irregularly, can consume almost half of the available bandwidth and are consequently associated with a host of issues. Securing elephant flows with IPsec is a well-known challenge to SDN and SD-WAN solutions on commodity hardware. The key problems for those developing solutions are: - How to seamlessly enable dedicated HW to accelerate IPsec processing when available? - How to distribute workloads to more CPU cores and maintain packets ordering to scale? - How to scale up/scale down the computer resource usage when the elephant flow appears and disappears? In this talk we will discuss our recent work done on open-source project FD.io/VPP to address the above problems. We will describe how we utilized and enriched the VPP architecture to accelerate on-demand IPsec elephant flow processing in a unified and seamless way.
179
Thumbnail
20:09
245
253
Thumbnail
30:06
294
350
Thumbnail
59:28
370
419
491
588
Thumbnail
30:18
IntelIPSecCryptographyDataflowScale (map)Single-precision floating-point formatSpacetimePlane (geometry)Continuous functionInternetworkingChemical affinityBand matrixCore dumpProcess (computing)LastteilungPoint (geometry)Configuration spaceArchitectureCodeSample (statistics)Communications protocolData modelDevice driverInterface (computing)SoftwareSoftware frameworkPoint cloudFunction (mathematics)Virtuelles NetzDiscrete groupComputer networkOpen setStack (abstract data type)Pauli exclusion principleImplementationOpen sourceServer (computing)Advanced Encryption StandardAlgorithmBefehlsprozessorData managementMilitary operationReading (process)Scaling (geometry)Generic programmingSymmetric matrixWorkloadGraph (mathematics)Process (computing)DataflowBefehlsprozessorNumberComputer hardwareCore dumpDevice driverIPSecRight anglePlanningDifferent (Kate Ryan album)SpacetimePoint cloudCryptographyPlug-in (computing)Cartesian coordinate systemSymmetric matrixConnectivity (graph theory)Graph (mathematics)Generic programmingProduct (business)Library (computing)Logical constantMultiplicationRange (statistics)Duality (mathematics)Projective planeSingle-precision floating-point formatSoftware developerAuthenticationData managementScaling (geometry)Pairwise comparisonDatabaseVirtual machineSoftware frameworkWorkloadExtension (kinesiology)Presentation of a groupCache (computing)Functional (mathematics)Key (cryptography)Condition numberService (economics)Maxima and minimaOperator (mathematics)Information securityCycle (graph theory)CodeOpen sourceConfiguration spaceLastteilungCommunications protocolSynchronizationRegular graphCodeSequenceAssociative propertySoftware engineeringImplementationGreen computingÜbertragungsfunktionInternetworkingServer (computing)AlgorithmData structureAnalytic continuation2 (number)Sheaf (mathematics)Multiplication signComplete metric spaceIntelState of matterCASE <Informatik>Particle systemSource codeOffice suiteWebsiteType theoryFlow separationInheritance (object-oriented programming)FrustrationPoint (geometry)NP-hardTotal S.A.40 (number)System callLimit (category theory)Observational studyOnline helpInteractive televisionPosition operatorGame theoryScripting languagePersonal digital assistantRow (database)Queue (abstract data type)Computer animation
CryptographyGeneric programmingGraph (mathematics)Hill differential equationIntelCompact spaceData structureAddress spaceoutputPointer (computer programming)Data bufferControl flowFunction (mathematics)Patch (Unix)Raw image formatComputer hardwareProcess (computing)WorkloadFrame problemBefehlsprozessorCore dumpThread (computing)Compilation albumMotion blurAsynchronous Transfer ModePoint cloudDistribution (mathematics)Single-precision floating-point formatIPSecSynchronizationDataflowScale (map)Stack (abstract data type)EstimationScheduling (computing)Queue (abstract data type)Graph (mathematics)CryptographyArithmetic progressionBefehlsprozessorCore dumpDistribution (mathematics)Computer hardwareComplete metric spaceWorkloadPoint cloudOrder (biology)Single-precision floating-point formatLevel (video gaming)Operator (mathematics)NumberCycle (graph theory)IPSecMaxima and minimaSequenceMultiplicationScaling (geometry)Condition numberSynchronizationEmailGeneric programmingProcess (computing)Functional (mathematics)System callLastteilungDataflowBuffer solution2 (number)Raw image formatAsynchronous Transfer ModeThread (computing)Sheaf (mathematics)Event horizonDifferent (Kate Ryan album)Moving averageScheduling (computing)SoftwareReading (process)Point (geometry)Frame problemDefault (computer science)Address spaceData structureGraph (mathematics)Directed graphPhysical systemScripting language40 (number)Closed setStandard deviationCASE <Informatik>Water vaporEncryptionIcosahedronCasting (performing arts)Social classRight angleQuicksortMathematicsStress (mechanics)Streaming mediaShared memoryTouch typingSet (mathematics)Euler anglesSound effectInterior (topology)Hypermedia1 (number)Computer animation
Scale (map)WorkloadPoint cloudAsynchronous Transfer ModeProcess (computing)CryptographySynchronizationScheduling (computing)IntelElement (mathematics)Computer animation
Object (grammar)Queue (abstract data type)Thread (computing)Core dumpCycle (graph theory)NumberLink (knot theory)CryptographyOperator (mathematics)AdditionMultiplication signGraph coloringNatural numberRight angleGoodness of fitPlanningProcess (computing)Video gameScripting languageMeeting/Interview
Element (mathematics)Computer animation
Transcript: English(auto-generated)
Hello everyone, thank you very much for joining this talk. My name is Fan Zhang, I'm a NetAppox software engineer in Intel Channel Orange. I've been working on the crypto database acceleration for over 10 years and I've been working on the
deep decay and VPP crypto and IPsec for over five years. So today's my topic is your elephant Godzilla how to accelerate IPsec elephant flows. So this is the agenda for today's topic. First, I will briefly describe the elephant flow concept and points out the
bottleneck to secure them with existing open source IPsec solutions. So we will bring up our answer to fix those bottlenecks. So the proposal was built on top of FIDO VPP and VPP IPsec. So we will describe that. We will also
describe the crypto infrastructure and the engines used on top of VPP IPsec. With a synchronous crypto engine, we will be able to accelerate the VPP IPsec single flow approach. But to improve the performance even higher, we will describe our ongoing project to scale the
single IPsec flow to 100 gig per flow approach. And in the end, we will recap the presentation with a summary. First, what is elephant flow? It's extremely large continuous flow in internet. So it only takes 4.7% of the packets in the internet in total, but when it's active,
it will take over 40% of bandwidth as well. Leaving the elephant flow aside, let's look at how the user space IPsec data plane approach nowadays is used to process any flows, securing them with IPsec. First, we will isolate and limited per core processing resources. Those
will be either the CPU core resources or the dedicated hardware like Intel QUT resources. So with that to make sure every core has the availability to process the IPsec flow. Secondly, we will do a flow to core affinity, which limits a flow is processed by one and only by
one core. This helps reduce their maximize their core cache utilization, and also eliminates their core to core race condition. Bring back the elephant flow to secure elephant flow with DC existing IPsec or data plane approach, it will be very difficult. The reason for that
is first, the crypto processing requires a large amount of cycles where elephant flow is mostly large packets as large as your empty your codes. The second the flow to core affinity, we will make always make a one core maximum extremely busy. The reason for that is you
don't you don't have many elephant flow happening in your system at once. So that's the reason other cores will remain relaxing. And a perfect core extremely powerful to handle the flow also means to waste the cycle most of time, because the same reason the elephant flow doesn't
happen often. If you have a big flow coming in, you want to load balancing and this flow to multiple cores, same as the regular elephant flow processing way to with the IPsec. You will have problem because it will cause the race condition multiple cores will fight to update the
sequence number in the of the same security association. So that should need their special treatment we've described later. To overcome these problems, we propose our answer with the FIDO VPP IPsec. So first, what is the FIDO VPP IPsec? With that we have to introduce
DPDK a bit. DPDK is a development cage on the Unix Foundation open source project. So it provides the framework MPI libraries and provides a number of drivers from different vendor to do the faster packet IO. So people want to use DPDK, they have to run the right application on top of
DPDK with the libraries DPDK provided, or directly use them by using the applications that you're running on top of DPDK, which includes OBS, constant fabric and the FIDO VPP. So FIDO VPP in comparison is the same. It's one of the
Linux Foundation open source projects. It provides about a different DPDK, it is a network function application. It provides packet processing pipeline. It is configuration driven, composable and extensible. Because it's built on top of DPDK, it inherits the rich libraries and
functionalities and the driver supports of DPDK. And it has its own native drivers. And the FIDO VPP has a wider protocol support. So for user for user who does, who needs the support on top of this existing protocol supports, they can write easily write their plugin, a plugin code and you're
into the FIDO VPP seamlessly. So FIDO VPP is widely deployed on OpenStack Kubernetes, and some of these great appliance already. So FIDO VPP IPSec is a very important component inside FIDO VPP. It is open source production value of the second implementation. It is
capable of doing single server dual sockets, one terabyte IPSec processing already, on top of the latest Intel IPSec isolator and Columbia well NIC. It supports a wide range of protocols, authentication header, ESP, and ESP over UDP and ESP over GRE. It
supports major crypto algorithms include, and it supports multiple crypto engine plugins running underneath. For CPU based crypto acceleration or lookaside hardware acceleration, VPP can seamlessly enable that in the same machine with no problem. Most importantly, it's efficient and cloud friendly.
Before VPP 2005, the VPP IPSec is running to on top of their native crypto infrastructure. So native crypto infrastructure is the generic infrastructure that provides symmetric crypto service within VPP. So it
provides a generic API for user graph nodes to consume the crypto capability. The API includes the key management and crypto operations. It has an advantage of performance availability and flexibility, but it doesn't have offload support,
hardware offload support, which means of course maximum throughput is a single flow, single IPSec flows maximum throughput, you cannot scale more than that. So to scale the VPP single IPSec flow throughput, we can do offloading the crypto. The
reason for that is the packet processing for an IPSec package is fixed, independent of the packet size, but crypto is not. So we can flow the crypto workload to dedicated hardware or like Intel security, or dedicated CPU cores. Each will help gaining more cycles for their RX core to do the packet IO and snagged processing. However,
dedicated hardware and dedicated CPU cores are two different things. We need a generic asynchronous crypto infrastructure to support both. That's why we upstreamed the VPP asynchronous crypto infrastructure in VPP 2005. So it shares the same key management as the synchronous crypto infrastructure. You don't need any more coding
on top of that. It provides a generic enqueue and dequeue handler. So different crypto engines can plug in their handlers into the infrastructure to support this section. So user graph nodes, as the ESP encrypted to know graph nodes showing the graph, we will do the enqueue their packets.
And a dedicated crypto dispatch node showing us the orange node in the graph, we are pulling the dequeue, we will continuously call the dequeue function to retrieve their process the packets back to the VPP mainstream pipeline of graph
nodes, like they push them into the graph nodes like ESP4 encrypted to the post. With this, you can you can achieve the synchronous crypto way of working inside of VPP. So what we did there first is add immacurity hardware acceleration with a deep decay cryptodev. So deep
decay cryptodev has been well known as one of their most performant crypto infrastructure. So what we find out is the deep decay cryptodev is the API and the data structure is different than VPP. The cost of adjusting the way of
working and the cost of translating data structures is actually coster costly process. That's why in deep decay 2011, we propose deep decay new cryptodev raw API. It has a more compact data structure. It is supposed to roll buffer point here and the physical addresses input. It helps in the end of
gaining more 15% performance between the VPP 2005 and the VPP 2009. So it's been officially used as a default cryptodev engine inside of VPP nowadays. What about if you don't have QUT? We can use multiple CPU cores under the software
crypto scheduler engine to achieve that. So what is the software scheduler crypto engine? It is a pure software crypto engine that utilizes dedicated CPU cores to process crypto workloads. So thinking about the picture over here, you have the extra in the middle, which includes the packets into
dedicated queue. The crypto workers, they were all red and blue ones. They will continuously scan these queues. So if they find a packet inside of the queue, his status is not processed yet. It will update as a work in progress and process them. Once it's done, it will mark the
status as complete. So from the beginning to end, the crypto workers won't dequeue any packets, but only update the status. It is the same iX stretch, running the crypto dispatch graph node, scanning the same queue, and to achieve the
first number of packets that has the status of complete back from the queue, not maintaining the packet order. So with this, all three cores can work harmoniously and efficiently to one half each other. So in this design, every core
core could work as iX stretch or as a worker cores. It will help each other to achieve their, you know, helping each other to gain the maximum throughput. And also, we were thinking about, when we were upstreaming, we
were thinking about the crypto dispatch node running the pooling mode, which we will get the best performance for sure. But it's unfriendly to cloud native use cases, and it wastes a lot of cycles if there's no crypto workload to process. That's why we made it to supporting interop mode. So every graph node can
can signal and wake up from the signal by other threads. So once a package, we do active pooling reading and interop handling. So this crypto worker core orange, the orange and blue ones, they will try to process as many packets as possible before they fall back to sleep again. And we do a
precise syncing when crypto frame is in queue and processed. So once in queue, we signal the iX thread will signal the crypto worker one and two. When they finish processing, they will signal the iX thread for the crypto dispatch. So it will help you know, maximize the efficiency. So with a synchronous
crypto, we can achieve single IPsec flow for up to 40 gig. And let's see, why is that? Why we cannot get higher? Because every crypto, even with the crypto offloaded, that's still heavy packet IO process, that's still heavy stack processing left. So to achieve the performance even higher, we should be able to, we needed to
offload more workloads to the dedicated CPU cores. But to do that, so it's not a single crypto workload anymore. We need to think about the way that you can do the load balancing and reordering costlessly. So Intel DLP can actually help with that. Intel DLP is dedicated to hardware. They
can do the packet distribution and aggregation while maintaining the order from RX to TX. From the picture you see over here from zero to five, the packet is ordered, no matter how many worker calls in between to share the workload processing. So
we have hardware, how can we utilize into the IPsec? So first, let's look at the graph over here. So for a single package, this is the stages that need to be executed before the encrypted package is going out. This is encryption, by the way. So you have three IPsecs including RX, packet classification. You have to do
an assay lookup. Then you have to update the sequence number. And as I said, this is the places where the waste condition happens if you offloaded to multiple cores. Then you have the heaviest attunar header plus SEV operation and synchronous crypto. And in the end, you have the TX. So if we can say,
okay, an updated assay sequence number cannot be handled by multiple cores. For a single IPsec flow, we can use one core to do RX only. We do pre-IPsec workload, we do assay lookup, and then we update assay sequence number. Once it's updated, we include them into DLP. DLP will distribute them to multiple worker calls. This worker call does the most heavy lifting of either the attunar
header and ICV and plus synchronous crypto. Once it's done, it includes the packets back to the DLP. Then the packets will be handled by the TX call for the post-IPsec processing. With this, we should be able to get a single IPsec flow up to 100
gigabits per second. This is ongoing. We estimate the finish and upstream by the end of 2021. The summary. Today, we talked about the VPP synchronous crypto infrastructure, which is performance but fail to the scale. We provide the way of
scaling the IPsec throughput single flow by offloading crypto workload to dedicated CPU core or dedicated hardware accelerator. With the asynchronous crypto engine, we can achieve 40 gig IPsec elephant flow processing. To
scale their performance even higher, we can utilize the Intel DLP or DPDK event dev to achieve that to offload most of workload to the workers. And that's the end of my talk. Thank you very much. And let's back to the host.
Thank you shortly. And I'd
say we're live, or at least I hope we're live. So Fan, thank you for a great talk. Really interesting. Thank you, Rick. I want to go back to a question that Vincent Jardin asked earlier. And I think it was, I was on the previous talk. Whenever you use
asynchronous crypto worker cores, how do you maintain packet ordering? So the idea it works is every let's say, I recall that contribute, you know, processing their packets, instead of crypto, it's acting as a producer core. So
which we will, it will include the crypto ops into this queue it owns. So the crypto worker cores, other than the producer core, if they are, they are voluntarily processed in a crypto core for the producer core, it will, instead of dequeuing from NetQ, it will
only updates on queue objects status inside, we are using atomic operation. So yes, atomic operation will cause some cycles. But it's compared to, you know, the RX and TX efficiency. It's, let's say it's not that big. And those worker
cores, once you process a finished processing, that you're a bunch of packets inside these queue objects, each will again, update the status this time, instead of using atomic, it directly writes that to say, hey, I'm done. Then the same thread letting queue the packets, we scan their queue, the queue
from the first object to the last object and finds the first n number of queue objects that share has the status of done and dequeues them. So they're ordering of their packets, it's, it's actually naturally resolved. So it never actually leaves
the queue, it always stays on the same queue. So there's no, there's, there's no concern there. So at this juncture, I'll just say we have a minute left. So if you have any questions, additional questions for his fan, because I'm conscious we're about to run out of time, you can switch over to the hallway discussion and the link for the hallway
discussion will appear momentarily. And so I don't find that if you want to tackle the second question which was what what's your future plans?