Userspace Networking with libuinet
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 24 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/15365 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2014 | |
Production Place | Ottawa, Canada |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
1
5
7
9
12
13
14
15
17
18
19
21
23
00:00
Proxy serverNumberScalabilityLocal ringClient (computing)Virtual LANIdentical particlesServer (computing)Time evolutionStack (abstract data type)TrailEvent horizonProcess (computing)Library (computing)Function (mathematics)Integrated development environmentSocket-SchnittstelleRun time (program lifecycle phase)Conditional-access moduleSpecial unitary groupIndependence (probability theory)ImplementationKernel (computing)MereologySoftware frameworkSeries (mathematics)Computer networkComponent-based software engineeringData structureAxiom of choiceTrans-European NetworksDecision theoryProxy serverMultiplication signDifferent (Kate Ryan album)Address spaceNumbering schemeComputer networkCartesian coordinate systemInterface (computing)QuicksortPhysical systemTranslation (relic)NumberMessage passingState of matterGame theoryCommunications protocolCausalityEndliche ModelltheorieState observerMetropolitan area networkInstance (computer science)Single-precision floating-point formatIncidence algebraStaff (military)Projective planeServer (computing)Computer programmingOcean currentMereologyCoprocessorMultiplicationSoftware frameworkCASE <Informatik>Context awarenessInformationProcess (computing)TwitterLibrary (computing)Table (information)Integrated development environmentGoodness of fitSelf-organizationRight anglePower (physics)EmulatorFunctional (mathematics)System callFreewareMechanism designLevel (video gaming)Key (cryptography)Intelligent NetworkWhiteboardSoftwareKernel (computing)BuildingComputing platformSign (mathematics)DistanceSeries (mathematics)Revision controlMachine visionSoftware developerScalabilityTerm (mathematics)Socket-SchnittstelleClient (computing)CodeIndependence (probability theory)DemonUtility softwareWide area networkFitness functionPoint (geometry)Connectivity (graph theory)EmailSet (mathematics)Stack (abstract data type)File systemPortable communications deviceData managementConnected spaceBootingView (database)Event horizonFrame problemLetterpress printingMultilaterationType theoryWeb pageSpacetimeExterior algebraXML
09:12
Data structureSource codeKernel (computing)Network topologyStack (abstract data type)Computer networkCodeLibrary (computing)Data typeSample (statistics)Computer programUsabilityView (database)Gamma functionBinary fileInterface (computing)AbstractionSpacetimeSpecial unitary groupMaxima and minimaNetwork topologyPhysical systemStandard deviationPosition operatorDifferent (Kate Ryan album)Cartesian coordinate systemConfiguration spaceArchaeological field surveyTrailRow (database)DataflowFocus (optics)Data structureWorkstation <Musikinstrument>Semiconductor memorySpeech synthesisTouch typingDescriptive statisticsMultiplication signAdaptive behaviorRight angleLibrary catalogMusical ensembleQuicksortSystem callCASE <Informatik>HypothesisPoint (geometry)Matching (graph theory)Revision controlKernel (computing)Functional (mathematics)Connectivity (graph theory)Regular graphUniform resource locatorCodeWebsiteLogical constantLine (geometry)Similarity (geometry)Process (computing)Library (computing)State of matterSoftware frameworkCovering spaceStaff (military)Design by contractDiagramNP-hardSingle-precision floating-point formatSocket-SchnittstelleWhiteboardDigitizingSpacetimeDirectory serviceAreaProjective planeComputer programmingComputer fileStreaming mediaPhase transitionProduct (business)Order (biology)SoftwareTerm (mathematics)Context awarenessMetropolitan area networkInterface (computing)NamespaceVideo gameOcean currentEvent horizonResource allocationDirection (geometry)Set (mathematics)Sampling (statistics)Time zoneStack (abstract data type)Binary fileSlide ruleCollisionDampingSource codeImplementationFehlererkennungQueue (abstract data type)Maxima and minimaAbstractionMereologyNetwork socketEmailBuildingInclusion mapProgram flowchart
18:25
Interface (computing)Kernel (computing)AbstractionSpacetimeEuclidean vectorData structureExecution unitImplementationEvent horizonDisintegrationSystem programmingPhysical systemNetwork socketSocket-SchnittstelleMilitary operationGroup actionProgrammschleifeComputer programCAN busInterface (computing)NamespaceThread (computing)Integrated development environmentPhysical systemFreewareSocket-SchnittstelleCoroutineCASE <Informatik>EmailKernel (computing)Event horizonException handlingNumberCartesian coordinate systemSystem callSuite (music)Order (biology)Device driverNetwork socketImplementationQuicksortLibrary (computing)Source codeEvent-driven programmingReverse engineeringSet (mathematics)Operator (mathematics)Domain nameProjective planeINTEGRALSoftware developerMereologyData structureRight angleDirection (geometry)Generic programmingEquivalence relationComputer fileBuffer solutionCodeHazard (2005 film)CurveDefault (computer science)Symbol tableFunctional (mathematics)Pointer (computer programming)Computer programmingGroup actionBitMechanism designContext awarenessCodeBuildingPhysical lawWeb 2.0Functional (mathematics)Video gameProxy serverPoint (geometry)Tablet computerCondition numberLocal ringLattice (order)Multiplication signInternetworkingElectronic program guideMIDIIterationExtension (kinesiology)Uniform resource locatorRow (database)Staff (military)File viewerComputer configurationVoting
27:37
Computer networkDevice driverInterface (computing)Kernel (computing)Level (video gaming)Fraction (mathematics)Ring (mathematics)Buffer solutionElectric currentBuildingSoftware testingTime domainTablet computerType theoryThread (computing)RecursionCache (computing)BefehlsprozessorContext awarenessAerodynamicsSummierbarkeitLine (geometry)Physical systemSimilarity (geometry)Functional (mathematics)Message passingHeat transferUniverse (mathematics)Expected valueNatural numberContent (media)MereologyImplementationEmulatorThresholding (image processing)BuildingInformation managementNumberInterface (computing)Operator (mathematics)Programmer (hardware)CASE <Informatik>Video gameDifferent (Kate Ryan album)Right angleForm (programming)Arithmetic meanResource allocationArchaeological field surveyKey (cryptography)INTEGRALFigurate numberContext awarenessObject (grammar)Software testingProcess (computing)View (database)Position operatorCellular automatonSemiconductor memoryState of matterData structureBit rateOpen setSet (mathematics)Residual (numerical analysis)Event horizonNetwork topologyComputer-assisted translationRevision controlSemantics (computer science)Order (biology)Fraction (mathematics)Mathematical optimizationCoprocessorWater vaporOcean currentFilm editingDataflowTwitterMultiplication signTime zoneStructural loadSpherical capLocal ringCache (computing)Ring (mathematics)Kernel (computing)Connected spaceGreatest elementBefehlsprozessorPersonal identification numberReverse engineeringThread (computing)Buffer solutionCoroutineReduction of orderBitWechselseitiger AusschlussSoftwareFerry CorstenRecursionQuicksortSource codeStack (abstract data type)Software maintenanceQueue (abstract data type)Connectivity (graph theory)Point (geometry)System callElectronic mailing listPrimitive (album)Motion captureAsynchronous Transfer ModePortable communications deviceReal numberComputer fileXML
36:49
Socket-SchnittstelleTerm (mathematics)Daylight saving timeVirtual LANStack (abstract data type)CASE <Informatik>Domain nameAsynchronous Transfer ModeInterface (computing)InformationMetropolitan area networkTime domainContext awarenessComputer clusterConditional-access moduleInterface (computing)Term (mathematics)Connected spaceSet (mathematics)Asynchronous Transfer ModeThread (computing)Single-precision floating-point formatKernel (computing)Source codeSocket-SchnittstelleContext awarenessoutputSoftwareNetwork socketSeitentabelleBefehlsprozessorSoftware developerProxy serverMatching (graph theory)Multiplication signVirtual LANDomain nameComputer architectureVirtual machineLevel (video gaming)Mathematical analysisInformationStack (abstract data type)ImplementationFunctional (mathematics)NumberMotion captureFrame problemCombinational logicInstance (computer science)Latent heatBitCASE <Informatik>MereologyScalabilityAnalogyEntire functionCuboidRoutingAddress spaceUniform resource locatorDevice driverProgram slicingPerspective (visual)TorusWebsiteQuicksortStaff (military)View (database)WordDemosceneWhiteboardWeb pageData structureDressing (medical)Incidence algebraStability theoryAssociative propertyTrailMassStructural loadSound effectFunctional (mathematics)Computer networkPresentation of a groupOffice suiteDependent and independent variablesState of matterWater vaporObservational studyRoundness (object)Cartesian coordinate systemStatement (computer science)Point (geometry)Electronic mailing listPhysicalismStress (mechanics)Object (grammar)XML
46:02
Coma BerenicesSocket-SchnittstelleImplementationFibonacci numberInterface (computing)MaizeCategory of beingAsynchronous Transfer ModeVirtual LANStack (abstract data type)Network socketCache (computing)Content (media)Decision theoryDependent and independent variablesNormal (geometry)CASE <Informatik>Matching (graph theory)Proxy serverClient (computing)Server (computing)Slide ruleHTTP cookieComputer configurationPointer (computer programming)Keyboard shortcutCoroutineInterface (computing)Connected spaceClient (computing)Network socketProxy serverFibonacci numberSoftwareSocket-SchnittstelleServer (computing)Term (mathematics)Form (programming)Stack (abstract data type)Matching (graph theory)Physical systemPoint (geometry)ImplementationCategory of beingSequenceMereologyMultiplication signContext awarenessVirtual LANAsynchronous Transfer ModeIndependence (probability theory)Domain nameMultiplicationDot productDataflowDefault (computer science)Semiconductor memoryCache (computing)IP addressInstance (computer science)Configuration spaceCASE <Informatik>MultilaterationSubsetComplex (psychology)Field (computer science)Decision theorySystem callComputer networkCuboidFunctional (mathematics)Mathematical analysisAddress spaceRectangleLogical constantMappingNumberoutputApproximationWebsiteDependent and independent variablesTwitterForceOctahedronCartesian coordinate systemMathematicsCellular automatonRight angleRadical (chemistry)Hand fanHypermediaFlagStructural loadPlastikkarteLatent heatQuicksortData structureProcess (computing)BitLibrary catalogPrice indexView (database)Computer simulationDressing (medical)Metropolitan area networkXML
55:14
Network socketComputer configurationPointer (computer programming)Keyboard shortcutVirtual LANStack (abstract data type)CoroutineSocket-SchnittstelleScalabilityDecision theoryTable (information)Distribution (mathematics)TupleSpacetimeMixture modelLevel (video gaming)ImplementationPoint (geometry)NumberSystem callResource allocationHash functionAsynchronous Transfer ModeVariety (linguistics)Regular graphBefehlsprozessorCountingStreaming mediaWindowConstructor (object-oriented programming)Military operationDatenpfadLocal ringView (database)Drop (liquid)State observerRepresentation (politics)Function (mathematics)Data bufferBoundary value problemControl flowExplosionQueue (abstract data type)Assembly languagePhysical lawLimit (category theory)Focus (optics)Performance appraisalPhase transitionComputer networkComputer-assisted translationData typeSoftware testingComa BerenicesPhysical systemProof theoryWhiteboardDisintegrationWebsiteTime seriesElectronic signaturePopulation densityTable (information)Communications protocolMultiplication signDistribution (mathematics)Combinational logicMusical ensembleBuildingPoint (geometry)Streaming mediaOperator (mathematics)E-bookQuicksortRight angleWebsiteCartesian coordinate systemScalabilityDesign by contractFunctional (mathematics)Basis <Mathematik>Structural loadContext awarenessProjective planeAdditionTerm (mathematics)Proxy serverComputer configurationSeries (mathematics)View (database)Socket-SchnittstellePolygon meshState of matterFigurate numberInformation securityLatent heatLattice (order)Functional (mathematics)Software testingDifferent (Kate Ryan album)CASE <Informatik>Metropolitan area networkNumberInstallation artRow (database)Product (business)Semiconductor memoryInsertion lossElectronic mailing listPhysical systemCuboidGodTwitterImplementationNatural numberClient (computing)Network topologySoftware developerMereologyComputer fileWordHeat transferSource codeOpen sourceClosed setNetwork socketConnected spaceBoundary value problemWeb browserDomain nameHash functionVirtual LANGoodness of fitServer (computing)Assembly languageReduction of orderOrder (biology)Kernel (computing)Computer architectureWhiteboardOptical disc driveSequenceNormal (geometry)Interface (computing)INTEGRALCountingCodeStack (abstract data type)Real numberComputer networkPhase transitionComputer animation
01:04:26
Level (video gaming)Electronic data interchangePoint (geometry)Link (knot theory)DemosceneWeb pageMultiplication signOpen set
Transcript: English(auto-generated)
00:00
ability goals of WAN proxy, which is pretty much anything POSIX and then a few other systems. So long story short, the decision was made to begin by porting the previous DSTAT to user land to get that done, to get that transparent TCP proxy for a large number of connections done.
00:24
And we figured that was a really good platform to build upon, both for transparent proxies and other features that were of interest later. And the main reason we went with the FreeBSD stack is it's stable, obviously, been developed for a long time, used in a lot of mission critical systems.
00:41
It's very widely used. There's a lot of active development work being done on it today. And of course, the license. This made it really of wider commercial interest than, say, looking at Linux or something with a GPL-related license.
01:00
Just to make sure we're on the same page with the context here, this is one definition of a transparent TCP proxy. So that's one that can proxy connections between a client and the server and maintain server addressing from the client's point of view and client addressing from the server's point of view. So at the frame level, you shouldn't
01:23
be able to tell if there's a proxy involved. One of the easiest-to-understand motivations for wanting a transparent proxy is if you have a transparent proxy, you don't have to worry about the details of the protocol that you're proxying on top of it. Some protocols, we can argue whether they're poorly designed
01:44
or whatever, but the fact is they send as part of their protocol addressing information from the server side or client side to the other end. Those types of things when you're going to cause trouble with NAT or any other sort of address translation layers. That's just one example.
02:02
Another way to look at the utility of this is if you have something that can impersonate the addresses of the servers on the one side of the proxy and the clients on the other side of the proxy, it then becomes easier to think about how you're going to build proxies that handle large numbers of subnetworks.
02:22
And plug into different network addressing schemes, and it's easier to think about how you're going to architect that product. So that's a transparent proxy in a nutshell. By scalable, we simply mean it can do this for a large number of connections with arbitrary addressing.
02:40
So tens of thousands of connections, hundreds of thousands of connections. Maybe some of those connections are coming in on BLANs or nested BLANs. Some aren't. Different subnets, what have you. So we set the decision to port the TCP IP stack from previous D to user land. And the overarching goals were we're
03:02
going after scalability clearly. The choice was made to go with a non-blocking, event-based API. WAN proxy is based on an event system. And that's also just sort of the way to go if you're looking at scaling to things like handling many, many, many connection contexts.
03:21
We also looked at scaling out in terms of how do you scale across interfaces? How do you handle all those connection contexts? And so the targets there were to be able to scale through multi-threading within the application using libUI.net and also take advantage of the fact that now that we've got a stack poured into the user space, you can run a multi-instance.
03:42
Every application instance that's linked with the library that you boot is its own isolated stack. So you can then architect whatever your application is to take advantage of that and split up your traffic for management that way. So as a library, it's tightly coupled to the application. We want to keep everything in process.
04:02
There's been other ports of TCP stacks to user land that have had slightly different goals and wanted to expose all their functionality as a service to other clients through sockets or some other IPC mechanism. None of those that I'm aware of really
04:21
comport with the performance aspect and the scaling aspect that we're going after. So that wasn't a goal. We're just focused on keeping everything in process. And since we have a callback and an event-based API, there's other opportunities for enhanced functionality performance. We're not emulating the syscall layer.
04:42
We're not saying we're trying to make a drop-in replacement library that you could just link an existing application to. What we're going after is something that provides functionality you can't get anywhere else. And for that, you're willing to rewrite your application to it. But once you've crossed that river,
05:00
you then have other opportunities for building features that wouldn't be really feasible if you were trying to deliver them through something like the existing syscall API between user land and kernel for networking. On the portability side, the initial target is POSIX environments. When we look at re-implementing kernel facilities
05:20
a little later in user land, that's what we're looking at initially is using POSIX to give us portability of FreeBSD, Linux, Mac OS. And we want to do this in a maintainable way, right? Because in my point of view, the saddest thing that can happen is you do all this exciting stuff,
05:42
there's this burst of development, and the effort's been organized in such a way that it sort of becomes hopelessly stale and maintainable, hard to bring up to speed with a new version of the network stack, and then that's sort of the death knell for a lot of these types of projects. So definitely wanted to avoid that. As I said, we weren't going after providing
06:02
a drop-in replacement sockets library where you could just relink and it would look like the BSD sockets API exactly at the binary level. Not even looking at having a set of headers that you can compile against that would give you exactly the same API, because there's things we're looking at delivering that really just don't fit
06:20
in that model, and we didn't think there was any value in emulating syscalls and getting the exact behavior that you currently get running a user land program using the library interface of the kernel network stack, getting that to have all the same exact behavior.
06:41
And I think I also, I already mentioned, we're not interested in creating a daemon process that exposes the user land networking stack services to other processes. Of course, you can write that into your application if it suits, but not a goal for LibUI.net to provide natively. So there were some alternatives that were considered at the outset.
07:03
It was talked about, just maybe start out with something that seemed a little less complex than the FreeBSD code base. So a lightweight, independent TCP IP stack, there are some out there. So the problems are that they're lightweight and independent, so you tend to run the issues
07:22
with them being feature poor, or there's a small user base, they're not mature, the project could become implementing features for TCP IP that are already in the FreeBSD stack, but weren't there because we went lightweight to begin with, or we basically hitched ourselves to a project
07:40
that really has small exposure, and we've become the maintainers of it. The benefit, the huge benefit of going with a stack like FreeBSD's is that we're reusing all of the TCP IP functionality that's already there in the kernel. None of our work has to focus on improving or maintaining that. We're gonna build new features on it, we're gonna make it available
08:00
in a slightly different package, but all the tremendous engineering effort that's already gone into having a modern, full-featured stack is reusable to us. There was another project called libplebnet. I'm not sure if that's how you pronounce it, I've only ever seen it in print. That was the usual import of an eight-series stack that had slightly different goals, like having a daemon service to export networking
08:23
to fully emulate the syscall API, things we weren't interested in. It was apparently abandoned when I found that I hadn't seen, I haven't seen any new development on it in several years, but parts of it were used to see the libuynet port. It actually served as a pretty decent roadmap for what facilities would have to be re-implemented,
08:43
what kernel facilities would have to be really re-implemented in user space to support the stack. There's also the Rump kernel, which is a NetBSD project. It's a framework for running NetBSD kernel components in user land, including the stack, but it's not focused on just the stack.
09:01
It also includes file systems and other kernel components. It has slightly different API goals. On paper, when you list the things we were after with libuynet, there's a lot of overlap, but when we looked at it, it seemed like it was gonna need non-trivial work anyway. And since it had a lot of aspects to its framework
09:23
that really weren't relevant to libuynet, we figured, well, we'll put that non-trivial adaptation effort into something based on the FreeBSD stack instead, and get something that's really tailored to our goals. So the approach to porting the stack was, of course,
09:41
to re-implement what kernel facilities we needed. This is a similar approach that would be taken by any other of the efforts you'll find, like even the Rump kernel, that are out there. And what the idea is, the kernel facilities like threading, memory allocation, locking, get re-implemented.
10:00
You have the same API, you have a user land implementation underneath, then we can reuse the network code untouched, until we get to the point where we're adding new features to it, because again, we're trying to leverage as much as possible all of the wisdom and hard-won experience and effort that's gone into the stack as it is today
10:20
for libuynet. Another one of the goals was, whatever new features we put in should be able to completely disappear from the source base by turning off an ifdef. We want to, at any point, be able to say, I'm compiling the stock kernel stack by not defining a set of guard defines that wrap all the new feature code that we've introduced.
10:41
And we wanted to target some of these features so that it could actually be used inside the kernel. That doesn't necessarily serve directly the goals, the initial motivations of bringing the stack to user land, but as we're implementing some of the new functionality, it's clear that there could be a use case for having those built into the kernel directly.
11:02
So I kept that in mind as I went and tried to introduce as few new interfaces between the application and the stack as possible to get the job done. And I made it possible to include libuynet and FreeBSDBase, just in terms of how I structured the port, trying to do all the user land support work
11:21
to the side of the kernel source tree, even within my project structure. For no other reason than to make it possible to perhaps one day ship libuynet with FreeBSDBase. So as you can see, the source structure has all the kernel sources under the sys directory in the project. If you go to the current GitHub location for libuynet,
11:42
this is what you'll find. Under sys subdirectory is all the subtrees that contain files needed by libuynet. Not all the files that you'll find in there under the project are actually needed, but the approach taken was to pull in whole subtrees to make merging future versions easier. So instead of cherry-picking files out of directories,
12:03
out of kernel directories for libuynet, we've just imported whole subtrees just to make the merge to later releases of the stack a bit easier. Under the lib directory in the project, libuynet contains all the user land re-implementations of the kernel facilities, the UINet API,
12:20
any other support code that's part of the library. Anything else you find in the lib directory is application support code or just support for some of the example programs. Probably the most interesting thing in there is a fork of libev, which is an event system
12:40
that's been around for a while and is pretty highly performant and widely used. And that fork contains a new watcher type, so you can combine the event loop, access to kernel sockets in the host OS, as well as UINet sockets, as well as timers, whatever else the event system gives you access to.
13:03
And then bin has all the sample programs for exercising functionality. Pretty straightforward stuff. So here are the layers. Mostly, it's hard to draw these diagrams and really show every single relationship between the subcomponents, so you can be pedantic
13:22
and say I'm missing something here, for sure. But I think I've captured all the major relationships here. So at the top of the stack is the UINet API. So that's what applications written using this library will use. There's a couple things going on there. One is, of course, we're building API entry points
13:41
to give you access to the features of interest that are available inside the networking stack. But one of the other big purposes of the API and the way it's built is to give you a clean namespace, right? Because we want this to be portable to be used in applications of other operating systems.
14:01
We also wanna be able to have a different version of the previous DTCP IP stack inside the UINet than perhaps is on the host operating system. So just using the user land networking API headers for constants, structure definitions, that sort of thing,
14:21
is sort of a non-starter, because if you take that API and then bring it over to Linux or take it over to Mac OS, things aren't gonna build, because although there's a lot of similarity in different implementations of the BSD sockets API and the associated constants and structures, they're not identical everywhere.
14:40
So one of the things that's going on in the API is just namespace laundry, just giving you a clean generic version that comports with what's inside the library of all those constant structures and entry points. So that API, though, is built on top of kernel sockets. And the goal here is to integrate with an event system
15:02
of one sort or another. So our main focus was on non-blocking sockets and running things in an event-based manner. So the kernel sockets API, I didn't have to go any further than that, right? Because you can run kernel sockets non-blocking
15:22
and you have up calls, which are sort of a bare bones event interface for kernel sockets. So that's pretty much everything that you'll find in UIPC sockets, .c in the stack, in that kernel sockets layer that we're using. The only difference I would say
15:42
is that we've also pulled in some code from the UIPC syscalls. In particular for accept, if you look at SO accept in the kernel, it's really bare bones. It does a really minimal amount of work. If you look at current accept, that's handling the syscall, it's handling taking new sockets off the queue
16:03
and doing some other error checks and housekeeping details that you really need to get done any time you're accepting a new socket. So what's the UINET API? It's pretty much kernel sockets exactly with some amount of the kernel side of the syscall interface,
16:21
just with file descriptors removed because we completely have avoided file descriptors here. So below that, the net plus net inet, that's the stack, right? So that's just my shorthand for everything in the TCPIP stack. So that's just kernel sources. We'll see in the next slide, I'll show you where everything comes from, what's been re-implemented, and then on down we've got relevant kernel facilities
16:44
that we need to make that all float. These little legs up here are just showing that there's some things outside of the kernel sockets API that show up on the API for the application. There's some network interface configuration entry points that can expose the API, and there's currently access to the UMA zone allocator
17:03
through the API, so applications can access those pool allocators if they wish. Underneath all those kernel facilities is something called the UINet host interface. So that's an abstraction layer that's going between those re-implemented
17:21
or partially re-implemented usual end versions of those kernel facilities. That's serving two purposes. One is it's giving us portability, so even using POSIX threading,
17:42
and POSIX locks, and standard C library routines to re-implement some of these kernel interfaces, there are inter-platform differences. And we also have a similar issue there with namespace. Everything in the UINet host interface is being called into from kernel code context.
18:02
I have another slide where we'll highlight this in detail, but we can't have the pthreads header pulled into a file that's also got the kernel kthreads header in kernel mode, because in general, you're just gonna have namespace collisions. It's just not buildable.
18:21
It's the architecturally wrong thing to do. So one thing that gets done in the UINet host interface is another namespace cleansing process. Every symbol and constant used in the interface is just completely based on basic C types, and doesn't pull any baggage from the host OS.
18:41
The remaining piece here is on the left, the new packet interfaces. So once we've got this userland stack, we've got an API of the application to talk to it, the question is where are the packets coming in and out. So there's a set of packet interfaces that are just interfaces of IFnet, just like they would be in the kernel, except they're tying into other things
19:01
that are available to userland. So for example, you can have a packet interface that's talking to NetMap, and then you can access anything that NetMap can access for packet IO. You can talk to PCAP, you can talk to DPDK, you can talk to Unix kernel domain socket, clay tablets, I mean, whatever suits the application. If there isn't a packet interface that suits,
19:22
it's a pretty straightforward exercise to write one. Right, so this shows where the sources for all these things are coming from. That nice shade of FreeBSD red is unmodified kernel code. Everything in blue is showing things
19:41
that are created new, entirely new for LibUI.net, and this is my attempt at a purple that's halfway between that red and blue. And those are all the kernel facilities, and what you'll find if you look in there is depending on the facility and the subroutine in that facility, there's either been a wholesale re-implementation,
20:02
there's been a copy, essentially a copy made of what's in the kernel with some slight modifications, or in some cases almost entire reuse of the kernel code because some of these facilities are built on top of exclusively other existing kernel facilities, so once we've ported the other ones, we get those other kernel facilities for free.
20:25
All right, this is just a summary of the namespace issues I was talking about. From a development standpoint, when you're working in any of these layers, you have to keep in mind what environment you're really in.
20:41
You're technically writing user-land code except that the build environment for everything here in red is the same as if you were running kernel code, because everything underneath the UINet API is written as if it was written in the kernel
21:00
because you're talking to the kernel sockets API, you're talking to the kernel networking stack, you're talking to kernel facilities, and then on down. If you're coding inside the UINet host interface, that's a host OS environment. All the code you write in there is like a normal user-land program. You've got pthreads, clibrary, everything at your disposal.
21:20
You'll see the packet interfaces are typically split between the two because they have to implement an IFN interface. You have to interact with the kernel facilities, but to get your packets in and out at the end of the day, plugging into some host OS facility, whether it be a netmap or a Unix domain socket or something else. What you'll find in the typical IFN implementation
21:43
is the driver split into two separate files. One is built in a kernel environment and one is built in the host environment, and then there's an API, a clean API that doesn't have any external dependencies for symbol definitions that goes between the kernel and host parts.
22:03
And also, there's things that go into the host side of one of these packet drivers that could be pulled into the UINet host interface if they're generic enough. Sometimes they're just driver-specific and they live in there. The character of everything that falls in here inside one of these drivers
22:21
is exactly the same as the character in there. And in some cases, it's just sort of a purely discretionary call as to whether a routine can be found in, say, the host portion of a netmap driver, or whether it was pulled into the more general UHI. The UHI itself, of course, can be used anywhere because it has a completely generic interface.
22:41
It's not dependent on any host or kernel headers. In general, though, if you're inside kernel code, you use a kernel facility first and only use UHI interface if you have another option. An example of that would be you're writing a new feature
23:02
inside the kernel part of the lib UINet and you need a thread. You could call the UHI create thread interface, which would work, but the more proper thing to do is call the kthread interface, which is using UHI underneath because the kthread interface is doing additional things to keep that thread properly initialized and organized
23:24
within the kthread kernel context that wouldn't be happening just by calling the UINet thread create routine. All right, so as I've said, the API itself is intended for use
23:42
in non-blocking event-driven applications. You get pretty much, just by virtue of the fact that it's based on the kernel sockets interface, you get blocking support almost by default because it's already there, but there's currently no way to wait on groups of sockets in this implementation
24:00
because we've completely done away with file descriptors. We're not emulating file descriptors. A UINet socket is really just an opaque pointer to a socket structure. It's not wrapped in anything else. So there's no, and because we're only interested initially in event-driven applications, there's been no facility equivalent to like poll or select
24:20
implemented for UINet sockets. So it's a direction that could be gone into, but it's just not on the direct roadmap for the UINet right now.
24:40
So the initial goal of the API was to integrate with WANProxy, right, because that's where the whole project started. WANProxy has its own event system, so the API was tailored where necessary to interact with that, but the idea is that we provide enough tools to in general integrate it with any event system. We're trying to capture generically what's required
25:01
for integration with event systems. I think I've done it because we've integrated not only with WANProxy, but also with Libv, and they're both event systems, but the integrations, they just look very different. There's a number of details that can differ
25:21
in the implementation of event systems that have different requirements they place on things like a library that provides non-blocking sockets and a callback-based interface for events, and I think between these two integrations that have been completed so far, we have a pretty general interface.
25:40
If we want to integrate with another event system that was application-specific for another project or was a Lib event or one of the other extant event systems, I think we have all the tools in place already to do it. You know, one of the motivations for integrating
26:01
with Libv was that although we expose essentially all the kernel sockets API functionality, which includes non-blocking sockets and up-calls, and you can write your application directly to it, I think most people would be happier not having to do that because the kernel up-calls
26:22
mechanism has quite a bit of a learning curve to it, and there's a common hazard involved that involves the fact that kernel up-calls are invoked. You know, kernel up-call is a callback that you can attach to a socket and say, you know, when there's read activity, call this function, when there's write activity, call this function.
26:42
Those functions are called with the lock and the sock buffer inside the socket held, and typically what you want to do in integrating with your application through one of these callbacks is inside that callback routine that you've supplied, you'll grab some other application-specific lock, right, to then do something in your application under that lock,
27:02
and then what typically happens is there's some other part of the application that wants to hold that lock while doing some sort of socket operation. Well, the socket operation's gonna grab the sock-buff lock to do its work, and you have a lock order reversal then, because when the up-calls are called, you've got sock-buff lock, then your application lock.
27:20
When somewhere else in your application you lock your application lock then call into an API routine, you've got the locks being acquired in reverse order. There's usually a way around that, but what seems to happen is it's common for the initial implementation of trying to integrate with up-calls to run into the lock order reversal problem and then have to solve it, and then you find out when you solve it,
27:41
what you're really doing is writing an event system because you're like, oh, I have to queue these event notifications to some other thread or context and then deal with them with the right locking order. Okay, you're starting to write an event system, so why not just provide integration with a widely-known and used event system so you can hit the ground running
28:00
and not have to worry about details like that. There's currently two packet interface implementations in the source base. One is for NetMap. It was written to an earlier version than current, so it does zero copy on receive
28:22
up to some fraction of the available ring buffers, so when it's feeding packets to the bottom of the TCP IP stack, it'll do that zero copy up until 1.5 or 3.25 or some fraction that you can choose of the NetMap ring buffers are outstanding to the stack, and once you pass that threshold, it'll start doing copies so that you don't wind up
28:42
using all the ring buffers up, handing them all to the stack. The stack hangs onto them, and now you've stalled your receive path. So I say it builds with the current NetMap, but it's not yet taking advantage of some of the more recent features, which are the ability to expand the number of receive buffers beyond the ring size for the adapter.
29:03
That gives us a much wider zone where we can stay in zero copy receive mode when feeding packets to the stack, and there's also functionality on the transmit side that allows us to do zero copy coming from the stack. Currently, everything coming out on the transmit side
29:20
is copied from an MBuff into a ring buffer in NetMap and then sent, but the functionality is currently there in NetMap. I just have to update that pack interface to use it, and we'll see some increased performance on that pack interface. There's also a PCAP interface, which I've used less widely.
29:42
I've mainly used it for feeding PCAP files to the stack for testing. It's a really handy feature to have because you can can a capture and then feed it to the stack to develop a new feature, reproduce a bug, et cetera. Of course, you can also use it to deal with real network interfaces, not just PCAP files,
30:02
so it's useful for doing portability work to operating systems that don't currently have NetMap support, say, Mac OS. But pack interfaces in general could be anything. You just have to put an IFNet interface on the top of it, and what's underneath could literally be anything.
30:23
It's not a comprehensive list. So there's a couple of, I'd say, major open issues with the port so far. One is with locking. There's a pretty diverse set of locking primitives
30:45
in the kernel that have different semantics and different behavior in certain circumstances. In user land, going with POSIX, we've got a much smaller set of primitives, and in some cases, the behavior differs. I think that one of the most relevant issues
31:03
in this port has to do with read-write locks because right now, everything in libuiNet for re-implementation of kernel lock facilities are mutexes. You know, they're POSIX mutexes. They're either configured for recursive operation or not, depending on what the remapped kernel call
31:21
was really asking for, but there's not a real read-write lock. So right now, the built-in expectation is there's a lot bigger chance for lock contention in libuiNet than you'd find in the kernel for a similar traffic flow through the stack. One of the issues with the pthread interface
31:42
is that while they have a read-write lock, it doesn't support the recursion semantics that are defined for the FreeBSD kernel read-write lock, and the recursion behavior that's not supported
32:01
by pthreads is an optional feature of the kernel rwlock interface, but the locks of interest, like the npcb locks that are used for managing connection context in the stack, want to use that recursive behavior. So something has to be done there, remains on the to-do list to build something. It might be possible to build something
32:21
around the pthreads rwlock, or it might have to build something from mutexes and queues and things. PCPU is a kernel facility for per-CPU data, right? There's a number of optimizations in the kernel today that use PCPU. The whole point is that you can keep context
32:40
on a per-CPU basis, and so components of the kernel can be implemented to cache things or maintain state on a per-CPU basis, so when you're trying to access a certain facility like allocate memory, or allocate memory in a UMA zone allocator,
33:01
or do packet processing work through netISR, that you can keep that processing or context on a given CPU and take advantage of warm cache effects, or you can avoid, for the memory allocator, you can keep local caches of objects that are per-CPU,
33:22
so wherever the allocation happens, it can try and allocate it out of the local cache first and not have to grab a lock that might be contended across CPUs, which is an expensive operation. In general, what PCPU is providing and how it's using the kernel is a value, and it's something we wanna be able to take advantage of in libUI.net,
33:41
but we really can't currently. There's more work to be done there, but part of the issue is there's no userland way to disable preemption. You know, in the kernel, there's two routines called critical enter and critical exit, right, and those are really inexpensive ways to keep a currently running thread from being preempted on whatever CPU it's running on.
34:01
There's just no userland equivalent. Some of the, that's not required for all the uses of the PCPU infrastructure in kernel code, but it is for some important ones, like the UMA zone allocator. It uses critical enter and critical exit to protect access to its per-CPU cache of zone objects.
34:21
So, you know, currently, you know, you could emulate that by saying, okay, critical enter and exit are gonna be a mutex, but now you're sort of defeating the purpose. You know, the per-CPU caching and preemption disable going on in the kernel is in part done to prevent acquiring mutexes and continuing on mutexes across processors, right?
34:41
So that's not really a reasonable way to emulate it and expect performance. And although it hasn't, we'll talk a little bit more about performance later, it's certainly clear that some of the PCP optimizations, you know, aren't able to deliver their intended benefits
35:01
in the, with the current state of the port as it is. But I think it's actually worse in some cases where they become pessimizations. Like I said, the UMA zone allocator is heavily used throughout the stack. You have a question? Yeah, yeah.
35:35
Right, because, well, really, right, so there's two things that we can do, I think.
35:43
Let me back up. In thinking about this problem, there's different approaches that may not 100% give you what PCPU does in the kernel, right? But it can still get the benefits or some portion of the benefits that it's trying to provide, right, using the same infrastructure.
36:00
And the suggestion of saying, well, what we're really trying to do is keep threads from contending with each other for the same resources across CPUs, right? So we don't have to quite literally keep everything on a per-CPU basis, right? We can keep, we can reduce inter-thread contention of resources by having per-thread resources instead of per-CPU.
36:21
And that's what the bold item is saying, oh, can we make things per-thread instead of per-CPU? An example that many people might be familiar with is the way J. Malek uses arenas for its allocation. Right, so that's certainly one idea. It might be an answer or the answer here. Another way to look at it is using thread pinning.
36:40
You know, that's currently how things are, you know, in the functional port that's done today. It will work better when you have threads pinned than not. So the way that currently works is if your thread is pinned to a CPU, it'll access, any of these per-CPU accesses
37:00
will go to the context for the CPU that that thread is pinned to. If the active set or the CPU set for a thread doesn't have a single CPU in it, in other words, you're not pinned to a single CPU, it'll just go to context zero. So it's kind of like, anything you haven't explicitly pinned will all fight for the same PCPU context, context zero.
37:23
But if you start pinning your threads, all the accesses to PCPU data will go to a CPU-specific location that corresponds to the CPU that you're pinned to. So that's another way to kind of, you know, you're getting some slice of the functionality,
37:41
but you're not getting the full benefit. And that's, you know, the analogy there for the thread pinning approach would actually be the way that NetISR uses, actually currently uses PCPU stuff. Because NetISR is really a pool of threads where, full of worker threads with each one pinned to a CPU.
38:01
And it uses PCPU data, you know, in that way, because it's accessing that data from inside of a worker thread. And it relies on the fact that it's already pinned. Okay, so now onto the extras.
38:22
So the first new feature work that was done once we had the stack ported to Userland and functioning and delivering all of its existing functionality, you know, TCP, UDP access, anything that works in the stack in kernel after the port works in Userland.
38:42
There's some things we haven't tried yet. Like we haven't stood up as CTP support. So there's probably some corner cases in the kernel abstraction, you know, re-implementations that we need to fix. But in general, you know, we're using the whole kernel source unmodified. We've provided a lot of facilities, everything that works in the stack,
39:01
in the kernel should work in Userland. We've just been most heavily exercising the TCP part of it. But the first new bit of functionality is aimed at the original motivating case, which is building transparent proxies that can handle large number of connections, that can handle a huge diversity in addressing across those connections,
39:23
and also handle, you know, lots and lots of VLANs in that diversity of addressing. So the promiscuous sockets term is what I'm using to describe the ability to set up a listen socket that has a lot more control than usual over what kind of connections it'll capture.
39:40
By capture, I mean if it sees a SYN come in, it'll say, it'll match that SYN and say, that's a connection for me. It'll do the three-way handshake and establish the connection with the client. So a promiscuous socket allows you to listen on any IP address, any port, any VLAN tag stack. And when I say any VLAN tag stack, you can specify, I'm only gonna match connections
40:02
that have, you know, this many levels of VLAN tag stack with these specific tags in each level. You can say, I wanna capture connections that match all the other criteria on any VLAN, or you can say, I wanna insist that there's no VLAN tags. I only wanna match untagged traffic.
40:21
And you can wildcard on any of them. So you can say, any combination of, like I said, listen on specific VLAN or any VLAN or no VLAN, you can say, I wanna listen on a specific IP or any IP. You can say, I wanna listen on a specific port, I wanna listen on any port. And all the combinations are supported.
40:40
So you can really go from very targeted listens to wide open, you know, I'm gonna create a connection, do a three-way handshake for any SYN that reaches this interface. By the way, I'll just make a quick side note. If you're actually building something that is going to respond to every SYN that reaches it,
41:01
you should think carefully about the network that you're plugged into. Because even on switch networks, right, address table misses will cause SYNs going between two other endpoints that are unrelated to your network segment to show up on your interface. That's something I became aware of early in development when all my ex-terminals disappeared.
41:24
So that's on the listen side. On the active side, promiscuous sockets give you control over your personality on the network. So you can say, all the frames that I send are gonna have this VLAN tag stack, this source and destination MAC address, this source IP and source port.
41:43
Of course, the destination is controlled through bind, so there's nothing new there. So that's promiscuous sockets. It's those specific pieces of functionality plus any supporting infrastructure for that. The supporting infrastructure includes an ability to bypass routing on network interfaces,
42:01
on input and output. We'll talk about why that's interesting a little later. That's done using something called connection domains, another invented term abbreviated CDOMs. We'll talk more about those. There's new interface mode for implementations of network interfaces for the drivers
42:20
that provide additional handling of L2 info. I mean, normally that stuff is stripped off the packets after it's passed up from Ethernet layer to a higher layer, but some of these functions, like having control, and you'll see later you wanna have access to L2 info at a higher level in the stack, at the socket level,
42:41
requires preserving some of that information. And also some of the steering for connection domains gets handled by that mode. And then finally, something called SYN filters, and that's an ability to do some analysis on each arriving SYN packet to decide how you wanna handle it,
43:00
whether you wanna feed it to the stack or not. I'll, of course, go into detail on that. So, connection domains are really a way to map, to map connection handling, TCPIP socket behavior, to physical ports in the machine. Well, I say to network interfaces.
43:21
These network interfaces, I'm talking about the stack, of course, it's a virtualized concept. It may or may not map to physical ports in the machine, depending on what kind of packet interface you're using and what your architecture is. But, you know, normally on packet output, there's gonna be a routing lookup that's gonna decide which interface to send to. Connection domains is a way to bypass that, to say that, you know, to set everything up
43:42
so that you can say, all traffic sent from this socket is gonna go out through this network interface all the time. So, that fits, that kind of behavior fits architecturally with some of the use cases for this stack enhancement. But it's also something that's desirable
44:00
from the standpoint of, you know, if you weren't originally architecting things that way, but you wanna handle, you know, hundreds of thousands or more simultaneous connections, you know, you need to start considering whether you really wanna feed all that stuff through the routing infrastructure for performance and scalability reasons. Because a lot of times, it seems to me, at least from my perspective of how this is being used,
44:21
you know, the routing infrastructure isn't actually necessary. So it's, you know, it didn't make sense to feed everything through there, even though it would, quote, work, because we'd have these huge, you know, if you've got 100,000 sockets that are all, you know, that will all wind up, you know, there's a diversity of addresses
44:40
and such that they basically all wind up with their own individual routes. You've got 100,000 routes in the box. But your architecture might be that all the traffic's coming in these two interfaces and going out those two interfaces. There's no purpose for doing all that. So that's one of the motivations for this whole connection domain approach. So the way it works is every packet interface
45:00
belongs to a connection domain, all established connection contacts in PCBs in the kernel. You can think of that roughly as equivalent to being a socket. They all belong to a connection domain. On receipt of a packet, it's, that packet is tagged with a connection domain, depending on the interface. It just inherits the connection domain of whatever interface it arrived on.
45:23
And a given received packet can only match a connection contact within its connection domain. So that's where the whole term comes from. So when that packet's received, it makes it up to, you know, it's a TCP IP packet, and it makes it up through the IP layer, the first thing that's done is a lookup is performed
45:41
to figure out which existing connection, if any of that packet matches. Is it something of interest that should be processed further by the stack? Connection domains are a way to segment the scope of that matching. So instead of matching as all connections that are present in that entire instance of libuiNet,
46:02
associating interface of the connection domain gives you the ability to have an independent pool of connection contacts when it comes to matching. And that lets you do things like build a box that can be connected to multiple, fully independent networks that may be using the same addressing. All right, apparently I'm gonna speed things up here.
46:25
That's something, you know, you can't currently do very easily without promiscuous sockets. So the way this shakes out is CDOM0 is the default. If you don't configure anything, everything will be in CDOM0. It's special from the standpoint of connection domains,
46:40
but really it's the default behavior of the stack. You can have multiple interfaces in CDOM0. Outbound packets in CDOM0 are actually routed. They don't go through this, you know, fixed interface transmit path. And all the promiscuous, you know, CDOM, non-zero CDOMs are what are used for promiscuous sockets. So when you create a promiscuous socket,
47:01
you assign it a CDOM, and you know, that basically identifies which packet interface in the system that socket is gonna handle traffic for. And as I said before, once you do that, no outbound packets are passed through the routing infrastructure. They always go to that interface.
47:22
So for people familiar with the, you know, internals, a lot of this, a lot of these characteristics, and maintaining these properties for interfaces, connection contacts, you know, in PCBs, is already in place in the form of the FIB numbers, FIB number properties that are in there. So we leverage that plumbing
47:41
in the implementation to make CDOMs work. So this is just a quick sketch of what I'm talking about. You can see CDOM0 is a normal case. These green rectangles represent a connection. You think of it like a socket. You know, inbound traffic from the interface comes in a connection lookup,
48:01
gets mapped to a connection on input. Anything coming outbound goes through the routing infrastructure and then gets shunted out to the proper interface. For non-zero CDOMs, it's much simpler. All inbound stuff goes through lookup, but only for that, only within that connection domain, and everything outbound always maps to a single interface.
48:25
I mentioned there's a new interface mode that works in concert with promiscuous sockets. There's a new flag, you know, IFF promiscuous net, that when set on the interface causes the VLAN tag stacks to be removed. It supports arbitrary, I mean, up to some defined constant.
48:42
I think it's currently 16, which is, I think, not necessary. In terms of it'll handle anything you'll see in the wild. But it'll remove the tag stack and remove the MAC addresses and save them an inbound tag on the packet. So that will be available for analysis at any other layer in the network stack,
49:01
up to and including the SYN filter. So SYN filter is something you can install on a promiscuous listen socket that will get called for, it's a callback that'll get invoked for every SYN that arrives that matches that socket's criteria. It's a way, it provides two main pieces of functionality.
49:22
One, it allows you to do more complex matching than is possible with just the, you know, specific or wildcard ability of the individual fields of the promiscuous listen. So you can look for, you know, complex subsets of VLANs or IP addresses or ports using a SYN filter. The SYN filter will get called on the SYN packet.
49:41
You then return a status that says, yeah, accept this and pass it through the existing machinery. You know, reject it silently, reject it with a reset. Or I'm gonna defer the decision, which is basically saying, hey, don't submit that to the SYN cache machinery that already exists. I'm gonna hold it aside and at some later point I'll resubmit it with a disposition.
50:02
And that last piece of functionality is particularly useful in making well-behaving proxy applications, right? Couple of things to keep in mind when you're implementing a SYN filter. You have to take into account the fact that a SYN may arrive,
50:21
the same SYN may arrive multiple times due to retransmits, especially if you've told the first instance you're gonna defer the decision. That's taking you a while to make that decision. You may get another copy from the client in the meantime, so you have to structure your SYN filter implementation accordingly. And nothing in promiscuous sockets or SYN filters defeats the existing functionality or directly defeats
50:42
or precludes the use of the SYN cache or SYN cookies. They both still work fully in the way they were intended except that depending on your implementation of the SYN filter, you can subvert their benefits. You know, if you're a SYN filter, the first thing it does every time it runs is allocate a large amount of context to do whatever sort of decision-making you're going to do.
51:00
You've pretty much subverted the point of the SYN cache, which was not to allocate a lot of memory on every SYN that arrives. This shows the enhanced proxy behavior you can get with a SYN filter. And what this is showing is, you know, normal proxy without a SYN filter is going,
51:20
these green dots represent connection establishment, right? So normally, a client's gonna send a SYN. It's gonna go to the SYN cache. SYN cache will send the SYN ACK. When the final ACK and the handshake arrives from the client then the socket will be created, you know, it'll be available through accept in the proxy. And then at that point, you say, oh, I've got a new connection from a client
51:41
in your application, that's your first point of awareness. You then initiate the connection with the server. The problem is, if this part doesn't work out, between here and whenever you've had a timeout or a FIN arrived from the server that said this isn't gonna work out, you've had an open connection from the client, right?
52:02
The client may be sending you things because it thinks you're the server. You know, architecturally, it presents some challenges. Do I queue the data? Do I ignore it? Does the client care how fast I respond once the connection is opened, et cetera? With the SYN filter, your SYN filter runs when the SYN comes in.
52:21
At that point, you can initiate the connection to the server you're proxying to. And when the server responds with its SYNAC, then that's when you get connection establishment on your outbound connection, right? And at that point, you can say, ah, that worked out, I'm gonna tell, I'm gonna submit my deferred decision
52:42
for the SYN filter invocation up here back to the stack, and then that's gonna complete, that's gonna submit to the SYN cache, which will send the SYNAC and complete the connection with the client. If anything happens in here, maybe you got a reset from the server or it timed out, you have no established client connection. The client application hasn't proceeded in any way.
53:04
Your emulated behavior is now exactly what their experience would have been in terms of connection establishment. It's the same as what would have been if they were connected directly to the server instead of the proxy.
53:25
Yeah? I can see that, you know, in one case, we'd have gotten the connection closed, and in the other case, we'd have the client we'd have gotten the connection reset,
53:41
so if you have your SYN filter, you can actually convey the reset all the way down to the client. That's a different. Well, in this case, in this case, you're saying if the server, if you sent it a SYN and it sent you a FIN in response,
54:05
if it sends you a reset, then you actually know that because you get that status from your connection attempt. You know how the server responded to you. You can distinguish through, you know, you've done a connect call here, and based on the way the connect effort terminates,
54:22
you can actually distinguish through the normal, without any special, we haven't changed anything, you're able to distinguish whether you got a reset, you got a close, you got a timeout, and then you're able to tell the SYN filter when you submit this for decision, you can say reject that SYN silently, which would emulate the timeout. You can say reject it with a reset,
54:41
which would emulate a reset. I don't think there's, I don't think in the flow here, you can get anything other than a reset or a timeout in this part of the sequence. Maybe my question is, and then you get a reset from the server,
55:02
but by the time you get your reset, we're not already established a connection. And so when you get the reset from the server, let's say after you, when you send your SYN like the first. We're talking about this case here, yeah. Then the only thing you can do is close. Oh, right, yes, exactly.
55:20
I mean, that's part of the benefit of the, that's part of the benefit of having the SYN filter implementation. Right, in that case, you can't, without the SYN filter, you can't actually emulate all the behaviors of the server to you back to the client, yeah. So that, and I personally have never seen an application
55:46
with that as actually useful. All the, you know, the web browser doesn't care whether it's resetting or you're closing. I guess my question to you is, have you ever seen what this kind of behavior is important? Yeah, I'd say that the actual answer to that
56:01
is I don't know the specific real world installation example where that's the case. All I can tell you is that this was one of the required features when I was approached to do the work. Because I'm doing this work under a contract
56:20
and I'm not on the, I don't have exposure to the customer side in this project, right? So I don't have a real world use case that can tell you, oh, in this installation, with this client and this protocol, you know, this makes all the difference. You know, I think that, you know, the way to look at it is the main value is it allows you, you know,
56:41
if you're talking about inserting a proxy into some situation to provide some sort of behavior, I mean, just look at the point of view of being a startup and creating a new product and saying, you know, I can deliver feature XYZ by proxying all your traffic, right? And then, well, someone can, you know, this is just a basic, you know, product development benefit, right?
57:00
Someone can nitpick and say, ah, well, you know, you're changing, you're really getting, visibly getting in the way between my clients and my servers. You know, and then you have to have an argument, does that really matter? Like you're saying, web browsers don't care. Well, do they? Technically, this is transparent, right? But then it never ends, right? Because you have your sequence data, you have to align.
57:21
Well, that's, you know, that's like how transparent do you want to be? Then there's timing, then there's, you know, it's like, do you want to be invisible? You know, so this is, this is true. This is transparent down to addressing and like, you know, connection establishment details, right? Beyond that, nothing else is being aligned. You know, timing, sequence numbers,
57:41
all the odd behaviors of stacks that can fingerprint them. I mean, this isn't invisible sockets. I guess it's, you know, that's, yeah, it's not the goal. So I think we're one minute over now. Okay. I apologize. How about, give me two, give me two more minutes
58:00
and I'll wrap it up because this is really, I think the bulk of the interesting piece. I'll skip the walk through the API, but yeah, the flavor here is just that the interface for doing this looks a lot like the interface for normal socket operations. I've tried to implement all the functionality through additional socket options as opposed to API calls. That just goes to, you know, what if we wanted to use this in kernel
58:21
as opposed to in user land? That makes that more possible. Just a comment on scalability. The way scalability was handled is that all the connection contexts are still kept in one giant hash, right? So, but everything that comes through the promiscuous sockets, promiscuous interfaces,
58:41
promiscuous sockets is hashed with a more expensive hash that takes into account VLAN tag stack, source and destination in order to get good distribution. So if you're handling a million connections, all those connections can be on a million different VLANs and the same IP. They can be on a million different IP and port combinations. You know, you can slice it however you want.
59:00
You'll still get good hash distribution. You won't get, you know, performance degradation from long chains. And the way it's done is that everything that's in connection domain zero, that's not being processed in promiscuous mode, still uses the existing hashes, the smaller, faster hashes. And as it goes in the same hash table, the lookup path is different because it knows whether it's doing CDOM zero or promiscuous sockets lookups.
59:20
So you only pay the penalty when you want the promiscuous functionality, but you can still use all the natural stack functionality without additional expense. And this has been tested up to, you know, basically 2 million sockets in the box. And you know, you have 1 million active connections, 1 million listen connections, addressing however,
59:41
and all those sockets function and pass data correctly. So the second feature that was added is a lot simpler to talk about. It's called passive receive. And that's the ability to run TCP reassembly and socket operations using a copy of a packet stream.
01:00:00
between two endpoints. So you can be connected to a span port or some other layer in your architecture that delivers a copy of the packet stream. And you can get a pair of sockets that you can read from that'll have to fully reassemble TCP streams that are present in the actual connection between those two endpoints that you're monitoring.
01:00:25
We'll have to save the rest of that, I think, for online reading. But a couple of things to think about before we part ways is when you're doing passive receive, you're not actually participating in the TCP protocol. You're only monitoring it. So if there's any packet loss in the path
01:00:41
between you and the packet stream that you're observing, I mean, if you're using a span port, this is a big problem because span ports are really lossy, at least in the equipment I have access to. Even between virtual machines, which is all in memory, they're really lossy. There's no way to get a retransit from that. The client isn't gonna retransmit a packet because it went missing to you, the passive receiver,
01:01:02
if the other side saw it and hacked it. So there's additional functionality that's been built in to handle a case where there's missing packets. You get holes in the data. You might have missed FINs closing connections down. So you need support in the MBUX system for whole data, support in the receive function to tell you where whole boundaries are
01:01:22
and whether you've had them or not so your application can give up or continue accordingly. And you also need to have timers to make sure connections don't live indefinitely. You can either use the idle timers that are existing to kill connections after an activity, or the application can have its own timer to handle the case where you miss connection close down.
01:01:43
Okay, so API. And really, performance, in terms of time performance instead of scaling performance, we've talked about scaling. We're just really getting into the phase where we're starting to look at time performance. But the quick and dirty tests,
01:02:00
if you're doing something like net cat-like transfer of a large file, throughput through the usual stack is currently about 70% of what you get with everything else the same doing it in kernel. So on the one hand, it's not like 10x slower, 2x slower. On the other hand, there's clearly a gap that we have to close. And I think there's some obvious places to look.
01:02:23
You know, we talked about locking and PCPU issues earlier that'll count for part of the difference. Using the NetMap interface, we'll get some benefit reduction of packet copies when I bring it up to the current revision of NetMap. And we'll close with the list of future work.
01:02:44
Maybe the most relevant will be the fact that, right now, for promiscuous sockets, it's not fully plumbed through IPv6. There's no technical reason why not, other than there was an IPv4 requirement. This is one of those examples of, you know, this was a scratchy-ish project and an IPv4 requirement,
01:03:00
and then a requirement to get passive receive working. So plumbing everything through IPv6 for promiscuous sockets is low on the list. But if you actually look at the code, it's partially done. You just have to replicate exactly what's going on on IPv4 side into some of the six specific equivalent functions. But there's technically nothing new going on. Just code that has to be written.
01:03:23
Currently, the TCP IP code in there is 9.1. So, you know, near-term effort we're looking at, maybe over the summer, is to upgrade that to 10 series to get some of the improvements that have happened since 9.1. And that's probably all it's interest.
01:03:41
And a couple of quick acknowledgements. Of course, to Julie Mallet, who was really persistent in connecting me to this work. Good sounding board throughout, and certainly suffered through the first Libyuan integration with WAN proxy. And of course, to the sponsors who paid for all this, all of whom were to be named are listed on the slide. If you click on their logo,
01:04:01
it'll take you to their websites. For some reason, nobody wants to talk about work like this. The good news is it's open source. So we can talk about the details. I think it's the same old story. For some reason, the source code is perceived to have no marketing value. You know what the use of it is? So we can talk about the source code all day and night,
01:04:21
but nobody wants to advertise what they're building yet, right? And this was a nice Canadian scene from west of town. Well, Vancouver Island. But I think we've covered the Q and A and ran out of time, so. All right, well, thanks for sticking around, guys.
01:04:41
Sorry I ran over. And this stack will get posted. There'll be a link to another speaker page at some point.