We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Multipath TCP for FreeBSD

00:00

Formal Metadata

Title
Multipath TCP for FreeBSD
Subtitle
An overview of the protocol, stack architecture & performance analysis
Title of Series
Number of Parts
26
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Come with me on a journey to learn about the Multipath TCP (MPTCP) protocol and the first publicly released FreeBSD implementation. This talk will examine MPTCP's 'wire' characteristics, the architecture of the modified FreeBSD TCP stack, observations from the development process and results of both performance analysis and empirical research conducted using the stack. Multipath TCP (MPTCP) transparently retrofits multi-pathing capabilities to regular TCP and is a work in progress Internet Draft being developed within the IETF. The Cisco University Research Program funded the Centre for Advanced Internet Architectures to develop an interoperable implementation of MPTCP for FreeBSD as part of a research project to study mixing loss-based and delay-based congestion control in a multipath context. As a researcher on the funded project and lead author of the FreeBSD MPTCP implementation, I've data and insights to share with you about the process of going from stock FreeBSD and an IETF Draft to an interoperable MPTCP implementation that is being used in ongoing research programmes.
MultiplicationFreewareSineComa BerenicesComputer fontImplementationInstallable File SystemCommunications protocolWebsitePatch (Unix)Kernel (computing)ComputerDegree (graph theory)ArchitectureImplementationKernel (computing)BitSheaf (mathematics)Computer architectureSoftware developerQuality of serviceLink (knot theory)Different (Kate Ryan album)Goodness of fitProcess (computing)Expected valueMachine learningField (computer science)Level (video gaming)InternetworkingPatch (Unix)QuicksortWebsitePoint (geometry)Moment (mathematics)Computer programmingForm (programming)Right angleSlide ruleState of matterDivisorAlphabet (computer science)Data structureMereologyComputer animation
Degree (graph theory)ComputerArchitectureFreewareFirst-order logicRAIDExtension (kinesiology)MultiplicationSingle-precision floating-point formatAddress spaceMilitary operationInterface (computing)Computer fontSineServer (computing)Mobile WebComputer networkConnectivity (graph theory)Link (knot theory)ÜberlastkontrolleFrame problemDatabase normalizationNetwork socketKernel (computing)Amsterdam Ordnance DatumFibonacci numberStandard deviationMetropolitan area networkSpecial unitary groupComponent-based software engineeringExecution unitEmulationSoftwareFrame problemCASE <Informatik>Wave packetÜberlastkontrolleConnected spaceInterface (computing)Game controllerQuicksortLink (knot theory)Scheduling (computing)Event horizonFrequencyObservational studyDatabase normalizationMultiplication signInferenceStatisticsMoment (mathematics)Computer configurationCartesian coordinate systemDecision theoryClosed setPattern languageArmLogicCategory of beingSampling (statistics)Parallel portChannel capacityCondition numberPhysical systemNetwork socketMaizeMathematicsShift operatorPersonal digital assistantWeightMereologyOnline helpStandard deviationVideo gameKernel (computing)WindowPresentation of a groupProjective planeConnectivity (graph theory)NetzwerkverwaltungOpen setInternetworkingRegular graphControl flowTerm (mathematics)MomentumFunctional (mathematics)CuboidPoint (geometry)SpacetimeParticle systemMultiplicationNetbookImplementationNP-hardCommunications protocolComputer animation
Message passingComputer configurationControl flowSineEmailLattice (order)Line (geometry)SatelliteSimulationInformation managementCAN busPhase transitionWeb pageSequence spaceRegular graphLevel (video gaming)Order (biology)SequenceNumberMountain passSummierbarkeitInstallable File SystemMaß <Mathematik>Color managementSpacetimeDataflowBitMetadataConnected spaceGraph coloringMessage passingGoodness of fitSequence spacePoint (geometry)StatisticsQuicksortInformation securityExtension (kinesiology)ArmCuboidComputer configurationCASE <Informatik>SpacetimeMathematicsNumberSequenceDifferent (Kate Ryan album)ÜberlastkontrolleOpen setRandom number generationKey (cryptography)Table (information)String (computer science)Token ringInterface (computing)Phase transitionPhysical systemAddress spaceExecution unitSpring (hydrology)Type theoryLevel (video gaming)Multiplication sign1 (number)Band matrixVideo gameMoment (mathematics)Cycle (graph theory)Disk read-and-write headDataflowOrder (biology)Standard deviationGroup actionNonlinear systemMereologyMomentumHeegaard splittingLogic gatePublic-key cryptographyEntire functionHash functionRegular graphMappingStreaming mediaCovering spaceLattice (order)Computer animation
SimulationPersonal area networkComputer clusterLogicSequenceExecution unitSummierbarkeitWeb pageÜberlastkontrolleMereologyControl flowAlgorithmWindowLink (knot theory)Standard deviationChannel capacity3 (number)MKS system of unitsImplementationFreewareVertical directionValue-added networkUniform resource nameComputer architectureCommunications protocolStack (abstract data type)CodeData structureNetwork socketSocket-SchnittstelleAssembly languageLevel (video gaming)Term (mathematics)Network socketEndliche ModelltheorieKernel (computing)Data structureBuffer solutionGame controllerLatent heatComputing platformAreaRegular graphQuicksortDifferent (Kate Ryan album)Sequence spaceSurjective functionModule (mathematics)ÜberlastkontrolleAssembly languageBoss CorporationContent (media)Connected spaceWordBand matrixBasis <Mathematik>Right angleMultiplication signFrequencyImplementationInsertion lossWeb 2.0Order (biology)Condition numberBitSphereMathematical optimizationSoftwareDiagramServer (computing)Link (knot theory)Single-precision floating-point formatSound effectMereologyThermal radiationLevel (video gaming)SequenceSpacetimeSelectivity (electronic)CASE <Informatik>1 (number)Computer programmingBlock (periodic table)Electronic mailing listGradientDefault (computer science)CodeScheduling (computing)DataflowThread (computing)Socket-SchnittstelleComputer animation
DisintegrationComputer architectureData structureCodeOrder (biology)Queue (abstract data type)Level (video gaming)Context awarenessNetwork socketData bufferMulti-agent systemArc (geometry)Control flowPort scannerMetropolitan area network3 (number)Euler anglesCAN busSineArmArchitectureBuffer solutionFunction (mathematics)Maxima and minimaTrailInformationSymbol tableMappingSource codeState of matterLevel (video gaming)Order (biology)QuicksortPlastikkarteCellular automatonCASE <Informatik>DialectSoftware bugData structureModule (mathematics)Flow separationOnline helpScheduling (computing)Block (periodic table)Integrated development environmentMathematical analysisElectronic mailing listExecution unitEntire functionWordFunction (mathematics)MathematicsRight anglePoint (geometry)Shared memoryGame controllerContext awarenessMoment (mathematics)Single-precision floating-point formatAssembly languageTrailInterface (computing)Term (mathematics)CausalityBoss CorporationCartesian coordinate systemBuffer solutionInformationSpacetimeSlide ruleKernel (computing)Network socketSystem callValidity (statistics)CodeResource allocationSequencePrice indexBitComputer animation
Computer architectureArchitectureBuffer solutionNetwork socketFunction (mathematics)Data bufferMaxima and minimaLevel (video gaming)InformationTrailData structureElectronic mailing listSineExecution unitUniform resource nameSequenceOrder (biology)NumberBroadcast programmingProcess (computing)System callDenial-of-service attackQueue (abstract data type)CAN busLink (knot theory)Right angleTerm (mathematics)Order (biology)Level (video gaming)Scheduling (computing)NumberElectronic mailing listLine (geometry)Cellular automatonBuffer solutionParticle systemÜberlastkontrolleProcess (computing)SpacetimeCartesian coordinate systemSequenceComputer configurationMaxima and minimaInformationDisk read-and-write headDivisorBitEndliche ModelltheorieScaling (geometry)Point (geometry)EmailMereologySource codeMultiplication signVariable (mathematics)Game theoryCausalityGreatest elementQuicksortMappingMoment (mathematics)Resource allocationData structureWindowArmSystem callNetwork socketAssembly languageTunisAnalytic continuationBit rateDifferent (Kate Ryan album)Insertion lossException handlingQueue (abstract data type)1 (number)DataflowComputer animation
Data bufferControl flowBlock (periodic table)Computer architectureData structureSineLevel (video gaming)Uniform resource nameNetwork socketTrailDrop (liquid)LengthChainWeb pageScheduling (computing)Resource allocationAreaSpecial unitary groupCAN busGame theoryVacuumRoyal NavyCodeComputer configurationParsingElectronic mailing listCache (computing)Source codeManufacturing execution systemÜberlastkontrolleInsertion lossAerodynamicsRevision controlDecision theoryCASE <Informatik>Buffer solutionLevel (video gaming)Disk read-and-write headCategory of beingMoment (mathematics)Point (geometry)NumberMacro (computer science)ImplementationDrop (liquid)CuboidMereologyMappingFormal languageAlgorithmScheduling (computing)Function (mathematics)System callMetric systemCartesian coordinate systemExtension (kinesiology)MathematicsGoodness of fitParticle systemFood energyDifferent (Kate Ryan album)Amenable groupWhiteboardMultiplication signGame theoryCondition numberNetwork topologyGame controllerFrequencyPerspective (visual)Decision theoryKernel (computing)AreaContent (media)BitQuicksortÜberlastkontrolleExecution unitMultiplicationSlide ruleProjective planeSymbol tableRight angleCellular automatonMaxima and minimaInfinityScripting languageSurjective functionBlock (periodic table)CodeReal numberPlastikkarteAssembly languageTestbedMobile WebNetwork socketExistential quantificationCodeSource codeDescriptive statisticsRegular graphComputer animation
Software bugWeb pageMiniDiscAddress spaceSineLink (knot theory)WebsiteKernel (computing)Revision controlMilitary operationExtension (kinesiology)TupleSurjective functionInformationConditional-access moduleEmulationComputer configurationPatch (Unix)Default (computer science)ImplementationDisk read-and-write headBuffer solutionFitness functionMereologyDataflowComputer hardwareNetwork socketAlpha (investment)Error messageElectronic mailing listTable (information)Interface (computing)MappingSet (mathematics)Decision theoryCore dumpComputer architectureObject (grammar)Different (Kate Ryan album)Process (computing)RoutingMathematicsInternetworkingAddress spaceCAN busComponent-based software engineeringReading (process)ScalabilityPoint (geometry)BitType theoryAdditionRight angleFirewall (computing)Existential quantificationOperator (mathematics)CASE <Informatik>NumberAsynchronous Transfer ModeInterrupt <Informatik>Cartesian coordinate systemField (computer science)WordGradientQuicksortForcing (mathematics)Range (statistics)Insertion lossIntegrated development environmentAngleRandomizationTape driveMoment (mathematics)Arithmetic progressionGame controllerDirection (geometry)Context awarenessMetropolitan area networkMomentumComputer animation
Link (knot theory)WebsiteKernel (computing)Address spaceSineExtension (kinesiology)Military operationRevision controlInformation3 (number)Computer animation
Transcript: English(auto-generated)
Thanks for coming in. I know I'm up against rocket science at the moment, so it's good to know some people are still interested in TCP. So I guess I'll preface this by saying that I only got invited to speak on Thursday of last week. And so between traveling halfway across the world
and organizing all of that and getting here today, I had to write up some slides. So it's going to be a little bit higher level, possibly, than what people may have expected. But I'm here for the next couple of days, so if there's anything that you're interested in
or whatever, hopefully this gives you a good overview of what's happening and you can come up and talk to me about details or anything like that at any point. So basically, the talk's going to be divided into two parts. So firstly, I'll talk about MPTSP itself. I guess there's a couple of different multipath kind
of solutions out there, and I guess people have different ideas about how that might work and whatnot. So I'll give you a good idea of how this particular implementation works, and then I'll go through the high level architecture of what we've done. Where our patch is at the moment, it's not completely 100% quite yet,
and just some of the experiences that we had in following an internet draft and now it's an experimental RFC, but some of the pitfalls and fun that was had there. So who was working on it? It was myself and Lawrence Stewart, who's kind of a bit of a more experienced
previous D developer than I am. In fact, I'm completely new to it. I drew the straw at Shaw, so I get to present today, but basically it was us with me doing kind of the legwork programming, and Lawrence was kind of coming up with the architecture and kind of doing the heavy thinking.
That's our website there. There's a link section at the end. I guess you can grab the notes afterwards if you really want to check it out. So about me, I'm pretty much an unknown quantity, I'm sure. It's everyone here, so there's nothing particularly there. I know you're all probably thinking I have a fantastic resume,
but that's pretty much about it there. I've done previous research work, mostly in kind of the field of traffic classification and machine learning and quality of service and kind of automating that process for home users. Like I said, I'm pretty much new to FreeBSD.
It was almost a year or so, pretty much, where I sat down with the developer's handbook and sort of decided to try and find out what this kernel thing is that people keep talking about, and I guess it's a little bit lucky, but given the keynote presentation that we had,
because I'm a professor of FreeBSD and UV, you all have to be nice to me. Okay, so MPTCP. Basically, it's an experimental RFC, and there's a few details about it, but I guess the two main things that are gonna be of interest to people
are it allows a standard TCP connection on a multi-home host to use multiple interfaces, okay? So you're not bound to a single interface when you start up your connection, and that can change dynamically over the life of the connection. And the other main thing, of course, is that it's backwards compatible with existing TCP,
so with TCP applications. So basically, if you have an application that's using TCP already, it doesn't need to change at all. You basically run it as you would normally, and in the kernel underneath, MPTCP does all the magic to make sure that that connection can work over multiple interfaces.
Okay, so why do we need multi-path TCP? I guess one easy answer is that we've got a lot of devices which are multi-homed now. Of course, all our mobile phones are multi-homed. If you have a fancy netbook, it may have 3G in it as well, or LTE.
And of course, data centers, they have plenty of interconnects and lots of interfaces everywhere. And with traditional TCP, you're not taking advantage of this. You've got this kind of extra capacity that's not being used all the time. You're setting up a connection. If your interface disappears at some point
during the connection, then you have to tear that connection down and start a new one, particularly if you're mobile, moving between WiFi access points or, I don't know, WiFi and 3G, then TCP, the traditional TCP is not gonna help. Of course, like I've said, there's lots of kind of solutions out there already.
A CTP, I suppose, is something that you might use to use multiple interfaces. But of course, NAT kind of makes that difficult to use on the internet. So this particular protocol was made with the internet in mind.
So basically, it'll function over the internet. It kind of looks a lot like regular TCP, so middle boxes won't mess about with it. So it's sort of a pragmatic solution that allows applications to stay the way they are and for things not having to change on the network.
It'll work over the internet. So here's kind of like a really basic sort of example, I guess, of where it may come in useful, I guess. You've got a mobile phone and you have a WiFi and a 3G connection. You set up a TCP connection to something in the internet.
And of course, if your WiFi disappears or you go out of range, something like that, the connection's gonna drop, it's gonna break, and you're gonna have to establish a new connection. So you've got no persistence, basically, in that scenario. Of course, with MBTCP, it'll automatically detect that you've got these different interfaces available
when you start your connection. So it's not necessarily using, say, 3G at this period in time, but in the event that your WiFi connection drops or if you move out of range, say, you hop on a train or whatever, it'll automatically sort of shift the traffic on that connection across to 3G.
So I guess, basically, the benefits are it adds redundancy and persistence. It'll keep your connection. Even if one of your interfaces fails, it'll start using another connection, another interface, sorry, and they've got this kind of idea of break before make in terms of you can have
an MBTCP connection and your interfaces can all drop away from underneath it, but it'll keep that alive for some period of time, just in case another interface does pop up, at which point the signaling can negotiate transferring the connection to that new interface.
Another, I guess, benefit is reducing congestion in the network. Now, this is mainly down to congestion control, so whether it reduces congestion or not will come down to what kind of congestion control you use, but if you are using kind of the pooled congestion control that's recommended with the spec,
it works in a sense of moving data away from congested links and it'll kind of favor links that are being underutilized at that particular time. Okay, efficiency, of course. Parallel paths, you can use capacity that wasn't there before. Well, that's in your system,
but you couldn't use with traditional TCP, and like I said, it works with standard TCP applications. You don't need to change anything. You can just call a socket as you would generally. So this is kind of what it looks like from a kind of really basic standpoint. So you've got your socket, you've got MBTCP kind of sandwiched
above where TCP would have been, and you've got these subflows in here, which are basically acting like TCPs on the network. So what does your application see? Well, it just sees the top of the MBTCP connection, doesn't see any of that stuff that's going on underneath.
Of course, there's a whole bunch of signaling involved to negotiate all this sort of stuff, and I don't really go into the details for it. The spec kind of really talks about why this was done, but the signaling is done in TCP options, so there is a MBTCP option and a whole bunch of subtypes, which are used to provide signaling.
Okay, and of course, the decisions about which subflows to send data on and things like that, that's basically decided upon by congestion control, but there's also other methods. So at the moment, I'm working on a vehicle to infrastructure project,
and as part of that, we're kind of interested in using Mac layer kind of Wi-Fi statistics to kind of help with these decisions. And so basically, there's a couple of, broadly speaking, there's sort of a bunch of logical components that make up MBTCP as it is.
You've got your path manager, so that kind of relates to identifying the interfaces that you've got available on your host, but also picking up messaging from the other end host and saying, okay, well, they've got this interface available to the connection and so forth, and it's all kept in the path manager, which doesn't necessarily need to be in the kernel.
That's why it's kind of a little box on the side. We use this CTLs to kind of hard wiring our interfaces at the moment, but technically, this is kind of done separately from the main kind of MBTCP implementation, and you can kind of talk to it when you wanna know what interfaces you've got and what interfaces are available.
You've got the congestion controller and the packet scheduler, which are kind of interrelated in that the congestion control really is keeping all your statistics about your different subflows, about your congestion windows or RTTs, things like that, and then the packet scheduler, which makes kind of the decisions about
when a flow is able to send something, it'll talk to the congestion controller and say, what are the statistics at this point in time? And based on that, it'll say, okay, I'll use subflow N or subflow 12 to send this particular bit of data. That'll probably become clearer, hopefully,
as I go along, but if it's confusing, just interrupt me. So this is the signaling, basically. So you have TCP options and you have an MBTCP type, which is, I can't remember the exact number, but it's like an official type
in TCP now, and then you have a bunch of subtypes, depending on what you wanna do. This is basically all of the signaling. So you've got signaling for establishing connections, adding new addresses to an existing connection, a bit of sort of metadata relating to sequence spaces,
which I'll go into in a little bit, and this stuff I kind of won't cover, but that's basically connection teardown, priority channels and this and that. Okay, so just an example, quick example on how the messaging works. This is just a connection setup. So basically you send a standard TCP three-way handshake.
It's kind of a four-way handshake because we need this extra little ACK on the end, but basically you add your TCP option for MBTCP, which is MP capable. The passive opener, if they recognize that option and pass it, then they'll also put an MP capable. Now there's a bunch of keys and whatnot which are associated with this
that I'm not gonna talk about, but basically the options are in there. You detect an option, you go through the handshake, and at this point here, it basically identifies it as a multipath connection, and after which you can go through and use add address and join and all that sort of stuff to bring up additional interfaces.
Well, there's actually tokens and keys, and so there's keys exchange in this initial exchange here. And then you have random numbers and hashes that you take of that. If someone's able to, well, I guess this is a security issue
that hasn't been resolved yet, but if someone's able to take, to sniff out your initial setup, then they can get all your keys, calculate all your hashes. No private keys. No, there's no, well. There's no pre-shared key. Yeah, there's no pre-shared key. Yeah, it needs to be exchanged, and it's exchanged in the clear. So it should be immaterial.
Yes, yeah, yeah, so yeah. And basically if someone can get the keys at the beginning, then they can insert stuff into your flow. If they don't, then everything then is exchanged as a hash. So you've observed this first four-way handshake that someone's doing. Yeah, yeah, that's the danger. And I think that's, obviously that's something in the security group
in the ITF is gonna be looking at or. Sure. Yeah, so yeah. That's basically a big security problem at the moment. How they fixed that, I'm not sure. Righto, so an important part of multipath TCP is, I guess.
Oh, yeah, sorry. So I wanted to ask you for all of the other options. There's not that much options based on TCP. Like, now you can set up. Yeah. Do you wind up wiping out the kill windows?
Or that, you know, or that? Yeah, it does take up space. It's, it does squeeze in and doesn't squeeze out too many existing things. It's about 20, it's about 20 byte. Sorry, 12 bytes I think off the top of my head. So it's large, but it's not horrendously large.
It doesn't take the entire option space yet. I think the biggest of the options is about 20 bytes. And yeah, and the options will at times squeeze on, I guess, SAC and things like that. SAC options, I've kind of prioritized MPTCP over that.
But for the most part, timestamps, things like that, it's not gonna push them out. Right, so in TCP, you got the idea of data sequence space. Sorry, you have the sequence numbering. So basically sequence numbering lets you,
you chop up your data into different segments and then you have attached sequence numbers to them, which lets you, you know, do acknowledgement and retransmits and all that stuff that makes TCP TCP. So the interesting thing with MPTCP is that now you're getting a whole bunch of segments and then spreading them out
across a bunch of different subflows. And you need to aggregate that back at the other end somehow. And there was some discussion about this. Again, I won't cover it, but about splitting up sequence space, you know, the first X bytes here and then the next X bytes here and then the next blah bytes on this side,
but that leaves big gaps in your sequence space. And you know, you can't do that because middle boxes won't like gaps in the sequence space. So eventually the solution was to add another layer of sequence numbering on top. So it's a 64-bit sequence number space. And basically this helps with all the reordering that's probably gonna take place
because if you're using different paths with different RTT, different bandwidths, things are probably possibly gonna be arriving out of order that need to be reassembled. We're maintaining all of the sequence numbering at the TCP level. So each subflow has just a standard TCP sequence space.
It looks normal, but to aggregate it, we need to add extra data on top of that, the data sequence space. So it kind of looks like this. So obviously that's regular TCP across the top there. Not too much to talk about, but over here we've got two subflows.
And so these are obviously randomly generated, whatever, and these sequence spaces are unique to each subflow. But then this data sequence space, which is kind of applied globally and used at the sender and the receiver to order all the segments, that is actually applied across both of them.
So basically that's another TCP option or MPTCP option where it advertises, I've got this segment here, that's all the regular TCP stuff, but at the data level, this is where it lives in the byte stream.
And for example, this one's from 4,000 onwards and that one's 4,100 onwards. And I guess this kind of explains it a little bit more. So at the receiver end, basically it's gonna know that, well, bytes 4,000 to 4,099 is gonna be arriving on subflow A
and the next 300 bytes are gonna be on subflow B. So the receiver knows that. It's got those maps and it knows where to expect the next sequence of bytes from. And it uses that to reassemble it at the receiver side. So I'm assuming you have this working. Do you note any additional headwind blocking
from having different, if you think about having different speed connections, like I send something out this way and I send a bunch more down this way, I end up getting more headwind blocking going on, right? Yeah, yeah, so depending on how smart you are with scheduling packets, you may in fact send something and say, oh, I'm gonna send it on this subflow here
and then send a whole bunch of other data on a fast one and then you gotta like, oh, I gotta wait while it reassembles, while it waits for that to arrive and puts everything in order. Depending on how long your retransmit timer is, you do need an acknowledgement. I don't know if I said this, but you do need acknowledgements at the data level as well
so each subflow is acknowledging and doing its own thing but also at the data level. So you might retransmit on the faster path? Yeah, you can retransmit on the faster path assuming that the transmit timer kicks off or something like that. But yeah, again, when you're doing your packet selection, path selection, you can say, well, you can use RTTs
and say, well, this one's gonna take a while to send a segment, so perhaps I'll send everything here and I'll send this offset of bytes. I'll send them on that slower connection now. So right, in duplicating the data sequence space, we've also duplicated, there's also a duplicate
of some of these, receive next, send next, and send UNA. It's purely cumulative, the acknowledgements, at the data level, but basically that works like, basically that works like it would in TCP but it's just working at one sequence space removed.
Not quite sure what I was smoking when I made this diagram, so it's a little bit confusing but basically what it's trying to illustrate, probably it doesn't actually work like this in the implementation but what it's trying to illustrate is that you've got a send buffer, you've got a bunch of different subflows digging into the same send buffer and you're maintaining at the data level,
you're maintaining a list of unacknowledged bytes here. Okay, and it's gonna hand down a bunch of segments to each of these subflows and then that subflow's gonna do its regular TCP thing and send data backwards and forwards and say a packet's lost or whatever. It's got its own list here where it's gonna say, okay, I need to retransmit that or whatever.
Assuming everything's in order at the sender here, you know, reassembly has occurred, it's gonna sort of push that up again into the multipath layer where we can now look at the data level sequence space and say, okay, do we need to reorder it now? Okay, and just a quick word on congestion control.
I think I alluded to it earlier. There is a couple of congestion control for multipath TCP RFC floating around. I probably should have put a reference in but basically, the idea behind that one is being fair to TCP. So if you're on a bottleneck link, you've got a single regular TCP connection on it
and you've got your two multipath flows. You've got two multipath subflows going through the same bottleneck. They're basically gonna divide up so that they're not kind of taking all this extra bandwidth away from standard TCP. It also does a bunch of traffic engineering
in terms of it'll move data onto less congested paths. And again, that's not part of the specification strictly. The UCL guys did it. Yeah, yeah, yeah, it's, yeah.
Congestion control is like the main challenge and that's hopefully what we will be looking at eventually when our platform is stable enough, yeah. Yeah.
Yeah. Okay, I thought you actually implemented, I thought you implemented that couple right there. I hope to not implement the CMT. Right. We didn't do the congestion control. In fact, previously, CMT is turned off by default because we didn't have a congestion control and that's the nice thing that Mark did
when he defined the couple congestion controls. Yeah, yeah. I think someone implemented that. I can't remember. Yeah, again, I'm probably not doing it justice. I'm talking more about implementation, but yeah. Congestion control's obviously where it's at. That's where we wanna do our research.
Our approach is more about combining loss-based and delay-based congestion control. Is that the stuff that Stray sponsored originally that Hamilton Institute originally? Yes, the CHI delay gradient. Yeah, yeah, yeah. So we're hoping to test some of that. I was just going to start that. Oh, right, right.
Yeah, yeah. Okay, so hopefully there's enough background for what's going on with our MPTCP. I don't wanna go in for time. All right. So here's our implementation. Basically, we wanted to do something
that we can experiment with and hopefully other people can experiment with. To that effect, we wanna make it easy, have lots of hooks in there and let people play around, switch congestion controls, do whatever crazy thing that they think they wanna do. We wanna make it easy for that and to facilitate further research.
And of course, we're using it internally as well. And of course, we're a BSD. We prefer BSD in our research labs. So our BSD license is always nice. And I guess one important thing to note is that it's not really an optimized or fantastic implementation.
It's not really a goal of ours. Maybe down the line, someone who's really awesome can implement it. We've just managed to kind of complicate and change a whole lot of TCP. So that'll require a lot of nitpicking before anything like this,
before anything like this actually makes it through. Just a minute. We're all gonna die anyway, sure. Right, so these are kind of the questions that we were asking to begin with. Did we wanna do it as a kind of a module with some hooks or just as a shim layer?
How tight did we wanna couple it with the code in the kernel? I guess there's a way of reusing the existing TCP by creating a socket and then calling sockets from within that. So we're using that socket, which kind of allows you to bypass
some having to play around with socket buffers. There's a whole new bunch. Well, I wouldn't say a whole bunch of data structures, but there are new data structures which are related to dealing with all this aggregate traffic that's coming in. New methods of walking through segments
and reordering them at different layers. And of course, we've got a bunch of different subflows dipping into the socket buffers now. So they're coming in on different threads and there's lock contention that's gonna happen. And we wanted to kind of minimize that to make sure that we're not kind of like locking the multipath control block
members all the time so that we're not waiting on threads all the time. So basically this is how we eventually implemented it. It's done as a shim. I think there was just way too much change that needed to occur for us to do it as a module.
So basically it's got its tendrils all the way through TCP and possibly hand up into the socket buffer as well. We've taken the existing transport control block and kind of redefined that as a subflow control block and added a multipath control block,
which kind of sits above those now and takes care of the multipath kind of side of things. We've kind of gotten rid of the idea of a separate reassembly and in order kind of segment list on the receive side. So basically whether or not a segment's in order or not
is sort of assembled into this new list of segments. And I've got here that data level reassembly has been deferred to the user context. It's kind of what we want to do, but hasn't been possible at the moment because of some unresolved bugs. I'll get into that in a moment.
And of course on the send side, we're using a shared send buffer. So we needed a way of mapping chunks of that send buffer to different subflows.
Basically that means, so you're doing reassembly at the subflow level and then you need to reassemble at the data level. So at the moment that reassembly is occurring on whichever subflow receives the next in-order data level.
It's basically walking through the list at that point, reassembling everything and then calling, appending it to the buffer and calling sort of wake up or whatever. Ideally we don't want to reassemble at that point. We want to call sort of wake up, let the application do it. By user context, right? User thread, yeah, yeah. So in user space or in kernel space.
Yeah, so at the moment the reassembly's happening in kernel space. We don't want that to happen. We'd prefer to defer that. So you want to give the data up to the user with some sort of sequence indication and then have them reassemble it in the user space, not in the... Well, they're basically, so we don't need to give them, we basically just need to call the reassembly code.
It's got all the information that it needs in there. We don't actually have to pass anything out. Just let them know that you're paying calls. You want it off to the wake up. Yeah, after the, yeah, yeah, sorry. In the context of, right, yeah, yeah.
Yeah, yeah, kernel space, yes, yes, sorry. Previously in UP, sorry. It took that way. It's meta-formality. Yeah, yeah. Yeah, so basically, so it's not locking up all these subflows in kernel, yeah, yeah, yeah.
So basically, this is kind of how it looks. We've got that MP-CB kind of sitting above all these subflows here. This bit here is orange because basically, if you have a single subflow, that'll act just like regular old TCP.
And yeah, so we've created an MP-CB and it's just keeping track of any additional subflows that are made. Obviously, they use from the IP layer down down to the interface. It's just all the same code that it was previously. Okay, so this is what I was kind of talking about
just a couple of slides ago in terms of needing to map data between the socket buffers and the subflows in a way that's not gonna cause problems with having to do too much locking. So basically, define a structure called dsmap. I guess it's more in play on the sender side
in terms of you can schedule a segment to be sent and you can say, okay, well, here's the mbuff that it's in. Here's the amount of data that it's gonna send. And basically, a subflow can take that as being its own little socket buffer
and not have to afterwards lock that entire structure. It's not gonna interfere with any other subflows, anything like that. But yeah, so it happens on receiving an ACK.
You go through TCP output. At that point, you're calling the scheduler, and the scheduler's checking the actual socket buffer on the behalf of the subflow and saying, okay, well, I'll give you this much. Here's your offset where you start, and here's how much data you've got. And then the subflow's gonna take that, and then it's gonna, as long as that's got valid data in it,
it's not gonna attempt to call back into the packet scheduler, and it's gonna exhaust that. Once that's done, then it'll call back in. Yeah.
Yeah, it shouldn't really matter. It may transmit it, and the receiver will just disregard the duplicate. I guess the smarts probably can be put in to do that, but we haven't. Yeah.
Right. Yeah, and it's got outstanding data that needs to be sent.
Yeah. Yeah. Well, in that case, that subflow is just closed, and then because we're using, because there's a retransmit timer at the data level,
you won't get an acknowledgement of that data, and then say, okay, well, I'm gonna retransmit that block from another subflow. But that would look the same as if that subflow is just really, really slow, and it retransmits to it off, and we can try to send it someplace else. Yeah, yeah. The data will just arrive. It's really, really slow. It's never getting there.
Yeah, it's never gonna get there, yeah. So it's gonna say, well, I don't wanna allocate anything to this now, but there's always this kind of outstanding using kind of your ds-send-una. We've got this out. Yeah, yeah, it remains in there.
As long as there, so even if it's been act, if it's referenced by other subflows, it still stays there. Well, it'll be removed, and it won't have that mapping anymore. At some point, we're gonna kill the subflow. Yeah, we'll kill the subflow, yeah. That will remove the data.
Depends on how much it's getting there. Yeah. You can get copies of that data. Yeah, yeah.
Yeah, it's not quite as complicated on the receive side. Basically, what it does is it takes the map that's advertised as part of the multipath protocol, and then it just creates a map based on that so that each subflow knows how much data it'll be expecting.
And basically, once it's filled up its map allocation, then it can pass that up a layer, and that's when the data can be sent. So. If that was what you do up on there without continuity.
Sorry, in terms of like. Well, traditional DTPs are kind of the other order that is. Yeah, yeah. Anything that gives us the socket buffer at that point might give us the socket buffer that are ordered from the previous thing. So I'm wondering about the rate and moments of which things go from, oh, I might have some more continuity in the flow layer up to the top layer.
Okay, well, well, if I understand what you're saying correctly, the subflow stops worrying about things once it knows it's contiguous as far as it's concerned and as far as the mappings that it's got is concerned. Then basically all that it does is that,
I'll show you the list how it looks, but basically that data just stays in a list and it just sits there until time that the correct data sequence number is received. Yeah, yeah, yeah. So I'm kind of getting to about some of the shenanigans that are involved in the receive structures now.
So basically in this example here, we've got one sequence segment that's arrived in order and hasn't been delivered yet. So it's just kind of sitting over in the socket buffer. And we've got these two, two and three sequence numbers, three and four, sorry, and two is missing.
So we had this idea of, well, we've got the reassembly list and we've got stuff that's in order and can go straight up to the application. We're gonna wait until we get segment two and then that'll be in order and then the application can take all that stuff. So we've kind of taken that list now and basically made a list of the list heads,
which what that does is that now we're not so much concerned immediately whether or not we're receiving something in order. So we receive, say, subflow sequence number one. We go, okay, we've got one that's in order, but we're just gonna append it onto that list and it's just gonna sit there.
And then we get three and four and then they get put into the list. And then two comes along and say, okay, well, we'll just insert it in that hole. But nothing's actually done yet. It's not going into the socket buffer. It's not doing anything like that because now we're more worried about these DSN numbers.
So we've got, say, for example, in subflow one, we've got sequence 24 and 26, I don't know why, but DSN one and three. So that's not in order at the data level. Okay, and we've got DSN four, which kind of comes after that one. Once we get DSN two, then we can do reassembly,
which is kind of shown here. So these are all data sequence numbers and we can assume that everything's in order at the subflow sequence at this point. So we've got DS one, which is arriving. Okay, it's on subflow one. It's gonna be inserted on the bottom of the list here,
assuming that's all in order. And now we know, okay, that was our one that we were expecting to receive now. So basically we're gonna lock all of the lists, receive lists on each of the subflows. And we're gonna go through and reorder everything at the data level, and then we'll call sort of wake up,
or append it to the socket buffer, call sort of wake up. And the process can come and read the data out. And then we schedule obviously a data level lack at that point, because now we've received what we were expecting at the data level. This is what we were talking about before with the deferred reassembly.
So at the moment, this kind of happens as soon as that segment arrives. Ideally, we don't wanna reassemble it at that point. We just wanna wake up the application and then perform that reassembly then.
So window size is an interesting one, because we don't really have a receive buffer where we can say, oh, we've got this much space left. This is probably not the final way
that we're gonna do it, but at the moment, we just advertise a maximum window with whatever scaling factor, and rely purely on congestion control for. All right, so basically it's an option.
It's another TCP option. And what it looks like, sorry, I didn't bring a diagram of it, but if you imagine you've got an option, basically that option kind of looks like a TCP,
except it's kind of an option space. So you got a little bit of header information, whatever. Yeah, we can just throw it on any old packet. Yeah, yeah, any sub-flow, yeah. So basically anything that's outbound, you can insert it into the option space,
insert that data level back, and it doesn't need to be tied to a particular sub-flow or anything like that. It'll come in, it'll pass the option, and say, oh, okay, this is acknowledging that it wants, you can send from data level by 50 from now on. I'm trying to find out these data-level segments,
file-level segments, but by the way, what difference does that make? In terms of MSS? So the file-level? No, basically, you mean the segment when it gets down to the actual- Well, when I, so I tried to push some data down
to a particular flow, I guess I'm notionally a segment-y thing, and it might break it up. I guess the bottom layer of the line is as close as possible with the MSS as possible. Yeah, yeah, we use the MSS and kind of fill out MSSs. So yeah, we've got a really simple packet scheduler
which uses MSS as kind of the dividing line between, we'll give it X number of segments. It still isn't, so we can, it's still not, say, tied to the ethernet MSS. It can work with anything in particular, but yeah, MSS is kind of the, all right, well,
I kind of explained all this already, but basically, that does the reordering. Then it goes back up into the application. When you're in the subflow queue waiting for reordering, can the receiver renege on anything?
So I mean, in a classic SAC, you have the reneging problem, right, which basically, until I give you a tune of that, I can always renege data that I've already seen. Or once it's ordered in the subflow queue, it's not renegeable, is that right? I would say I don't know the strict answer to that,
but yeah, I haven't done anything that allows that to happen. From the send buffer side, it's very relevant because once I know it's not renegeable, I can free it, and now I can add more to my send buffer, but if it's renegeable by the receiver, then I have to keep holding my send buffer,
and that causes something called send buffer blocking. Okay, no, I don't think, the send buffer doesn't actually clear anything until it receives a data level ACK, so. At the very top. Yeah, at the very top, so, yeah. So it's all renegeable, then? Yeah. So it's ultimate send buffer blocking, then?
Yeah. Really, you think they're actually six, so they're sitting there holding some button for you?
Speaking of the send buffer, I guess we, most of us know what that looks like, but there it is. So basically, that's very simplified. We know what that looks like. This is kind of how we do it now. So we've got DS maps. So basically, you call your packet scheduler, and it's gonna look into the send buffer
and say, okay, well, I'm gonna map this much data onto subflow one, this much data onto subflow two, and I'm gonna return that map, and then the subflow can worry about exhausting its map before it bothers me again, okay? But of course, that requires some finicky accounting
because of things like maps not fully covering an MBuff, things like that, multiple subflows referencing the same bit of the send buffer, so we're kind of having to track now SBCC,
which kind of becomes logically where things are happening in the session, but we've got SB actual, which is kind of like, well, here's actually all the data that's sitting in the send buffer at the moment. We don't wanna drop things that are gonna kill an MBuff that's still being referenced by a different subflow.
So the packet scheduler, which I've kind of alluded to on a number of occasions, basically this is in the TCP output path, and any subflow that's gonna send that data, it's a bit like in regular TCP. If there's data to send, it'll call through into TCP output
and at that point is kind of where the multipath code kind of sneaks in and says, okay, well, we're calling a packet scheduler at this point. I'm gonna return you a map. This is what you think the actual send buffer is from now on. It can return a null, in which case the subflow thinks that there's nothing to send,
so it's like, oh, hey. Then in case it does return something, then it sends from there and it's not aware of anything else that's going on in the send buffer, and so this kind of really simplifies what's going on. It's driven by aXe and obviously a write, if that occurs,
but for the most part, we're worrying about aXe. So when aXe comes in on a particular subflow, we're asking the question, do we have any data left to send? Yes. Do I have an existing map? No, I don't have a map. If I have a map, then I can just continue on normally. If I don't have a map,
well, I'm gonna call the scheduler at that point, and the scheduler's gonna return either nothing or it will return something, and then basically it operates as it generally would, and you send a segment. Okay, time is good. Okay, so like I said, it's a pretty broad overview.
I didn't have a great deal of time to prepare the slides and put in lots of code and cool stuff, and there's a lot of guff that kind of, it doesn't seem like there's much happening, I guess, from a hype perspective, but there's really a lot of like, TCP's gonna be developed over a long period of time.
You change one thing, and the problems will just propagate through everywhere else. We ended up having to change, basically anything that has TCP underscore and that we've changed to some extent, probably things that I can't remember at the moment, and we've added some new source files in there as well.
Oh, the socket, why did we change the socket? I'm doing something to do with, I can't remember off the top of my head, but there was a lot of things going around in the socket buffers, and we added a macro to make that a bit more direct, but we also put in calls to do our deferred reassembly.
So the code's in there, but it's kind of not enabled at the moment because that's kind of where it's gonna be called from, assuming that it works. Other properties of the socket buffer such as available data to look like. Yeah, yeah, yeah, yeah. And we have to add in a bunch of macros
to like disguise SBCC and SB actual from wherever they're being checked, which is apparently a lot of places. You know, what else was there? Oh, now I can't remember. Oh yeah, and NFS was doing some weird things. Yeah.
Yeah. Yeah, and one thing about NFS, though, is if you're trying to change TCP and your test bed is running over NFS. Ooh, bad things here. In TCP, if it doesn't boot,
then you know straight away that something's wrong. Okay, I know in the talk description it said that I was gonna talk about our research projects. We haven't really been able to do too much, obviously, because it's taken a little bit of time
to get the implementation working. But this is the kind of thing that we're looking at doing, I guess, over this year. Myself, I'll be looking at vehicles and infrastructure stuff, so ed tool MP, and mobile data, and little onboard units that are running MPTCP.
Basically, not for non-safety applications, so entertainment like streaming videos, playing games, and things like onboard card telemetry being uploaded in the background. The real stuff that we really wanted to get into but haven't had a chance to yet is the congestion control business.
Of course, we've got our delay-based congestion control that we really wanna use, but we can also wanna take advantage of ModCC and do things like per subflow congestion control depending on the path, putting in different path cost metrics, things like that. Changing on-the-fly congestion control algorithms,
if you want, and packet scheduling is obviously just a big area of research because there's a lot of decisions to be made, and kind of, you know, and depending on what your application is, you wanna be, you know, doing,
allocating segments to subflows in a different way. Okay, well these are my observations, and mostly because I sort of came into it completely unaware of, you know, developing the kernel and all this sorts of thing.
Not conferring with Lawrence earlier on kind of comes back to bite you in the bottom, but, you know, doing things and then unwinding things is a huge process. Any kind of architectural decision you make early on that doesn't work out. Will mean you're in the lab seven days a week,
20 hours a day, well, maybe not that much, but close enough. And there's a lot of accounting in TCP. There's a lot of, you know, off by one kind of errors that you come across, and, yeah. And things that tell you that they do something,
but they don't really do that, and you don't discover that until later on. And of course, there's this whole aspect of implementing something from an internet draft that's kind of a work in progress, and is very, I'm not sure if tightly couples the right word, but it is kind of in some parts
related to the reference implementation. So being that the people who wrote the Linux reference implementation wrote, kind of had influence over the draft, those kind of assumptions and things that kind of didn't occur to them that kind of made us scratch our heads a little bit. We interpreted things slightly different.
The draft would just suddenly change without warning, and things that we'd implemented were no longer needed, things like that. And of course, we're continually patching, so we've released a couple of patches. Implementation isn't really anything
above kind of alpha quality at the moment, but we're sort of slowly, every couple of weeks, putting in the extra features, making it more spec compliant. It's interoperable for X number of packets.
Well, it's a big X. Well, what we've done is, would have been, I imagine, the first. Yeah, yeah. We did have to see the first interrupt with a 22 implementation.
Oh, wow, okay. We spent the first with the interrupt just trying to make sure we didn't even talk. Yeah, yeah. I mean, it's funny, things that happen unexpected in the handshake. You're like, oh, well, so now we need to fix that, or we need to add an assist ETL so we can cheat, and when we want to interrupt, we can say, oh, interrupt mode.
But yeah, it's something that we work on, and it's getting better. And an API, I haven't mentioned it, but it does work with traditional TCP applications, but if something's MPTCP-aware, then it's also nice to have an API to take advantage of the extra bells and whistles.
Acknowledgements. Well, Cisco provided the funding, so props to them for keeping me in employment for a year, and of course, the BSD can. They left it late, but they did eventually invite me to speak, and hopefully it's been something that's educational for everyone,
and like I said, please come up and ask if you have specific questions. I'll be hanging around, not doing too much, so I'll be happy to talk if you have anything to say, and follow up these links, because I'll cover a lot of what I've skipped over, particularly the how hard can it be paper,
which really talks about why a lot of the design decisions were made, and the compromises that had to be made, and things like that, and that's about it. Questions? So I assume that they've grown
into kind of a multitasking situation where you've supplied two different addresses to the other end, is that what it's like? Yeah, yeah. Like the same. Yeah. Do you have any facility, two sessions from the same address, and forcing out two different bases and they're like two different ISDs and so on, in the same destination address?
So take, sorry, so you've got two interfaces, and then. So you've got two interfaces going, or you have two interfaces which go out the networks, which eventually go out to two different ISDs. Right. So that would be more than partly part of the part. Right, so whether you can force that to occur. Do you have any way of two different sessions
with the same IP, to the same destination IP, but is there any way that a firewall or something can tell the difference between them and force them out other ways? You can do it only, and I don't, just because the list does not include route.c even though radix.c. I had passes for SDP that would do that.
And you had to change the routing and actually have an additional type of route so you can save it. Well, what I was thinking is whether or not you could specify a different fit for the two different sessions. Same basic idea. You have to have changes in the routing infrastructure and it's gonna need you to say, I want to route this way, and you have two default routes,
one going to ISP, AT&T, one going to ISP Verizon. You want to be able to say, I've already got this route, give me another one for this address. And then that leads a different way. And you can do that, but they gotta change some things. No, I didn't do it. Yeah, yeah. So Randall's answer stands for that one.
That would be, but when I did it, there was no fits. So you can do anything interacting with hardware operators, is it, or interacting with hardware operators? Well, we disabled hardware.
Yeah, basically, because we didn't want it chopping up our packets and things like that. Hardware operators don't like options. Yeah, yeah. You'd be surprised at how nasty and weird the hardware can do to change the options for the IP or TCP. Yeah, yeah. Basically, we've been disabling that for now
in case it does anything funky. One of the things I just want to kind of read about a couple of reasons are at that point. Do you see other reasons for that as a problem?
Short answer, I'm not sure. I haven't even thought about that much. I guess it's a performance thing. If the locking was right, it could be a feature, right? If the locking is no good, it could be, yeah. Scalability across different software components. Sounds like it might be a feature. Yeah. But it might also be a problem. That's a bit like you could take one huge TCP node,
save it one of the offloaded segments, and just hand one is one of one of your segments off the one, yeah. There's more opportunity, but the question is, how big are your segments, I think? The difference. Not the, not the.
I'll write the segments down to the sub, in the mappings. Not the frames, but the subplots. Yeah, to the subplots. Subplots, chunks, whatever you call them. Oh, geez. At the moment, it's just really a loose, more than an MSS. It's like, oh, how big is the socket buffer? There's stuff in there. We'll give you half of what's in the buffer. So an interesting set of questions.
The flow table can be arranged for the subload and it should be strategically placed so that you don't get terrible, multi-processing types of objectives. That is, keeping terrible for subloads is a threat to the main core. That way, you're not randomly distributing
all the subloads across different cores and you're getting into that really hard. Yeah. I mean, I'm sorry, I just, I reckon, I mean, I'm interested in when you guys come, personally, for them, it's 50 gigabytes, it's still over a lot, over six, and you'll see that there'll be a couple of hours. Yeah, yeah. I mean, there's quite, quite far along,
compared to ours, but yeah.