We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Extensions to FreeBSD Datacenter TCP for Incremental Deployment Support

00:00

Formal Metadata

Title
Extensions to FreeBSD Datacenter TCP for Incremental Deployment Support
Title of Series
Number of Parts
41
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Datacenter TCP (DCTCP) achieves low latencies for short flows while maintaining high throughputs for concurrent bulk transfers, but requires changes to both endpoints, which presents a deployment challenge. This presentation introduces extensions to DCTCP that enables one-sided deployment when peers implement standard TCP/ECN functionality. This makes DCTCP significantly easier to deploy incrementally. We also improve DCTCP in two-sided deployments by refining ECN processing and the calculation of the congestion estimate. A FreeBSD kernel implementation of these DCTCP improvements demonstrates better performance than the original DCTCP variant, and validates that incremental one-sided deployments see benefits similar to those previously only achievable in two-sided deployments.
16
Thumbnail
39:54
IRIS-TMathematical singularityHeat transferComputer networkAverageData transmissionDirected graphPort scannerInteractive televisionDataflowDependent and independent variablesServer (computing)Human migrationMetropolitan area networkLength of stayTime domainSingle-precision floating-point formatData transmissionWeightInternetworkingDomain nameOnline chatService (economics)Military operationInformationInternet service providerRouter (computing)Data managementSet (mathematics)WindowTotal S.A.Vector potentialMach's principleIntelDuality (mathematics)Core dumpBefehlsprozessorWeb pageArtificial neural networkArmSoftware testingLeast squaresDemosceneAddressing modeNewton's law of universal gravitationParameter (computer programming)Turing testSoftware engineeringStructural loadAsynchronous Transfer ModeModule (mathematics)Inclusion mapPartial derivativeMetreValue-added networkSoftware developerEinstein field equationsLucas sequencePersonal area networkHaar measureFile formatMaxima and minimaComputer-generated imageryComputer fileTunisExt functorConditional-access moduleBitCommodore VIC-20AlgorithmMathematical analysisFraction (mathematics)Video gameDatabaseData managementGraph (mathematics)InformationOrder (biology)Perspective (visual)Data centerFeedbackGame theoryNetwork topologyValidity (statistics)VarianceSemiconductor memoryType theoryDigitizingProduct (business)Level (video gaming)WindowAverageSoftware testingIntegrated development environmentTask (computing)Subject indexingComputer configurationPhysical lawState of matterArithmetic meanBefehlsprozessorForm (programming)BitBit error rateForcing (mathematics)Fundamental theorem of algebraFunctional (mathematics)Gradient descentUniverse (mathematics)Power (physics)Performance appraisalMaxima and minimaExtension (kinesiology)MereologyMoment (mathematics)TheoryProjective planeResultantTerm (mathematics)NumberCellular automatonQuicksortAreaSystem callQuery languageConfiguration spaceMeasurementCovering spaceInternetworkingComputer fontOperator (mathematics)Human migrationParameter (computer programming)WeightCASE <Informatik>Strategy gameField (computer science)Numbering schemeVector potentialMusical ensembleShared memoryAdditionPoint (geometry)Programming paradigmOcean currentSocial classWordCircleData transmissionCartesian coordinate systemOnline helpInsertion lossPeg solitaireEstimatorChemical equationAgreeablenessStatement (computer science)Film editingPlastikkarteWorkstation <Musikinstrument>CNNComputer fileClient (computing)ÜberlastkontrolleCharacteristic polynomialMixed realityWebsiteCondition numberEndliche ModelltheorieFiber (mathematics)Different (Kate Ryan album)CalculationBlock (periodic table)Commitment schemeLoginParticle systemPlanar graphLengthMultiplication signCybersexStandard deviation2 (number)MappingOperating systemDemosceneFigurate numberPattern languageGame controllerDesign by contractSoftware developerDigital electronicsComputer hardwareImplementationSoftwareModal logicPlotterFluid staticsDrop (liquid)Entire functionFinite differencePairwise comparisonPhysical systemRadical (chemistry)Device driverTunisVirtual machineHeat transferQueue (abstract data type)Wind tunnelLink (knot theory)DataflowRevision controlGoodness of fitSimilarity (geometry)Server (computing)Matching (graph theory)Regular graphError correction modelNormal (geometry)Error messagePredictabilityRouter (computing)IP addressSet (mathematics)Kernel (computing)Communications protocolInternet service providerPartial derivativeSelectivity (electronic)Thresholding (image processing)TouchscreenArc (geometry)Uniform resource locatorMessage passingRight angleVirtualizationSuite (music)Mechanism designComputer animation
Transcript: English(auto-generated)
If you use DC-TCP, so it takes 82.5 milliseconds, so three times faster than normal TCP. And the third one is BSD feature,
so there is a situation that you use DC-TCP at your destination doesn't use DC-TCP. In that case, DC-TCP is used at one-sided.
So in that case, in the situation, the transmission time for downloading is 89.4 milliseconds. So what we can say from this example is,
using BSD-TCP achieves faster data transfer time than using normal TCP, not only in the fully deployed network,
but also partially deployed network. So in this example, there is no packet loss.
Yeah, so this time indicates the queuing delay in the switch. So I think many of you may not know what DC-TCP is, so at first, I'm going to introduce you what DC-TCP is
and what benefit you can receive and what is necessary network equipment for DC-TCP.
Then I'm going to introduce you what BSD-TCP feature. At last, I'm going to show you how to configure DC-TCP on FreeBSD.
So what is DC-TCP? DC-TCP stands for Data Center TCP. So as the name indicates, DC-TCP is a proposed TCP variant that solves performance in flow in a data center network. So what happens if we use normal TCP in the data center network?
So imagine the situation, so the link share short flow and long flow.
In this situation, short flow loses long flow, then they get longer transmission time. So this case happens in the data center.
So imagine short flow is like data-based query and response, and the long flow is like bulk transfer for server migration.
So DC-TCP solves this problem and DC-TCP contributes these three points. So first, DC-TCP maintains long and predictable latency for short flows.
And second, DC-TCP flows of sub-traffic paths. Third, DC-TCP flow maintains high throughput for long flows.
So DC-TCP focuses on data center network. Why do they focus on data center network? What is the difference between data center network and actual normal network like the internet?
So this figure shows the difference. So in this picture, I show two perspectives of the difference.
So first difference is traffic pattern in data center network. So in the data center, servers communicate with servers inside the same network.
So the majority, major of traffic pattern is run inside the network. Compared to the data center network, internet communicate with other network.
So like this. Another difference between data center network and the internet is prioritized criteria for application.
So in the data center, they prioritize data transmission time. Even if it's one millisecond delay, they lose customers. So they want short transmission time.
On the contrary, the internet focuses on the average of circuits. So what we can say from this difference is, first, the data center network feature is easy to optimize the network equipment and operation for themselves.
And the second feature is the application requirement.
So customers want short delay and not data transmission. So by using these features, what did TCP approaches is leveraging ECN to the internet.
What is ECN? So ECN stands for explicit congestion control notification. ECN is a traditional active queuing management scheme and it proposed around 1990.
More than 10 years ago, I think. So they provide supportive information for TCP congestion control. So it's not one type of congestion control, but they work with TCP.
So ECN motivated to make host transfer data without packet losses.
So in order to use ECN, you need network equipment support for ECN. So if you want to use ECN, please check layer 3 switches and the routers configuration.
And for servers, it is easy to support ECN because many operating systems have been implemented ECN. So you just set up, turn on ECN, that's all.
So how ECN works? So before I explain about it, I'm going to quickly review how traditional TCP works.
So look at the top figures. So there are two servers, sender and the receiver. Between them, there is one switch. And if the sender transmits many packets, the queue of the switch becomes full.
Then packet loss occurs. The sending host knows the packet loss by the receiver's arc. Then the sending host controls the window size and helps the window size.
So if the server and the network equipment use ECN, what happens is shown in the bottom picture.
So switch has a threshold that indicates potential congestion.
So if the sender transmits many packets, the queue of the switch exceeds the threshold. Then at the timing, switch starts to mark ECN.
Then the receiver gets ECN, so the receiver sends ECN echo by the arc.
And the sender now knows ECN is marked on the switch. So they halt window size. Then they avoid packet loss, like this mechanism.
So DCTCP uses ECN differently. So they use ECN to estimate the potential congestion precisely.
So what they do is shown in this picture. So switch configuration is same with traditional ECN.
So the switch has a threshold, and if the queue length of the switch exceeds the threshold, it starts to mark ECN.
The difference between legacy ECN and DCTCP is window control. So DCTCP senders calculate the fraction of ECN marked packet in the previous window size.
So in this example, suppose that window size will be updated at this timing.
So in this case, they receive two ECN marked packets and six no ECN marked packets. So the fraction of ECN marked packet is one-fourth. So the DCTCP sender reflects this information to window control, like this.
So this is why DCTCP can...
How do you say? This is why DCTCP achieve faster transmission time than DCTCP and without packet losses. So we do the simple experiment to verify the DCTCP performance.
So this is the topology we use. So there are four firsts and four machines, and they run FreeBSD 10, current 10. And they have two dual-core CPUs, and they have 16GB memory.
And they have four Ethernet, 1GB Ethernet cards.
And for switches, we use Cisco Nexus 3548. This is the switch implemented for DCTCP, so they support ECN. In this switch, we set threshold to 10 packets.
And in order to run DCTCP or TCP flow, we use Flow Grant as a traffic generator.
So this is the topology we use. So there are three senders and one receiver. So this senders transmit packet to receiver under destination is R1.
And the receiver has two interfaces. So we set two IP addresses for receiver. So for sender 1, we use R1 IP address.
And for sender 2 and 3, for sender 3, we use different IP address. So in the evaluation, we did two experiments, in-cast and dead-back transfer.
In the in-cast experiment, we evaluate the TCP or DCTCP performance for a fast transfer.
In order to do this, achieve this, we run 10 flow at the same time. And we change the dead size to be transferred from 10 to 800 KB.
We measure the average of dead transmission time for the 10 flows. Another scenario is bulk transfer. So in this experiment, we evaluate the TCP or DCTCP performance by running mix of short and long flow.
So in order to do this, what we do is start 10 short flow, 500 milliseconds after two long flow runs.
So like in-cast scenario, we change the dead size for short flow. And we set the static value for long flow.
It corresponds to 40 megabytes. Then we measure the average of dead transmission time for short flow and long flow. So this is a result for when we did in-cast scenario.
So x-axis shows the dead size to transfer, the sending host transfer. And the y-axis shows the average of dead transmission time.
Each plot shows average transmission time. And the error bar shows the standard deviation. So what happened here is DCTCP is almost the same as normal TCP.
But if the dead size increases, there is five milliseconds difference between them.
So DCTCP is faster than TCP slightly. So what happens if we mix the long flow and the short flow?
So you can see the very good advantage of DCTCP. So the top figure shows the average transmission time of short flows. And the bottom figure shows the average transmission time of long flow.
So as you can see, the difference between DCTCP and the normal TCP is not so big for long flows.
On the contrary, for short flows, there is significant difference. So if we see the result when the sending host transfer 10 kilobytes data,
so TCP takes 33.7 milliseconds. On the other hand, DCTCP takes 1.6 milliseconds.
This difference becomes larger when the dead size increases. So if the sending host transfers 800 kilobytes data, so DCTCP is three times faster than normal TCP.
So we're going to quickly review what DCTCP is. So what is DCTCP?
DCTCP is a proposed TCP variant using ECN for dead center network. So what benefits we can receive is shown in our experiment.
You can reduce 170 milliseconds of dead transmission time for 800 kilobyte transfer in the mix traffic of long flow and short flow.
And in order to use DCTCP, you need layer 3 switch and router and servers that support ECN. So next, I'm going to introduce you what BSD DCTCP feature are.
So I extend original DCTCP to get better performance.
And so what I did is first is incremental deployment support. Another extension is initial window size calculation for performance tuning.
So why I did incremental deployment support? Why is it important in dead center? So look at this picture. So can you recognize the difference between two pictures?
So in the top figure, both servers are DCTCP. On the bottom figure, one server uses DCTCP and the other server uses ECN with TCP.
So we can recognize the difference between two, but for servers, there is not. So because for servers, both of them uses ECN and they somehow behave condition like TCP.
So they don't recognize the difference. So why do we need to consider this situation?
So in dead center, this situation happens. So this is an example. So if you run application server in the dead center network, you can easily upgrade your kernel
because application doesn't matter kernel version. So you can use new protocol like DCTCP. On the other hand, if you run appliance services,
they limited kernel version for the driver support or something like that. So in that case, they cannot use DCTCP for a while.
So in this situation, one server can use DCTCP, but the other cannot use DCTCP. So in this topology, what I found is in the last case, one-sided DCTCP gets a very long,
much longer transmission time than two-sided DCTCP.
So we solved this problem by supporting compatibility with ECN. So it's very detailed in this talk, so please see our paper for detail.
So by supporting the compatibility with ECN, BSD DCTCP remarks 90% of similar performance with two-sided DCTCP.
So we minimized the performance penalty compared to the original DCTCP. So another feature of BSD DCTCP is initial window size calculation.
So for initial window size calculation, they strayed off between latency and the throughput. So in the previous DCTCP, you can choose either of them by setting parameter.
So if you set throw starts as the parameter throw start to zero, you can get higher throughput,
but it's unfriendly to competitive running flows. If you set throw starts to one, which is recommended to DCTCP draft, the benefit you can receive is shorter latency and friendliness to competitive flows.
So how can I set DCTCP on FreeBSD?
So this is the flow to set up DCTCP. First, you load DCTCP module.
Then if needed, check the available congestion control algorithm using this command. Then in order to receive the benefit of DCTCP, you have to enable ECN.
Then set ECN to congestion control algorithm like this. So BSD DCTCP included the support of incremental DCTCP benefits,
so it's not necessary for additional configuration for it. If you want to change the initial window size control, set up this command like this.
So in my talk, I introduced what DCTCP is and what BSD DCTCP feature is.
So what we did is BSD DCTCP minimize the performance penalty in the partial use of DCTCP.
So we can say that the BSD DCTCP is more business, so more practical than other implementation.
And the next, the other message is, so we have selectable parameter for performance tuning. So you can choose the performance benefit according to your request for applications.
So at last, I want to say thank you to these guys. So thanks to Hiren and Lawrence and the grant support, I can match DCTCP in FreeBSD.
And thanks, I want to say thank you to my ambassador and the NetApp lab, Germany, because I can work on DCTCP thanks to their cooperation.
And I want to say thank you to Cisco guys because they help about DCTCP configuration on their seat.
And thank you for all ITF people and others as well. So that's all. So if you have any questions, yeah.
Could you just put the URLs back up so you had it on the screen earlier, and place the stats, the URLs of the presentation. I'm sorry? Can you go back to the beginning where the URLs were? Okay.
Ah, okay. Yeah. This one? Those ones, thanks. Okay. Just to confirm, you're not manipulating any of the queue fairness, right? The games are simply because there's not loss on the sensitive short-lived connections?
The games are just because there's simply not loss on the short-lived connections? Not packet loss on the short-lived connections? Is that really one of the entire games? Yeah. Meaning you're not manipulating anything, any of the queues or any of the fairness?
Yes. The performance data you showed, that was with small star and all? Small star equals zero? In this experiment, I set initial window size to three, I think, yeah.
And no slow start? Yeah, we have slow start. Bigger initial?
Slow start parameter here, is it one or zero? Ah, zero, yeah. Zero? Do you have any measurements with that set of one? Do they come out the same? In this case, yes.
It's the same. Yeah, I'm not too worried about that case. I'm worried about that case. This one. This one, yeah. I think strategy takes longer time, but it's not so long, I think. You can check my master CCC if you want.
Maybe there is a link on this, on this, the left. Yep. So I'm wondering, what if you have in your data center servers that are kind of bad citizens,
whether they don't listen to ECM and they're just sending standard PCP. How does DC-PCP, which is very gentle because it does listen to ECM, will compete against these other flows if you make measurements?
So you mean? I'm saying like imagine those two large flows on the test, they're just standard PCP, very aggressive. They won't fill the queue, so they won't listen to ECM. Yeah. So in that case, DC-PCP loses. So you have to separate two flows with different configuration in the suite.
This is a theory.
Well, the assumption was that you control your data center, so you can make sure all your hosts are still here. So if you're using a mixed tendency sort of environment where you don't control the OS, well then you either need to do flow isolation, you can work with gear or something else.
If your infrastructure provided, assume that the ECM is supported and it doesn't work, it doesn't work with ECM, you cannot rely on the operations. This is more for the infrastructure provider to run than for someone on the machine at a data center.
It's for you to control the whole path. Yes, but what if, I mean, this point is kind of valid, right? What if you don't? I mean, who's guaranteeing you that all the servers, that all the VMs are going to run in your data center, are all going to stop negotiating events? Yeah.
Okay, but I was just wondering, maybe there is a technique that you would have found, or switched, whether they will sniff on the ECM negotiation to see, okay, this is a flow that doesn't have ECM negotiation, so I'm going to automatically put it in a different category,
and I don't know if it even switches, This is one of those support ECM, often don't do the marking quite carefully, so you've also got to be very careful when you get hardware that does the right thing as well, so it's like, there's about 10 different variables, obviously, that we're discussing, and all of them can go wrong, so it's very complex in a multi-tenancy situation, trying to actually get it working.
I mean, Microsoft did this for their bases, obviously they controlled everything, so it was easy for them. Yeah, that's true. In a virtualized environment, we have to configure all the host servers, but the ECM...
In a virtualized environment, like VMware, on the host servers, we have to configure also, I suppose, but the ECM, because somehow the device itself communicates, and we can have some behavior. So you mean on the virtual machine?
So you mean, do I test DC-DCP on the virtual machine? You have a VM guest, and a VM host, and the switch.
The switch sends ECM to your host, so did you try, I guess, to see if the host would pass it through to the guest? Whatever stack terminates this reconnection needs to have the support for this. Not the host. The host doesn't care. Unless you do tunneling, and then you have other issues, right,
where ECM tunneling is a whole other set of issues. I don't think we tested it in the VM, but it will obviously work in the VM. The host should just pass it through the VM guest. Yes. It's just a piece of traffic. We can have congestion on the host server,
even the ECM works very well in the guest, the centralized server and switch. So if you have a queue that doesn't support ECM in your path, then you have a problem, and if the queue is in the host, you have a problem, just as if the switch didn't support ECM. Okay, did you test that?
No. Okay. We have two tests. We assume that all switches use ECM support. She only did one master thesis, and she only did one. It won't work at any market soon. I've had it a long time, so.
Back to your comparison for short time flows. This? Yeah. Do you have any feeling for why, with the short flows, you're getting a net improvement with VCTCP and why they're converging in the bottom graph? It looks like it's approaching the performance of the regular VCP.
I think the main reason is queue occupancy in the switch. But it's... But the VCP you're competing against is also doing ECM. What? The regular VCP is also doing ECM.
No, no. No ECM at all. No ECM. But they start 500 seconds before short flow runs, as I said. So I run long flow at first.
Then after 500 milliseconds, we start short flow. So for long flow, they have already occupied the queue for long flow.
So the loss would be quite a bit more detrimental to short flow, where I guess the session's actually probably idle, waiting for that retransmission with the large one. In that case, we don't have packet loss for... Before it did take a long time. The loss was very detrimental to short the session.
I'm sorry. Of course. Without the ECM, the performance loss comes because the short flow is sending into a full queue already, and so it's pretty guaranteed to see a loss in the beginning. I think that's what you mean by that.
That word, the long session loss, you can easily recover from it, drop in the bucket. Yeah, that makes sense. So that's all. Yeah.
Thank you.