Abusing SSH for ZFS and Profit - TIB AV-Portal

Abusing SSH for ZFS and Profit

00:00

2

Related Material

Berkeley Software Distribution (BSD)

Formal Metadata

Title

Abusing SSH for ZFS and Profit

Subtitle

SSH Bulk Transfer Performance Improvements

Title of Series

The Technical BSD Conference 2017

Number of Parts

31

Author

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/45269 (DOI)

Publisher

Berkeley Software Distribution (BSD)

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Are you waxing your NIC trying to make it go faster? ZFS is so fast, but my replication is slow... why? SSH tries by default to be very responsive, and sacrifices throughput to remain interactive. This is usually what you want, but not always. We abuse SSH for many things, be in ZFS replication, rsync, ansible, etc. This work describes a number of proposed improvements to OpenSSH to increase the performance of non-interactive sessions used for bulk transfer. The author describes ongoing development and tuning work to maximize performance of bulk data transfer over SSH. Development includes improvements to the HPN patch sets to resolve problems with dynamic window scaling (both TCP and SSH windows), new functionality to manually specify a larger remote send/receive socket buffer for high latency networks, and development of the new NONEMAC feature. The author also presents detailed benchmarks on the performance tuning required to maximize transfer rates over both local and long-haul networks. A comparative analysis of the performance of various ciphers on modern amd64 hardware is also presented. With this work, transfers in excess of 15 gigabits per second were achieved using a pair of E5-1650 v3s back-to-back with 40gbps Chelsio NICs.

The Technical BSD Conference 201713 / 31

1

45:50

Debugging the debugger

2

37:05

The History and Future of Crash Dumps in FreeBSD

3

43:12

Save & Restore for bhyve

4

51:03

The Foundation of IT Infrastructure culture

5

37:46

Not simply pieces of string

6

35:24

FreeBSD ARM : Before Kernel

7

1:07:40

The FreeBSD Tool Chain

8

53:40

Centrally managing an ISP with NetBSD and PostgreSQL

9

50:07

From the outside

10

56:41

The Trouble with FreeBSD

11

49:48

12

52:23

Backpressure in FreeBSD I/O Stack

13

52:38

Abusing SSH for ZFS and Profit

14

41:07

Universal Userland

15

55:49

From microservices to monoliths

16

1:00:47

Closing session

17

58:29

The Realities of DTrace on FreeBSD

18

50:00

Hardening pkgsrc

19

48:12

The OpenBSD Web Stack

20

59:12

DTrace Internals: Digging into DTrace

21

26:47

FreeBSD as a Service

22

58:26

Where is IPv6 going ?

23

53:02

Improving the ZFS Userland-Kernel API: Channel Programs

24

54:06

Continuous Integration of The FreeBSD Project

25

1:06:12

The Technical BSD Conference 2017 - The opening session & keynote

26

58:56

27

24:27

IP Forwarding Fastpath

28

1:03:27

Oblivious sandboxing

29

44:01

Understanding NFSv4 ACL's

30

38:35

31

49:13

vmd: an virtual machine daemon for OpenBSD

Automatic playback

Speech

Text

Image

00:00

Server (computing)FreewareScale (map)Content delivery networkVideoconferencingCache (computing)Core dumpSingle-precision floating-point formatData storage deviceRow (database)Streaming mediaProjective planeInstallation artComputer animation

01:13

Server (computing)Replication (computing)Heat transferScale (map)Set (mathematics)Personal digital assistantClient (computing)Local area networkMetropolitan area networkBackupLink (knot theory)Control flowFreewareCASE <Informatik>Connected spaceData transmissionReplication (computing)Server (computing)Local area networkLink (knot theory)Default (computer science)BackupComputer animation

02:52

InternetworkingFreewareClient (computing)Sanitary sewerRepository (publishing)Band matrixProduct (business)Order (biology)MetadataVideoconferencingDirectory serviceSet (mathematics)Entire functionServer (computing)Buffer solutionBand matrixReplication (computing)Product (business)Network socketComputer fileWebsiteCASE <Informatik>2 (number)Computer animation

04:55

Replication (computing)VideoconferencingServer (computing)Data storage deviceClient (computing)Server (computing)VideoconferencingDatei-ServerConsistency1 (number)Row (database)Different (Kate Ryan album)Streaming mediaProcess (computing)Patch (Unix)Centralizer and normalizerCASE <Informatik>InternetworkingBit rateBitConnected spaceComputer fileComputer animation

06:15

File Transfer ProtocolContent (media)Revision controlDirection (geometry)VideoconferencingClient (computing)Revision controlClient (computing)Server (computing)Hacker (term)WindowGame controllerVideoconferencingVirtual machineCuboidComputer animation

07:15

Patch (Unix)Default (computer science)Product (business)Heat transferBand matrixStandard deviationAerodynamicsPersonal digital assistantBuffer solutionSet (mathematics)Patch (Unix)Computer fileNetwork socketTunisBitStatement (computer science)SoftwareConnected spaceWindowRevision controlLink (knot theory)Interactive televisionDefault (computer science)Computer-assisted translationRadical (chemistry)Limit (category theory)Computer animation

09:05

Client (computing)Configuration spaceComputer configurationData bufferNetwork socketHeat transferServer (computing)Revision controlConnected spaceServer (computing)Different (Kate Ryan album)Buffer solutionNetwork socketPatch (Unix)Client (computing)CASE <Informatik>Goodness of fitComputer animation

10:26

Local area networkHeat transferNetwork socketData bufferDefault (computer science)Band matrixTheoryDrop (liquid)Maxima and minimaInternetworkingProduct (business)Software testingNetwork socketBand matrixLink (knot theory)CASE <Informatik>Buffer solutionMaxima and minimaDefault (computer science)Product (business)NumberTheorySubset2 (number)Computer animation

11:44

Data bufferNetwork socketTheoryChannel capacityLink (knot theory)Network socketBuffer solutionBand matrixMaxima and minimaSoftware bugProduct (business)TheoryPressure2 (number)Right angleLimit (category theory)Computer animation

13:48

Heat transferSubsetScale (map)Dynamical systemScaling (geometry)Heat transferNetwork socketBuffer solutionMaxima and minimaCodeWindowSubsetRight angleOcean currentCondition numberState of matterPressureDynamical systemHeat transferSet (mathematics)Stack (abstract data type)ResultantComputer animation

15:22

Function (mathematics)Maxima and minimaPattern languageOnline help2 (number)Game controllerStreaming mediaRevision controlWindowNetwork socketBuffer solutionFunctional (mathematics)Software developerMultiplication signMessage passingComputer animation

16:39

Patch (Unix)AerodynamicsData bufferNetwork socketMaxima and minimaPhysical systemProcess (computing)Operations researchBand matrixInteractive televisionoutputProduct (business)Software testingInteractive televisionWindowConnected spaceMultiplication signMaxima and minimaMathematicsConfiguration spaceDifferent (Kate Ryan album)Limit (category theory)Network socketProduct (business)Asynchronous Transfer ModeComputer configurationBuffer solutionPhysical systemPatch (Unix)Revision controlEncryptionBand matrixPoint (geometry)Communications protocolSound effectTunisLine (geometry)Set (mathematics)Computer animation

19:23

Network socketGreatest elementDifferent (Kate Ryan album)Band matrixSoftware testingGraph (mathematics)Line (geometry)Revision controlBuffer solutionNetwork socketVirtual machineBitPatch (Unix)Limit (category theory)ResultantDiagram

20:33

Heat transferComputer configurationScaling (geometry)Network socketData bufferBuffer solutionConnected spaceNetwork socketMaxima and minimaDenial-of-service attackSet (mathematics)Computer configurationWeb 2.0Streaming mediaComputer animation

22:01

Limit (category theory)Extension (kinesiology)Local ringCommunications protocolServer (computing)Data bufferClient (computing)Configuration spaceComputer configurationConfiguration spaceMatching (graph theory)Row (database)Limit (category theory)Network socketBuffer solutionServer (computing)CASE <Informatik>Datei-ServerClient (computing)Patch (Unix)Computer configurationComputer animation

23:02

Data bufferMaxima and minimaClient (computing)Heat transferNetwork socketSocket-SchnittstellePhysical systemBand matrixProduct (business)Scaling (geometry)Server (computing)Read-only memoryMaxima and minimaTunisBuffer solutionNumberConnected spaceDefault (computer science)Network socketWeb 2.0Latent heatComputer animation

23:52

Data bufferNetwork socketMaxima and minimaRead-only memoryOverhead (computing)Scaling (geometry)CodeMaxima and minimaPressureOverhead (computing)Semiconductor memoryDefault (computer science)MereologyBenchmarkNetwork socketWindowBuffer solutionData managementRoundness (object)Scaling (geometry)Connected spaceSet (mathematics)Game controller1 (number)WeightComputer animation

26:22

Gastropod shellPatch (Unix)EncryptionLoginProcess (computing)Heat transferCryptographyComputer configurationPasswordEncryptionConnected spaceBefehlsprozessorGastropod shellKey (cryptography)Local ringoutputReplication (computing)Multiplication signComputer animation

27:35

Replication (computing)Scale (map)EncryptionLocal area networkInterface (computing)BefehlsprozessorPatch (Unix)InternetworkingCASE <Informatik>InternetworkingRevision controlBefehlsprozessorPatch (Unix)Multiplication signConnected spaceEncryptionComputer animation

28:15

Patch (Unix)EncryptionEvent horizonDefault (computer science)Software testingPatch (Unix)EncryptionComputer animation

29:10

FreewareDefault (computer science)BefehlsprozessorRevision controlCore dumpMeasurementNetwork socketSingle-precision floating-point formatPatch (Unix)Software testingSet (mathematics)Computer animation

29:54

MeasurementDefault (computer science)EncryptionEncryption2 (number)Computer fileWeb pageMessage passingConnected spaceBuffer solutionDefault (computer science)Patch (Unix)Operating systemBit rateNetwork socketDifferent (Kate Ryan album)BenchmarkPolygon1 (number)Software testingMereologyBlock (periodic table)CoprocessorVirtual machineNumberComputer animation

34:28

EncryptionAdvanced Encryption StandardComputer hardwareCryptographyInteractive televisionReplication (computing)EncryptionFreewareMereologyReplication (computing)Block (periodic table)Streaming mediaMultiplication signRow (database)Computer animation

35:54

Advanced Encryption StandardHeat transferControl flowEncryptionCalculationRevision controlAsynchronous Transfer ModeNumberResultantPatch (Unix)EncryptionRevision controlCombinational logicNumberCoprocessorMereologyComputer animation

37:12

Limit (category theory)Graph (mathematics)BefehlsprozessorArchitectureWeb pageHeat transferFreewareServer (computing)Scale (map)Content delivery networkVideoconferencingCache (computing)Core dumpData storage deviceSingle-precision floating-point formatReplication (computing)Set (mathematics)Personal digital assistantClient (computing)Local area networkMetropolitan area networkBackupLink (knot theory)Control flowInternetworkingSanitary sewerRepository (publishing)Product (business)Band matrixRevision controlDirection (geometry)Content (media)File Transfer ProtocolEncryptionRight angleScripting languageElectronic mailing listReplication (computing)Computer animation

37:55

Patch (Unix)Default (computer science)Band matrixProduct (business)Heat transferStandard deviationAerodynamicsPersonal digital assistantClient (computing)Server (computing)Computer configurationConfiguration spaceData bufferNetwork socketLocal area networkDrop (liquid)TheoryMaxima and minimaInternetworkingChannel capacityLink (knot theory)SubsetScale (map)Scaling (geometry)Dynamical systemHeat transferFunction (mathematics)Pattern languagePhysical systemProcess (computing)Operations researchInteractive televisionSoftware testingoutputEmailNumberEncryptionComputer configurationComputer animationDiagram

38:36

Limit (category theory)Extension (kinesiology)Communications protocolLocal ringServer (computing)Data bufferClient (computing)Configuration spaceComputer configurationMaxima and minimaHeat transferNetwork socketSocket-SchnittstellePhysical systemProduct (business)Scaling (geometry)Band matrixRead-only memoryOverhead (computing)Gastropod shellPatch (Unix)EncryptionProcess (computing)CryptographyReplication (computing)Scale (map)Local area networkBefehlsprozessorInterface (computing)InternetworkingEvent horizonDefault (computer science)FreewareMeasurementAdvanced Encryption StandardComputer hardwareInteractive televisionGraph (mathematics)Slide ruleFile viewerMultiplication signComputer programmingComputer animation

39:19

Heat transferControl flowEncryptionCalculationSoftware testingAdvanced Encryption StandardRevision controlAsynchronous Transfer ModeNumberLimit (category theory)Graph (mathematics)BefehlsprozessorArchitectureEncryptionSlide ruleLimit (category theory)Resource allocationPatch (Unix)BefehlsprozessorTunisPoint (geometry)Graph (mathematics)Multiplication signComputer animation

40:31

FrequencyRead-only memoryIntelNetwork socketBefehlsprozessorBenchmarkScaling (geometry)Scale (map)Link (knot theory)Computer networkEncryptionFrequencyScaling (geometry)Graph (mathematics)Client (computing)Point (geometry)Operating systemMultiplication signCoprocessorRevision controlDynamical systemRoundness (object)Buffer solutionPatch (Unix)MereologyNetwork socketGoodness of fitFunction (mathematics)Right angleSoftware frameworkCache (computing)Projective planeBitHeat transferCountingConnected spaceFlow separationCore dumpMaxima and minimaBenchmarkNumberBefehlsprozessorWeb pageLatent heatWindowCommunications protocolSet (mathematics)Electronic mailing listSequenceDeciphermentLink (knot theory)Software testingServer (computing)Virtual machineTurbo-CodeLine (geometry)Portable communications deviceEncryptionGroup actionHeat transferMixed realityComputer configurationDifferent (Kate Ryan album)Rule of inferenceElectric generatorComputer animationDiagram

Transcript: English(auto-generated)

00:05

got involved working on the installer and other stuff, and then became a member of the FreeBSD source committers, and then the bunch of silly people in the FreeBSD project decided to elect me to the core team. In my day job, I'm the architect of the ScaleEngine CDN, which does video streaming.

00:25

We're actually streaming this live, if people can actually see me. And I also recently co-authored two books, FreeBSD Mastery ZFS and Advanced ZFS, with Michael Lucas. They're for sale out in the hallway. If you want an autograph after, just find me.

00:40

And I host a podcast every week, bsdnow.tv. The crux of most episodes is an interview with a developer. I've interviewed a bunch of the people I see sitting around the room. We're on episode 190-something. It's recorded live every Wednesday and comes out usually Thursday or Friday.

01:01

Very reliably every week for the last 195 weeks in a row. And I use a lot of ZFS, and I have to move data around the world a lot, and I like to use SSH for that. So we use SSH for bulk data transfer, because especially using ZFS, now that we have resumable send, thank you to the people from Delphics,

01:24

when I have to send data from Toronto to Australia or Germany or whatever, the easiest way to coordinate the resumption of an interrupted connection is having a bi-directional channel like SSH. And it's just much easier to facilitate than trying to use something like Netcat and have it configured on both ends and try to pass

01:44

the resume token over some other channel or something. But we have four primary use cases for bulk data transfer over SSH. The first is just ZFS replication. We're trying to replicate between servers on a LAN at the racks in the data center,

02:01

but also we have a metro connection that goes from Toronto to my basement about 75 kilometers away, where all the backup servers are, so that if we need the backups, they can go in the back of the car directly out of the rack in my basement. And so now that our LAN connections went from 1 gig to 10 gig, or we got this main connection that's about

02:26

1 milliseconds of latency or so at a gigabit, suddenly we needed better performance than we were able to get out of SSH by default. We were running into other bottlenecks and not saturating the whole 10 gig link between back-to-back servers

02:42

using SSH. And I knew ZFS could put out 10 gigabits, so why couldn't SSH handle it? The other thing we do is we do the video distribution, but another one of the things we do is the packages and ISO downloads for TrueOS, one of the desktop spins of FreeBSD.

03:04

And so for that, we have a ZFS data set for the ISOs. It's 200 and something gigs, and we ZFS replicate it to servers all over the world, including, you know, Germany or Melbourne, Australia. With a latency of 240 milliseconds, the bandwidth delay product gets very high.

03:20

You need an awfully large socket buffer in order to actually get more than a couple of megabits between Toronto and Australia. Works very nice, does incrementals. We take a snapshot of a data set on a server every 15 minutes, and they throw files in there and they appear

03:46

by the mirrors that way. It's coordinated. The big reason they did this is we also do the packages for the downloads, and the entire package set, which could be a couple hundred gigs, needs to be atomically switched over so that, you know, the package metadata and the actual packages are

04:03

consistent at any one second. And so they use rsync with delayed update to upload all the files to us in a hidden directory and then atomically rename them all at once. And this has two advantages. First, we replicate it out in 15-minute chunks, so all the servers are getting the data

04:22

as it's being uploaded, and as soon as it's done, they can all flip it into place. The other one we looked at was having them ship us snapshots directly off their side, but they haven't set up that side yet. Or I haven't either, and Chris is in the room so I can blame him.

04:41

So, the one we have is ZFS replication over the LAN. This one is where we have servers that are pulling data from our master site over SSH. So we have a send case and a receive case in this particular one. Another one we have is our recording servers. So we have

05:02

servers around the world where we ingest the live video stream, like this one, and we record it on the server. For the ones where we do multi-bit rate, where we take in a high bit rate stream and transcode it down to lower bit rates for you, so you can watch it on your cell phone or whatever, or if you have a weaker internet connection.

05:21

Those ones, sadly, are Linux because we need it for the video drivers, but our video servers, the previous ones and the Linux ones, will collect these recordings and then they use an rsync job to move individual files when they're done recording back to our central ZFS storage servers. And so in this case, we're pushing

05:41

via rsync over SSH. In particular, some of those servers are Linux, and so it's much more difficult to get a consistent set of SSH patches on them, right? On FreeBSD, we can build OpenSSH with the HPN patches and push that package out to all our machines, but there's no pre-built package for that for CentOS, and so there's a bit more work.

06:06

And there's that. So in this case, we have servers all over that want to push data back to the master site, so that's a different use case. But with that one, if we come up with our own patches, we could use a modified version of SSH because they are our servers.

06:23

The fourth use case, though, is customers uploading videos to us. So for video on demand content, where they've gone and produced a video and want to distribute it now, they send it to us over SSH or SFTP or whatever. In that case, we have no control over the version of SSH that the end user is using.

06:43

So even if we come up with some really cool hacks for SSH to make it faster, our customers are not going to have that version of SSH. They're going to have the stock version, who knows how many versions old that their whatever machine they're using is, you know, have a ten-year-old copy of WinSAP on their Windows box or something, right?

07:03

And so we wanted to look at what we could do to the SSH server to be able to receive data faster from them without requiring any modifications to the client on their side. So the first thing we looked at was these set of patches called HPN or high-performance networking.

07:21

There are a set of patches that first started being developed in 2004 at, I think it's the Pittsburgh Supercomputer Center in the US. Back then, the default window size for SSH was 64 to 128 kilobytes. The idea there is, you know, in your text interactive terminal session, you don't want to buffer up a bunch of data.

07:43

You know, if you accidentally cat a 20 gigabyte file or something, you want the control-c you send to get there in a reasonable amount of time, and that data to stop coming at you over your dial-up connection. So they did some work. The problem is, with such a small buffer, because SSH doesn't let the TCP socket buffer do its auto-sizing, it forces a small window

08:08

to deal with these high-latency links and so on. So what they did with the HPN patch was optimistically, if both sides are HPN, it will grow that buffer using the TCP

08:24

tuning in the OS. So basically, have SSH get a little bit more out of the way and let the OS decide what the socket buffer should be. Or if only one side is HPN, they will at least use a larger buffer of two megabytes. That two megabyte limit actually came from an assert statement in

08:42

OpenSSH versions older than 3.8 that would crash if you tried to have a bigger buffer. But, as of 2007, when OpenSSH 4.7 came out, the default window size in SSH went up to two megabytes, and so a lot of the advantages of the HPN patch kind of were in upstream at that point, although not all of it.

09:06

So one of the things we found is that the HPN patches added a feature where on the receiving side of the, or on the client side, you can optionally set the receive buffer size. So it calls setSockOpt and forces the buffer to a bigger size.

09:23

That only works in the case where the HPN patched SSH version is on the receiving side of the connection. If you're pushing data, then setting your receive buffer bigger doesn't do you any good. But with this, for our servers that are receiving the mirrored data for the

09:47

the we set a larger socket buffer, and instead of relying on the OS auto growing up to its size, we could also, we could specify a much bigger size and get to it right away and not have to ramp up the connection. Which made a big difference in

10:02

you know, when you're replicating smaller amounts of data, you know, what Chris could upload from his house in 15 minutes would only take about two minutes for our servers to replicate at a gigabit. So having that be only take two minutes instead of three because we didn't have to wait for the socket buffer to grow up to

10:20

24 megabytes made a difference. And the biggest one though is the bandwidth delay product. It means that, you know, even if you have just 10 milliseconds of delay with a 4 megabyte socket buffer, which is double the default in pretty much every OS, it means the theoretical maximum bandwidth that you have available to you is 3,300 megabits,

10:44

even though you have a 10 gigabit link. Now if you send that up to 160, sorry, yes, so in my test setup where I created a virtual link that has 10 milliseconds of latency, but can still do 10 gigabits,

11:02

Netcat was able to get almost all the way to the bandwidth delay product of 3,300 megabytes or megabits per second. But when we used SSH, we found it only ever got to about 160 megabits because the socket buffer would never grow to 4 megabytes, but would stay at 128k or 2 megabytes in this case.

11:25

And then we found with HPN, it could receive data quickly up to about 1,300 megabits. So not maximizing the bandwidth delay product, but at least getting a more acceptable performance number. But if you tried to push data, send, it was only getting 175 megabits.

11:45

But the reality is that, you know, the delay between Toronto and Melbourne, Australia is not 10 milliseconds. But even if you're just going to Europe at 100 milliseconds of latency, then with your 4 megabyte socket buffer, you're down to

12:01

335 megabits per second of throughput. So even though I'm paying for gigabit at both sides, unless I can get a bigger socket buffer and get SSH to actually use it, then the maximum bandwidth I could theoretically get is 335 megabits. What I would actually get would be a lot less than that. And

12:20

just stock SSH, because of another bug I'll get into later where it wasn't putting the right amount of pressure on the socket for the OS to grow the buffer, it meant that the socket buffer never actually got above about 128 or 192k. And so the actual performance I would get was between 9 and 14 megabits,

12:41

which is not a lot when you're hoping for at least 300. Not in ZFS, but there are tools like BBCP that are supposed to be able to do that. I've looked at it a little bit, but I

13:05

managed to fix these problems, so I don't have to worry about it. Right, so with HPN, it was able to do 180 megabits, which is half-ish of what the theoretical limit is, so that's not so bad.

13:23

If we crank the socket buffer up to 32 megabytes, the problem is stock SSH doesn't get any better because it's only using like 192k of the socket buffer anyway. But with HPN, we could actually saturate the gigabit we wanted, so yay! But, you know, that was still a lot less than what the bandwidth delay product says we could have gotten if we actually had the 10 gigabits.

13:49

So we found that manually setting the HPN TCB receive buff setting was giving us acceptable transfer speeds because we were getting that socket buffer right away.

14:00

But when we looked into it and read the code in the comments, we figured that, specifically at least on FreeBSD, that the dynamic socket buffer growing feature that HPN had wasn't actually working. It was the condition that made it try to grow the SSH window size by 150% never actually triggered.

14:21

Because its condition was when the amount of data in the SSH window, which is the amount of data that's been sent but not acknowledged yet, is greater than the size of the, or sorry, when the size of the socket buffer is greater than the size of the amount of data we have pending, then expand the size of the buffer. The problem is that

14:42

unless you try to send more data, the socket buffer will not grow, and so it just never happened. It's interesting, on Linux the result you get from getSockOp to both the socket buffer is different. It's what the maximum is, not what the current state is. And so it seems the HPN thing

15:03

assumed everybody was using Linux. So anyway, we found that the code there that was supposed to make this all magically work was never actually growing it beyond maybe 256k because it just wasn't putting enough pressure to trigger the growing of the socket buffer by the OS's TCP stack.

15:23

So we looked into it and there's a there's a function called ChannelCheckWindow, and it checks the size of the SSH window. Around version 4.7, a helpful OpenBSD developer added a feature that said if this is or what it does is it sends an acknowledgment every time we've sent three packets.

15:47

And so the amount of data that's pending is never very high. And so because we're sliding the window forward constantly, this causes two things. A, you can end up sending twice as many packets back when you're actually receiving. So you're downloading data and you're getting it at, you know, hundreds of megabits a second,

16:07

but the little stream going back that's supposed to be just acts has enough, twice as many packets per second of SSH control messages going back the other way because you're sending an ACK every three

16:21

32k packets, in like SSH packets in this case. And so that behavior in particular was conflicting with FreeBSD's socket buffer sizing and causing it never to get very big.

16:42

So I added an extra check to that that says continue to do that if this is an interactive session because you want to have a very interactive low latency connection when you're trying to, you know, run commands. But if this is not an interactive session because I'm piping something into SSH, then

17:00

don't do that and fall back to the traditional behavior of send an ACK every time we filled half of the entire window. And then suddenly things were a lot less terrible. So the patch that I've made, I have two different versions of this. The first version is a patch against the HPN patches

17:21

and does it in that framework. I've since made another version that's against stock SSH so you don't have to have the HPN patches and I'm I think I can convince them to upstream it because it's it's only a couple of lines and it's it's not doing anything terrible. Whereas the HPN patch set has a reputation with the SSH people because it includes a

17:40

null encryption option and a bunch of other things that they're just not interested in. But anyway with the change here, it means if it's a non interactive session, then we only send back a window resize to the other side over the SSH protocol once half of the local window has been consumed. And at that point

18:01

we pull the size of the socket buffer from the OS and if the amount of outstanding data we have is close to that buffer, then we expand the size by 150% so that we can put more pressure on it, hopefully growing that socket buffer even more until we hit whatever tuning limit the OS has set

18:22

allowing us to maximize the available bandwidth delay product as controlled by the system configuration on what that maximum should be. So with this fix in place now SSH both send and receive got much more reasonable speeds

18:42

when we had the high bandwidth delay product. So when going to 100 or 200 300 milliseconds of latency because we were actually using all of the socket buffer that I was making available for 32 megabytes to try to overcome the bandwidth delay product, it was actually working now.

19:02

The change is restricted to non-interactive mode so that you don't end up having to wait through 32 megabytes of text data coming at you in your interactive session and making it all unresponsive and annoying. You know, SSHing from your phone, you don't want to have to wait for your phone to download 32 megabytes before the control-c takes effect.

19:24

So this is some experiments I did on a machine in the FreeBSD test cluster. These across the bottom are the different socket buffer sizes and then the performance of the various patched versions of SSH I played with and the blue line is the theoretical limit from the bandwidth delay product.

19:45

This graph is all at a 25 millisecond latency. There's more detailed version of this in the paper I wrote for ASIABSDCon, but I'll go over this a little bit more later. But in particular, the big orange bar you see is netcat. So this is no SSH.

20:04

And then if you look at the results we get from stock unmodified SSH, the HPN version, or sorry, this is stock SSH, this is stock SSH with my three-line patch, the HPN results, which are about the same, and then finally

20:28

something like that. Anyway, so we found that the nice thing to do is if you use the TCP receive buff option in HPN,

20:41

you can just skip all this auto-growing stuff and go right to whatever size you want. In particular, there are two different settings that control the socket buffer size in FreeBSD. There's one for the auto-tuning that sets the maximum that you want to auto-tune to, and then there's another one that controls what the maximum anybody can actually use is.

21:02

If those two values are different, it allows you to say, you know, for connections coming into my web server, I never want to allow the TCP socket buffer to grow beyond two megabytes. But for this one SSH session where I've explicitly done a set sock opt to a big number, I want to allow a TCP socket buffer of 64 megabytes so that the stream to Australia is not slow.

21:24

So, it avoids the situation where somebody could try to do some kind of denial-of-service attack against you and cause you to queue up, you know, a thousand connections. These were 64 megabytes of data in the socket buffer. So only on connections where you explicitly ask for it, you get the large socket buffer, but your auto-tuning for, you know,

21:44

regular web serving or something doesn't grow that high. So having this option I found to be very useful, and it also means that the first five minutes to solve the connection are not going at low speed as it slowly builds up the socket buffer.

22:01

But I decided what I needed was to extend this further so it works in the push case, in the case where I'm the recording server, and I have this bunch of video, and I want to send it back to the storage server. So we extended it with a patch that adds a remote receive buffer option.

22:21

So the client says to the server, hey, if you wouldn't mind, could you set your receive buffer to 64 megabytes? And I added a config option on the SSH server side that allows you to set what the limit to that will be. And the OS won't let you set a value larger than what your tunable is set to anyway.

22:44

And you can use the match options in the SSH D config to say only this user can set a larger socket buffer, not just anybody. So that you can use it for your own purposes, but not have other people using it against you or something.

23:04

So then I just have some tuning tips on how you can get better performance out of SSH. So for bulk transfers, it's desirable to avoid increasing the maximum size of the automatic tuning of the socket buffer. So you don't want to have every socket buffer be able to use 64 megabytes because, you know,

23:24

you don't want to queue up that much data out of every person that's trying to connect your web server, but you want specific things to be able to use a larger socket buffer. So you tune the maximum buffer size to a very large value, being extremely high, but you tune the auto scaling to a much more reasonable number. So default connections will use the reasonable number, but

23:47

specific connections where you need it, you can set it to a much larger value. So those CCTLs are you have net.inet.tcp.sendspace and receive space. Those control the initial size for the very first part of the connection.

24:04

Then you have send and receive buff max, which is the maximum for the auto scaling. And then you have send and receive buff inc, which is the increment, how much it grows each round trip if the growth is warranted. Be careful setting this very large. It seems to actually make things worse

24:25

because if you don't end up queueing, you know, when I, in one of the benchmarks I set this to like 256k, but because the packets coming out of SSH are only 32k at a time, it never grew. It never tried to send a whole 256k at a time. So the socket buffer never grew off the bottom post.

24:43

It just sat at the lowest value because we never put enough pressure to cause it to go up one notch. So in most of the benchmarks I did, I found while it takes a little longer to get to the maximum speed, not setting this very to larger than the default worked better. Although I think that's probably something that should be addressed in the autogrow code in FreeBSD rather than elsewhere.

25:09

There was also a CCTL to just disable the automatic buffer management and say, you know, unless somebody asks for a bigger buffer, all they get is the default 64k received in 32k send window.

25:24

But the last one is kern.ipc.maxsockbuff, which controls the maximum size of the socket buffer. But that one's slightly confusing in that it's the amount of memory the socket buffer can take, or can consume, not the amount of

25:40

data that can be in the socket buffer. So it's actually, you know, for every 2k segment, you also have 256 bytes of overhead for the mbuff. So if you want to allow a maximum socket buffer size of 64 megabytes, you have to set this value to at least 72 megabytes to account for the management overhead of the mbuffs.

26:03

And so, you know, if you set it to 64 megabytes and then ask SSH for a 64 megabyte socket buffer, you'll be told not enough buffer space available. That one confused me for a while. I'm like, I've said the maximum is 8 megabytes. Why can't I have 8 megabytes?

26:22

So one of the other things we did with the HPN patches, especially over the LAN, was using their option called the NUN cipher. So, you know, I'm just going over my LAN. I don't really need to encrypt all this data. I just like using SSH for setting up the commands on either side.

26:40

The way this actually works is the connection starts out encrypted so that your credentials and all the keys stuff still happens, and your username and password, if you send it, or your keys or whatever, still happen over an encrypted connection. But once it starts, once it passes the command it wants to run, and that command actually starts and you open the channel back and forth, it re-keys the connection to having no encryption.

27:04

It still has a Mac to verify that the data hasn't been modified, but it has no encryption. This is great for, you know, local LAN replication over ZFS because you don't need to spend all the CPU time encrypting and decrypting.

27:20

So you can enable that, and it has a bunch of protection to make sure that you can't ever spawn a shell where you want to be typing commands back and forth and they'll be visible to other people because, you know, you might input a password in the password command or something and you don't want that to ever be not encrypted. So starting in 2011, we started using the HPN patch version of

27:42

SSH that Brian has kept working for us for all this time in the port street to accelerate our ZFS replication, especially over the LAN, but for things like the PCBSD mirrors, that's not private data, so we're fine with doing no encryption over the internet in that case. And it made it possible over the LAN to saturate, you know, a 1 gigabit

28:05

1 gigabit internet LAN connection without using a lot of CPU. And the other performance improvements they had were pretty good as well. Enough for the LAN anyway. However, the HPN patch doesn't seem to help very much outside of, you know, manually requesting a larger receive window,

28:24

and it only works on the receiving side, like we mentioned before. The NUN cipher, it turns out in my testing, is actually slower than some of the newer modern ciphers, which confused me greatly for a while. Turns out because the NUN cipher still uses a Mac, and the Mac it uses is uMac 64,

28:44

which I think is a Dan Bernstein thing, right? But I read the paper for it, and it was written in 2002 and benchmarked on like a Pentium 2, and it was like, it's super fast on a Pentium 2. It turns out it's actually not that fast.

29:01

And so when I was trying to test that 10 gigabits or more, it turns out it was a very strong bottleneck. So what do we do? Where do we go from here? So I borrowed two machines out of the FreeBSD test cluster, which is donated by a company called Centex,

29:20

out in Kitchener, kind of a ways from here, but in Canada. So they have these E5 1650s, so that's a high-end server, but single socket. So you get six cores at 3.5 gigahertz. It's got 32 gigs of RAM. They have these really nice Chelsea or 40 gigabit NICs connected back-to-back with no switch or anything in the way.

29:41

We installed FreeBSD 11 on them. I had the base version of SSH, the HPM patch version SSH from ports, and my various fixed versions of that. So the first set of measurements I did, the default cipher in SSH has been

30:03

ChaCha 20, Poly 1305 for a while. Because it doesn't benefit from things like AESNI, because it's not AES, the most we could get out of it, even with 3.5 gigahertz processor on each side, was 1,900 megabits.

30:21

So while the reason SSH chose this is because they wanted a not AES cipher to use for their SSH connections, and it makes sense for your regular interactive connections maybe, but when you're trying to do bulk transfer, you probably want to choose a different default. AES

30:40

CBC, depending on if you have 128 or 256, you can get between 2,500 and 3 gigabits per second. With AES-CTR, you can get up to almost 5 gigabits a second. With the NUN cipher, I got 5,800 megabits per second. But with AES-GCM, because this machine is new enough to have the AVX instructions to do the 128-bit

31:04

division, and it has AESNI, I could actually get eight and a half gigabits per second. Yes, so all of the, so ChaCha 20 does the Poly 1305 for the Mac.

31:22

It's AEAD that way. All of the AES ones, we're using the UMac 64 Mac, the same as the NUN cipher one. But because AES-GCM and Poly, or ChaCha 20, are authenticated ciphers, they provide their own Mac that is used instead, and that's why AES-GCM ends up being faster, because

31:45

in one pass over the data, you get the encryption and the Mac in one pass, and don't have to run the UMac, which is the slow part. I did a benchmark of all the Macs, and UMac 64 is the fastest of the available ones in SSH, even though it's still terribly slow. So by using AES-GCM,

32:04

you can actually get more speed than having no encryption, because you end up not having to do a second pass over the data to get a Mac. Whereas Netcat, you can actually saturate pretty much all of the available 40 gigabits. Although not if you use a pipe, it turns out.

32:23

The original test I did with Netcat for my benchmarks was like, you know, Netcat to, on one side of the Netcat was dd, you know, dev0 with a one megabyte block size into Netcat, and on the other side was Netcat, and dd it back to dev null. And with that, the best throughput I could get was about 23 gigabits a second.

32:44

And I thought that was as fast as it would go. But it turns out, if you just do Netcat redirect from dev0, and Netcat redirect to dev null on the other side, you can actually get all 40 gigabits. Which led me down a whole other rabbit hole, and Rod Grimes and Tico Nightingale are working on a new page flipping thing for pipes to

33:04

make them faster. And hopefully we'll get faster pipes out of my confusing benchmarking results. Yes, so yeah, iperf can get the 40 gigabits too. They're about the same as the Netcat. If you can give it the socket buffer. Yeah.

33:23

But yes, I tested with iperf first to make sure that everything was working. And I was a little confused why Netcat would be slower than iperf. I was like, it's not doing that much, but it turns out in FreeBSD with a pipe, the default buffer size for a pipe is 16K, and it

33:40

can grow based on the size of the reason writes up to 64K. But that's the limit, and the file is like last modified in 1996 or something. So another patch I worked on that's gonna be up for review soon that just makes it a tunable so that you can set it to like 256K or a 1 megabyte. And all of a sudden now

34:03

single threaded through a pipe, you can instead of doing 5 gigabits a second, you can do 13 gigabits a second. There's like gigabytes per second. And you know that makes a big difference when you're trying to do this. So yes, these numbers end up being quite a bit different than the ones I presented in Tokyo because I figured out my problem was actually that I was hitting a bottleneck in the pipe in the operating system.

34:28

So I was a little confused at first about why having no encryption was slower than having encryption. But it turns out because AES-DCM is an authenticated cipher and is basically you get a MAC for free as part of the encryption,

34:45

it means it doesn't use UMac and that we don't have to read all of the data twice, once to encrypt and once to MAC it. So while the NUN cipher, you know, even in its best day was doing 6 gigabits, AES-DCM could do 9 gigabits. So I wondered,

35:05

well, over the LAN, I don't really need a MAC, right? ZFS is doing a checksum on every block anyway as part of the replication because with resumable replication each record in the ZFS send stream has its own checksum now instead of only one big checksum over the whole stream.

35:21

So I was like, how hard would it be to just make it not do a MAC? So I basically added OpenSSL's null MAC transform as one of the options, kept all the existing restrictions that the NUN cipher has, so you can't start a shell, you know, you can't get a TTY,

35:40

all the basic things that are already there, and added that as an option. And just like the NUN cipher, it prints out a big scary warning every time you use it to make sure you know you're using it and all that stuff. Using the NUN MAC, there's no encryption and no MAC, with the patch version of SSH I was able to get much closer to the results that Netcat was getting.

36:05

It was about 80% of the performance of Netcat pre-figuring out that pipes were my problem. Not quite as good anymore, but it turns out that AES-CTR can actually keep up if you don't make it do a MAC as well. So with this combination

36:25

AES-CBC with no MAC can actually get up to about 4.5 gigabits. Your AES-GCM numbers are about the same because you're still doing the MAC as part of the encryption. But AES-CTR with no MAC can actually get as high as 10 gigabits on this 3.5 gigahertz processor.

36:45

But doing NUN at all, SSH could actually shuffle 13 gigabits. Originally, Netcat topped out at 23, but with the pipe fix it can now do 36 gigabits. So now that we have the NUN cipher and the NUN MAC, we can actually still get faster than encrypted,

37:05

which is what we expected to happen. And we were very confused when that wasn't happening. Non-private data, which you've described as being a substantial amount of your grants, where you don't care about

37:25

the authentication, you don't care about data modification, you don't care about encryption. Why are you using SSH instead of Netcat? Because I need to start the right ZFS command on the other side and because I might need to pass the resumption token to

37:45

resume the replication in the right spot. And because the script I used to coordinate all this uses SSH to go to the other side, get a list of the snapshots, compare that to the snapshots on this side, decide what to do, and it's just SSH was easier.

38:01

It was a passive least resistance compared to trying to coordinate a Netcat. Sorry, I lost my place. Any other questions while I find my slides? The numbers you presented with SSH and your new cipher, does that also include the pipe fix?

38:24

Yes. Originally I stopped investigating when I got almost to the Netcat number. I still need to go back and see if there's anything else I can do to get even more performance out of the null encryption option with the null MAC.

38:41

I have a flame graph coming up that shows that I think I've done everything I can. Right, so switching off the crypto, the slide viewer program on TrueOS is really laggy. It takes a very long time to render each slide.

39:07

The Lumina PDF viewer. Ken's here somewhere, I think. What's that? Okay. Yes, so it's his fault.

39:22

Anyway, there we go. Now we're caught up. Yes, so at this point I thought I had reached the limit of what tuning and small patches to SSH could do. There's a decreased graph on the next slide that shows pretty much all the CPU time

39:40

being used by SSH is spent in memcpy, memset, and realloc. There's no time being spent on the encryption anymore because we've no encryption in null MAC. And so other than rearchitecting SSH to try to do fewer copies or something or bigger buffers, there's not much more that you'd squeeze out of SSH, but

40:03

13 gigabits is enough to saturate my 10 gigabits between my home servers, so I'm done. It's you guys' problem now. At this point, yes, it's not much point in going further at abusing SSH. You know, it's already well outside the scope of what SSH was meant to do.

40:20

If you really need to do 40 gigabits, you should probably just use netcat or something. Or like Collin mentioned, BBCP that can use multiple TCP flows to spread out your data and reconstitute it on the other side. So yeah, this flame graph you can see we're running SSH, main, in the client loop, and you end up in memcpy, memcpy,

40:42

memcpy, memcpy, memset, and yeah, you spend all your time in libc, not in SSH, so I've done all I can do. So, you know, at some point you're stuck with how fast your operating system and your processor can copy memory around.

41:04

And it turns out that scales quite linearly with the frequency. So this is the non-Cypher, AES-GCM, the non-Mac version, and netcat at each of, this is gigabits per second, and each of these lines is the processor speed scaling from

41:25

1200, 1500, 2000, 2500, 3500, and 3500 plus turbo boost. So you can see the transfer speed you get is

41:42

scales exactly as you'd expect with the CPU speed of the machine. No, the maximum window size you're allowed in TCP, the protocol specification is one gigabyte, so that the,

42:06

you can't wrap the number around and acknowledge bytes that haven't been sent yet or something. So specifically because it's 32 bits, the TCP protocol specification says it can't be greater, the window can't be greater than one gigabyte.

42:21

So that the axe you're getting, since the sequence number is only 32 bit, you don't want to, it has to be able to make sure it's not wrapping, and so... And the target's sufficient, one gigabyte, as a maximum. Do you have any thoughts about, in the future, maybe we need to have 64-bit TCP?

42:47

Yeah, hopefully we won't still be using IPv4 by then, right? I amused myself. A little bit, but

43:08

it's not as easy to change that in your benchmarks. It's just running, you know, dev.cpu0.frequency equals each one off the list of supportive frequencies for that processor.

43:26

Yeah, that might be worth trying as well. Yeah, in the separate pipe benchmarks I did, I found that the sweet spot for the

43:43

buffer size for the pipe is like a little bit less than half of your L3 cache size. So that the page flipping that's there happens all inside the L3, and you don't blow it out.

44:13

So it adds speed to the crypto?

44:21

Right. But I think it's a good idea, especially for the pipe test, to consider that. In the tests I have so far, I've benchmarked everything that I have from a core2 duo through every different generation of Intel

44:40

with a mix of laptops, desktops, and servers in there. Yeah, it mostly has to do with the L3 cache size, where the sweet spot for the pipe buffer size is. I think the same thing applies here. As long as your socket buffer size fits in your L3 cache, it probably If we did on one CPU looking at just the socket buffer size,

45:02

there's probably a detriment to having your socket buffer connection going, being bigger than half your L3 cache. Yes, this was all... The first round of tests were parsing the output of DD, but when I got rid of DD, what I'd used is the IPFW count rules

45:24

over, you know, there's one NIC, all it's doing is having the delay, and we're counting the exact number of bytes that went from here to here in this window, then zero it and do the next test after sleeping for long enough for the TCP host cache to clear. I set that to expire very quickly.

45:42

Turns out if you disable the TCP host cache entirely, the socket buffer auto-sizing seems to not work. I need to talk to somebody from the transport group about that.

46:01

Possibly. All my testing was over a perfectly silent 40 gigabit back-to-back connection between two servers in a test lab. So, yeah. As a slightly selfish person, when I'm sending my data from Toronto to Melbourne,

46:24

I don't care if I'm causing other people problems. You know, I just want my data to get there in less than three days. Other questions?

46:40

Other ideas of what I should do next? Oh, yes. Yes. So the NUNMACK is my patch set and the remote receive buff and the fix for the dynamic window.

47:01

I built the NUNMACK one on top of the HPM patch because they already had the NUN cipher and the checks to make sure that you don't use it when you shouldn't. So I need to polish things up a little bit. On my GitHub I've caught all of these patches up to the recent released version of OpenSSH because a lot of this work was done originally

47:22

around Christmastime and to be done for my paper for AsiaBSDCon in March. But the patches are on my GitHub, Alan Jude, OpenSSH, portable. There's one called dynamic window and that patch is against BaseSSH and it gives you most of the gain if you just want to use AES-GCM

47:42

and high latency link and it'll work. So that's a very small patch against stockSSH. All the other features like remote receive buff and the NUNMACK and so on are all enhancements to the HPM patches and I've been working with those guys upstream

48:01

to hopefully those will be part of the HPM patch set. And I'm sorry Brian for the extra work that will cause you. So yeah, hopefully some of this will be available to everybody as part of stockSSH and the more controversial features will be part of the HPM patch set

48:21

and you'll be able to just use them. The other one I randomly stumbled into is that in the HPM patch set they have a feature called multi-threaded AES where they have a version of AES-CTR that can use more than one CPU at a time. Turns out because that one predates AES-NI

48:43

and doesn't use AES-NI is actually slower than using one core with AES-NI. So I talked to them about it a little bit. In some of their tests the performance was actually about the same but it turns out it's because they were testing on very low

49:02

frequency older Xeons. So like their Xeon was like 1.8 gigahertz and mine's a modern 3.5 gigahertz one. These are E5V3 so that's Haswell or Broadwell? One of the two. Haswell I think.

49:20

So with the original version of this I finished for Tokyo. My home servers which are only 2.4 gigahertz E5s because they're 2620 so they have higher core count but lower frequency couldn't quite do 10 gigabits but I think now that I learned that sticking some extra pipes in there doesn't help

49:41

but I'd probably get most of the 10 gigabits anyway out of my home lab even though I have more than a gigahertz less power out of one core. Based on what I've heard, I just want to make sure I understand you've now finally arrived at the point where your minor patches get upstreamed

50:05

the whole HPN patch thing can just fade into obscurity as long as you're running on AES-NI equipped CPUs? Yes. In particular this came up Julian Elisher found that when we removed the HPN patches from

50:24

the version of SSH of FreeBSD base it actually brought back some performance problems he was having and so that's what drove me to make a version of this dynamic window patch that applied to just stock SSH, not only the HPN version.

50:40

So the advanced features for doing one-off really big fat stuff with no encryption or whatever will go into HPN and you can get it from OpenSSH-portable from ports with the HPN option. So you might actually be able to do fast SSH bulk getting transfers on OpenBSD sometime this decade?

51:01

Well, that would require having a fast connection on OpenBSD. I don't know how good OpenBSD support for 10 gigabit NICs is. It works. Do they have 40 gigabit NICs? No.

51:20

ISA-L? Yes. No, but as a separate project I would really like the OpenCrypto framework to grab some of those assembly optimized versions of deciphers like AES-GCM and maybe eke a bit more performance out of that.

51:45

Yes. Are you doing that through the OpenCrypto framework or do you just completely go around that at Netflix?

52:02

So did you guys just add Yazm as part of your toolchain? Okay. Yeah. So if I just import Yazm into the contrib, we can just have it?

52:22

Okay. That's a deal. Ed and I already decided to do it in Tokyo. Alright, thank you.

Recommendations