We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

SMB3 Overboard

00:00

Formal Metadata

Title
SMB3 Overboard
Subtitle
An Offload Engine for NASty Networks
Title of Series
Number of Parts
637
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Data Processing Units (DPUs) and SmartNICs--these things are getting a lot of attention, particularly in the network storage world. There's a bit of history behind them, particularly if you consider TCP Offload Engine (TOE) cards and iSCSI NICs, both of which have been around for quite a while. This latest wave, however, is more general-purpose by design and presents an opportunity for developers like us. DPUs are typically, but not necessarily, built around ARM cores. They are the engines that power the current cavalcade of SmartNICs, and are also being integrated into server-class mainboards. In any case, they are going to be a feature of the next generation of storage platforms, application servers, and cloud infrastructure systems. This talk centers around work to implement NAS Offload, particularly SMB3, for SmartNICs. We'll expand from there to discuss how NAS Offload can be leveraged for Software Defined NAS services, and will also consider what DPUs might mean for distributed object storage. The doors are wide open to Open Source development community, and this is an opportunity we cannot afford to miss.
179
Thumbnail
20:09
245
253
Thumbnail
30:06
294
350
Thumbnail
59:28
370
419
491
588
Thumbnail
30:18
Computer networkData storage deviceSoftwareSoftware developerMonster groupFeedbackPresentation of a groupCommunications protocolProjective planeAndroid (robot)Latent heatArithmetic meanWritingGrand Unified TheoryTerm (mathematics)TangentInformationSuite (music)Statement (computer science)Open sourceSource codeMobile appKernel (computing)WindowSlide ruleShift operatorClosed setEstimatorCovering space19 (number)Multiplication signRight angleComputer animationDrawing
ComputerComputer networkComputer programMemory cardMultiplicationMagnetic-core memoryProcess (computing)Interface (computing)Generic programmingCoprocessorPCI ExpressProgrammable read-only memoryTheory of everythingMotherboardChannel capacityArmSystem programmingComputer fileData storage deviceSoftwareOpen sourceChannel capacityMatching (graph theory)Data storage deviceInternet service providerComputer programPerspective (visual)Expert systemMemory cardPlastikkarteWindowServer (computing)Overhead (computing)Connected spaceUniverse (mathematics)Information privacyCentralizer and normalizerView (database)Computing platformMoment (mathematics)NeuroinformatikProduct (business)Computer hardwareGroup actionExecution unitBlock (periodic table)Uniform resource locatorComputer fileBit rateMicroprocessorCoprocessorIntelDiffuser (automotive)MotherboardQueue (abstract data type)Communications protocolHeat transferArmInterface (computing)Data storage deviceTerm (mathematics)Local ringMessage passingDigital photographyCycle (graph theory)Metropolitan area networkMagnetic-core memoryException handlingMilitary baseTheory of everythingFile systemScheduling (computing)Level (video gaming)Right angleRegular graphStack (abstract data type)Fiber (mathematics)Adaptive behaviorArithmetic meaniSCSIDatei-ServerOcean currentPCI ExpressType theoryField programmable gate arrayWeightComputer animation
Density of statesCommunications protocolState of matterEncryptionMessage passingParsingLevel (video gaming)Key (cryptography)ImplementationFuzzy logicSemantics (computer science)File formatDevice driverComputer networkLocal ringInterface (computing)SynchronizationParallel portServer (computing)Open setComputer fileLibrary (computing)DialectData compressionError messagePersonal computerWindowLibrary (computing)Message passingQuicksortFile systemCombinational logicMultiplication signState of matterInformationConsistencyInterface (computing)Key (cryptography)Error messageLevel (video gaming)DialectPlastikkarteMemory cardSoftware developerStreamlines, streaklines, and pathlinesCASE <Informatik>Identity managementTranslation (relic)Different (Kate Ryan album)Computer fileCommunications protocolOperator (mathematics)Semantics (computer science)Fuzzy logicImplementationEncryptionInternetworkingMetadataDevice driverSet (mathematics)Network topologyPhysical systemMereologyStack (abstract data type)Server (computing)Entire functionData compressionLogicMultiplicationCodeParallel portLocal ringEndliche ModelltheorieSynchronizationProduct (business)Client (computing)File formatExtension (kinesiology)BitConnected spaceRevision controlField (computer science)Default (computer science)Streaming mediaKernel (computing)Semiconductor memoryOrder (biology)Single-precision floating-point formatParsingMathematicsComputer clusterPressureVirtualizationVideoconferencingNumberPoint (geometry)Data managementGraph coloringSession Initiation ProtocolTimestampPosition operatorCollaborationismGroup actionSoftware development kitAuthenticationOpen setSoftwareAttribute grammarTheoryResultantDensity of statesDecision tree learningComputer animation
Presentation of a groupSlide ruleVideoconferencingMultiplication signBitImplementationMeeting/InterviewComputer animation
ImplementationEncryptionData compressionSet (mathematics)Category of beingFundamental theorem of algebraQuery languageDependent and independent variablesError messageNetwork topologyLogarithmControl flowInformationFile formatParsingCodeStrutComputer wormServer (computing)Fuzzy logicMonster groupFlagField (computer science)Asynchronous Transfer ModeData compressionBand matrixConnected spaceMessage passingPoint (geometry)EncryptionError messageData centerInformation securityReflektor <Informatik>FehlererkennungElectronic signatureDependent and independent variablesPresentation of a groupOrder (biology)CodeGroup actionOperator (mathematics)Transportation theory (mathematics)SoftwareSoftware testingAdditionDirected graphGeneric programmingServer (computing)Client (computing)Web 2.0Set (mathematics)Alpha (investment)WindowKernel (computing)Flow separationFile formatSign (mathematics)Network topologyComputer fileAuthenticationCommunications protocolSingle-precision floating-point formatArithmetic meanLogicFormal verificationKey (cryptography)Decision theoryEquivalence relationData structureTrailTouchscreenDifferent (Kate Ryan album)MultiplicationComputer wormBitFlagType theoryImplementationAsynchronous Transfer ModeLoginVector potentialSpacetimeMultiplication signView (database)CASE <Informatik>Web serviceEstimatorCircleLevel (video gaming)Position operatorECosForcing (mathematics)Condition numberComputer animation
Multiplication signWindowOpen setAuthorizationSoftware testingFunctional (mathematics)System callImplementationContext awarenessPermutationNumberEmulatorMathematical analysisMaizeCASE <Informatik>Position operatorMereologyOrder (biology)Form (programming)19 (number)WeightComputer animation
TheoryModul <Datentyp>Wide area networkData storage deviceSoftwareProxy serverServer (computing)Cache (computing)Web portalRemote Access ServicePersonal digital assistantImplementationVideo gameData storage deviceProjective planeMultiplication signComputer hardwareCodeWeb pageCASE <Informatik>Library (computing)Program codeSoftwareSoftware developerTheoryRemote procedure callImplementationProxy serverCache (computing)Open sourceEmailWeb portalServer (computing)Storage area networkCommunications protocolQuicksortAssociative propertyAddress spaceNear-ringTerm (mathematics)Computer animation
Element (mathematics)Computer animation
Element (mathematics)Computer animation
Transcript: English(auto-generated)
SMB3 Overboard, an offload engine for NASD networks. Hi, my name is Chris Hertel
and I'm building an SMB3 offload engine for SmartNICs. And I'll explain what that all means as we go along. First, some quick background info on me. I've been a Samba team member since 1998, but I actually started working with the SMB protocol way back in the 1980s. A few years after I joined the Samba team, I started a side project called JSIFs,
an SMB toolkit written in Java. JSIFs was a quiet success, you don't hear much about it, but you can still find it hidden in things like Android apps. It's under the LGPL, so it really shouldn't be a secret. While I was working on JSIFs, I gathered together the notes I'd collected
about the guts of the SMB protocol suite and hammered them into a book, which came out in 2003. It's an open source book and it's available online. On the strength of that book, I was asked by Microsoft to pull together a team and write Microsoft's official SMB-SIFs protocol specifications.
So there I was, an open source Samba team geek, digging through the Windows kernel source code to uncover and document long buried SMB secrets. I'll just mention that the terms SMB and SIFs were fairly interchangeable once upon a time. I've got a slide later on that briefly explains the naming and how it has changed.
This is a quick butt-covery statement. I'm representing myself here. The opinions that I have, that is to say, which are mine, are mine. Also, I have to believe that some of you will know more about the material that I'm going to cover than I do. Also, I'm likely to gloss over some stuff you'd really rather I didn't.
In a live show, I would gauge the audience and adjust. I can't do that in a prerecorded segment, so ping me if you want to talk. I want to start with a quick tangent. I like to ride my bike a lot. I mean, I'm even willing to wear the goofy clothes.
So the map here, here it comes, will give you an idea of how much ground I've covered in the past few years. And yep, that's me on the right. I asked my wife to take a picture of her cool biker dude hubby, and she laughed, and she laughed, and then she snapped this photo of me with that look on my face. So I like cycling. I also like data.
At the intersection of these two, there is obviously ride data. There's GPS data like time, elevation, and location. My bike and I both also wear sensors to collect speed, cadence, and heart rate. The bike computer collects all of this and stores it on internal storage for transfer at the end of the ride.
I own a couple of Android-based cycling computers. Despite the open source underpinnings of these things, both are designed to deny me access to my own data unless I create an account with the manufacturer and upload my data to them. That is, I have to create a third-party online account to access my personal data despite the fact
that it's on an open source platform. Think about that for a moment. I complained to one of these companies, and I was told that it was all okay because they had a privacy policy. So how is this semi-relevant? Well, this is the same kind of problem that got me started working on Samba.
I was at a Big Ten university in the central networking support group. A lot of users on campus had Windows desktops, and that meant they were using SMB to access their file servers. Back then, if you were using a commercial SMB server, you needed a license for every seat, every connection, essentially, and the costs were high,
and the overhead of managing all of that was more than we could handle. So we used Samba, and I got tagged to become the local expert. The core problem from my perspective was this. Here we were at a public institution. Users were being charged license fees to access their own data on their own equipment.
A third party was effectively holding those user files for ransom. So let's talk about SmartNICs. A SmartNIC is a network interface card with brains. FPGAs are typically used to provide programmability,
but the current crop of SmartNICs are sporting powerful multi-core ASICs. I'm focused on that latter type. For those who may be interested, I'll mention that there is a talk on the schedule tomorrow that looks at the challenges of offloading to FPGA-based SmartNICs. Should be interesting.
SmartNICs are general purpose network offload devices. They have multiple interfaces, meaning four ethernet ports or something like that. They might have multiple physical layers. They could have ethernet and InfiniBand and fiber channel and something else. They also are generally connected to the host over PCIe.
And the idea isn't new. If you're familiar with TCP offload engines or TOE cards, these have been around for a while. iSCSI adapters are essentially TOE cards with the iSCSI protocol stack on the card as well. From the host perspective, the card looks like a regular SCSI card connected to local drives.
SmartNICs are the same basic idea, except that they have a more powerful, more general and more versatile engine for running software. Some SmartNICs can even run Linux. The brains are a microprocessor with lots of IO capacity. That is a processor chip. These chips are called data processing units or DPUs.
They're basically ASICs. As an aside, there is at least one vendor out there using the term DPU for neural net deep learning processors. We'll ignore them. DPUs don't need to be on the network card. They can be on the motherboard. DPUs are typically risk-based. ARM is very popular.
I know of one example of a MIPS-based DPU. And I was chatting with a colleague recently, and he was very hopeful we'd see RISC-V-based DPUs. I'm not sure what Intel is working on, but I'm sure they're working on something. And again, one key feature of a DPU is that it has lots of IO capacity. So let's add SMB to the mix.
SMB offload is not a new idea. One big iron hardware vendor I worked with claimed to have built SMB SIFs offload using FPGA 10 years ago or more. I have no reason to doubt them. DPUs and SmartNICs just make SMB offload more practical.
And if we can offload SMB, we can also offload things like NFS, regular file systems, distributed storage. So here's what you need to know about SMB. Server message block, that's its full name, had a difficult upbringing. SMB was born in Florida, a product of the relationship between IBM and Microsoft.
It started out living with PC-DOS, then moved in with MS-DOS, and spent its formative years with OS2. In case you're not familiar, these are all operating systems from the early days of personal computers. When IBM and Microsoft split, SMB found its forever home with Windows. In the mid 1990s, SMB went through
a bit of an identity crisis and changed its name to SIFs because the I stands for internet, common internet file system. It was a marketing upgrade. For reasons unclear, however, the name got turned around. Instead of the new name for the new thing, SIFs wound up referring to the SMB protocol
as implemented in Windows NT4 and earlier. That was a nomenclature mishap from which it was difficult to recover. So the SIFs name was dropped. It is now only used in legacy documentation, like my book, long lasting software, and third party sales brochures. It's time to let it go.
The original SMB, which became SIFs, is now called SMB 1. SMB 2 was introduced with Windows Vista. SMB 2 is a different protocol. It lacks all of the DOS and OS 2 baggage of SMB 1. SMB 3, however, is not a new protocol.
It's a dialect of SMB 2. It was originally version 2.2. Another marketing upgrade and SMB 2.2 became SMB 3 and it does have a number of enhanced features, including things like multi-channel and cluster failover and nifty things like that. SMB is still proprietary, still belongs to Microsoft,
but it is now well-documented and widely supported. Microsoft, in fact, is funding the development of the Linux SMB kernel client, the SIFs client. When you do mount minus T SIFs, Microsoft is behind that now.
So SMB offload. SMB 1, SIFs, has been officially deprecated. It is disabled by default on newer versions of Windows. Offload code, therefore, should support SMB 2 and SMB 3. SMB 2.3 messages may be signed, encrypted, compressed.
This can all be handled in offload. The basic marshaling and unmarshaling of the messages, the syntax layer, can be offloaded. A small amount of state information needs to be kept in the offload engine, and it has to be shared with the host above. Encryption keys, those are provided by ARPA layers.
Which capabilities and features are available so that the card can handle the negotiation steps? State information is nicely spelled out, however, in the SMB 2, SMB 3 documentation. IO operations can also be streamlined. How would this work in an initial implementation?
At the top of the SMB stack, there's the semantic layer. This deals with translations between Windows and POSIX behaviors, for example. Below that is something I call the fuzzy layer, probably a library or device driver interface. I'm not sure yet. SMB is also quite stateful, and this layer has to ensure that both the host
and the smart neck have access to correct and consistent state information. I call this layer fuzzy because it's not fully defined and is subject to change as I work. Collaborators, welcome. The syntactic layer, this is fairly stable, mostly concrete packing and parsing of messages. Things like handling byte order
and ensuring correct translation between wire and memory message field layouts. SMB 3 also supports message signing, encryption, and compression, obvious candidates for offload. This design does not preclude building an entire SMB client and or server on the smart neck. I just tend to approach problems in stages,
so I'm leaving the semantic layer on the host for now. Speaking of the semantic layer, here it is. This is where the serious thinking work gets done. Windows versus POSIX file system semantics, things like locking, identity, extended attributes, timestamps, file names, and much more
are all different between these systems. And there are some modern file storage models that don't conform to either of those two. So you have to adapt. Local file system interface, including synchronization between different access methods, this metadata management, access controls, file attributes, et cetera.
For now, the semantic layer is not part of the offload engine. The first step is to build a sensible API between the semantic and syntactic layers so that they can be split between the host and the smart neck. This lets me leverage Samba code at the semantic layer, for example. In theory, multiple semantic layer implementations
could be swapped out or even run in parallel. And that brings us to the fuzzy layer. I need to shave it down to a sensible API. The layer needs to be able to manage and communicate the shared state, things like the set of negotiated features, encryption keys generated during authentication, active sessions, tree connects, which are mounts,
and open files. There must also be a way to manage the SMB engine on the smart neck. You have to be able to tell it what to do at some level. Okay, so what do I need to make the fuzzy layer happen? I need a rational, well-documented API.
I need a stackable low level for adding new dialects and capabilities as they come out. This is not a stagnant protocol, it changes. I need some sort of interface layer, which could be a device driver or a library, a toolkit, or some combination of all those. And I need to be able to develop a community around this.
I need to have enough other people interested that I'm not the only one driving it forward. The smart neck and transport layers. This, of course, is the raison d'etre for this effort. I need encryption, compression, message handling. I need to write all that up
and make sure it's running on the card. I need to be able to handle syntax errors in the messages. I need to support SMB3 multichannel. This is multiple streams coming in across multiple physical connections that are all combined into a single logical stream, something that SMB3 can do.
I need to support multiple transports, and I need to hide all those details from the upper levels of the stack. Virtual conferencing is kind of new to me. Normally, in a live presentation in a 40-minute time slot, I would be scrambling after 30 slides to finish up so that I could answer questions. I've done more than 30 slides,
and I've only taken up 15 minutes. So we're gonna dig deeper. Unless you're looking to join the Samba team, here's a bit more than you really wanted to know about SMB3. Keep in mind as we push through this that SMB3 is just SMB2 with features added.
An SMB3 implementation will support SMB2. There are four officially supported transports for SMB3. There's TCP, there's NBT, which is an older transport for SMB that's based on TCP and UDP. It's documented pretty well in my book and in RFCs 1001 and 1002.
There's RDMA, and there's QUIC, which is a newer protocol that provides secure and resilient transport over UDP. It was originally intended for web services, but Microsoft saw its potential for providing resilient SMB connections over Wi-Fi
and or over WAN links. Multi-channel. Multi-channel is multiple network paths that can be used to transmit and receive SMB messages bound together in a single session. Each message is sent in its entirety over a single connection.
The response must be on the same connection. Multi-channel increases bandwidth and also provides resiliency for SMB3 connections. Transports can be mixed, meaning that a single logical client server SMB3 session may be carried over multiple transports and multiple logical paths.
So when SMB3 was first introduced at a conference in California several years ago now, it was introduced as SMB 2.2. They set up a small data center on the stage, and they had one person going around just unplugging
and replugging the connections, the wires, and SMB kept running. What was really interesting about this is that I know that they recompiled the Windows kernel the night before. This was all alpha code. And you know that alpha code generally generates blue screens and things like that. Not in this case, they ran for the entire presentation
without going down. So this was a really big deal. Again, multi-channel means that you can disconnect RDMA for example, and the same session which is also running on TCP will keep running because the TCP connection is still there. Disconnect the TCP and the RDMA will take over
if it's connected. If you disconnect all of them, okay, then it stops. But they were plugging and unplugging cables as they were doing the demonstration. It was really quite impressive. SMB messages may be signed, encrypted, or compressed.
I'm not sure, I probably should check, whether compressed messages may also be signed or encrypted. Signed messages of course include a security signature to detect tampering. Encrypted messages are encrypted. This is also called sealing. The encryption also provides verification just like signing does.
But it turns out that the encryption is actually faster than just signing the message. So full encryption is generally preferred. Compression was only recently added to the protocol. So there's still a lot to learn there. These three capabilities need to be negotiated at the start of the session in order to be used.
So the first few messages, negotiation and session setup are raw, unencrypted, unsigned, uncompressed. Authentication is handled by these messages as well. The signing and sealing keys are generated during authentication. Network transport, multi-channel,
and all of the packet mangling for signing, sealing, and compression can all be offloaded to the SmartNIC and hidden from the upper layers of the stack. Okay, SMB command set. So this is my utilitarian breakdown of the SMB3 commands. It just helps me to set priorities
and keep track of which commands I've implemented and which I haven't. There are 19 commands in SMB3. Most of these have both a request and a response message associated with them. Some of them don't. Requests are sent by the client and servers send the responses. There is also one generic error response message type.
The error response is normally very simple, but in special cases, additional data is tacked on to help the client make decisions about what to do next. The format of the additional data depends on the error code being returned. And by the way, a tree connect is the equivalent of a mount command.
A create is used to open a file. In POSIX, we can use open to open or create a file, and in Windows, we can use create to create or open a file. Backwards, only thing, same, the it's. Several of the commands have a very simple message format.
I call it the base message. Structure size is always four to indicate that the entire structure is four bytes long, and two of those bytes are reserved and unused. This makes it easy to implement support for nine of the message types. There's the logoff request response, the tree disconnect request and response,
the echo request and response, also the cancel request, lock response, and flush response. All of these are the same base message type. Because of the behavioral differences between POSIX and Windows systems, there's a lot to think about when implementing an SMB3 command.
Consider the echo. It has no payload and no repetition, even though SMB1 had both of these features. It is, however, only valid within an SMB session. So both negotiate and session setup have to be implemented first. You can't just test echo request
like you could with SMB1 without implementing the negotiate and the session setup. Also, you can't just reply to the echo from the smart NIC. The echo is asking the question, is my session still active? The session must be active all the way to the semantic layer.
So you have to know the semantic layer is still running. Possibly the worst command to implement, both at the syntactic and the semantic layers, is the create. This thing is a kitchen sink of flags and modes and locking requirements and special conditions, and it supports a large set of what are known as create contexts.
These are additional blobs of data that ask the server to perform additional actions during the create. The create is, essentially, a mini protocol unto itself. The order of operations on the server side is also significant. And if something fails,
the server may need to back out of previous steps before replying. But from the client's point of view, the create needs to appear to be atomic. So time for a quick story. Years ago, late 1990s, I went to a conference on Unix Windows interoperability,
and there were a lot of important people there. And one of the important people was David Corn. He's probably best known as the author of the Corn Shell, but he's done a lot of work in the Unix world over the years. And one of the things he was working on at the time was a POSIX implementation on top of Windows NT, so a POSIX emulation layer.
Now, he and his team did some analysis of the Windows create versus the POSIX open, trying to figure out how complex it was going to be to map one to the other. And what they determined was that the number of permutations for the POSIX open
was on the order of hundreds. And that's a lot. That's a lot of testing that has to be done to test all those permutations. But the number of permutations on the Windows create was in the millions. And this is before create contexts were added
to the SMB create call. So it gives you an idea of how complex it is to implement the create function on a POSIX platform, for example. I've said a lot of stuff so far. What does it all mean?
In theory, theory and practice are the same. In practice, they're not. This is one of my guiding principles in software development and in life. I can imagine how an SMB3 offload engine will run on a SmartNIC. That's the theory. In practice, the software comes first. I did some business with a startup just a few years ago.
The engineering team were very enthusiastic about the hardware they were designing, and it was a good design. But I asked them about the software they were going to run on that hardware, and they said they were working on it, back burner. When the hardware was ready, the software wasn't.
Potential customers weren't buying, so they refocused on software. But by the time the software was ready, the hardware was out of date. That startup didn't make it big. So what does an open source geek like me do when there's an itch to be scratched? Start a new project.
I named it Zambezi, because it needed a name, so I called it Zambezi. The library code is under the LGPL. Any program code is under AGPL. But see the notes on the project page about licensing. As is usual for me, the code is excessively well-documented. As you've considered what we've covered today,
consider where else this code might find a useful home. Software-defined storage switches, proxy and cache servers, WAN accelerators, I worked with a few of those, it's interesting. Hypervisors, actually there are sub-protocols to SMB just for hypervisors.
How about remote access portals? So my near-term goals for the project, well, I wanna get it done. I wanna work with the SNIA, the Storage Networking Industry Association, to standardize the API once I've hammered it out. And I wanna fork a reference implementation
under the SNIA's license. I also wanna partner with others to implement this project on their SmartNICs. What companies are out there that are interested? Anybody doing a RISC-V SmartNIC? I also wanna find new and interesting use cases for this code.
So here are the links, my email address, and again, the project page. And that's the end.