We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Asynchronous Directory Operations in CephFS

00:00

Formal Metadata

Title
Asynchronous Directory Operations in CephFS
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Metadata-heavy workloads are often the bane of networked and clustered filesystems. Directory operations (create and unlink, in particular) usually involve making a synchronous request to a server on the network, which can be very slow. CephFS however has a novel mechanism for delegating the ability for clients to do certain operations locally. While that mechanism has mostly been used to delegate capabilities on normal files in the past, it's possible to extend this to cover certain types of directory operations as well. The talk will describe work that is being done to bring asynchronous directory operations to CephFS. It will cover the design and tradeoffs necessary to allow for asynchronous directory operations, discuss the server and client-side infrastructure being added to support it, and what performance gains we expect to gain from this. overview of problem (why metadata operations are so slow on network filesystems) and proposed solution CephFS caps DIRUNLINK and DIRCREATE caps directory completeness and dentry revalidation asynchronous unlink Inode number delegation asynchronous creates benchmarks what about error handling?
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
Directory serviceOperations support systemCore dumpWeb browserComputer animation
Kernel (computing)Computer networkSoftware maintenanceServer (computing)SynchronizationComputer fileDirectory serviceSoftwareOperations support systemMetadataSystem programmingData bufferLocal ringRead-only memoryStapeldateiFile systemDirectory serviceWeb 2.0SoftwareData recoveryMetadataMiniDiscOrder (biology)Crash (computing)Computer fileTouch typingRoundness (object)StapeldateiDifferent (Kate Ryan album)Local ringSemiconductor memorySynchronizationWorkloadKernel (computing)Operations support systemFile systemMereologyPoint (geometry)Server (computing)Database transactionCuboidMultiplication signLink (knot theory)SurfacePhysical systemProjective planeRight angleLattice (order)BitCASE <Informatik>Network topologyComputer animation
Client (computing)MetadataMereologyPosition operatorCartesian coordinate systemData storage deviceParameter (computer programming)Computer clusterSpeech synthesisMetadataClient (computing)Reading (process)Cache (computing)Server (computing)Right angleBlock (periodic table)CASE <Informatik>Object (grammar)WritingTrailFile systemMechanism designComputer fileSpherical capEndliche ModelltheorieCuboidArchaeological field surveyRoundness (object)2 (number)Directory serviceComputer animation
Client (computing)MereologyMetadataComputer fileLink (knot theory)Type theoryVariety (linguistics)Positional notationData bufferWritingCache (computing)Directory serviceMiniDiscDependent and independent variablesInformationMechanism designTrailNegative numberSign (mathematics)Raw image formatInformationSpherical capComputer fileState of matterMechanism designExistenceLink (knot theory)Directory serviceServer (computing)Client (computing)CodeAuditory maskingOperations support systemCovering spaceBitAuthorizationPersonal identification numberType theoryDampingFlow separationDifferent (Kate Ryan album)QuicksortArithmetic meanExclusive orFile systemOrder (biology)MereologyCache (computing)Positional notationProxy serverConnectivity (graph theory)SynchronizationRoundness (object)Software developerPrice indexExtension (kinesiology)Asynchronous Transfer ModeMusical ensembleDataflowInstance (computer science)Direction (geometry)Buffer solutionComplete metric spaceVariety (linguistics)MassMultilaterationPhysical lawCountingComputer animation
MetadataClient (computing)Operations support systemInstance (computer science)Right angleSystem callLink (knot theory)Computer fileRoundness (object)SynchronizationDirectory serviceComputer animation
Virtual machineCAN busKernel (computing)Directory serviceFamilyPlotterElectronic mailing listRight angleInformationSpherical capProcedural programmingDiagramGreatest elementComputer animationProgram flowchart
SpacetimeComputer fileDirectory serviceCache (computing)Web pageData transmissionException handlingDecision theoryDirectory serviceDifferent (Kate Ryan album)WorkloadWeb pageOperations support systemCache (computing)System callKernel (computing)Multiplication signOcean currentPoint (geometry)Link (knot theory)Computer fileRight angleWritingBuffer solutionDirected graphCASE <Informatik>Spherical capNatural numberComputer animation
MiniDiscProcess (computing)Error messageInheritance (object-oriented programming)Directory serviceSign (mathematics)Computer fileDirectory serviceSpherical capRecursionSystem callRight angleServer (computing)Multiplication signPosition operatorComputer animation
Directory serviceComputer fileConvex hullDirected graphIntegrated development environmentSoftware testingNumberVirtual machineArithmetic meanLink (knot theory)Directory serviceVirtualizationRight angleFigurate numberHistogramMultiplication signVolumenvisualisierungMereology2 (number)Computer fileBitDivisorCASE <Informatik>SynchronizationInteractive televisionCuboidOutlierPrisoner's dilemmaBit rateParallel portComputer animation
Operations support systemAnalytic continuationClient (computing)MiniDiscRemote procedure callCASE <Informatik>Operations support systemOutlierComputer animation
Operations support systemAnalytic continuationClient (computing)Remote procedure callMiniDiscInheritance (object-oriented programming)Directory serviceSynchronizationNegative numberComplete metric spaceNumberDependent and independent variablesDefault (computer science)Hash functionKernel (computing)Open setVertex (graph theory)Computer fileHill differential equationBuildingBuilding2 (number)CuboidClient (computing)Scripting languageLoop (music)Error messageDependent and independent variablesCASE <Informatik>NumberGastropod shellRange (statistics)AreaComplete metric spaceKernel (computing)Order (biology)Directory serviceComputer fileOperations support systemStapeldateiMultiplication signSoftware testingSynchronizationSystem callOutlierMathematical analysisRight angleSlide ruleNegative numberInheritance (object-oriented programming)Cache (computing)Spherical capSet (mathematics)Exclusive orOpen setBitPhysical lawOcean currentDrop (liquid)Line (geometry)Hash functionDirected graphComputer animation
SynchronizationOperations support systemError messageLocal ringDirectory serviceCommunications protocolMiniDiscRootInheritance (object-oriented programming)Interface (computing)StapeldateiComputer fileInterface (computing)Kernel (computing)Directory serviceMetadataType theoryMultiplication signError messageSpacetimeClient (computing)Propagation of uncertaintyCuboidOperations support systemCrash (computing)File systemVirtual machinePoint (geometry)Exception handlingMetropolitan area networkCASE <Informatik>MiniDiscMereologySystem callDifferent (Kate Ryan album)Inheritance (object-oriented programming)Hecke operatorAreaImplementationVapor barrierCartesian coordinate systemProcess (computing)DatabaseLink (knot theory)Communications protocolPhysical systemSynchronizationSoftware testingEntire functionOcean currentSequelRight angleComputer clusterLeakDirected graphLattice (order)WebsiteCausalityLevel (video gaming)Computer animation
Directory serviceRootError messageInheritance (object-oriented programming)Interface (computing)Complete metric spaceAsynchronous Transfer ModeMereologyBit rateMathematicsSocial classPoint (geometry)Server (computing)Electronic mailing listArmDescriptive statisticsSource codeSequenceComputer fileDiagramDirectory serviceMultiplication signOperations support systemFlow separationNatural numberStatisticsPhysical systemSpherical capObject (grammar)Mechanism designMultiplicationCommunications protocolBlock (periodic table)State of matterLink (knot theory)RecursionRoutingAttribute grammarRevision controlLatent heatNetwork topologyDecision theoryHand fanArithmetic meanClient (computing)Observational studyLevel (video gaming)Endliche ModelltheorieBuildingProcess (computing)Computer clusterEquivalence relationSoftwareMetadataOpen setContext awarenessBackupOrder (biology)Different (Kate Ryan album)Error messageMiniDiscExtension (kinesiology)RootFile systemRight angleOpen sourceComputer animation
MereologyCodeKernel (computing)BitException handlingWindowPrototypeError messageClient (computing)Staff (military)Disk read-and-write headComputer animation
Focus (optics)CodeComputer configurationMultiplication signMereologyClient (computing)Kernel (computing)Endliche ModelltheorieParameter (computer programming)WindowLine (geometry)Computer animation
Point cloudFacebookOpen source
Transcript: English(auto-generated)
All right, welcome everybody. We have a Cephas talk, and we have two of the Cephas core contributors, Jeff and Patrick, so please welcome them and enjoy the talk.
Hi everybody, this is a talk about some work I've been working on for the last year or so. It's still quite leading edge, so be gentle. So first of all, go in, oh yeah, sorry, let me apologize for this. Firefox, browse the web thing, yay Fedora. So anyway, just a little bit about us.
I'm a long-time kernel dev, done some move to doing recent work with Ceph, and just recently took over maintainership of kceffs, Zhang wanted to move on to doing more work in the MDS, and so I've taken over that part of it. And Patrick is also a contributor, or a lead person
on CephFS these days, and joined Red Hat in 2016, and he mostly shepherds the project along at this point. So anyway, what motivated this work is the realism
that anytime you're working with a network file system, NFS, Ceph, anything like that, metadata directory operations generally are pretty slow. If you're doing an open, an unlink, rename, anything like that, you almost always are doing a round trip, synchronous round trip to the server.
So someone will call into the kernel, we'll dispatch an RPC to the server, and then we have to wait for the reply to come in, and then finally we can return back to user land. Those are slow. So this affects a whole lot of different workloads. You're untouring files, rsync,
anytime you're removing a big directory tree, and also stuff like compiling software, basically anything you do that touches a file system is gonna be affected by that. So first of all, is why are local file systems so much faster? Well, the obvious thing is that they don't have
a server to talk to, right? They don't have to make this long round trip. And then there's also some non-journal file systems that buffer their metadata mutations in memory. So these are stuff like ext2. Not so much in use these days. But in most journal file systems, the journal's pretty quick,
and so we don't tend to worry too much about the fact that we have to journal all this data in order to handle the crash recovery. So the consequences of that is that you can, they can batch out the writes to the journal, so you can do a whole bunch of, you can build basically a transaction
and then flush it out to the journal. But those operations are not guaranteed to be, especially when you're dealing with these non-journal file systems, the operations that you do are not guaranteed to be durable unless you have sync. So if you do a rename, unlink, anything like that,
it's possible that if the box crashes before that data hits the disk, you may, that operation may turn out never to have happened, even after you've returned back to user land. Now it turns out in most modern journal file systems, that's not such an issue. They almost all synchronously write to the journal
before they'll return to user land. But, you know, technically, you know, you're supposed to fsync in order to do that, in order to ensure that your operation is persisted on disk. I'm gonna let you take this part. So this is a, so, can you all hear me on this?
I don't know if the mic's working. No? It's loud? Speak up. Okay, I'll try to speak up. That's hard for me. All right, so, I get, you know, just to give you all an introduction to Cephefes, for those of you who don't know,
Cephefes is a POSIX distributed file system. It's the oldest storage application that's run on Cephef. It was the original use case for Cephef back in around 2005. It's a cooperative file system with clients. In particular of note is that the clients
have direct access to the object storage devices. They're able to read and write all the file data blocks themselves. They don't have to go through any kind of metadata server. So the server that Jeff was talking about earlier is actually the metadata server. So that is the centralized services, there can be more than one,
that aggregate all the metadata mutations, journal them to RADOS and the metadata pool, and also serve to manage the cache between all the clients, making sure the clients are all consistent, and that the clients' caches are also coherent. So there's a capability mechanism that the MDS has
to give the clients rights to do things like read or write from a file, or keep track of what entries exist in a directory, and that's all cooperatively maintained by the clients and the MDSs. The clients are considered trusted in this Cephefes model,
so they're not going to misbehave in any way, because, namely, they do have direct access to the data pool, but they're also expected to maintain their caches coherently with the MDS. So Jeff's going to talk in particular in this talk, focusing on these RPCs at the top
between the client and the active metadata server. So how does the MDS manage all this, mediate between the different clients? Well, it has this mechanism that we, it's called the CAP subsystem, short for capabilities.
And basically, capabilities, if you're familiar with something like NFS or SMB, is very similar to like a delegation or Oplock, but they're more granular. In particular, they come in several different types of flavors, so we have a pin, auth, file, link, etc. And so they, a lot of that sounds pretty obvious.
Pin just ensures that the thing doesn't go away, and auth ensures that, or pin actually ensures that it doesn't float between MDSs, I believe, actually. So we ensure that that thing is pinned to a particular MDS while the operation is going on. Auth covers ownership mode, file is a big cap,
I'll talk about that in a minute. Link is a link count, primarily. And then x adder covers x adders. So they all pretty much all have a shared and exclusive variety, so we can hand out shared or exclusive caps to a client for them to buffer operations,
or cache operations. But the file caps are a little special. They have a whole bunch of other different bits. And if you see down here too, the way we express caps and track them in all the code is via bit mask. And so this part down here is more or less showing you sort of how the bits
are laid out for the different caps. The thing to notice about the file caps is that they are pretty extensive. So we have shared and exclusive, of course, but there's also cache, read-write, buffer. I believe that one is an append. And then there's a lazy IO, which is sort of a weirdo thing to allow it to not have to talk to the MDS so much.
But mostly here we're talking about directory operations. And so traditionally, the MDS has not really given out much in the way of caps on directories. So we will give out shared caps pretty much,
but exclusive caps, not so much. And then in order to try to speed up directory operations, what we want to do is start allowing the clients to do a bit more locally. And so to do that, we have extended,
or overloaded the file caps to have different meanings on directories. So in particular, we want to allow create and unlink. Those are the two that at least we're starting with. So basically, and we're also going to have the MDS handout exclusive caps.
So you'll notice too, we have sort of a shorthand notation here as well for how the caps work, or how the caps are expressed. So internally in the MDS, typically whenever we have to do a directory operation, the MDS has to go gather a bunch of locks to ensure that other MDSs don't come in
and try to do something. And so Zhang, sort of lead a developer on the MDS, has developed a new lock caching facility. So he can basically have an MDS gather locks for an operation on a directory,
and then cache those for later use if he needs to do another. So essentially what happens is we only hand these out though on the first synchronous create or unlink in a directory. So now let's talk a little bit about denture caching.
So again, we don't want to always have to do a synchronous round trip to the server to do a lookup or something like that for a directory entry. And when I talk about denture, what I'm talking about is a path name component within the file system. So in order to do an asynchronous directory operation,
we have to, we need reliably, we need to know that our cached information about the directory is correct. We can't go and fire off an unlink and then find out later that, oh, that file didn't actually exist. So that's not allowed. So we have two mechanisms for tracking dentures. We have denture releases,
and they just come in positive or negative flavors. And then we also can hand out FS and by extension, shared file caps on a directory or exclusive caps on a directory, exclusive implied shared. And so for the latter, if we just get the caps on a directory,
we don't actually know anything about the dentures that are in it. So we have to either have done a redir, a full redir on the directory, or we have to know what the state of all the dentures is in the directory. So for instance, if we create a new directory, we know it's empty and we can consider it complete. And this allows us, so we do this,
use this today actually, because this allows us to do lookups, even negative lookups on a directory without having to talk to the MDS. If we know that the state, that our information about the directory is complete and someone asks for some denture that we know is not there, we don't have to talk to the MDS. We can just say, no, that doesn't exist.
So now let's talk about doing actual asynchronous operations. So we'll start with talking about what happens today. It's pretty typical, similar to NFS or SMD or anything like that. When we do an asynchronous, when you do a synchronous, someone calls unlink down into the kernel, for instance.
We do that synchronously. We dispatch a call to the MDS. We have to wait for the reply to come in, and when the reply comes in, then we can return back to user land. But this can be really slow, right? I mean, think about it, doing like an rm-rf on a directory.
We're gonna do a reader, find out what all's in there, and then we go issue an unlink on each file. And each of those is a round trip. And so that slows, that's very, very slow. So in fact, here's a diagram that kinda shows the procedure for an asynchronous unlink. So here we have, so we did an open on directory, right?
So we did a, and then that's gonna do, get information about the directory. We get capabilities for the directory. Let's say we've got exclusive caps. We do a reader to fill it, so that we know what all the dentries are. And then we do, go back, and that gets a reply.
And then we do an unlink, and then that comes back. And then we do another unlink, and that comes back, and so on and so forth. And then finally, we can do an rmdr at the bottom. So if we're gonna do these asynchronously, we have a decision to make. Do we wanna wait to transmit them,
or do we want to just go ahead and fire them off as soon as we get them, right? And so it's natural to think about, when you're talking about asynchronous operations, it's natural to think about, like buffered IO in the kernel. In that case, we are writing to a cache, you know, page cache in the kernel, and then eventually we flush that out.
And so the deal with the kernel, with writes though, is that any time you do a write in the kernel, there's a pretty high probability that, you know, that there will be a follow-on write a little later that may also modify that data. So often it's advantageous to wait a while before you flush these things out.
Not so much on directories. You know, operate workloads that repeatedly create and unlink the same dentries are pretty rare. So, you know, at least at this point, we're operating under the assumption that there's not a lot to be gained by delaying them. And so as soon as someone calls this, we're just gonna go ahead and fire off the call,
and then we just won't wait on the reply. We may change this in the future. There are some workloads, things like Rsync, will often create a file, write to it, and then rename it into place. That may be more advantageous to do that.
Can I wait until the end? So, okay. That may be more advantageous to, so we may, you know, in the future consider doing this differently. So an asynchronous unlink. So how do we do this, right? So first of all, we have to get, you know,
exclusive caps on the directory, and we also need an unlink cap, which typically means that we have to have done a synchronous unlink in the directory first. We also need to know that the dentry is positive, right? So we have to know that the file exists. And then also we, there's also a concept in Ceph, it's sort of exclusive to Ceph,
that it has to be the primary dentry for the file. Ceph has a really strange way of tracking hard-link files, and so we exclude, basically you're excluding hard-link files from this, at least for the time being. So the idea here is that we're gonna fire off the unlink call to the MDS,
and then we're just gonna assume that it worked, right? And then not wait on the reply. So we can fire the thing off, we go ahead and de-delete the dentry inside the kernel, and then we return back to user land. You know, the upshot of that is that if we are doing a whole bunch of these, we are shoveling them all out in parallel, and that really can speed up, you know,
like removing a directory recursively. So here's our diagram again. Pretty much all the same up here, but down here you can see we're firing off lots of, you know, async unlink requests, and then they come back, and then we eventually, once we've, you know, once all the replies have come in,
we can go ahead and issue an RMDO request to remove the directory. So again, this is real bleeding-edge work, so these numbers may change in the future, but for now, you know, I just did some real basic testing
on a virtualized test environment on my home machine, and so I basically created 10,000 files in a directory. So, and I should say here, too, when we did start this work, we started with unlink because it's easier. Creates are, you know, quite a bit harder, and I'll go into why that is later,
but so here you can see, if we just remove all the files in that directory, if we have to wait and do them all synchronously, it took about 10 seconds on this box, but with asynchronous dirops, less than a second. The catch is that we have to wait
for all the replies to come in before we can issue the Render. So again, here, I'm just removing all the files in the directory. I didn't actually remove the directory itself. If you go and remove the directory itself, you'll find that it blocks for a while. You know, it's still faster than the async, or than the synchronous case because you're not waiting for,
because we're issuing all these unlinks in parallel, but it does, but you do notice a delay on the Render part. Here's some more numbers. So these are histograms that I created with BPF. If you've not used BPF yet, you should.
It's awesome. But, so you can see here, this is the time spent in Ceph unlink. These are in jiffies, which is a millisecond. And so, here you can see synchronously that these are all quite slow, right? The fastest one is still 512,000 milliseconds.
Over here, we're down to 1,000, 1,000, 2,000. And you see down here, there's one outlier right here that's probably, and some of them are still outliers. I haven't gone to go figure out why some of them are not going as fast.
I think occasionally we get a situation where they go synchronous. And again, we'd have to do at least one synchronous remove in the directory too, or one synchronous unlink in the directory before we can do an async one. Jeff, I think those are actually microseconds.
So on the right side, I thought it was 1,000 microseconds. I did these in jiffies, so yeah. So I'll have to go back and look. Maybe I'm off by a factor of 1,000, but I don't know. Yeah, maybe you're right.
Okay, I stand corrected. I did do it in jiffies though. So we have some opportunities to improve this situation too. Right now we're doing synchronous RMBER, but we may consider doing that asynchronously in the future.
Again too, what I find in certain cases is that, like I said, we have some outliers here, and I think what happens is occasionally we end up doing something synchronously, and then those operations get backed up behind the pile of async operations that are in flight.
So we probably may need to consider doing some throttling on this. And then also we can consider batching up the unlink operations as well. So that we can just, if we could batch a bunch of them up, fire them all off in a single call, that might be more efficient.
I'm not convinced on this. The lock caching thing that Zhang has put in seems to me that that's where the most of the slowdown would be. So I'm suspecting that that may not be as useful, but we may experiment with it and find out. If there is a benefit, it's probably with the MDS journaling. We might expect to see the MDS can write the operations
to the journal more efficiently. That's a good point, yeah. Okay, now let's talk about async create. So if we do, so the requirements for an async create, we need DX and DC caps. Again, we've overloaded the file cache cap for this.
We also need a known negative dentry in this case, of course, because if there's already a file there, we can't do a create on top of it. And so we need either a negative dentry release on the thing, or we need, you know, we already have DS on the parent directory by virtue of the fact that we have DX, but we need completeness otherwise, if we don't have the release.
We also need a file layout. So like Patrick pointed out, file data goes through directly to the OSDs. The clients need to know where to write that data. And so the file layout is what tells them that. When we do that first synchronous create in a directory,
then we copy that file layout, because we know that any new file created in a directory will inherit the file layout from the parent. And then we also need a delegated inode number. Whenever you do a create, you know, we're creating an inode, we're creating a dentry, that inode, we need to know what that inode number is.
I'll talk about that a little bit more here in a minute. So again, we just fire off the create call immediately. We set up the new inode, plug into the dentry and move on and return from the open. And we always assume that a newly created inode gets a full set of caps. Because we want the, we know that the,
typically whenever you do a synchronous create even, on a new file, we get back a full set of caps on it. And we always set O exclusive in the call, just because there should be no reason that we can't. You know, that the file's not there, or the file turns out to be there, we want it to error out. We don't want to open the file and screw that up.
So inode number delegation. So in order to, we need to know in advance what the inode number's going to be whenever we create this new inode. So we have to, that's in order to, A, hash it properly in the kernel so we can find it later. And also to allow for writes into the file
before the open reply comes back. So we hand out ranges of inode numbers and then create responses now. So whenever we do a synchronous create, that first synchronous create, the MDS will shovel out a pile of inode numbers to the client, and then it can go and use those. I mean, so we have a new tunable in the user land MDS now to, the MDS also preallocates inodes
for a particular client and attaches it to its session. And so what we're doing is just delegating a pile of those to the client for use. So this is tied currently to the MDS session. I think we need some work in this area still. Right now, if you lose the session,
then the inodes go away and it's not clear to me how we're going to handle a case where we've already fired off a request and it didn't work. So error handling on this is all still a bit sketchy. So let's talk about performance. Again, we're creating 10,000 files in a directory here.
So I'm doing a very simple shell script loop to just write to all these files. So without async directory operations, takes about 11 seconds on this box. With it, about half, so a slight improvement. Again, histograms, again, if you see this with BPF,
you'll see like the nice bars, text bars outside, but I've chopped those off here just because I didn't have room on the slide. But here you can see these are all quite slow up here. Most of them are in this range right here, 512K to one million.
Over here with async durops, we're down into the thousands. We do have some outliers here. Again, I think with what's happening in that case, again, I need to go do some analysis to figure out why, but I think the situation there is that we get to a situation where we run out of I know numbers. And so when that happens,
the client has to go synchronous. And some of these calls end up backed up behind some of the previous async calls, and then they take a long time to come back. So we probably have some work to do here, too, again with throttling. The all important kernel build test.
So I just did a, you know, built a, you know, made a little Linux tarball, make a directory, CD into it, untar it, and make the thing. So here, just about five minutes to do the build. With async durops, we shave about 50 seconds off. It's about a 20% improvement, not bad.
So again, opportunities to improve this would be, we could do allow for in-place renames. You know, again, we may need to make an asynchronous rename. We need to, we could also batch creates as well. If we buffer them for a little while,
we don't necessarily have to fire them off immediately. And then we could also do other operations, make dur sim link, stuff like that would be kind of nice to add, probably. And of course, error handling, which is the bugaboo for this whole thing. So if we return early, right, the error handling is where this is all iffy.
So if we return early from an unlink or an open, what would we do if these fail, right? I mean, that's the big question. For creates, we've already closed, we could have already closed the file by the time the reply come in. I mean, we could have written, if they're small files in particular, we could have opened them, written to them, closed them, and then all of a sudden the create reply comes in
and we find out it didn't work for some reason. I'm not sure which failures are permitted by the protocol, I'm not sure. Patrick, put that in. You wanna mention what that was about? Yeah, I think so. So, you know, obviously there's a lot of different types of failures
for an unlink or open that could happen. And part of the challenge with doing an asynchronous create or unlink is identifying what failures entirely are the responsible of the client and you would not, it would not be permitted for the MDS to give you those failures as part of the asynchronous call
versus failures that may occur at the MDS, for example, eno space, which the kernel client would need to handle somehow. Eno space actually in practice for metadata mutations just doesn't happen in Ceph unless you have a catastrophically out of space Ceph cluster, in which case you have many other problems.
But there's probably not very many failures that the kernel client can actually handle itself, but we still need to go through all the different cases there and make sure we're not missing anything. That's basically I think what that bullet points for. So again, when I started this work, I kinda hung the whole thing on this paragraph, right?
That comes out of the fsync man page, right? And basically it just says calling fsync does not necessarily ensure that the entry in the directory containing the file has also reached the disk. For that, an explicit fsync on the file descriptor for the directory is also needed. So the upshot being that if you don't, that when you write a file or create a file and write to it, you also need to fsync
on the directory too to ensure that the dentry actually made it to disk. In practice, most file systems nowadays, you don't need to do that anymore. This was written, I believe, when ext2 was prevalent. And so nowadays, almost all modern file systems journal to create before they return back to user land.
And so if the box crashes and comes back or whatever, the file will certainly exist. Just to add onto that, you still see the remnants of very cautious applications that are written to fsync on the database, namely SQLite, which actually has some very excellent documentation for the entire rigmarole process of synchronously writing
the database correctly, such that it'll survive any kind of crash or machine failure. So they do actually do the fsync on the directory file descriptor, but you'll see as many, especially kernel developers, it's actually fairly rare for applications
to use fsync correctly. So yeah, I mean, in most cases, fsyncing on the directory is usually pretty quick, because there's not much to be written back. But here, at least in this implementation too,
if you do an fsync on the directory, we will wait for all the async operations to come back. So we can use that as a barrier to ensure that things actually did hit the MDS and do the right thing. So more about error handling. So currently, after failure to unlink, what we do is we mark the directory noncomplete,
because we don't know what the heck happened. We invalidate the dentry that was there and force the client to have to do a new lookup. And then we set a writeback error on the parent directory to show up, so that if you do that fsync, you'll get an error back. After fail create, we again invalidate the dentry.
We should also mark it noncomplete. We set a writeback error on the parent directory again, and we set a writeback error on the created inode. So this is an area where I'm still exploring what we should do. So one idea might be to propagate errors
all the way back up to the top of the mount, so that you could potentially open a high-level directory, call an fsync on that, and then find out, and then get an error back. And then if you see that error, then you know something failed down below. In the modern world, where we're doing a lot of stuff in containers, spinning up temporary containers to do
builds and stuff like that, that may be sufficient. If it falls over, if something fell over during the build, you can just throw that build away and start again. And we may need to consider new interfaces. Oh, another idea might be to use SyncFS. SyncFS in the kernel has pretty lousy error handling
right now. Basically, the only error you can reliably get back from SyncFS is ebaddf, so if you pass it a bad file descriptor. But we could use that as well. And that's it. Questions?
Jeremy, you had a question earlier. Oh, sorry. So you kind of answered that later on, but I was wondering if the Ceph protocol had the equivalent of the asynchronous, the thing with SMB
where you would basically chain operations together. We have an asynchronous open with delete on close marked, and then a close. You chain the things together, and then you can issue those asynchronously or together in one RPC. And basically, the server then processes them in whatever order it wants, and you're just waiting for them to come back. Yeah, yeah. We don't do compounding.
OK. More's the pity. Yeah, compounding, I mean, there might be a future enhancement. Yeah, I would love to see that. It makes a great difference. Ceph wasn't designed with that from the get go, and so it doesn't happen. I mean, especially if you're doing an open lot of operations close, the ability to implicitly just sequence all those and use the file description implicitly from the open is very powerful.
Yeah, I'd love to have that, but we don't have it right today. Any other questions? We're working on an open source backup software named Barrios. I'm sorry, can you speak up a little? Pardon? Oh, can you speak up a little? OK. We are making an open source backup software
named Barrios, and I hope this question isn't too off topic. GlassFS has something named GlusterFind, which can give us a list of files and directories changed since a certain point in time. Does CephFS have anything like that? Not that I'm aware of.
Anybody? Greg, you may know, or Patrick. OK, yeah. So CephFS has this concept of recursive statistics, which is like stats on an inode, except it's recursive in nature. So you can figure out hierarchically
how many files are under a directory tree by just looking at a specific extended stat attribute of the directory. One of those things you can look at is the version of the file. So you can actually see that, and that trickles up all the way to the root. So if a file has been changed recently,
you can look at the recursive statistics of the directory to see that something under it's changed and just keep going down and examining the files. It's not quite as efficient as the GlusterFS, where you can get the entire diff and see all the files. But the mechanism is there to actually do it yourself.
Although in the future, we do want to make it simpler by building support of light GlusterFS and CephFS and making it a first class citizen of the file system. So not quite, but you could do it yourself if you want. Any other questions? Jeremy again.
So for returning errors, I thought that was interesting. The only errors I can see other than the remote disk died are as if someone does a rename of the directory out from under you, or if someone changes the permissions on the directory so that your operations would get access denied. I'm assuming that you would get a lease on the directory first, such that they would be blocked
or you would get a pending notification before that. That's how you would handle that? Right, right. That's what those caps are all about. They're kind of similar to an Oplock release, just on part of the metadata. But yes, that would not be allowed while we had exclusive caps on the directory. And you can't do this unless you do. Got it. OK, thanks.
Any other questions? So just to add on to that, we didn't really get into this in too much detail, but you heard Jeff several times say the first synchronous create or the first synchronous unlink. Our diagram was actually a bit wrong on that that we showed.
Not all of the unlink requests are completely asynchronous. The first one has to be synchronous. And that's just so that MDS can acquire all the necessary locks and also issue any necessary capabilities so that you can do any further unlinks yourself. So that first request is actually synchronous. And one of the reasons for that is you need to get the file layout for the create, which
lets you know how to actually stripe the data blocks across multiple objects for the new file. And then also ensures that because file layouts can also be set hierarchically on a subtree, that you know that no directory above you has had its file layout changed,
such that the file layout on a new file should be something else. So the client can safely move forward knowing that the file layout hasn't been changed out from under it by operations on a higher level directory. And all that's protected by the caps that the MDS is issuing.
Any other questions? So this is all still experimental, but part of it's already in the code. So when will this be available? Will this be 16 or next release?
Like not upcoming, but the one after? Or is it like a multi-gear effort? I think the Userland part of it, the MDS part, is in. We had to get that part in before we could even figure out whether this was going to be worthwhile. And so we've got that part in mostly.
I think it is worthwhile. We're going to keep pursuing it. The kernel bits, I have prototype code, of course, that works, but it has some rough edges. It won't make this merge window. So 5.7 would be the absolute earliest in the kernel part, but probably later.
I have a feeling I'm going to probably want to do a cleanup of the syncfs code in the kernel, and then turn around and maybe plumb the error, have that, use that for the error handling, recommend that we use that for the error handling. And that probably will be a bit of an effort because the syncfs code needs some work.
Any other questions? Just to add on to the answer to that question. So the MDS bits are actually going to be in Octopus, which will be released in March. So all the groundwork is there. The StepFuse client probably won't have support
until the P release, which we're calling Pacific, so in 2021, that'll be available. But the kernel client, I think, which is getting most of our focus right now, should, hopefully, we'll have something merged into mainline within a few months, assuming everything's going as we expect.
Yeah, I think the unlink code probably can go in fairly soon. I almost merged it, I almost had to merge it for this coming merge window, but I decided to wait because I'm gonna have to rework some of the underlying code for the creates to be better. And I wasn't comfortable merging the unlink part
until I felt better about what that part would look like. So it's not quite there yet, but maybe a latter part of this year, but my guess is that we'll probably have something in before too long. And this will probably be like an optional feature too. Right now, I've got it set on a module parameter,
but we may eventually make an amount option or something like that, so, okay. I think we're out of time now, so if you haven't got any questions, let me know. Thank you.