We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

FOSDEM 2009: Ext4

00:00

Formal Metadata

Title
FOSDEM 2009: Ext4
Title of Series
Number of Parts
70
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This presentation will discuss history of ext4, its features and advantages, and how best to use the ext4 filesystem. The latest generation of the ext2/ext3 filesystems is the ext4 filesystem, which recently left the development status of 2.6.28. With extents, delayed allocation, multiblock allocation, persistent preallocation, and its other new features, it is substantally faster and more efficient compared to the ext3 filesystem.
17
Thumbnail
56:35
18
Thumbnail
15:55
35
Thumbnail
15:35
60
69
Workstation <Musikinstrument>Demo (music)NetbookBitCrash (computing)LaptopPresentation of a groupSelf-organizationSoftware developerMultiplication signBackupFile systemSoftware testingComputer animation
File systemSoftware developerImage resolutionTimestampLimit (category theory)Software developerBitNeuroinformatikFile systemQuicksortImage resolutionDistribution (mathematics)CodeMultiplication signLimit (category theory)Goodness of fitComputer fileDean numberExtreme programmingForm (programming)ResultantReal numberOcean currentTimestampVideo game2 (number)Source codeStaff (military)Link (knot theory)NumberPosition operatorTerm (mathematics)WorkloadEntire functionSpecial unitary groupBefehlsprozessorMultilaterationBound stateKernel (computing)Computer animationLecture/Conference
Device driverSource codeExtension (kinesiology)CodeBlock (periodic table)File systemBitComa BerenicesMoment (mathematics)Computer fileLimit (category theory)Process (computing)MiniDiscView (database)Physical systemMathematicsDirectory serviceLevel (video gaming)NumberComputer hardwareGoogolBackupOrder (biology)Sound effectSpacetimeTable (information)Revision controlCore dumpCategory of beingExtreme programmingPoint (geometry)File formatDevice driverUniqueness quantificationKernel (computing)Expert systemMobile appImage resolutionSummierbarkeitStability theoryMaxima and minimaPosition operatorSoftware developerExecution unitVideo gameMultilaterationMultiplication signLine (geometry)MereologyOnline helpTheoryNetwork topologyAxiom of choiceSolid geometryWorkstation <Musikinstrument>State of matterGroup actionPower (physics)Hard disk driveTorvalds, LinusException handlingCache (computing)Standard deviationExt functorData storage deviceComplete metric spaceLecture/Conference
Block (periodic table)Computer fileRange (statistics)Data structureMiniDiscFile formatNetwork topologyData conversionData modelRead-only memoryEmailType theoryResource allocationBlock (periodic table)Table (information)MereologyExtension (kinesiology)InformationRevision controlFreewarePointer (computer programming)PhysicalismSequenceMultiplication signPhysical systemQuicksortMedical imagingPoint (geometry)LengthLevel (video gaming)SpacetimeFitness functionStandard deviationComputer fileRange (statistics)MiniDiscFile formatHardy spaceLaptopData storage deviceDirection (geometry)Data structureNumberFront and back endsLogicBitBezeichnungssystemFile systemMappingUniform resource locatorCASE <Informatik>CodeField (computer science)Exterior algebraNetwork topologyRight angleSurgeryLink (knot theory)System callMoment (mathematics)FrequencyCuboidComputer animationLecture/Conference
Data conversionNetwork topologyAsynchronous Transfer ModeData modelRead-only memoryEvent horizonBlock (periodic table)Resource allocationInformation overloadComputer fileBlock (periodic table)Cartesian coordinate systemExtension (kinesiology)Physical systemProcess (computing)2 (number)Flock (web browser)Programmer (hardware)LaptopSign (mathematics)Asynchronous Transfer ModeCondition numberCodeLevel (video gaming)Web pageNetwork topologyResource allocationMathematicsMereology4 (number)Multiplication signData storage deviceMultiplicationSpacetimeMiniDiscCrash (computing)Dependent and independent variablesRight angleBackupOrder (biology)Revision controlFile systemNumberSemantics (computer science)TheoryInformation securitySubject indexingSparse matrixExtreme programming3 (number)Medical imagingSoftware testingAlgorithmWorkload1 (number)Line (geometry)Single-precision floating-point formatTable (information)Default (computer science)Forcing (mathematics)Normal (geometry)Exception handlingSurjective functionClosed setFreewareQuicksortComputer animationLecture/Conference
Block (periodic table)Resource allocationBenchmarkKey (cryptography)File systemComputer fileLevel (video gaming)WeightData loggerSimilarity (geometry)Directed graphSoftwareComputer hardwareWindows RegistryMultiplication signBenchmarkRight angleData storage deviceConfiguration spacePoint (geometry)Integrated development environmentDirectory serviceSampling (statistics)Block (periodic table)Cartesian coordinate systemWebsiteVideoconferencingDevice driverMiniDiscSpacetimeSynchronizationNumberBitPhysical systemSystem callSoftware developerUniform resource locatorResource allocationPosition operatorDatabaseField (computer science)Drop (liquid)Mobile app1 (number)Electric generatorReduction of orderCASE <Informatik>WritingProcess (computing)Quantum stateForm (programming)Crash (computing)Discounts and allowancesProgrammer (hardware)GodAuthorizationGoodness of fitWorkloadDecision theorySoftware bugWindowTraffic reportingLengthProper mapLine (geometry)Raw image formatStructural loadSemantics (computer science)Dot productClosed setEmailQuicksortComputer animationLecture/Conference
Thread (computing)Computer fileGradientPerturbation theoryMultiplication signRight angleFile systemComputer fileGreen's functionRevision controlDifferent (Kate Ryan album)RAIDThread (computing)BitArmExtension (kinesiology)ResultantForestPeer-to-peerComputer animation
Computer fileRandom numberComputer fileBenchmarkMathematicsThread (computing)RandomizationReading (process)TunisSequenceCASE <Informatik>Cheat <Computerspiel>Right angleComputer animation
Thread (computing)Mail ServerRAIDSimulationSingle-precision floating-point formatComputer fileRandom numberEmailWorkloadServer (computing)SimulationVirtual machineThread (computing)WritingReading (process)Mixed realityFile systemSingle-precision floating-point formatBenchmarkRAIDNumberInheritance (object-oriented programming)40 (number)File formatOctahedronTunisRight angleBitResultantMiniDiscComputer animation
IRIS-TKernel (computing)Series (mathematics)Conic sectionGezeitenkraftExtension (kinesiology)Cartesian coordinate systemDivisorCore dumpSynchronizationQuicksortPlanningEmailRule of inferenceThumbnailMultiplication signStability theorySoftware maintenanceLaptopResource allocationDirectory service2 (number)DatabaseFile systemAsynchronous Transfer ModeSystem callBlock (periodic table)Group actionUtility softwareEntire functionSpacetimeTable (information)Default (computer science)NP-hardINTEGRALDifferent (Kate Ryan album)Patch (Unix)Reading (process)Software bugWikiTheory of relativityMessage passingKernel (computing)NumberCodeProduct (business)MiniDiscComputer filePoint (geometry)Revision controlResultantWeightBranch (computer science)Software developerSeries (mathematics)Network topologyBitElectronic mailing listLink (knot theory)Online helpPhysical systemExtreme programmingData storage deviceMoment (mathematics)Sound effectSubject indexingCommitment schemeOrder (biology)Directed graphFlagComputer animationLecture/Conference
XML
Transcript: English(auto-generated)
First of all, I'd like to thank you all for coming, and I'd really like to thank the FOSDEM organizers. This is actually my first time presenting at FOSDEM, first time I've been at FOSDEM, and it's been a lot of fun.
I really enjoyed myself, so thank you all for coming, and thank the FOSDEM organizers for inviting me. Please bear with me, this is a presentation I had to create on the fly, because my primary laptop got stolen at the Brussels train station. I had this cool demo I was going to show, showing how quick FSCK was on a file system that I'd been using since July, except it got stolen.
WikiTravel has all of this stuff about things you have to be careful, pickpockets work in teams, they'll distract you, grab your laptop bag, I'm here to tell you, it's all true.
So, yeah, I learned that the hard way. Anyway, fortunately I happened to be carrying a backup netbook that I was planning on using for crash and burn testing. I was actually planning on getting some development work done over the weekend, which didn't happen, but that's okay. So why don't we get started, and first let me talk a little bit about some of the good things about the ext3 file system.
It's probably the most widely used file system in Linux, and it's code that therefore has been around for a long time. People trust it, it's been well shaken down.
Also very, very important, and this is probably one of those things that a number of file system efforts before didn't really pick up on. ext3 has an extremely diverse development community, which means we have developers from Red Hat, from ClusterFS, which was since purchased by Sun.
We have developers from SUSE, Red Hat, IBM, and so that actually is really, really useful because it means that, well, number one, you don't have to worry about what happens if one company decides that it's time to cut back on their kernel development budget.
But it's also important because if a distribution wants to support a file system, they need to feel comfortable that they have people who understand it well enough that they can actually help their customers if their customers have problems with it.
Historically, I can't speak for Red Hat, but until very recently, Red Hat did not have an XFS engineer on staff, and I don't believe it was a coincidence that Red Hat didn't support XFS. It's only recently that Eric Sandeen, who was a former XFS developer, he's now helping
us out with ext4, joined Red Hat, and now they're going to be including XFS support. My understanding is at least in preview form in a rel update, and then they'll be supporting it fully in the future. But again, it points out the fact that if you don't have developers add a distribution, it's really
not surprising the distribution is going to be really hesitant supporting something as critical as a file system. Another example of that would be JFS, IBM's JFS, which is a very good file system, and at the time when it was introduced, there were ways in which it was in fact far better than ext3.
There was only one problem, which was almost the entire development team was at IBM. Red Hat and SUSE didn't have any engineers who were really familiar with JFS, and surprise, surprise, they were really hesitant in supporting it. And that's been a really big deal. So one of the things that I've told the BTRFS
folks, and a big supporter of BTRFS, I really believe it's going to be a great file system, although people who think it's going to be ready in the short term are probably a little bit too over enthusiastic. I'll talk a little bit more about that later. One of the things I told them is you've got to recruit people from across
the Linux industry if you want to be successful, because it's the realities of development. So there are also a couple of things that are not so good about ext3. A lot of silly limitations, perhaps the most stupid one is the fact that we can only have 32,000 subdirectories.
Actually, that should be 32,000, not 32,768. We have second resolution timestamps, which is a bit of an issue, given that computers are kind of fast now when you can compile a file in well under a second, which is sort of an issue if you're going to be using Make, which likes to track based on timestamps.
And, you know, 16 terabytes is starting to actually be a real limitation. And perhaps the biggest problem with ext3 has been its performance limitations. Now, some of that has been deliberate. Ext3, we've always taken the position that we care a whole lot more about making sure the data is safe than it being fast.
Because if it's... people get really cranky when they lose data. That's probably the simplest way I can put it. You know, it's one thing to win the benchmark wars, but first of all, many, many workloads are not even file system bound.
Right? So if you have a super fast benchmarking result, but in real life you're actually really CPU bound, then it may not really matter, and then if you lose your source tree, people get cranky. So, we've historically been very, very conservative with ext3.
But, over time, that's started to become a real limitation, and so it was time to try to add new features and make ext3 into what I would call a modern file system. Now, this brings up an interesting philosophical question, which is, is ext4 really a new file system?
It's certainly a new file system subdirectory, so if you look in the kernel sources, under fs slash ext4, you will see a complete source code. This is source code that was forked in 2.6.19, but it's important to remember that ext in ext2, ext3, ext4 stands for extended.
And, in fact, ext4, as far as the file system format is concerned, is actually a collection of new features that can be individually enabled or disabled.
So, some of them are extents, some of them are a huge file, dir underscore nlink, so on and so forth. And together, if you enable all these features, then you would get the full effect of ext4. And the ext4 file system driver in the Linux kernel supports all of these new file system features.
But, you can also run a standard ext3 file system and mount it as ext4, and it will work just fine. In fact, as of 2.6.29, which is the next stable kernel release, we're currently
at 2.6.29 RC3, you will actually be able to mount an ext2 file system, which is to say a file system without a journal, on the ext4 file system driver core. And this was actually code that was contributed to us from Google that allowed us
to be able to mount a file system without a journal with ext4 code base. Ext3, simply because way back when, when Stephen Tweedy was developing ext3, we had forked the code base. In order to make life simpler, he had written the code such that the journal had to be enabled.
And in fact, if the journal was not present, and you tried to mount it as ext3, ext3 would just refuse the mount. Google, as it turns out, wanted the advanced features of ext4, because extense and all the rest proved to have some really nice performance benefits for them.
But Google doesn't believe in journals, because Google has the theory that if the system ever crashes, you wipe the hard drive and you recover from the other two redundant backups. And if you're going to do that, and you don't have to worry about fsck'ing the drive, because if the system ever crashes, you just wipe the disk and copy from a backup,
then you don't actually need to recover from journal, and they got a little bit of a performance boost by running without the journal. So in the latest 2629, you'll actually be able to enable all of the ext4 features, but disable the journal, because Google was interested in running in that fashion.
It also turns out you could run with all the features disabled, and mount a standard ext2 file system on ext4. You would still get some of the performance benefits that don't depend on the file system format, but it's more interesting just simply from a flexibility point of view.
So why did we even bother to actually fork the code? And again, this was just simply from a matter of development stability. ext3 has a huge user base, including Linus Torvalds and Andrew Morton, and they would
get cranky if their file systems got destroyed and their source trees were wiped out. And so we had a lot more opportunity to experiment if we just simply forked the code base, and we didn't have to worry that we might accidentally trash some important people's code. At the same time, it also allowed us to do all of our development in the mainline tree, which was also a big help.
So that's what we actually did. But from a theoretical point of view, it's not really a new file system, except that we added a whole lot of code to it. We started with ext3, we added new extensions to it. From the user space code, we used the same e2fs progs to support ext2, ext3, and now ext4.
You just simply have to have a new enough version of e2fs progs. So is it a new file system? Is it not? It really depends on your point of view. It is definitely a new file system code base. I'm not even sure you could
call it a new implementation, it's just simply a more advanced version with new features. So pays your money, takes your choices. So what's new in ext4? There are a huge number of features that are new.
Probably the biggest one is extents, and then the changes to the block allocator, and I'll talk more about those in just a moment. Some of the other features that we have added are simply to address problems that I've already alluded to before. For example, we removed that really stupid 32,000 subdirectory limitation.
NFS v4 has a requirement for a 64-bit unique version ID that gets bumped whenever a file is changed in any way. And they need that specifically so they can do reliable caching.
I'm not an NFS v4 expert. I think they have some kludge that makes the caching less efficient if you don't have that feature. But the NFS v4 people really wanted it, so while we were in there we added the NFS v4 version ID.
We also, and this is in the category of stupid changes that was really easy to do once you're actually going to open up the code, we now store the size of the file in units of a file system block size as opposed to the POSIX mandated 512 sector size,
which is a mistake perpetrated by system 5 Unix. And that basically gave us a very painless way of expanding the maximum size of a file from 2 terabytes to 16 terabytes if you're using a 4K block size file system.
And if you're on an Itanium system or a power system and you use an even bigger block size such as 16K, 32K block size, you get another couple of powers of 2 out of that. And that was just something that we could do that was actually very, very easy.
We'll talk a little bit later about why we didn't actually change that to expand that even further. And it's something we could do, it's just something we didn't do this time around. We added ATA trim support. This is support that will be showing up in some of the new big storage subsystems that do something called thin provisioning,
where you've got a large number of block devices that might not be completely full. And the file system, when you delete a file, can tell the block device, we're not using these blocks anymore so you can use them for something else.
It's also useful for solid state disks for the same reason, they can do a better job ware-leveling, if the file system can inform the solid state disk that these blocks are no longer in use so you can use them for ware-leveling. The support is in the file system. The low-level code to actually send the trim commands to the devices has not actually hit mainline yet.
And my understanding is it's because the people who are in charge of writing that bit of code doesn't have hardware that actually implements the features yet or something like that.
As far as I'm concerned, I have all the file system code hooked up, it's just not talking to any real devices yet. But it's something that I've been told is going to be a really big deal for solid state disks and for thin provisioning. So we have that in there already. Another thing that we've added is checksums in certain bits of the metadata, specifically the journal and the block group descriptors.
And that's allowed us to reliably put in the block group descriptors what part of the inode table is actually in use and what part of the inode table is not. And that's allowed us to speed up FSCK.
So there have been a lot of little tiny improvements that we sort of made while the patient was opened up for surgery, as it were. But probably the biggest one, as far as the ext4 is concerned, is extents and then related to extents, the block allocator. So let's dive into that.
How many of you are familiar with the indirect block map system that ext2 and ext3 use? How many people are familiar? Okay, some people are, some people might not be. So, quick review. Inside the inode for ext2 and ext3, there is room for 15 block pointers, 15 32-bit block numbers.
The first 12, 0 through 11 in the IDATA array, are used to map direct blocks. And so if your file is less than or equal to 12 blocks long, if you're using a 4K file system, 4K block size file system, that's 48K.
The location of all of those blocks can be stored in the inode, and you don't have to do anything else. It's all there, and it's just mapped. So in this case here, in this example, the inode, the first 12 blocks are located at block numbers 200 through 211.
Now, if the file is any bigger than that, there's no more room for direct blocks inside the inode. So we allocate, in the second column to the right on the top, that gray box, an indirect block.
And we put a pointer, which is slot number 12 in the IDATA, that points to the indirect block. And there we have room, again, if we're using a 4K block file system, 1024 block numbers that can go in an indirect block,
and that will give you a range of blocks. Now, if it turns out that that's not enough room, we can insert a double indirect block, and that's the light blue. And so we have a pointer in slot 13 to a double indirect block, and each double indirect block will point to 256 indirect blocks,
which then contain 256 block pointers to the file. And finally, if that's not enough, we have room for a triple indirect block, and a triple indirect block has 256 slots. Each one points at a double indirect block, 256 slots.
Each of those slots points at 256 indirect blocks, which then goes to a file. And 256 times 256 times 256 times a 4K block size is a really big number. And so that's how indirect blocks work.
Now, it turns out this system is incredibly inefficient for really large files. If you're using EXE3 and you ever have to delete a huge ISO image, and it really doesn't matter whether or not it's a CD or a DVD ISO image,
a DVD ISO image will take even longer, it will take a long time to delete. And the reason that it takes a long time to delete is that it has to read all of those indirect blocks. And then free the block pointers in the indirect blocks, and the double indirect block, and the triple indirect blocks. And that takes a very, very long time.
And it's especially inefficient when you consider that the file system is actually going to fairly great lengths to keep files contiguous. So most of the time, what you actually see in these indirect blocks are increasing sequences of block numbers.
200, 201, 202, 203, 204, 205. And that's not a very efficient way of storing that type of information. A much more efficient way is to use something called an extent. And an extent is just simply a way of saying, we're going to start at logical block 0,
and logical block 0 is going to be located at physical block 200, and that's going to continue for 1,000 blocks. So if we have 1,000 blocks free on disk, and we can allocate it continuously to the file, then I only need a very small amount of room, room for three integers, to encode what previously would have taken 4,000 bytes to store.
1,000 entries, one for each block, and instead, we just simply say that starting at block 0 and going on for 1,000 blocks, we're going to use the range starting at block 200.
So this is what the on disk extents format looks like. This code, the extents work, was actually contributed by ClusterFS, Andreas Dilger and company, and they actually used ext3 as the backend storage for their cluster file system, which they called Lustre.
And they needed to get better performance out of it, and so they actually enhanced their version of ext3 to have extents, and then they contributed that code back to us. Many, many thanks to Lustre.
And so, we sort of stuck with this particular format, and what this format effectively gives us is 48 bits of logical block numbers, and 32 bits of, sorry, 32 bits of logical block numbers and 48 bits of physical block numbers.
And a lot of people ask us, why didn't you go to 64? And the answer was, well, this is what Lustre was using. We wanted to stay compatible with Lustre because, after all, they contributed a lot of really good code to us. There was thinking that, at some point, we would add support for an alternate data structure
that would, in fact, give us 64-bit logical, 64-bit physical block numbers. And basically, instead of a 12-byte structure, it would probably take a bit more than that. It would probably be a 16-byte structure. And we have a version field in the extent header where we could actually indicate
this is a new version of the extent header. And we may very well do that at some point in the future. It turns out that 48 bits is a very large number. You're already up to an exabyte with 48 bits of physical block numbering. And we wanted to get ext4 out there sooner rather than later, so we've sort of stuck with the very simple.
Maybe at some point in the future, we'll expand it. But for many, many users, an exabyte is more than enough space for what they would actually want to use it for. So this is what the extent map actually looks like.
In the IDATA field, we store a small header structure which indicates how deep the tree is, what version of the extent structure we have, and then pointers to the location on disk.
Now, we can store up to three of those extent structures in the inode body directly. And as it turns out, and I'll show you some numbers in a moment, the vast majority of your files on a fairly standard file system will, in fact, fit in under three extents.
In fact, on the file system that got stolen on Friday, I'd been running ext4 since July. And I think I had done a full backup and restore once or twice during that period. But it had been my primary file system. I'd been using it for quite a while, 128 gigs.
It was my primary laptop. It had Ubuntu on it, Ubuntu Hardy on it. And I had all of one file. I'm sorry, that's not right.
I had several hundred thousand files on it. And I had maybe 700 or 800 files that spilled over to a single extent block. As in a B-tree that had a single leaf block, and that leaf block had room for 129 extents.
And then I had a single file that had an extent tree that was too deep, that had one index node and then a number of leaf nodes. But the vast majority of the files, 99% of the files, all of the extents, lived inside the inode
because they were under three extents. And in fact, something like 95% of them were encoded in a single extent. And that was because of changes in the block allocator.
So for the more complicated files, and as I mentioned, on my file system where I had been torturing the file system fairly badly, I was using it for normal use, I had exactly one file that looked like this.
Where I had the inode table, an index node, that index node could store up to 129 leaf nodes, and each leaf node could store up to 129 contiguous extents. And that one file was in fact a sparse ext3 extent image that I had been using for testing.
The short version is, I had to deliberately create files that were deep enough that I could actually exercise the extent tree code. Because in theory, this tree can grow to two, three, four levels deep. But in fact, you have to really torture the file system to get it to generate that.
Most of the time, you either have a single leaf node pointed from the body of the inode table, and the vast majority of the time, you just simply have one, two, maybe three extents in the inode table, because the file is that basically contiguous on disk.
So we can handle sparse files, and we can handle files where the file system gets very badly fragmented. In practice, I've been noticing that HD4's anti-fragmentation algorithms are good enough that, at least for my general workload, it's fairly rare.
Now, I'm sure someone out here will have a workload that will prove me wrong. And that's okay, we have an online defragmenter that we're hoping to get done, but that's not in mainline yet. So, part of the reason why the code is so fragmentation resistant
is because of changes to the block allocator. And this is also code that was contributed by LustreFS and Andreas Dilger. And the reason why they needed it is because extents work best if the files are contiguous. In fact, if your file system is really badly fragmented, so that you have a free block here,
and a free block there, and a free block here, it takes 12 bytes to encode an extent. So if you have lots of singleton free blocks on the file system, extents can actually be a less efficient way of encoding the file map data than just simply using a normal indirect block.
Now, in practice that doesn't happen, but one of the reasons why is because of the multi-block allocator. And the multi-block allocator, which again came from Lustre, gave us two things. Number one, it gave us delayed allocation, which means we don't actually allocate files until the very last minute,
when either the application has explicitly requested the data to be flushed out to disk with an fsync call, or the page dirtycleaner has decided that it's time to actually push blocks out to disk. And then the other part is the multi-block allocator, which, when it allocates blocks,
it allocates blocks based on how much data it needs to write. The previous block allocator allocated a single block at a time, and about the only thing it knew was the previous block was located at block N, and so the first block you would actually try to find would be block N plus one.
And if that wasn't there we would try block N plus two, and so it was actually fairly stupid. The multi-block allocator will know that we're going to be allocating a dozen blocks, or 200 blocks,
and it will actually search for enough free space for the requested amount of space that we actually need to allocate. And that's one of the reasons why most of the files on disk actually turn out to be contiguous on disk. And this is, by the way, responsible for most of ext4's performance improvements.
Now there is one little tiny gotcha with the delayed allocation code, and that is that what many application writers had been used to was ext3's ordered mode semantics. And ext3's order mode semantics effectively said, before we do a journal commit,
we will make sure that any blocks that have been allocated on disk will in fact be written to disk before we slam the inode onto disk and do a commit, and this guarantees that you never get stale data.
Because stale data could be a security problem, because it's previously written data that possibly belonged to another user, and you might be exposing that if the system crashes, if the inode has been written to disk, but the data had not been written out to disk.
And so the default mode that most people used for ext3 was force a journal commit every five seconds, and use order mode semantics. Now in practice, what this meant was if you wrote a file and closed it,
within five seconds it was guaranteed to be on disk. And a lot of people just sort of assumed that that was normal. File systems that do delayed allocation will not actually do this. They will not actually force the data out to disk, because we haven't even allocated the data on disk.
And so what ext4 would do under these conditions is you might write a dot file, and some of these application programmers would rewrite a dot file without leaving a backup. So they would truncate, rewrite the data, and close it. And if the file had not actually been allocated, then there would be no data to actually push out to disk.
And if the data had not been pushed out to disk, we now have to wait for the page cleaner to decide that it's time to write dirty data back to disk. And the default for that is 30 seconds.
And the page cleaner doesn't actually write out all the dirty disk, it will actually stage it out. So that data will start getting written out to disk after 30 seconds have gone by. And if you've dirtied a lot of data blocks, it might take another 30 seconds before everything has been written out, because it doesn't want to overload the system, so it actually does it in little tiny chunks spaced out over 5 second intervals.
If you turn on laptop mode to save your battery, that 30 seconds can get expanded to like 2 minutes. And now what you end up happening is it can be a good 2 to 5 minutes before data that had been written to disk
is actually, sorry, data that had actually been written by an application might take 2 to 5 minutes before it actually is written on to disk. So if your system crashes, you can actually lose more data. Now this is in fact all legit, right?
If you look at the POSIX specification, it essentially says unless you call fsync, all bets are off. And it was just simply that many application programmers were used to the behavior of ext3 and not bothering to call fsync.
Now, I say this worried a little bit that everyone will now use fsync a lot. And the reason is because in recent years, I have noticed a disturbing tendency by application writers,
some of you may be in this room, to generate hundreds and hundreds of dot files. Like if I look under dot gnome and dot KDE, I see hundreds and hundreds of individual dot files that each contain huge amounts of data. Actually, they don't contain huge amounts of data, that's the problem. They each contain like 3 or 4 bytes of data, but there are hundreds and hundreds of files.
And if you call fsync on every single one, you will really pound your system with a hammer pretty badly, because it's going to force a lot of data to disk. And you'll be forcing a commit for every single one of these fsyncs. Probably the right answer is to use fdatasync, that's not going to be quite so painful,
and will actually have most of the semantics. So if you guys are going to stick with using lots of these little individual dot files, each ones that contain a few bytes, yeah, you probably want to use fdatasync. Or you might want to consider using SQLite or some kind of proper database,
because it's very clear what's going on here is people decided that the Windows registry was evil, so will be the anti-Windows, and instead you now have hundreds and hundreds of these little tiny files, which isn't such a bright idea either.
But be that as it may, I know I cannot influence what application programmers choose to do. All us file system authors can really do is sort of adapt to what the application developer community actually throws at us. So we may end up actually trying to store data into the iData array,
and the reason why we didn't load these many years ago was it used to be nobody was insane enough to use lots and lots of little, little tiny files. I mean I actually at one point took a look at it and said there was actually very few of them, so we never bothered to store data in the iData array. But these days it looks like there are a lot of app writers that have lots of files that are under 60 bytes,
and so maybe we have to revisit that decision. All that being said, delayed allocations just sort of expose that, because I've gotten one or two bug reports of the form, you know,
I was using ext4, crappy NVIDIA driver crashed my system, and I had several hundred zero-length files in .knoem or .kde. I think I got one of each, so this is not a GNOME versus KDE thing. Both desktops seem to be doing that, and it's like, you know, my first thing was,
oh my god, how come they have so many of these little tiny files, and why are they rewriting them all the time, because that's got to be a performance hit right there. And I'm not sure I want to know the answer, but if someone from the GNOME and KDE environment, you know, communities want to tell me why, you're apparently constantly rewriting hundreds of these files in, you know, the user's home directory,
you know, I'll get myself a good stiff drink and then you can tell me. But, you know, it's one of the things we're looking at, one of the things that we may end up doing is have some heuristic where, if the file is small and we notice that it was in a truncate or removed,
we'll actually immediately map the files on close, which is less heavyweight than actually calling fsync. Eric Sandin tells me XFS had to do something very similar. XFS apparently has this kluge where, if a file has ever been truncated,
it implies an fsync as soon as you try to close it, and that's because there are so many application writers that got kind of lazy about assuming that they could just simply do that, and then XFS's delayed allocation hit them. So, apparently this is not a new problem. So, that's one.
Another interesting feature that we have is something called persistent pre-allocation. This allows blocks to be assigned to files without having to initialize them first. The original use of this was for databases and streaming video files, where if you know that you're going to eventually fill a gigabyte on disk
because you're going to be recording an hour of video, and an hour of video compressed is about a gigabyte, you can tell the system, please pre-allocate a gigabyte on disk, and then the file system can allocate that space contiguously because you know exactly how big it is.
This can also be useful for packaging systems like rpm and dpackage. If you know how big the file is, the file system will be able to do a better job if you tell it, please pre-allocate me the space, because then it can pre-allocate exactly how much space it needs, and you can reduce fragmentation by a little bit if you can actually do that.
Another interesting use of this is for files that are grown via append. So, if a log file is constantly being appended to, a Unix mail spool file is constantly being appended to,
and if you know that that's happening, one of the things you can do is just simply pre-allocate space. If you know roughly how big the log file is, you can pre-allocate the space, and then the log file will be contiguous on disk because, you know, you've pre-allocated it. Now, you can access this via the glibc posixfallocate call,
but the problem with the posixfallocate call is twofold. Number one, if you happen to be on a file system that doesn't support pre-allocate, it will do it the old-fashioned way and just simply write blocks of zeros, which is very, very slow. And so there are some cases where if fallocate doesn't exist,
you would rather the call do nothing, and the glibc posixfallocate doesn't do that. The other thing about posixfallocate is it always changes the isize field, and so therefore if you look at the file using ls-l,
it will actually show that the file is a gigabyte after you've pre-allocated a gigabyte on space, gigabyte on disk. If you use the raw Linux system call, you can get a hard failure if the file system doesn't actually support fallocate. More importantly, you can let isize
remain at the original size. And now what you've done is you've pre-allocated the space on disk, but isize still shows that the file is zero length or whatever the original file is, and now you can do tail-f. Tail-f will do the right thing, and then as you append to the log file, the file will grow into the pre-allocated space,
and isize will grow along with it. And that can be a very nice feature, and what that basically means is we've been pounding on the glibc folks to actually expose the raw Linux system call, because it does a lot more than POSIX fallocate. So let me talk a little bit about performance charts.
There's an old line about, you know, lies, darn lies, and benchmarks. And so the first thing I'll tell people before you believe benchmarks is to ask, you know, are the benchmarks fair, are they repeatable, and do they fairly represent the workload that you're actually using, because a lot of times people will look at benchmarks
and say, this is the file system I want to use, look how great it is. And if you don't ask yourself whether or not that file system is even applicable to the kind of work that you do, remember what I said earlier, many workloads are not even disk-bound or file system-bound, kind of pointless.
One really good effort, you can find it at btrfs.boxcall.net. It's done by a guy named Stephen Pratt, who is a member of IBM's performance team. And if people want an example of how to actually do good benchmarking, take a look at his site. He documents the hardware and software configurations
that are used, and he tests multiple configurations. And this is why this is important. So this is large file creates using a RAID file system. Red is ext3, green is ext4 dev. This is back in October. He has newer results, but I didn't have time
to update these particular charts. Blue is XFS, red or hot pink is JFS, and then the last three are different versions of BTRFS. And this is a very early version of BTRFS. And you can see with this one that ext3 in red
is kind of low, ext4 is a whole lot better, almost as good as XFS, but not quite. JFS is a little bit lower. And this one says, oh, okay, that's pretty good. We're almost as good as XFS. This is with 16 threads.
And with 16 threads, now you see that ext4 is still a whole lot better than ext3, but it's nowhere near where XFS is with 16 threads, and BTRFS is down there. Here's with 128 threads. 128 threads, now XFS is way down there.
Ext4 is way up there. So if I'm going to sell ext4 as the best file system ever, which chart do you think I'm going to use? Right? Right. And this is just simply what the large file creates. If we do large file random reads, this is with one thread.
That's 16 threads. There's 128 threads. So large file random reads, you can see that in some cases ext3 is actually better than ext4. I don't know why. I suspect it has to do with changes in the layout algorithms that we can still fix. So there's still some tuning work we may need to do.
Large file random writes, you can see, were way better than ext3. 16 threads, 128 threads. Here's sequential reads. And the main thing I want to get across here is these bars are fluctuating wildly, right? This is why benchmarks can be highly misleading.
If someone only shows you one chart, they're trying to sell you something. For some reason, the mail server simulation workload, which is a mixed read-write workload that tries to simulate a mail server simulation, ext4 does really well.
I can't tell you why, but it just happens to be really well. Except on 128 threads where the machine apparently crashed. And I can't tell you why either. This was also last October. This is now with a single disk, right? And one of the interesting things with single disk is BTRFS is now way better
than a number of the file systems as we go through the various benchmarks. And you can see here that on some of these benchmarks, BTRFS is actually doing very, very well on a single disk. Not doing so well on RAID. Again, this is last October. BTRFS's file format has not been finalized yet.
Certainly wasn't finalized as of October, and they were still tuning it. So again, these results are a little bit unfair. You can see here. And then here's the mail server simulation where ext4 apparently walks all over the competition.
But again, workloads matter, right? So I'm not gonna tell you that ext4 is better than all other file systems. On some workloads, it does pretty well. We still need to do some tuning work. But it's always useful to know that. Okay, this is actually something kinda interesting because we didn't actually plan for it.
But it turned out that a lot of the improvements that we did to improve general read-write performance also made a huge difference for ext4. And I think looking at it, a lot of it has to do with the fact that we're doing much fewer indirect block reads
compared to extent reads. And the uninitialized block groups means that you don't have to scan the entire inode table if the inode table blocks aren't in use. So this is results from my 128 meg file system on the laptop that was stolen back in September.
And they were actually identical copies. I'd been using ext4 in production use for about two or three months at that point. And I just simply made a copy of everything on my file system onto a fresh ext3 file system. So ext3 actually had a benefit over ext4
because it was a fresh copy. It was totally defragged. Whereas ext4, I'd been using it for two or three months. And you can see that past one of FSCK on ext4 was 17 seconds. On ext3, it was 382 seconds. And take a look at the number of megabytes read. We went from over 2300 megabytes read down to 233.
And that's where a lot of speed up comes from. We're just simply needing to read fewer blocks on disk and we're having to do a lot less seeking. And we saved most of the time on pass one and pass two. Again, there's not that much difference
in the directory reads, but the directories, and in fact, on this one here, ext3 took less time to read the directories because the directories were contiguous because we'd done a fresh copy. But there were fewer reads for ext4 because there were no indirect blocks. So you can see there, the net is
you go from 424 seconds down to 63 seconds. The general rule of thumb that I've found is ext4, if you use a freshly formatted ext4 file system, saves you somewhere between six to eight times, it's six to eight times faster.
So take your ext3 fsck time, divide it by seven, and that's roughly what it will be under ext4. So if you want to use ext4, you need e2fsprogs 1.41. I really recommend that you go to e2fsprogs 1.41.4 because we fixed a whole bunch of ext4 related bugs.
You need at least a 2627 kernel or newer. I strongly recommend 2628 and the four stable branch. That stuff will hit the stable kernels fairly soon. It just hasn't yet. That was one of the things I was gonna work on before my laptop got stolen, but that's okay.
And there is a 2627 four stable kernel. Again, both of these will be sent off to the stable kernel maintainers soon. And of course, you'll need a file system to mount. You can just simply use a completely unconverted ext3 file system, and the delayed allocation will help you.
So you will get somewhat better performance just simply taking a completely unconverted file system from ext3. You can enable features such as extents, the huge files features, directory nlink, directory isys, or sorry, that should be dir index,
actually, on a particular file system. If you enable uninit bg or dir index, you will have to force an fsck after you actually enable those feature flags. That will get you some of the performance of ext4, but you will only use extents for the newly created files.
The old files on the file system will still use the old indirect blocks. Or you can create a completely fresh ext4 file system and then do a dump restore, and you'll get the best performance from that. But it's up to you how you want to do things. If you just simply want to play around with ext4, you can just simply leave your file system unconverted.
One warning is at the moment, once you start converting to ext4, we don't have a good way of going back in time and unconverting. So if you want to get involved, there's a ext4 mailing list. The latest ext4 patch series.
I have a git tree, and I also have a patch directory. At this point, the git tree is probably the most up-to-date. We do have an ext4 wiki, which is at ext4.wiki.kernel.org. It still needs a lot of work. If someone would like to jump in, I would love some help. At the moment, it's actually a little embarrassing.
kernelnewbies.org's ext4 article is actually better than what we have on the wiki. So if somebody wants to help me improve the ext4 wiki, I'd really appreciate it. We do have a weekly conference call. If there's someone who's really interested in diving in deep, contact me about that. And we have an IRC channel.
And this is the ext4 development team. And I'm probably missing a couple of people, but these are people who've been working on it for the last couple of years, and they do a lot of hard work. I'm the guy who basically does QA and all the integration work, and then a lot of the user space utilities. So with that, I know I ran a bit over time,
so I don't know. Maybe I have time for maybe one or two questions, and then I'll be happy to stick around and ask some more questions. Yeah, in the middle there. Thank you. I was interested to know if there are any good solutions
for the syncing problem that you mentioned. I'm involved in laptop mode things, and the default is actually not two minutes, but ten minutes. We spend a lot of time getting applications to drop all their fsync calls,
just because any one of those will spin up your disk, and there's no way to get rid of them. I think the short version is fdatasync seems to be a good compromise for now. We are looking into ways of solving the fsync problem, but it's in really tricky code.
So we know about it. It's one of those things we'd love to fix, and it's on our hit list. So yeah, that's one of those little embarrassing bits that we really want to try to fix. Thank you. Any other? Maybe I'll take one more. Yeah.
In sort of a related question, for databases and also presumably for a lot of those other applications doing fsync, they don't necessarily need the sync to happen immediately.
They just have to know that it, when it happens, they have to know that it hasn't happened yet. So they can keep their, they can avoid sending a commit confirmation or telling some other application. So is there any sort of non-blocking fsync that, no? There isn't today. Why don't we talk?
I'd love to hear from database people on that one, but we should probably take that one offline. And I know I'm really running over, so maybe I'll be happy to stand in the hallway and take questions for people who are interested, but I don't want to get people late for their next talk. So thank you very much for your attention.