We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The Evolution of Storage on Linux

00:00

Formal Metadata

Title
The Evolution of Storage on Linux
Title of Series
Part Number
5
Number of Parts
79
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Linux and Open Source Software have always played a crucial role in data centers to provide storage in various ways. In this talk, Lenz will give an overview of how storage on Linux has evolved over the years, from local file systems to scalable file systems, logical volume managers and cluster file systems to today's modern file systems and distributed, parallel and fault-tolerant file systems. Lenz Grimmer
Open setFreewareSoftwareTime evolutionEvent horizonData storage deviceRhombusMetropolitan area networkInclusion mapMaxima and minimaArithmetic logic unitProjective planeInformationComponent-based software engineeringKernel (computing)SoftwareElectronic mailing listEmailGoodness of fitMereologyComputer hardware1 (number)Flash memorySoftware developerSubsetSoftware bugTraffic reportingDistribution (mathematics)Patch (Unix)Set (mathematics)Device driverFile systemOpen setMultiplication signBlock (periodic table)Presentation of a groupData storage deviceArithmetic meanDifferent (Kate Ryan album)Disk read-and-write headOpen sourceSoftware maintenanceCodeDigital rights managementForm (programming)MiniDiscPoint (geometry)Slide ruleProduct (business)Volume (thermodynamics)Computer fileLocal ringPhysical systemService (economics)Electric generatorComputer configurationSemiconductor memoryMultilaterationBitEvoluteTotal S.A.Sinc functionForcing (mathematics)Graph coloringCone penetration testSoftware testingCodeProcess (computing)State observerFile archiverLimit (category theory)Term (mathematics)ProteinPersonal digital assistantFisher's exact testDatei-ServerTime evolutionKinematicsInformation technology consultingCohen's kappaWebsiteContent (media)XMLUMLLecture/Conference
ArmMetropolitan area networkPlateau's problemGamma functionBinary fileValue-added networkArithmetic meanTrailData storage deviceAnalytic continuationSemantics (computer science)Content (media)Axiom of choiceInternetworkingPatch (Unix)Electronic mailing listEmailWebsiteLevel (video gaming)State observerMoment <Mathematik>AdditionFile systemFile Transfer ProtocolSource codeOperating systemHard disk driveComputer fileEmulatorPhysical systemGcc <Compiler>Integrated development environmentProcess (computing)Point (geometry)Figurate numberBitLimit (category theory)Product (business)Inclusion mapModul <Datentyp>Software developerRadical (chemistry)TouchscreenEndliche ModelltheorieFunction (mathematics)Task (computing)FrequencyMedical imagingSystem callReading (process)Asynchronous Transfer ModeAbstractionBasis <Mathematik>Scalar fieldPolar coordinate systemUniverse (mathematics)Link (knot theory)Multiplication signMaxima and minimaComputer programVirtualizationMathematicsSummierbarkeitPresentation of a groupProjective planeResultantComponent-based software engineeringKernel (computing)Different (Kate Ryan album)RootExt functorRoutingDevice driverOpen sourceOperator (mathematics)Block (periodic table)Scheduling (computing)Thread (computing)ModemServer (computing)Torvalds, LinusLecture/Conference
Gamma functionMetropolitan area networkMusical ensembleFile systemPhysical systemPattern languageSystem callGroup actionPolar coordinate systemExecution unitMessage passingFile formatMiniDiscVariety (linguistics)Data structureComputing platformLevel (video gaming)Axiom of choiceSpacetimePartition (number theory)Data storage deviceDevice driverRight angleInterface (computing)WebsiteGoodness of fitWeb pageProjective planeBlock (periodic table)BitCuboidExtension (kinesiology)Multiplication signView (database)Kernel (computing)Limit (category theory)Point (geometry)CodeTerm (mathematics)AngleReverse engineeringFile systemFloppy diskComputer configurationFlash memoryForcing (mathematics)Module (mathematics)Different (Kate Ryan album)Computer filePersonal digital assistantPower (physics)Latent heatCASE <Informatik>DampingMeasurementOpen sourceOffice suiteDistribution (mathematics)Category of beingRevision controlClient (computing)Directory serviceGastropod shellSoftware developerDatei-ServerAttribute grammarBootingRootTestbedOperating systemIdeal (ethics)Sinc functionTimestampFlow separationFlagExt functor2 (number)Exception handlingMultilaterationLecture/Conference
Maxima and minimaMetropolitan area networkValue-added networkMusical ensemblePredictabilityExecutive information systemInclusion mapEvent horizonMathematicsState of matterDatabase transactionDatabasePhysical systemMultiplication signNP-hardState observer1 (number)Process (computing)InjektivitätSoftware developerFigurate numberFrictionAreaFood energyData recoveryCrash (computing)Arrow of timePoint (geometry)Slide ruleSound effectPersonal digital assistantUniverse (mathematics)Moment (mathematics)MultiplicationDampingView (database)ImplementationCodeTheoryRevision controlMessage passingComputer configurationCASE <Informatik>Data structureLimit (category theory)Operating systemFile systemData storage deviceProjective planeFlow separationEnterprise architectureFunctional (mathematics)Service (economics)Block (periodic table)Computer fileParallel portSystem administratorForcing (mathematics)MereologyStudent's t-testMiniDiscIterationConsistencySoftware maintenanceOrder (biology)Kernel (computing)Software bugSoftware testingAnalogyImage warpingCuboidServer (computing)Human migrationError messageData loggerLecture/Conference
Modul <Datentyp>Metropolitan area networkArmStorage area networkLandau theoryGamma functionDigital rights managementEmulatorValue-added networkGrand Unified TheoryData structureComputer fileServer (computing)EmailWeb 2.0Web pageMiniDiscAreaDirection (geometry)Integrated development environmentComputerInstallation artDefault (computer science)PhysicalismSpacetimeDigital rights managementVolume (thermodynamics)Set (mathematics)Kernel (computing)WebsiteElectronic mailing listProjective planeMereologySoftware developerPoint (geometry)Range (statistics)ImplementationMultiplication signAxiom of choicePartition (number theory)Physical systemFlow separationPatch (Unix)Virtual machineFile systemData storage deviceKey (cryptography)Component-based software engineeringCASE <Informatik>Flash memoryCache (computing)High availabilityLastteilungHard disk driveRevision controlFigurate numberDistribution (mathematics)Fiber (mathematics)Bookmark (World Wide Web)Game controlleriSCSIRoutingOrder (biology)Connected spaceScalabilityDifferent (Kate Ryan album)IterationConcurrency (computer science)MIDIProcess (computing)Optical disc driveReading (process)Block (periodic table)Forcing (mathematics)Silicon Graphics Inc.Wage labour3 (number)Operating systemService (economics)Beat (acoustics)LogicMetropolitan area networkSound effectEnterprise architectureComputer configurationSocial classString (computer science)Directory serviceTraffic reportingMoment (mathematics)Distortion (mathematics)ResonatorException handlingTerm (mathematics)Cellular automatonCausalityEndliche ModelltheorieStress (mechanics)1 (number)Disk read-and-write headView (database)Physical lawTheoryGame theoryComputer clusterLecture/Conference
Metropolitan area networkValue-added networkMaxima and minimaFile formatFlagPhysical systemMiniDiscPoint (geometry)Default (computer science)Distribution (mathematics)Connected spaceComputer configurationOcean currentBit rateSoftware developerMultiplication signCodeProbability density functionSlide ruleComputer fileAsynchronous Transfer ModeCASE <Informatik>Local ringDigital rights managementVolume (thermodynamics)Kernel (computing)Module (mathematics)Shared memoryWindowServer (computing)Product (business)Branch (computer science)SoftwareBlock (periodic table)Identity managementSolid geometryImplementationInternetworkingSpecial unitary groupMereologyInstallation artFacebookOrder (biology)Client (computing)Revision controlFile systemBitDecision theoryHigh availabilityOpen sourceData storage deviceSemiconductor memoryReal numberOpen setLimit (category theory)Sound effectPhysicalismDisk read-and-write headEvent horizonIterationFlow separationQuicksortService (economics)Process (computing)Medical imagingAreaExecution unitGene clusterCodeInternet service providerScaling (geometry)Functional (mathematics)Thermal conductivityEndliche ModelltheorieMoment (mathematics)Set (mathematics)Data structureVotingControl flowOffice suiteLecture/Conference
SoftwareOpen setFreewareComputer animation
Transcript: English(auto-generated)
I guess I can get started then. I'm going to give this talk in English since it's also being recorded and broadcast, and we might have visitors who are not capable of speaking German. Yeah, welcome.
FrostCon 10, it's been a wild ride. I'm very happy to give the first talk in the first slot on the first day. My name is Lenz Grimmer. I've been with FrostCon since almost the very beginnings. I think I missed two of the conferences in total. I'm always happy to be here, and I'm
glad to talk about a more historic topic this time. Maybe a little bit of background about myself. I currently work for a company named IT Novum. They are located in Fulda, and they
do consulting and projects based on open source technologies, also combined with proprietary technology. So they claim they are doing business open source, and I'm in charge of a product they develop, which is called Open Attic. I'm going to talk about it a little bit later
if there's some time left. Evolution of storage on Linux. I thought that, well, 10 years of FrostCon, it's time to go back in history and reminisce about things that have happened. So what I'd like to do is to give you a trip down memory lane.
That is a pun intended. The topic that I'm going to talk about is storage on Linux in particular. To give you an overview how the whole concept of storage on Linux has evolved over the years, I've been doing quite a lot of research
about this in the past few weeks. And when it comes to managing and providing storage for other systems, using Linux systems, your options are seemingly endless. If you just look at how many file systems are supported. Linux today is pretty much ubiquitous,
and trying to cover all the aspects in an hour is just, yeah, I would be hopelessly lost. So I tried to pick out a few themes, so to say. I'm starting with the history, local file systems, how this whole thing evolved,
then talking about more of the next generation of file systems that came afterwards. The most common services used to turn a Linux system into a file server, and then also try to give an outlook at today's landscape of file systems and storage, and where we're heading. Granted, it's a very broad topic.
I'm going to talk about lots of different technologies, and I will not be able to cover them all in much detail. And I've already been throwing out quite a lot of stuff. While I was doing my research, I found so many small technologies and projects that I found interesting, and I started making notes about them, and I ended up with hundreds of slides basically,
and had to just get a step back and think, okay, this is not going to work. So yeah, maybe at some point this should be turned into a workshop or something. But so far, it's more of a history and overview. Give you a broad picture.
Fortunately, most of these topics are very well covered by other talks and presentations. Also in this conference, and where appropriate, I'm going to refer to a talk that gives some more in-depth information about the technology. Yeah, so starting with local, and then going out into the wild world of distributed and cluster file systems.
Yeah, if you just look at local file systems, today the Linux kernel has 40 plus houses that are just capable of storing data on a local disk in various forms. Just looking at what file systems are available for flash devices, for example, would cover a talk by its own easily.
So I try to pick up or pick out the most prominent ones, popular ones, the ones that are most widely used to give you an impression there. And I also try to focus on stuff above the device drivers. The whole topic of storage, of course,
also has a hardware component to it. A file system needs some means of technology to store data persistently somewhere. So we could also create a completely set of presentations about the block storage layer, how data is being stored, what device drivers, what technologies are available there
out of scope, even though they are highly interesting. So we keep it on the software side on the upper layers of the Linux kernel. And also, especially with the current technologies, yes, they run on Linux, but most of the times
they don't even need a kernel component anymore per se. Also, I wanted to do a few conclusions up front, things that I've kind of realized and learned while I'm assembling the information for this talk.
And the interesting one about this, especially when you look at file systems, you end up on SourceForge a lot of times, still nowadays. And this is something I found quite interesting. The peak of SourceForge was around 2000, I would say, 2002 maybe.
And many of the projects that are still part of the Linux kernel initially started or evolved being hosted on SourceForge. And from a historic viewpoint, if you look at it nowadays, and you go back and try to figure out how this whole thing involved and what kind of history is involved,
SourceForge basically is a very important resource. Even though nowadays everybody says SourceForge is dead, we're not going to use it, we all move to GitHub. But in my opinion, it plays kind of a crucial role in this whole history of Linux. And it would be sad to see this resource go away because you would lose quite a big part
of history of Linux with it. So this is a risk that you will always have. And who will guarantee you that GitHub is going to be around in 10 or 20 years? So the good thing about open source and how it's being developed nowadays,
it's very easy to move and copy stuff around and head it somewhere else. But mailing list archives, backtracking history, all those things tend to get lost along the way. Which makes doing research on historic purposes quite hard. Also, an interesting observation,
which is more of an issue in previous times, especially when it comes to things like storage and file systems, new device drivers. The Linux distributions in these days played a pretty important role because they employed the developers.
They were driving the development and advancements of these technologies forward. And many times they didn't wait for Linux to actually incorporate the code in his kernel. So the distribution kernels back then, sometimes the amount of patches that were applied against the vanilla kernel were bigger than the actual kernel source code. And it was quite a maintenance burden.
I worked at SUSE from 98 to 2002. And Hubert Mantle, who was the main kernel maintainer for SUSE, he was yelling and screaming because you had many vendors that wanted to get their code into Linux, like for example IBM with their volume manager
of file system, and they were all nagging SUSE to incorporate their code into their kernel even before it was actually in mainline. And while SUSE was not the only company doing that, Red Hat was also patching the kernel quite heavily. So they ended up with a very unique
unicorn of a kernel that only they were shipping in this kind of form, which made it hard for the upstream developers to accept bug reports because, well, it was a distribution kernel. They had no easy way to reproduce the problems that occurred there. And also it put a lot of burden on the distributor because they had to maintain and support
these technologies by themselves. So they had to employ the developers, the support team. If problems were in the code, they first had to figure out, okay, is this a problem based or caused by our patch sets and modifications, or is it a genuine mainline Linux problem? So the kernel engineers at distributions
also weren't quite happy about that because they were basically swamped with problems and issues that they first had to figure out where they actually are located from. Oh, and by the way, how do I start this? I would be very interested if you have additional
insights or anecdotes or anything to add or to correct about the things that I'm talking about right away. I usually prefer doing a dialogue instead of just giving a monologue up front here and you just nod or fall asleep. Or you internally disagree because you know that I'm talking rubbish, but if there's something
that you know about the subject and I don't know much about most of these topics, so feel free to add your comments. And if you have additional insights to share, please do. I think this would be much more interesting for the talk for itself. Okay, also an interesting observation is this guy here, Christoph Helwig.
How do I put this? He left his mark basically on all levels of the storage stack. I've been trawling mailing lists. I've been looking at patches and release announcements. This name pops up quite frequently when it comes to file systems and the storage layer.
So Christoph, in my opinion, he deserves a Wikipedia entry just for the work that he's been doing. But assembling all the contributions that he made will be quite challenging since he has made such a broad contribution to the entire storage layer.
The good thing about it is he knows much more about these topics than I do and he's actually around, so don't miss his talk tomorrow at 2 p.m. He will be giving an overview about the inner workings of the storage layer and show you how these pieces fit together and how they work together,
things that I'm just going to gloss over very broadly. Also, I would like to thank Linux Weekly News for their continued work. I think Linux Weekly News is still, even though the website layout, well, it still looks the same as when they started basically, but the content they produce is also very invaluable.
And lots of the articles, I have been reading a lot of articles that were covering topics that I've been doing my research on. Colonelubies.org also do a very important job in keeping track of what's happening. And Thorsten Limhus, I think he's here this weekend
as well from Heise. He does a very great job on also following the mailing list and summarizing what's going on in kernel land. So a big thank you to him as well. And of course, well, Wikipedia also is a good resource when doing such a project or a presentation. I found a few things that I modified,
so I've been also updating some of the Wikipedia entries along the way. If I had more time, I had more things to contribute, but getting the talk done in time at some point can turn more important. Okay, history. The very early days. This is where I'm going to start.
Linux was developed, well, the start of the development was in 1991, where Linus Torvalds basically locked himself into his room for an entire summer. He got interested in the whole topic of operating systems
while learning about MINIX and the university. One thing he hated about MINIX was the terminal emulation. He needed a tool that allowed him to contact his university server. And the MINIX terminal emulation sucked so bad that he decided to write one on his own.
Bolt. So he started, he bought a 386 PC, if I remember correctly, and started with a very simple scheduler that did nothing else than printing As and Bs. So he had two parallel threads that were doing independent tasks. In the next step, he started expanding those tasks that one of the tasks was talking to the modem,
the other task took the output of the modem and printed it in the screen. So the very early beginnings of a terminal emulation. He did the development on MINIX itself. I think he spent a few months just modifying and updating MINIX to a point where it was a useful development environment because, well, MINIX was a research operating system,
not so much really for production work. So he installed GCC compiler and all these things and started get going. At some point, he realized that, well, my terminal emulation needs to store stuff somewhere. I need a persistent storage layer.
And well, the canonical choice was to create something that is MINIX compatible because the host OS that he was running with was MINIX and I think he just had one heartache so he had to figure out a way to store the data while running his operating system in the same storage medium than the other operating system he was developing on.
It also helped that, well, since it was a research operating system, MINIX and its file system are well documented and easy to re-implement in a way. So this allowed him to have a terminal emulation program that saved files on a MINIX file system. And well, this is how it started basically.
It had a lot of limitations, 64 megabytes of maximum file system size, which was actually quite bearable for these times because well, in the early 90s, hard disks weren't that big, to be honest. But of course, it were limitations that at some point made it clear
that changes had to be made. So yes, started as a terminal emulation which eventually turned into an operating system. So one of the first things that were implemented
very early on was the VFS, the virtual file system switch because on the one hand, Linus, of course, wanted to keep the MINIX file system driver because he needed it for practical purposes, but he also wanted to make sure that there is an easy way to add another file system.
So the VFS layer was developed and Chris Provenzano provided the initial patches for that. Linus integrated and published them with 096 somewhere in 91, but he heavily rewrote it. I've been trying to look at those old kernel sources to figure out where exactly it was introduced.
I think 096 was the first where I could find a VFS.h file in the source code. So this is when the development started. Back then, Linus was basically dumping tarballs on an FTP site at an infrequent basis, and then published patches on top of that
before he made a new towel a few weeks later. And the VFS basically is kind of an abstraction there in the kernel. So on the userland side, you have system calls like open, read, and so on, and the kernel takes those calls and figures out,
okay, which file system or which file does that call belong to? Where do I have to route it through? And this allowed you to create mount points using different file systems. So you could have your root file system on Minix and then mount slash user with an ext file system or whatever.
And this really was the key component or a very first step in making Linux modular and more attractive for others to contribute. So the inclusion of VFS quickly resulted in a whole slew of file systems that were developed and included in the kernel.
And the very first file system to actually make use of the VFS was ext, the extended file system. Remi Karp was the guy who wrote it initially, and it removed two of the biggest limitations that Minix had, in particular that the file system size was increased to two gigabytes.
File names were all of a sudden 255 chars long, not just 14 chars, and it had a much more modern infrastructure. And yeah, just a bit more modern than what the Minix file system looked like. And Linus added it in 0.96c, April 92.
So Linux has already been around for a year. It wasn't perfect, but it was already a big step forward compared to what was there before. So one of the key issues that the file system had was that it was using linked lists to keep track of the blocks and inodes.
And the more you were using the file system, the more it started to fragment itself, and the performance just degraded horribly at some point. So it was clear that ext was just a stopgap measure before something much better came along. And it was the second extended file system, which also was implemented by Remi.
January 93, also very early in the early days of the Linux kernel. And this file system was basically designed from scratch with extensibility in mind. So all the on disk data structures
already left some room in the sense that you had free space in blocks and inodes that you could possibly repurpose later if you wanted to extend the file system. Maybe that's why they call it the extended file system, because it really was quite extensible.
Inspired by existing file systems, again, like the BSD fast file system, and it had all the things that you would expect from a file system, like the possibility to store timestamps with files. Different block sizes, you could turn files immutable,
which makes a file basically unmodifiable on disk. Quite a lot of modern concepts that you will still see in file system nowadays as well. And again, two terabytes, the size was again increased drastically.
And yeah, even later, if you're using 8K blocks, you can store up to 32 terabytes in an ext2 file system. So it started at some point and had to live with the limitations that the Linux kernel around it gave it.
Let's say the VFS layer or the block device layer, but ext2 itself also further evolved over the years and gathered new features. And in fact, ext2 was an ideal test bed for things or features that were developed much later in the time,
like for example, COSIX access controllers, extended attributes, things like that. They were first added to the ext2 file system because it was well-designed, the developers knew the code, it was very mature and stable at some point, and that made it a perfect candidate
for starting with experimenting with new features that were needed in Linux. And it still has its purpose nowadays. Ext2, based because it's very simple and robust, is still a good choice. For example, if you're using Linux on embedded devices,
it's a good candidate for flash storage because since it's not a journaling file system, there's not much churn on the disk itself. It doesn't perform a lot of writes to store data. Or well, let's say a slash boot partition that only contains the kernel and the initial RAM disk.
Slash boot is usually a file system where not much is really changing. Do you really need a big journaling file system for that? If space is tight, ext2 is still a good choice for that. As I said, the VFS layer allowed people
to add a lot of new file systems to Linux, and they started porting existing file systems that made them capable of accessing existing data on a wide variety of formats. I'm just picking FAT-MS-DOS for one example because, yep, DOS was probably the most common used operating system for people
that were developing Linux back then on their PCs. So they needed a way to access the file names there. And Werner Almesberger introduced the initial FAT support very early as well. It was later extended using VFAT, which has some more extended features and attributes.
They initially called it XMS-DOS, but had to rename it because of a trademark issue. But yeah, FAT-MS-DOS support was also included very early on. Even before that, based on my research,
they already included a collection of tools named the mTools, which basically is a userland tool that was capable of accessing FAT-formatted file systems. You could think of it like an FTP client. You start mTools, point it to a device, and you could then get and put individual files.
So it was not like you enter a directory and you see a directory structure, but you had a client that talked to the file system over the userland, so to say. And mTools was around even before Linux was developed, and it was very easily portable. And Jim Winstead included it on the very early versions of the Linux boot and root floppy disks that he maintained, which in my opinion
were the first Linux distributions because they were floppy disks that included not just a kernel, but also a shell and rudimentary tools to get started. Later on, NTFS came aboard.
Microsoft was also furthering the development of their file system, so Linux people were looking at a way on how to access the NTFS file systems, and Martin von Leves was the first one to take a stab at that. The problem is NTFS is a proprietary file system and had to be reverse engineered,
which is actually the case for some of the file systems included in Linux. The specs, specifications were not open, so people had to basically look at what the file system is doing on the block layer and reverse engineer what was going on there. So the NTFS driver that was included
in the Linux kernel, to my knowledge, never fully supported writing to NTFS, these file systems. But you were always able to mount and read data of it, but the kernel module, I think they started experimental support and you had to load the module with a force option to enable the right support.
And you were still on your own because there was always a risk that writing data to an NTFS file system would just mangle everything and your data would be lost. Later on, the code was further developed under a project named NTFS TNG, which replaced the NTFS driver in the kernel.
And in parallel, there was a project going on named NTFS 3G, where they simply created an NTFS driver and user landed, used the FUSE file system, which I'm also going to cover a bit later on, to talk with the block device. The open source project page on SourceForge nowadays
redirects to a company named Tuxera. According to their website, they have further developed their NTFS driver to now support other operating systems like Mac OS X, FreeBSD, Solaris, QNX, and so on. So it was probably a good idea to implement this
on the FUSE level because it was much more portable to these other platforms then. But FUT and NTFS are just two prominent examples of what people were doing based on the capabilities the BFS layer provided to them. Okay, the next big step or milestone in my opinion
happened when file systems became journaling file systems. Everything that we've talked about before with the exception of NTFS maybe are file systems that maintain a very special structure on disk.
Well, yeah, let me talk about this in this talk here or in this slide. File system check versus journaling. You probably, if you have been around with Linux for a while, you probably know the message and file system, blah, blah, blah, has been gone for too long, file system check forced.
And Murphy's Law mandates that this always happens when you have to reboot your file server and people are waiting for it to come back up. So Xt2 had several ways or several criterias that forced file system to be checked.
Well, the first, of course, is an unclean unmount. So if your system crashes, the system goes down. Xt2 had a flag that wasn't killing it. So it was complaining, you probably crashed. I think it's better if I check my consistency in internal structures. And off it went.
But also, depending on how many times you've mounted and unmounted the file system or how much time has been passed since the last time you mounted and unmounted the file system would also trigger a file system check. And administrators were yelling and screaming about that because, well, disk drives got bigger.
Storage requirements were increasing. Xt file systems were spanning hundreds of megabytes, gigabytes. I think, again, this was in the 90s. But file system checks could take a very long time. And this was simply not an option if Linux ever wanted to become
an enterprise operating system that could compete with the other commercial UNIX systems back then. So this problem became more and more apparent as the size of disk drives increased and the amount of data stored on them increased analogously. Back then, there were several projects in parallel
that started to alleviate this problem or shortcoming of Xt2. The most prominent ones are the Xt3 file system, SGIXFS, IBMS, JFS, and the RYSEFS file system. And these are the ones that I'm going to talk about in brief in the next few slides.
Who of you has ever cursed about a file system check taking hours? Okay, so. Days, really. You know, it's going to finish at some point. You just don't know when. It's just mind-wracking and very annoying.
And especially if it actually spotted an error and you were clueless about what was going on then because then it started attempting fixing things. The debug messages just got more confusing and it got really scary at that point. I've been in the situation myself way too many times.
Okay, the solution to this problem from a theoretical point of view is that you use a so-called journal. So instead of changing the data and the file system directly in place, so to say, you basically mimic what databases are doing in the sense that you define transactions
that have to happen in an atomic way so they either succeed or they fail. And you first write the intention to change something into a log file. And from there, the changes will be propagated to the actual file system on disk. So in the event of a crash,
the recovery basically just included, I'm in 15 minutes? Okay, you stole me 10 minutes. Okay, I need to hurry up. Anyway, so journaling was the solution to this problem because, well, you just read the journal and figure out what was the last transaction
that has either completed or failed and you could get the file system back into consistent state in a much faster way. And again, there was a new layer created, the JBD journaling block device layer. Not all of the journaling file system actually use it.
And the one that does is the ext3 file system. JBD was developed in parallel with the file system, which is a generic service that any file system could use, well, to become transaction and add journaling functionality. So the file system had to communicate with the JBD
in order to announce that the transaction is taking place, the data is going to be written out to disk and so on. And Steve Tweedy developed this in parallel with the ext3 file system. JBD was also used by Oracle for OCFS2 and the ext4 system started by forking JBD into JBD2.
And here we have ext3. Basically, in a nutshell, they took the existing ext2 code because it was a mature, well-written, they know it all, and added journaling capabilities to it because this was, back then,
the most serious limitation that they observed. And the way they did it using the JBD block layer was actually quite smart. And since it didn't really change the on-disk structures, a migration from ext2 to ext3 was very easily doable because, well, you just mount it,
enable the journal, and off you go. And you could also disable the journal again in case that you had to mount it from an older system that was only capable of mounting ext2 file systems. It increased the file size to eight terabytes and added a few other new features. And we are in the 2000s already,
so development is going on there as well. IBM also realized that there's a need and a solution waiting. They had been working on journaling file systems for their AIX operating system
since the early 90s already, so they know this topic quite well. Back these days, IBM made a very big announcement that they are going to support the development of Linux with $1 billion. Sure, they wanted to make sure that their technologies were pushed into Linux and they remained relevant and dominant
and had a foothold in the door, so to say. But IBM contributed quite a lot of useful things and the JFS is probably one of the key contributions that I would recall. I've been working with Dave and Steve during my time at SUSE on making sure that SUSE Linux was always shipping
the latest version of JFS and so on. So they were very eager on making JFS on Linux a success. It was quite fun working with these guys. Yep, sizes were increased significantly again. And interestingly, the code base later then found its way
into AIX as JFS2 again. So they basically looked at the existing JFS implementation, decided it was crap because it didn't support SMP systems and was not very scalable or portable and just took the concepts and ideas and started from scratch again
with the support for multiple operating systems in mind. And AIX and later, oh, where is it? Oh, it's two. I thought I've not mentioned it somewhere. Oh, it's a warp server, yeah. So it was also capable of using that file system. So a Linux box using JFS could also mount
an OS2 server's JFS file systems without problems, which was quite nice. Another contestant, ReiserFS. It was quite heavily supported by SUSE back then. I think SUSE was the only distributor really putting a lot of energy and resources
into ReiserFS. Hans Reiser, the original creator, didn't you show me 10 minutes already? Anyway, just let me know when I'm done. He was quite a controversial person and in the way he pushed ReiserFS
and his, well, would I call it stubbornness? I don't know. But he had lots of issues with the kernel community and there was a lot of friction in the development process going on. SUSE, for example, employed Chris Mason who developed the journaling part of ReiserFS
and they were the first ones to really claim that ReiserFS is stable and can be used. That's an observation I forgot to mention in the beginning. File systems on Linux have a really, really hard time because people read about somebody losing his data on the file systems and will never touch it.
Therefore, nobody tests it. Therefore, it doesn't get better. And it's a catch-22 and it's especially worse if it's about a file system that isn't even part of mainline Linux yet. So ReiserFS really had to go through quite a lot of iterations and lots of negotiations and discussions until it was finally included.
And ReiserFS 3.5 still is part of the kernel nowadays but its star is sinking, I would say. It is pretty mature but the development basically has stopped and they're just fixing issues that spring up because of the code base evolving or if there's really a serious bug.
I'm not aware that anybody is actively working on 3.5. However, something that really surprised me is that Reiser 4 is still under active development and they have separate patch sets for Linux kernel 4.1 already. They have a website and a mailing list. So these guys haven't given up on the whole project yet.
But will it ever be part of mainline Linux? Question mark, I don't know. But it's interesting to see that some development is still going on there. And ReiserFS was really the first journal file system that was actually included in Linux at some point. It beat all the other implementations
when it comes to the timing. It had a pretty modern structure that made it quite suitable especially for small files like a server's mail spool, a new server or a web server serving small pages and things like that. But it had issues in other areas. SGI XFS on the other hand came
from the opposite direction. Big files, high throughput, lots of concurrency. SGI was using that file system on their Irix Unix flavor for quite some time. Development started in the mid to early 90s. They announced it in 1999, but I think it took them a year to go through all the legal process
and talking to all the developers that were involved in the file system to get their part of the history and making sure that all the third party components they included were taken out. So the legal review process took about a year. And I was quite impressed reading about how many iterations they had to go through
and that they really persevered and pushed this forward until they came to a point where a Linux version was available. These guys were the employees from SGI working on it. A number of high, very popular figures from the Linux community started supporting them early on
like Andy Klean from Zuse or Jan Carra. And again, Christoph Helwig is also I think a key contributor to XFS. And well, it made its way into the kernel in 2004 I think. And many of the distributions today default
to XFS as the key file system. And I think looking at Chris in the corner over there, I think it's still your most favorite file system as well for Linux deployments. I remember that we were recommending it for MySQL file systems. Sorry.
Okay. Okay. Okay. I actually posted the question but I'm going to repeat it for the audience. I was asking Chris about his experience and if he would still suggest XFS nowadays.
And yes, they were customer deployments using ext3 and they had timing and performance issues which switching to XFS basically solved. So it's very mature, it's robust. I think from all the journaling file systems that exist,
it has the biggest range of support and user base. Most of the employees work for Red Hat nowadays and it's under active development. So it's a safe choice. All right. File systems are nice but you need to put them somewhere and in the early days, you just had physical disks and partitions
where you could put a file system on. DefHDA, DefHDA one. If you ran out of space, what did you do? You took a bigger disk, created a new file system, copied the data across, mounted it somewhere else which was not very flexible
and well, again, was something that caused downtime which you didn't want to have in a server operating system. So the idea was, okay, how could we abstract that layer? And LVM introduced in 1908 by Heinz Malzhagen was the solution to that where he basically said a hard disk is just a physical volume somewhere underneath
and I'm going to put a layer on top of that that basically hides the underlying physical structure. I just create logical volumes on top and make them so flexible that it doesn't matter what kind of physical storage is underneath. And LVM then allowed me to create volumes, put a file system on top of that.
If space was getting tight, I just put another disk in my rack, added it to the physical volume, expanded my logical volume and we're good to go. So this was really golden. LVM really made storage management on Linux so much easier. And not just in a server environment but also on home computer PC.
So if your home directory, if it's on a separate volume, suddenly becomes too small, it's nowadays much easier to increase the size. Even though that some of the installers I think the Ubuntu installer still defaults to putting Ubuntu on physical partitions if you don't force it to use logical volumes
which is kind of odd. Device mapper, yes, so another layer. A very cool layer actually because it allowed me to do lots of things and the second version of the logical volume manager used device mapper to get its job done. So the first version of LVM was, yeah, hard-coded.
It was just LVM between the file system and the disks. With LVM 2 and kernel 2.6, it was using the device mapper and device mapper really did the brunt of the work. Device mapper allows you to define virtual devices that you can stack on top of each other.
You have a pluggable infrastructure that allows you to modify the data while it travels from one block device to another one which gives you unlimited opportunities and ways to mess with the data while it travels across this layer. And yeah, DM multipath is one
of the interesting aspects here. Basically, it's a way where the storage layer can access the same storage device through different storage path like fiber channel controller or iSCSI or whatever. It's just aware, okay, you can take this route
or you can take this route but you will end up on the same device and the DM multipath knows about this and can either load balance between those two routes in order to distribute the load or if one of the connection fails, it will simply use the other one. So you get both scalability and high availability for your storage.
Device mapper, LVM2, all these components initially were created at a company named Sistina which was then later acquired by Red Hat. And that's also an interesting topic that many of the key components like XFS,
device mapper and so on are primarily managed and developed by Red Hat nowadays. But it doesn't seem to be the case that this company is going down anytime soon. DM cache, something that was recently developed. Another nice trick, using device mapper, you create a separate flash drive as caching device.
So if you're rotating disks are too slow, DM cache allows you to put a flash disk as a cache layer into the storage path, so to say. Also developed Heinz Malzhagen, who now works for Red Hat, is part of the team working on that. Yep, I briefly covered it.
LVM2 basically improved what LVM1 was already providing by adding more flexibility. Interesting things that happened here, for example, is thin provisioning that allows you to basically create disks that are bigger than your physical storage can actually provide. And it also had a cluster volume management option,
so if you have multiple machines sharing storage, CLVM is capable of talking to these as well and making sure that your data isn't being messed around. Another project by IBM that was quite noteworthy back then was the EVMS, the Enterprise Volume Management System,
where they basically took the concept of how they were managing storage on AIX and ported it to Linux. They initially also wanted to create current support for that, but the Linux developers thought that Device Mapper and LVM2 are a much saner choice.
Lots of politics involved, probably, and lots of lengthy discussions. The interesting thing here, or the noteworthy thing here, was that the IBM developers at that point then decided, okay, we're not getting our cake, so we are going to adapt EVMS to use. Am I really done?
Okay. What time is it? I don't have a clock here. 10 minutes? Okay. So let me go on for five more minutes and then we do a bit of Q&A. Yes, so Linux decided LVM2 was the better approach.
The IBM developers accepted this decision and modified EVMS to use the, yeah, the Device Mapper instead of using their own implementation. EVMS is interesting in the sense that it does more than just volume management.
It is capable, or it knows about file systems that are being put on top of a volume, and it was early on possible to grow and shrink file systems along with the underlying volumes. That's a capability that LVM2 got much later in the process, for example. But nowadays, EVMS has no relevance
and IBM basically stopped development in 2006 or so. Storage services, I'm going to run through this now. I'm sorry, I've deleted so much stuff and I'm still running out of time. That's embarrassing. NFS, we have been talking about local file systems
all the time, but in order to run a server, you need to share the data with the world and NFS was the very first implementation, the very first way of how to exchange data with other systems. NFS is both a client and a server part
and it has gone through several iterations. NFS 4.1 is currently the most active version and Linux developers are really on the forefront in making sure that NFS 4.1 in Linux is up to speed. There was a time where the Linux implementation of NFS was lagging behind.
This has changed dramatically. There's a lot of effort being put into making Linux one of the reference implementations. But the initial NFS implementation in Linux started very early on. Then of course, we have Samba. You all have heard of it. Take a file system on a Linux disk,
run Samba on top of it and share it with your Windows environment. This is basically how Samba started. Yes, later on, SMB was expanded to be the common internet file system even though I doubt that anybody's really sharing disks
or SIFs over the public internet anyway. Microsoft has also been furthering the development and Linux was catching up with it as well. At some point, you had two kernel modules that allowed you to mount a share from a Windows server. The old SMBFS module and the more modern SIFs module. SMBFS is still mentioned in many how tos
and documentations but it shouldn't be used anymore and it has been dropped quite a while ago already. Which gets us to current file systems. Xt4, the latest version. Some still consider it unstable or not ready for production interestingly. As I said, file systems are a very tricky subject.
Xt4 also started basically as a branch of Xt3 that was then gradually extended and expanded and improved. Yeah, it added a few more modern concepts,
especially the new extend on disk format allowed it to be much more scalable and faster. But it also was a step that made it incompatible with Xt3 and Xt2. So while it is capable of mounting Xt3 file systems, you wouldn't make use of any of the benefits of Xt4
if you don't really convert the file system. There was a mount flag that you can set and any new data that would be stored on the former Xt3 system would then use extends which gives you a mixed on disk format of both the Xt3 and Xt4. At that point you can't go back basically.
But I think it's pretty solid. It's the default file system for some distributions it's been around. It's another option for XFS, how to say. BatterFS, yeah, the better file system. I've been giving talks about BatterFS a few years ago.
Basically it's not just a file system, it also includes rate and volume management. It covers the entire stack. And if a file system is capable of talking to the disk directly, it can do a lot of very interesting things. Check sums, for example, and the whole rate and volume management, all these things wouldn't be possible
without BatterFS being in full charge of the disk. So if you use BatterFS, don't put an LVM between it or DM rate or whatever. Give it raw physical disk. This is the best approach. Cool thing about BatterFS, you can convert existing E3 for a file systems on the fly.
And it would create a snapshot of the old EXT file system that you can even go back to in case you wanted to go back. So it gives you a way to keep your existing data. You wouldn't have to copy them all around. Chris Mason is still busy working in it. Nowadays he works for Facebook as files and now he started development at Oracle, then moved to Fusion IO together with some other guys,
and now Facebook is sponsoring his development. Still ongoing features are being added. BatterFS basically is the future local file system for Linux, in my opinion. Then we have ZFS. Sun open sourced Solaris as open Solaris in 2006 or something
and part of open Solaris was the ZFS file system which was revolutionary back then because similar to what BatterFS is doing nowadays, it's not just a file system but it does volume management as well. It's very scalable, it provides a lot of robust features.
Even though it will never be part of the Linux kernel due to the licensing, it's licensed under the CDDL which is incompatible with the GPL. There is a very active community around the code base open ZFS nowadays and with ZFS on Linux, which you need to download and install separately,
you have a very solid storage solution that solves a lot of problems. Yeah, I'm sorry that I have to run through this so quickly but NBD was something that I found interesting. Have a remote server, share a block device, mount it somewhere else, but there's a talk about it tomorrow if you need to learn more.
DRBD allows you to create kind of a rate over a network. You have a server here, a server there. Every data you write on the one side is being replicated to the other side so you have a block identical copy of the data on a second system. DRBD usually works in synchronous mode
but it can also be switched into asynchronous mode if you have latency between the nodes and it supports catching up in case the connection goes down and it needs to replicate data that has been stored locally and you can change roles so it provides lots of high availability capabilities as well. Currently they are working on DRBD9,
the next incarnation, which will support more than two nodes. Like with ReiserFS, the DRBD developers like to get involved in lengthy discussions. Is that politically correct? Anyway, there's lots of discussions going on about the DRBD right now so they are having a hard time getting their code in.
Cluster file systems. So these are file systems that you can use to share a disk between two different systems. I ran out of time, it's a bummer but I'm going to give you, well you can get the PDF of the slides later on but let's keep five minutes for questions real quick.
Tell you what, I'm going to flip through the slides real quick. I would have talked about OCFS, GFS, requirements, why we need storage, GlustFS, Ceph, Lustre, and yeah.
I apologize, that was way too much that I could handle but I got just too excited about all these different technologies and what you could do with them so bear with me. I hope it was still insightful in some way
and it gave you back fond memories of things that you hope to have forgotten already. Any comment, question, anything that you want to leave before I leave or that did I overwhelm you with too much stuff? Okay, thanks, I hope you still enjoyed it and we'll take a look at the slides.