We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

How ZFS Snapshots Really Work

00:00

Formal Metadata

Title
How ZFS Snapshots Really Work
Subtitle
And why they perform well (usually)
Title of Series
Number of Parts
34
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Software developerImplementationBitProduct (business)Library (computing)Physical systemExecution unitData storage deviceComputer animationXML
Data recoveryMalwareReplication (computing)SpacetimePhysical systemFile systemMalwareVirtual machineCommunications protocolDifferent (Kate Ryan album)MathematicsPhysical systemComputer fileData storage deviceSpacetimeBitType theoryReplication (computing)InformationMedical imagingResultantArithmetic progressionSystem callExecution unitService (economics)Direction (geometry)AuthorizationArmVideoconferencingComputer configurationComputer animation
Database transactionWritingGroup actionComputer clusterBlock (periodic table)Network topologyFile systemBlock (periodic table)Network topologyComputer fileState of matterMereologyResource allocationException handlingRootView (database)Point (geometry)Rule of inferenceOrder (biology)Office suiteHand fanLevel (video gaming)Computer animation
Block (periodic table)RootFreewareDatabase transactionWritingGroup actionComputer clusterNetwork topologyBlock (periodic table)RootFamilyInsertion lossCuboidNetwork topologyParallel portPointer (computer programming)Multiplication signComputer fileFile system1 (number)Data structureTrailMiniDiscOrder (biology)Physical lawInstance (computer science)Data recoveryPrisoner's dilemmaArmComputer animation
Electronic data interchangeBlock (periodic table)Resource allocationPointer (computer programming)Overhead (computing)SpacetimeLimit (category theory)Network topologyPairwise comparisonMultiplication signUniqueness quantificationFreewareMathematical optimizationMiniDiscOrder (biology)NumberFile system1 (number)Message passingPhysical lawHypermediaWordRoutingMereologyComputer fileArithmetic meanMiniDiscBlock (periodic table)SpacetimeMultiplication signMathematical optimizationVideo gameComputer animation
Traverse (surveying)Block (periodic table)Network topologyNetwork topologyBlock (periodic table)MereologyMultiplication signFile systemAlgorithmTurbo-CodeRootAreaSelf-organizationObject (grammar)Boss CorporationUniform resource locatorWater vaporComputer animation
Traverse (surveying)Block (periodic table)Network topologyComputer fileCharacteristic polynomialBlock (periodic table)Network topologyPointer (computer programming)Freeware1 (number)Uniform resource locatorNumberMultiplication signCuboidInsertion lossComputer animation
Block (periodic table)Reading (process)Identical particlesIntrusion detection systemTrailPointer (computer programming)MiniDiscTraverse (surveying)FreewareProcess (computing)File system2 (number)Reading (process)RandomizationOrder (biology)Block (periodic table)Multiplication signPointer (computer programming)Data structureForm (programming)1 (number)Type theoryElectronic mailing listNumberAlgorithmChannel capacityTrailMiniDiscOnline helpLine (geometry)AdditionGroup actionWater vaporQuicksortWordOffice suiteRow (database)ResultantPhysical lawArmDisk read-and-write headBoss CorporationRevision controlStructural loadCircleRule of inferenceInsertion lossComputer animation
Electronic mailing listBlock (periodic table)Process (computing)Mathematical optimizationAlgorithmBoss CorporationCuboidData storage deviceInheritance (object-oriented programming)Physical lawNumberRow (database)Video gameElectronic mailing listReading (process)ArmFreewareBlock (periodic table)2 (number)MultilaterationMereologyFlow separationMultiplication signTerm (mathematics)Range (statistics)1 (number)5 (number)Computer animation
SpacetimeData storage devicePoint (geometry)SpacetimeData storage deviceElectronic mailing listType theoryCategory of beingComputer fileWeightLikelihood functionRow (database)WordOrder (biology)Computer animation
Point (geometry)SpacetimeShared memoryInformationSpacetimeSeries (mathematics)Service (economics)Heegaard splittingDifferent (Kate Ryan album)Utility softwareSurfaceAreaState of matterComputer animation
Point (geometry)SpacetimeElectronic mailing listShared memorySpacetimeObservational studyFile systemRow (database)Uniqueness quantificationBlock (periodic table)Multiplication sign1 (number)Type theoryDiagramContent (media)SphereCausalityShooting methodOrder (biology)ReliefNP-hardCuboidVotingPhysical lawOffice suiteOvalComputer animationProgram flowchart
SpacetimeShared memoryDefault (computer science)File systemControl flowSummierbarkeitComputer fileSpacetimeBit rateCategory of beingPattern languageOnline helpEndliche ModelltheorieNumberMachine visionGroup actionBus (computing)Coefficient of determinationService (economics)Degree (graph theory)Condition numberMereologyMarginal distributionSurface1 (number)CuboidArithmetic progressionComputer animationProgram flowchart
Point (geometry)SpacetimeShared memoryGroup actionBoss CorporationMultiplication signSet (mathematics)Process (computing)Block (periodic table)Computer fileSpacetime1 (number)Range (statistics)Data storage deviceComputer animation
SpacetimeShared memoryFile systemSpacetimeMiniDiscDatabaseMedical imagingArmMetropolitan area networkBlock (periodic table)Computer fileDirectory serviceReading (process)Term (mathematics)CASE <Informatik>Rule of inferenceSoftware testingBit rateState of matterTotal S.A.Physical lawMassSuite (music)Computer animationProgram flowchart
Uniqueness quantificationSpacetimeCategory of beingMultiplication signMereologyOrder (biology)AuthorizationComputer animation
Uniqueness quantificationMechanism design1 (number)SpacetimeAdditionBitEmailData compressionEstimatorArmMoment (mathematics)PressureOrder (biology)Near-ringResultantWordComputer animation
SpacetimeFrequencySocial classPersonal digital assistantUniqueness quantificationCategory of beingBlock (periodic table)Social classMaxima and minimaCloningCartesian coordinate systemSpacetimeBlock (periodic table)CASE <Informatik>Different (Kate Ryan album)Range (statistics)Flow separationElectronic mailing listCategory of beingRadical (chemistry)Reading (process)WhiteboardCoefficient of determinationRule of inferenceOffice suiteCausalityGroup actionArmInsertion lossVideoconferencingOptical disc driveGame theoryProcess (computing)Water vaporComputer animation
Line (geometry)Graph (mathematics)NumberSpacetime1 (number)DiagramBitArithmetic meanSocial classArmLie groupSelf-organizationLocal ringGenetic programmingOffice suiteReading (process)Insertion lossVideo gameMereologyNeuroinformatikCASE <Informatik>WordProgram flowchart
Electronic mailing listMiniDiscSynchronizationCache (computing)Arc (geometry)File systemMathematicsMiniDiscBlock (periodic table)Electronic mailing listBuffer overflowSynchronizationMetadata2 (number)Multiplication signBefehlsprozessorNumberNeuroinformatikLevel (video gaming)CalculationResultantVisualization (computer graphics)Rule of inferenceArmProcess (computing)Insertion lossBus (computing)Observational studyVideo gameComputer animation
Electronic mailing listMiniDiscCache (computing)Arc (geometry)ArmDifferent (Kate Ryan album)Multiplication signComputer-assisted translationRule of inferenceBlock (periodic table)Computer fileRange (statistics)MathematicsStructural loadPartial derivativeSpacetimeMiniDiscFlag2 (number)Process (computing)Keyboard shortcutTable (information)Square numberAlgorithmElectronic mailing listMereologySource codeFood energyOffice suiteVisualization (computer graphics)RoboticsComputer configurationAuthorizationProgram flowchartComputer animation
Electronic mailing listMiniDiscSynchronizationCache (computing)Arc (geometry)SpacetimeLiquidSpacetimeCASE <Informatik>Multiplication signNeuroinformatikWorkloadElectronic mailing listPattern languageProcess (computing)Semiconductor memoryCache (computing)DivisorBooting2 (number)Partial derivativeRight angleRow (database)Self-organizationPhysical lawBit rateTotal S.A.Single-precision floating-point formatArmGroup actionWordSurreal numberSinc functionOrder (biology)Context awarenessCoefficient of determinationExecution unitSystem callProgram flowchartComputer animation
Open setOpen setSlide ruleEmailComputer programmingAnalytic continuationFlow separationStorage area network5 (number)Software developerFormal verificationBoss CorporationTraffic reportingOffice suiteComputer configuration19 (number)Computer animation
Transcript: English(auto-generated)
Hi everyone, I'm Matt Ahrens. Thanks for bearing with me, I have a little bit of a cold today. I'll try to blow all your ears with my cough. I'm here to talk to you about ZFS snapshots.
I work at Delphics. We do a lot of cool work with ZFS software development. And in the past, I worked with some microsystems. I helped to create ZFS. And way, way, way back in the day, at the very beginning of ZFS, helped to design and implement
ZFS snapshots. So, we'll start with the very basics. What are snapshots? So, you can take a snapshot of the file system, and it basically stores an old copy, without actually having to copy everything, of the data. So this is really useful if you make a mistake, like you delete some files accidentally, you need to get them back from the snapshot.
It's also used for malware recovery, where somebody else deleted your files accidentally. And also, it's used in conjunction with ZFS and then receive, to do replication of data to other systems. Who has used ZFS snapshots here? So like, 90 plus percent, okay.
I need this for a review. So how do you use snapshots? Well, there's a bunch of commands. You want to take a snapshot, you take a ZFS snapshot, and then the name of the snapshot is identified by the pool name, and then the file system name, and then the name of the snapshot. You take a whole bunch of snapshots at once with the dash R flag.
You can destroy snapshots. You can, as I mentioned, use send and receive to send incremental changes. So in this case, we're sending from the old snap here is the one that's already on the other system. We want to send the incremental differences between that one and new snap, and then you can pipe that over SSH or some other neural protocol to your other machine.
It is the best receipt. And then you can find out information about snapshots with live as get to get some properties in snapshots. So I'll review for a lot of you folks. How do you really use snapshots? I know you can type those things, but in practice, what we tend to do is like, okay, I know you can do that,
but what I want to do is set it up so that every hour I'm going to take a snapshot, and then I'll just have them there because I need it. I know sometimes that happens. And then you wonder where all your space went. So this talk, I'm going to be talking about where the space went and how you can figure that out.
But you're going to have to bear with me a little bit at the beginning because we need to understand how the snapshots actually work under the hood to fully appreciate how difficult it is to answer that question of where all your space went. It went to your snapshots. But if you want more detail than that, you've got to understand some more stuff.
So at a very high level, ZFS is a copy-on-write file system. That means that whenever we're writing disk, we're selecting a new, previously unused place to write the data, with the exception of the root block here. So if we want to change some data blocks, like say these to the lower left,
we don't overwrite where they already are. We allocate new places on disk, which is represented by the green blocks. And at this point, nothing is pointing to those green blocks, so if we crash, we won't be able to find them. They aren't really part of the file system. So we have to write the indirect block that points to those green blocks.
So we allocate another new indirect block, and another new indirect block all the way up the tree. And then finally, we have this new block for them by the green, and then these blue ones over here. We can overwrite the root block to point to that one, and then we've atomically switched over to the new state of the world. So this is true in general with ZFS and copy-on-write file systems,
even in the absence of snapshots. But it makes implementing snapshots really easy. So if we want to create a snapshot, all we need to do is save the old root block here before we overwrite it with the new one, and make sure that we don't create any of these old root blocks.
So, as you can see at the top there, essentially we have two parallel trees that reference some of the same blocks, but some unique blocks. So like the tree that's rooted at the green is the new file system, but the one rooted at the blue block at the top represents the old snapshot.
There's a couple of other things that we need to worry about in snapshots. So, we've taken the snapshot, nice and easy, because we're using copy-on-write. But when a block is removed, we have to figure out, can we read that block? If we didn't have any snapshots, then when you delete a file, all the blocks are gone.
The file system can go find them and clean them up and reuse those places on disk for something else. But if you have snapshots, we might not be able to do that. So, in ZFS, the way we do this is every block, we keep track of its birth time. The birth time is the time when we allocated and wrote that block. And we store that in the indirect block that points to it.
So, in this example, like this orange block was rooted at time 37, and the block pointer that points to it says that it was rooted at time 37. So, whenever we're removing a block, we have this block pointer available to us. And we can figure out if any snapshots reference it by just checking, is this time that the block was rooted
after the most recent snapshot? If so, then no snapshots reference it. But if it was before the most recent snapshot, then at least that most recent snapshot references it, so we can't free that block. Again, nice and easy since we have the right data structures. But when we delete a snapshot,
how do we find the blocks to free? So, we need to find all the blocks that are unique to this snapshot, the ones that are only referenced by this snapshot that I'm deleting, and not by any other snapshots or file systems, and we need to free those blocks. That's a little tricky. So, we're going to be talking about that for the next few minutes. It's tricky, but it's worth it,
because the way that other file systems have done this is really, really expensive and makes using snapshots in practice impractical for a large number of snapshots. So, as I mentioned, the goal of snapshot deletion, we want to find the unique blocks, the ones that are only referenced by this snapshot. And we'll look at a few different algorithms,
keeping in mind what the goal is here. So, the optimal algorithm would run in time proportional to the number of blocks that we need to free. So, in other words, if we're deleting a snapshot and it's going to recover us like a terabyte of space, then it's okay that that takes longer than deleting a snapshot
that's going to recover only a gigabyte of space. But it shouldn't be, like, two snapshots that are, if we delete two snapshots and they're both going to recover a terabyte of space, they should both take about the same amount of time to delete. And, ideally, the number of blocks that we read from disk should be much, much less than the number of blocks that we need to free.
So, one key thing in the way snapshots work that is going to really help us with this is that block lifetimes are contiguous. In other words, there's no afterlife. After a block is killed, meaning it's no longer referenced by the file system,
it can't become part of the file system again later on. So, you read a file, it becomes part of, you take a snapshot, it's part of that snapshot, then you take some more snapshots, then you delete the file, then you take some more snapshots, and while the file existed, that's when its blocks are referenced, and before that, they aren't, and after that, they aren't either.
So there's no way to, like, bring it back to life later on. The key thing about this is that it means that the blocks that are unique to a given snapshot are the ones that aren't referenced by the previous or the next snapshots. We don't have to look any further into the past or future to figure it out, right? So, the first way that we could delete snapshots,
I call the Turbo algorithm. The way that you could do this is traverse the tree of blocks, check the root times to see if they're referenced by the previous snapshot, just like we, similar to what we talked about before, figuring out if we can delete a block that's run by the file system.
And the cool thing about this is that we don't have to examine children if they're not referenced by the previous snapshot either. So, to take this example here, but essentially, if we find a block that was
an indirect block that was created a long time ago, then we know that all of its children were also created a long time ago. So we know that we can't delete any of that stuff, and we can skip traversing down into that part of the tree. And then, so that takes care of whether the previous snapshot references the block,
and then we have to figure out if the next snapshot references the block. So one way to do that is to just kind of look at the next snapshot and go to the kind of the same location in the tree, the same file, the same offset in that file, and see, oh, is this .2 the same one as the one that I have? If so, then I can't delete it. But if it's different, then I know that
the next snapshot doesn't reference it, and all the ones in the future don't reference it. So this takes advantage of another characteristic of snapshots, which is that if there's a block pointer at one place in the tree, then it'll always be at that place in the tree in the previous or next snapshots. You can't move around. You can't move the block pointer from one file to another file.
So this works, which is good. It's correct. But the disadvantage is that it can take a long time. It takes time, unfortunately, to the number of blocks written since the previous snapshot. So those are the ones that we have to go check to see if they are present in the next snapshot.
And we might have to run a lot of blocks. Like, each indirect block might be only pointing to one data block that we care about, and then we have to read that, and then we have to read the same corresponding indirect block in the next snapshot. So it might actually be like 2x the number of blocks that were kind of free. So the disadvantage is maybe you have to go read a million blocks,
but you actually are freeing nothing. This could happen really easily if the next snapshot is identical to this one. And you would actually have to read two blocks from disk in order to just free one. So you're going to end up doing a lot of random reads. And this is a big problem because it can actually become a bottleneck
for being able to inject new data into your file system because if you imagine a single rotating disk, you can write at about 200 megabytes per second, but you might only be able to free at less than 1 megabyte per second. So once you've filled it up once, you can only write to it at 1 megabyte per second,
even though you have a lot more capacity because you can't free stuff any faster than that. Alright, so this algorithm, thankfully, was never implemented for CFS.
So let's talk about the next algorithm called the Bunny algorithm, or maybe the HAIR algorithm. So the key thing here is that we're going to add an additional data structure called a deadlist to keep track of blocks that are no longer referenced. This is an on-disk data structure, and every data set, meaning every snapshot and every file system has this deadlist.
It tells us, here are the blocks that were referenced by the previous snapshot, but they aren't referenced by this snapshot or file system. So it's the blocks that I killed. So if we look at this diagram, this is showing a timeline of snapshots. So snapshot 1, 2, 3, and then the file system.
And then these are showing lifetimes of blocks. So this red line is showing that blocks on snapshot 2's deadlifts are the ones that were referenced, they were live during snapshot 1. Maybe they were live at earlier snapshots, maybe not. And those are the ones on snapshot 2's deadlist.
And then similarly the green ones are the green ones for this. So how does this help us? When we want to delete a snapshot, for example the red one, the target snapshot, we can find the blocks that are referenced only by this form by looking at the next snapshot's deadlist.
So those are these blue blocks. And the blue blocks include two types from what we care about. The ones that were born between these two are the ones that are unique to this snapshot. So we just have to go through the list of all this one's, go through this snapshot's deadlist. It'll give us all the blocks.
And we can look at the birth times, which is stored in the block pointer in the deadlist. And see, is it in this range? If so, it's this type. It says unique to this snapshot, and we can get rid of it. Otherwise, it's referenced by the previous snapshot, so we can't get rid of it. Nobody combined the two deadlifts. So let's take a look at how well this works.
It takes time proportional to the size of the next snapshot's deadlist, which is the number of blocks deleted before the next snapshot, which is kind of similar to the previous algorithm, but the big advantage here is that the deadlist is compact. So every block of the deadlist that we read in, it has 1024 block pointers inside of it that we can go evaluate.
So theoretically, it got to 2000 times faster than the previous algorithm, which is really good. But it can still take a long time to free nothing, because it might be that the deadlist contains a lot of entries, but they're all this type of entries, ones that were born before the previous snapshot.
On to the cheated algorithm. So this is actually what was implemented in ZFS at the beginning, and yeah, it can be slow until the next snapshot. We implemented the cheated algorithm I think in around 2008 or 2009.
So the cheated algorithm is based on the previous one. It's an extension of that. And what we do is we divide the deadlist into some lists based on their birthtimes. Because if you remember, when we delete snapshot 4, we want to find these blocks.
So the key thing here is, well, let's just kind of do that beforehand. As we're creating the deadlists, we're going to actually create several lists of blocks that are all part of snapshot 5's deadlist, but there's one list for blocks that were born in this range, one for blocks that were born in this range, one for this range, one for that range.
So we can separate them out based on when the blocks were born, and the separations are based on what previous snapshots exist. So now, when we want to delete a snapshot, we need to iterate over the sublists and find the ones where the sublist corresponds to this time range.
So in this case, it's just this one. We need to free everything that's on this sublist and keep all the sublists on these ones. So we need to just merge these together and delete this one. So the next trick is that when we merge these lists together,
we can do it by reference. So we'll basically say, okay, we don't actually have to iterate over this whole list and copy all the block winners over here. We just say, this list includes this other list by reference. So now, in terms of performance, when we delete it,
it's going to take time proportional to the number of sublists because we have to append them to the other ones, and proportional to the number of blocks to free, which is, remember, the goal. So in practice, you can see this is going to give us much, much better performance, 1500 megabytes per second,
with these assumptions compared to being able to write at about 200 megabytes per second. So finally, snapshot deletion is not your bottleneck. Yay! And typically, the number of sublists is much less than the number of blocks to free, so we're sorting on that. We'll go get back to that later.
But I didn't really tell you, where did all the space go? That's not very interesting. It's good to understand how snapshots work in deadlists because we're going to come back to that. But where did the space go? I know space is empty, but your storage space is not empty.
Alright, so the first thing you might do, if you want to know how much space your snapshots are using, is type C best list. And you might see that our whole slash fs, the file system, is using 1000 gigabytes,
but it's only referencing 700 gigabytes. So the refer or reference property means, how much actual data can I get to by looking at files in this file system? And in this case, if you didn't have any snapshots,
we didn't have anything fancy going on, then use and refer would be the same. But we do have fancy stuff going on because we have snapshots. So the first thing you want to do is look at the use by snapshots property. So this is going to tell us how much space is used by all the snapshots put together. In other words, how much space would be recovered
if I deleted all of this file system's snapshots? In this case, it's 300 gigabytes. And 300 gigabytes plus the 700 gigabytes that's referenced equals the 1000 gigabytes. That's great, but what if I want to know which snapshots are using up that 300 gigabytes?
So you type zfs list dash t all. It'll list all your snapshots. It'll list how much space they're using. They say they're using one or two or three gigabytes. So I add up all the space used by the snapshots. That's seven gigabytes, which is not 300 gigabytes.
So which is it? Seven gigabytes or 300 gigabytes? What about the other 293 gigabytes? Alright, so here's the big reveal. The snapshots use space is actually just the space that's unique to them. So I'll explain why this is, but I think that probably it was a
mistake for us to display this information so prominently because it was so misleading. In zfs space accounting, usually the concept that we go back to is there's a lot of space sharing.
So we have to have some principles of who gets charged for some space? And usually the idea is if I were to delete this thing, how much space would I get back? Because it's very direct and practical. So the snapshots use space is how much space you would get back if you deleted that snapshot.
And this is true. So the snapshots use space is the same as its unique space. Remember the unique space is what we get back if we delete a snapshot. But that's not very useful because snapshots have shared space.
So you can very easily have a situation like the one that we just diagrammed where all of your snapshots have very little unique space. Because they're sharing it with their neighbor. You can imagine if instead of taking a snapshot every hour, we just said every hour take two snapshots in a row. Then essentially you would always have zero space used by any one snapshot, but lots of space used by all of your snapshots together.
Because those two snapshots that were taken right back to back are going to have almost the same contents. And so if you delete one of them, you get almost no space back. But if you deleted both of them, then you get a bunch of space back. So we come back to our snapshot timeline diagram.
So here I'm showing not what's in Daedalus, but just the lifetimes of various types of blocks. So the blocks that are counted in the used or unique spaces of snapshots are the ones that are unique to the one snapshot. So like snapshots used, unique or used space is just the blocks that were born after the previous snapshot and died before the next snapshot.
So there's all this space that's shared between them. It can be shared between two, three, four, five, six, seven snapshots. And then there's also space that's not used by snapshots. So that's the 700 gig, which is still referenced by the file system and it might have been born at various points in time.
We're all doing snapshots. Alright, so this kind of explains why the default stuff that we're showing you is not very useful.
How can we do that? So here's another way of thinking about space used by snapshots. Which is to not think about space used by snapshots. But instead, to think about the ingestion rate of when data was added to your file system. And we can see that with the written property.
So the written property tells us how much space was written since the previous snapshot. And then this applies to both file systems and snapshots. In this example, the file system and the most recent snapshot are identical. So there was no space written to the file system, no data written to the file system since the most recent snapshot.
But you can see here, it makes it more clear that, okay, well, most of the space was written before the first snapshot. So in other words, it seems like what happened was we populated the file system with a bunch of stuff, 894 gigs of stuff. Then I took the first snapshot. That was the bulk of my data. Then I wrote another 52 gigs of stuff.
I took the next snapshot, snapshot 2. Then I wrote another 51 gigs of stuff. Took the next snapshot, that was snapshot 3. And then the 3 gigs before snapshot 4. And then nothing before, nothing after snapshot 4. So the cool thing about this is that you can actually reason about the sums of this. So the sum of all the space written is equal to the file systems used.
Because the whole file system and all the snapshots is all the space that was written through that whole timeline. So, you know, 0 plus 894 all added up together gives us 1000 gigabytes, which is the space used by the file system. And then in other words, remember the other way we can break it down is the file system's reference
plus used by snapshots is the same for the gigabytes. But these are not, like, they're not talking, they're talking about the same kind of space. But these are two different ways of looking at that space. And the written doesn't tell us directly, like, hey, if you delete these two snapshots, then you're gonna get space back.
But you might see the kind of patterns like I talked about before of, like, if you're creating these pairs of snapshots, then you would see that pattern in the space written where you see one snapshot written as a bunch and then 0. And then a bunch and then 0. And then a bunch and then 0. You'd see that kind of pattern in there. And that would help you to understand when space was added.
That was a good question. I'm just, I missed something here, but to me, intuitively, those written numbers should add up in total to 1007 gigabytes, not 1000 gigabytes. The only reason is just unique and right. That's what it is.
Yeah, so these are two, the written and the unique are like two different ways of looking at the space. Like, they aren't adding together or being the same thing. It's like a different conceptual model. And I think the key thing here is that the unique space is like not a conceptual model that's gonna help you to understand your space,
or like what action to take. Versus the written can help you actually understand, like, what, it might not tell you directly, like, you should delete this set of seven snapshots to free up space, but it'll at least tell you something very real about the history of your process.
So if we go back to this diagram, where, how do we figure out the written space? Well, I'm not gonna tell you actually how we figured it out, but conceptually, conceptually, like, the space that's written by SnapOne is the stuff that was born since the previous snapshot, which is down here.
So it's all these timelines that start right here, so the yellow, the orange, and the purple added together. For the red, again, all the timelines that, blocks whose lifetimes start in this range, so the red and the lighter purple there, and so on.
And all these may encompass some blocks that are still referenced by the file system, which is why we can have the space written by the first one is 894 gigs, which is way more than the space used by all the snapshots, because most of that space is still referenced by the file system.
And the reference, is it changing, referred? Yeah, so in this example, we're talking about, let's say, a database or a virtual disk image which is being overwritten, rather than something that's just being added to and added to. So the amount of space referenced is the same throughout this whole file system's history, but some blocks are being overwritten.
If you're dealing with something more basic like data ingestion, where basically you're never freeing anything, then it's a lot easier to understand what's going on, because basically you're going to sum up the written, and that's going to equal the used, and that's also going to equal the refer.
And also in that case, deleting the snapshots, the used by snapshots is going to be close to zero. So that's not as interesting of a use case in terms of understanding. I mean, the reading and all that stuff will help you to understand when data was ingested. But if you're wondering why, your snapshots aren't using a lot of space, so you don't have the question of where did my space go.
And this occurs for other worklets too, like home directory kind of stuff, where you're creating and deleting a lot of files over time, and maybe you have a 700 gig quota on the space, so of course it's always 700 gigs, but you're creating and deleting files to stay out of that quota.
Okay, so there's another cool thing that you can do with the written space, which is that there's these additional properties that kind of appear on the fly, written at, and then the name of a previous snapshot. So you can say, not just how much space was written since the previous snapshot,
but how much space was written since some snapshot in the past that I'm going to specify here. And conceptually the way this works is like, let's say I want to say, hey snapshot 3, how much of your space was written since snapshot 1? I can get this property written at snapshot 1, snapshot 3, and it's going to find the space whose birth times are after snapshot 1,
and that are still referenced by snapshot 3. So it includes obviously this one and the light blue one, also this light purple one, but it doesn't include this space, because that's not actually part of snapshot 3. That was killed before snapshot 3.
Inflating this is a little bit tricky, so I'm not going to go into too much detail there. The cool thing about this, in addition to being able to ask these kinds of questions, which might or may not occur to you which ones to ask,
the same underlying mechanism is used for ZFS send space estimation. So when you do a ZFS send of an incremental, and you're saying, hey I want to send an incremental, the other side has snap1, I'm sending snap3, snap2, I don't care whether it's a temporary snapshot or whatever,
you can say how much, essentially the amount of data we're going to send is based on the snap3 is written at snap1, because we need to know how much space, the other guy has snap1, how much space is written in snap3 is in snap1.
So we do that and then just like metal with it a little bit to take into account compression and headers and some of that stuff. But probably the reason most of you came to this talk, I'm not talking to you Siri, is to understand shared snapshots.
Well you came here to wonder, where did all my space go? And my answer was, it's all shared, don't worry about it. It's shared between the snapshots, it's all cool.
Well if you want some more detail than that, here's one way to understand it. So we can ask, we can do this what if, we can say what if we were to delete some of the snapshots, not all the snapshots, which is the used by snapshots property, not just one of the snapshots,
which is that snapshots used for unique properties, but what if I deleted some of the snapshots? So you can ask this question with zfs-destroy-nv and then a list of snapshots. And you can list the snapshots with commas separating each of the snapshots names. You can also list a range of snapshots with this begin, percent, end.
And then it means if I were to delete begin, end, and all the snapshots in between. How would you actually put this into practice? So what we've done at Delphix is we categorized the snapshot space into different application defined classes.
So for example, there's like, some snapshots were created because you set this policy that said I want to create a snapshot every hour. And some snapshots were created because the user explicitly requested a snapshot. And some snapshots were created for one of those reasons, but then actually we wanted to delete it, but we couldn't because you created a clone of it or something like that.
So we categorized the snapshots into the different categories, and then we can say, okay, I can at least tell you that the space used purely due to your retention policy of creating snapshots every hour and keeping them for a year is costing you x gigabytes.
Versus the space that's used by your clones, which presumably you want to keep those, you have a reason to keep those around anyways, and you're not going to be deleting them no matter what. We can tell you how much space that was. So you can tell, okay, if I were to change my retention policy, what's the maximum that I'm going to be possible to get back.
So the idea here would be that you would say, okay, list all the snapshots in one class, and that will tell you how much space you would get back, and how much space that class of snapshots is using. But you still have some shared space
between the different classes at the end of the day. So there's only a caveat there, but this is better than the other tools. Okay, so now, how could we implement this? So we've seen two corner cases already, which I mentioned, if you're asking about one snapshot or all snapshots.
But in the general case, we only really care about the first example, where you're looking at a contiguous range. If you're looking at several disjoint ranges, then you can just calculate each of them separately and then add them all up, because the lifetimes of blocks are contiguous, there's no afterlay.
So it boils down to the question of, why would I want to delete this contiguous range of snapshots? In this case, we're saying from begin to end, these five snapshots, I want to get rid of them. So we have to find all the space that is referenced by any of these five snapshots,
but not any other snapshots. So we can think of it as the space that's unique to this one, this one, this one, this one, that's this line. The space that's shared by two, meaning here's space that's shared by two by two, these two, these two, these two, the space that's shared by three, which is the next section,
the space that's shared by four, and lastly, the space that's shared by all five of the snapshots that we want to delete. So if I arrange this a little bit differently, so this is the same lines, but just sorted out differently, hopefully this harkens back to a diagram from earlier in the talk about deadlists.
So, for example, the blue lines there are all examples of sublists from the blue snapshots deadlists. Similarly, here are two sublists from sap-read-deadlists.
Now, of course, in this earlier snapshot, sap-read might have other sub-deadlists that extend for the graph, but these are the ones that we care about. So we just need to find all these sublists and then add up the space that's in them, and then, awesome, like we've done this, we've implemented this.
Although one little trick is that there are O of n squared deadlists with sublists, O of n squared deadlists that we need to add up. And in this case, what I mean by that is that, so here there's five snapshots that we are asking about deleting,
and the number of sublists here is a proportional to five squared. It's about half of five squared. So, there's five, it's not too bad. But what if you have like a hundred snapshots in a row,
and you want to know how much space are those? Well, here's the thing. n squared gets really big, really, really, really fast. So this graph is only showing n up to 100. I don't know if you can see at the very, very bottom, there's a line showing n, like y equals x, x equals y,
and another line showing n log n. They're like basically barely even above zero, and n squared is way, way, way up there. It goes so fast. So if you're talking about a hundred snapshots, it's like a lot of deadlists,
but what if you're talking about those 8,700 snapshots? Then it's a lot, lot, lot more. So, it's hard to get like a real visceral feel for how quickly n squared goes, but if you have these 8,700 snapshots, which is, remember, one per hour for a year,
then you'll have 75 million lists. Well, I mean, computers can do millions of things, right? I mean, we have CPUs do like billions of instructions per second or something. I mean, it's like no big deal. But if you imagine like each list, let's imagine that each list is one sector,
and the thing is disks don't do millions of things per second. They store lots of stuff, but imagine that each one of these lists just takes up one sector on disk, four kilobytes. You're talking about every file system taking an additional 200 and almost 300 gigabytes just for this metadata about your snapshots,
just for like one aspect of the metadata. And if you want to ask this question of like, hey, what happens if I delete like most of these 8,700 snapshots, we're going to have to read in all these data lists. Let's say your disks are super duper fast, you can do 10,000 IOPS, it's going to take you two hours to read them all in. Don't try this on a spinning disk.
I didn't do the math on that, but it probably would overflow my calculator. Oh yeah, and the way that we deal with this is pulling into locks that prevents doing TxG sync. So like, after a couple seconds, you're not going to be processing any more writes in user level.
Not great. And this is how it works. Fucking dick. But people didn't have a lot of snapshots then. But here's the key thing. Nearly all these lists are empty.
When you're talking about thousands of snapshots, it's almost impossible to have all of them be actually in use and have all these sub-lists actually have a non-zero number of blocks in them. Because remember, we're breaking it down based on that birth time.
So like, you know, when there's a thousand snapshots back there, it's pretty unlikely that the data that was deleted in this time range was written spread out over a thousand different snapshots in the past. It tends to be like,
okay, I created this file. I wrote a bajillion blocks here, and then I deleted that file, so I deleted the bajillion blocks there. Okay, cool. Alright, so in 2012, I implemented this feature called MTBPOG. This is an on-disk feature, so there's a feature flag associated with it,
you may have seen that, to not store the empty sub-lists on disk. With that, we're able to answer that question about how much space the 8,000 snapshots are using in about a minute when they're crashed in the yard.
The next... So and plus, obviously, we get to save all that space, right? So you don't have the 300 gigabytes of wasted space storing all your empty data. The next change that we are making that I've reviewed up for now is partial deadlist loading.
So these are still... The algorithms are still O of N squared. You still have a table of all of those past times, but you can have big, big speed-ups by being smarter about how you process it. So basically, this change, partial deadlist loading,
is all about short-circuiting early on when we discover, oh, this is a deadlist, this is a sub-list that is empty. We're gonna shortcut our processing early on so that we save time, and that gives us a 5x speed-up,
so down to 12 seconds. And this is what we have been using at topics for several years, and it works pretty well. We haven't had any customer complaints since then, so about this.
So there's a review out for that against the difference on Linux, and that's a good speed-up. The next thing that we want to do to protect is caching these partially loaded deadlists. This gives us a huge extra win, because most of the time is spent going in...
You're still... Like, there's still entries for all 75 million sub-lists, and we have to go process them, but computers can do millions of things pretty quickly. As long as we're not asking them to do very much for each of those millions of things.
Which is kind of the case for the partial deadlist stuff. But the cool thing about caching this partially loaded deadlist is that we get a huge extra suit-up, 70x, bringing this down to much less than a second. And if I were to take that, it works pretty well. I mean, you still have the 12-second penalty
the first time that you ask this question, but if you keep it cached, then you boot up, you pay this text once, and then you're fine. All these cases are still O of n-squared. But with the partial deadlist load, we're just making a constant factor smaller
by a certain many earlier. And then with the caching of the partially loaded deadlist, we're kind of cheating the n-squared, because we're saying we're only going to actually keep in memory and have to process the deadlists that are populated. So whether that's n-squared or not
depends on the access pattern. If you assert that 1% of all of the sublists are non-empty, then it's still O of n-squared. But if you assert that at most a thousand of the sublists are non-empty, then it's actually constant time.
So that really depends on the workload a lot. Cool. So we'll have time for lots of questions. The takeaway, hopefully the takeaway from this talk, is that if you're confused by snapshot usage, you're not alone, but here are some tips that you can use.
So first, look at the use by snapshots. That's going to tell you how much use is used by all of the snapshots together. Ignore the snapshots use space. I know we put that right up in your face when you do zfs list dash t all. Maybe we should consider just getting rid of that, but ignore the snapshots use snapshots use space.
The written space can help you understand how the space grew over time. And then you can do these what-if kind of experiments with the destroy-mb list of snapshots. So one final announcement before I get questions. Open ZFS organizes an annual conference
called the Open ZFS Development Summit. We're going to have the seventh annual conference this year, November 4th and 5th. It's in San Francisco. Talk proposals are due August 19th, and we still have several sponsorship opportunities available,
so if your company is interested in being known as a sponsor of Open ZFS and helping to continue Open ZFS into the future. We have some new benefits in the sponsorship program this year, so talk to me or send me an email. And with that, I'll open it up to questions.
All right, everybody understood what I said perfectly? I've not uploaded your slides yet. I've not uploaded my slides yet. I was just finishing them this morning. But I will. All right, well feel free to come
and shout out to me afterwards. And if you're interested in ZFS, come to Alan Judt's talk. I think it's right after lunch. Is that right? Yeah, first come to the BoF at lunch, and then come to Alan Judt's talk where he will be talking about
a kind of overview of ZFS and Open ZFS, previous ZFS on Linux, how they're all related, how we're all going to get along in the future. Two forty-five. Two forty-five. All right. Speaking of getting along, why did you pick a name that the Canadians and Americans have to pronounce differently? I know.
Well, there were not a lot of letters left. And that was 20 years ago almost. We could just make a canonical pronunciation opens. We both pronounce that the same. I think we'll start with the same. Thank you everyone.