ceph: a gentle introduction
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 63 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/54610 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
openSUSE Conference 20168 / 63
1
5
8
13
14
19
20
24
25
31
32
33
34
35
37
38
40
43
44
45
46
47
49
50
51
52
53
54
58
59
61
63
00:00
Computer animation
00:23
Data storage deviceProduct (business)SoftwareSystem administratorPhysical systemCodeCommon Language InfrastructureMereologyScalabilityFamilyData storage devicePhysical systemBitData centerProjective planeSystem administratorNeuroinformatikUser interfaceSoftwarePoint (geometry)Single-precision floating-point formatLecture/ConferenceComputer animation
01:54
Data storage deviceSystem programmingAlgorithmCalculationServer (computing)Client (computing)Function (mathematics)Physical systemCore dumpData storage deviceClient (computing)Computer fileBitHash functionService (economics)Slide ruleSingle-precision floating-point formatAuthenticationUniform resource locatorServer (computing)AreaMetadataPoint (geometry)Program flowchart
03:13
Point (geometry)Process (computing)Human migrationSingle-precision floating-point formatComponent-based software engineeringInterface (computing)MiniDiscPhysical systemInformation securityKey (cryptography)Raw image formatComputer networkState of matterService (economics)Revision controlMetadataFront and back endsSet (mathematics)Information securityState of matterMereologyInterface (computing)BitData storage deviceSpacetimeService (economics)DemonAbstractionObject (grammar)Physical systemSingle-precision floating-point formatLevel (video gaming)Point (geometry)Variable (mathematics)Utility softwareMultiplication signMetadataCommunications protocolDatabaseComplex (psychology)Revision controlCuboidSlide ruleKeyboard shortcutServer (computing)Series (mathematics)Key (cryptography)AlgorithmDomain nameComputer animation
07:15
Group actionComponent-based software engineeringAreaBitService (economics)Sheaf (mathematics)
07:39
MiniDiscGroup actionInterface (computing)Replication (computing)Rule of inferenceDistribution (mathematics)Codierung <Programmierung>Point cloudLogicSpacetimeProcess (computing)Block (periodic table)NumberHash functionCategory of beingAbstractionEquals signHeat transferBitQuicksortNeuroinformatikUsabilitySound effectVirtual machineHash functionLevel (video gaming)WordData storage deviceMetadataMappingReplication (computing)MiniDiscService (economics)Group actionClient (computing)Multiplication signLogicNumberCategory of beingFunction (mathematics)AbstractionSlide ruleUniform resource locatorVolume (thermodynamics)DemonData managementView (database)Rule of inferenceObject (grammar)Formal languagePoint (geometry)Speech synthesisComputer animation
11:49
Codierung <Programmierung>Default (computer science)Rule of inferenceReplication (computing)MiniDiscSpacetimeRAIDVariable (mathematics)Parity (mathematics)Data recoveryCache (computing)Interface (computing)Random numberBlock (periodic table)BitSet (mathematics)NumberCodierung <Programmierung>MiniDiscRAIDParity (mathematics)Normal operatorVariable (mathematics)Fault-tolerant systemoutputCodePhysical systemTerm (mathematics)Overhead (computing)MathematicsCache (computing)Natural numberWebsiteSubsetRule of inferenceReplication (computing)Multiplication signStrategy gameRight angleVirtual machineBit rateSocial classError messageRandomizationData recoveryCalculationInternet service providerMultitier architectureVector potentialWritingObject-oriented programmingMultilaterationExistenceComputer animation
16:45
Message passingInterface (computing)Codierung <Programmierung>Scale (map)Server (computing)Theory of everythingMultiplication signMessage passingBitLevel (video gaming)Communications protocolVirtual machineData recoveryService (economics)Server (computing)Client (computing)Point (geometry)Data storage deviceCalculationAreaScaling (geometry)Gateway (telecommunications)Data structureSingle-precision floating-point formatCodierung <Programmierung>Computer animation
18:19
Level (video gaming)
19:09
Level (video gaming)Client (computing)MathematicsAlgorithmBitSynchronizationServer (computing)Functional (mathematics)Lecture/Conference
19:43
AlgorithmLevel (video gaming)Element (mathematics)Client (computing)Right angleMultiplication signServer (computing)SynchronizationLecture/Conference
20:29
Point (geometry)Hash functionGroup actionCAN busClient (computing)Replication (computing)Server (computing)Level (video gaming)Physical systemDemoscene
21:32
Queue (abstract data type)Projective planeEnterprise architectureSpacetimeBasis <Mathematik>Software developerLecture/Conference
21:56
MathematicsComputer hardwareSoftware developerMoment (mathematics)Physical systemMatching (graph theory)SoftwareMiniDiscQuicksortArmPresentation of a groupData storage deviceNeuroinformatikINTEGRALFunctional (mathematics)MultiplicationBefehlsprozessorGoodness of fitScaling (geometry)Server (computing)MP3Meeting/InterviewLecture/Conference
23:46
Revision controlDatabase transactionObject (grammar)Data storage deviceMeeting/Interview
24:19
Data storage deviceProjective planeBenchmarkTerm (mathematics)Hash functionINTEGRALExpert systemMultiplication signGoodness of fitDoubling the cubeError messageRight angleRAIDPhysical systemGraph (mathematics)Database transactionFile systemRaw image formatFile formatStructural loadBit rateComputer hardwareBlock (periodic table)Data integrityMiniDiscGroup actionSoftwareLecture/Conference
27:48
Computer animation
Transcript: English(auto-generated)
00:08
My name's Owen Singh. This is a short talk. Oh, thank you. I feel I'm the only person who's kind of introduced my talk by dragging people in here. I hope you're not too disappointed those people have been dragged in here, but I can't resist trying to be like a circus.
00:25
And we're going to look at a beautiful animal of the octopus family called Seth. It's a storage system and it's been getting a lot of attention by a lot of people. It's really for big data centers, so it's a little bit hard to sell to individuals,
00:43
but it's much easier to sell to people who have lots of computers. But enough from the background. About me, this is really my third distributed storage project, and I've touched a few others in the meantime and in gaps between them. I've been working at Suzie for more than two years on Seth. I'm a software maker and I
01:04
like to hear admins' complaints, and I like to fix them. And that's a little bit about me. And a little bit about the basics of what Seth is. So, some people sit there and hate C++ and walk out. This is a C++ project. We make no excuses for it. It's
01:21
a Python project for some of the CLI stuff and some of the user interface stuff. I think it's a very elegant design. And compared to all the other storage systems that I've ever seen, it's the simplest. And I hope to convince you it's simple. And some of its really cool things is it's very friendly to Sysadmins. It's got self-healing,
01:46
it's very, very scalable, it has very high IO, and it has no single point of failure. And that's my starting bit on what is the nice bits of Seth. So, first thing I'm going to do is talk about something really in the core of Seth. And the best way to talk about
02:02
it is to compare it to other storage systems. Most storage systems, you have the client and I've skipped the authentication bit and all of that. But the client comes along and says, hey, where do I get my file? And you have some metadata server that says, hey, your file is over here. And the client says, okay, that's cool. My file is over
02:24
there. So, I'm going to speak to the data service, get the data and send it back to the client. That's pretty cool, really. Seth's even cooler. We can calculate the data location on the client and on the server. And because we can calculate where the data
02:42
is, we don't need this single point of failure or replicate this area. And this is where I can jump on the next slide now. And say, this oh, I'll talk a little bit more about it and say, it's effectively a hash function. I assume everyone in the
03:04
room knows what a hash function is. Is there anyone who doesn't know what a hash function is? Great. I'll skip that then. So, let's talk about the nice topic of no single point of failure. We can actually have more than one node fail. We can define
03:24
failure domains. And this has an extra utility because it means that we can migrate services to new nodes much easier because we have no single point of failure. So, time for the next slide. So, here we have a little picture of what makes up a Seth cluster.
03:46
Due to space and wanting things to be clear, I've put these things in a big box. This is the interface layer. Below this in the abstraction layer, we have a series of services that provide state. And then we have a security layer, which allows you to bind these parts
04:05
together over a messaging system. Now I'm going to break down that a little bit further for you. The security layer is based on a shared secret system. We have one key per service. And we have some capabilities on that security layer. All very boringly, reassuringly
04:24
normal. Then we have an interface layer. And this is really neat. It's stateless. That means you can rip one out and put another one in and replace them. It means that you can put your reliability, say, with the ice scuzzy, I believe, built into the
04:41
protocol. There's a fallover back end. That's correct, I think. And that allows you to sit there and bring up and bring down services much more easily. So, then we get to the state layer. And this is where things start to get a little bit more complicated, but I'm still not delving into the complexities yet. It's made up of two services. The
05:04
mon service, which is basically a replicated database. It elects a leader using something very similar to Paxos. And so you typically want three, five, seven of them so you can
05:21
come to a voting and decide which one is the leader. And you never get what we call in Ceph a split brain where the cluster splits into two. And that's one of those traditional problems that you get with distributed storage systems with no single point of failure. And then there's the OSD service, the object storage demon, which
05:43
is the data persistence layer. So, let me go into a little bit more details of what the mon service does first. This stores the keys. It stores the crush map, this little algorithm. And that algorithm also takes in the data of what OSDs are available.
06:04
And their status of in, out, up, and down. And this is all part of the crush map, but it's also stored by the mon service so that things can synchronize. It also votes for the master, as I said. And it keeps a revision of the latest status of the cluster
06:21
so that we can always have a consensus on what is the status of the cluster. So, this is the OSDs, the object storage demons. And this is where all the data flows in and out of the Ceph cluster. It connects to the mon service when it gets started, downloads the crush map, and then it can calculate where the data should be.
06:43
This means that if data is missing, it can request the data. And if it's asked for the data, it knows what data it has. And it can then scrub that data and say, hey, I've got an extra copy here. I should delete this. So, it's very important that it connects to the mon service and actually gets the members and gets the crush map.
07:03
It basically stores a little bit of metadata around its blobs, and its blobs are a variable size, but it's ideally they're quite small. So, that's the object storage demon service. So, now I'm going to jump into another area of Ceph. Forgive me for jumping around,
07:20
but hopefully we can build up a picture of where we are. First, I started talking about the components that you see as a service. And now I'm going to start talking a little bit more about how these little blobs that we break our data down into get distributed around the cluster. So, that's this section. And now I did go around asking people if
07:43
they could play Kate Bush's Babushka song, but I realized that the word for Russian dolls is different in different languages, and therefore it wouldn't really work with a mixed audience predominantly German speaking. So, you missed the Kate Bush sound effects
08:00
at this point. But what we have on a sort of data placement view of the world, getting increasingly more specific as we go down this slide, is we have the mon service, which is storing the map of the cluster. And then we have the object storage demons one per disk. And then we have this concept of pools, which are, if you know, logical
08:24
volume management. They're a bit like either volume groups or logical volumes. They're places where you store your data. Groupings. And then we have placement groups. Think of these as the buckets in your hash algorithm. And then we have placement groups for placement
08:44
purposes, which is to abstract placement groups. And it allows Seth to do some sneaky tricks to avoid replicating data too frequently within the cluster. And so that's how your data goes from the output of a hash calculated by the client to being put in a location
09:04
that's referenceable in the cluster. So, I think I've just explained that all too fast. And I think these slides say a little bit more.
09:21
So, we then have rules on top about how we place these buckets. So, we can add properties to this concept of a pool, this grouping of data. And say, hey, this grouping of data should be replicated three times. So, should I lose one disk? I have two other copies
09:42
on two other disks. And I want to make sure those two other disks are on a different machine. So, that way you can have full tolerance and expect a whole machine to go down and you not to lose any data, not to get any down time and just keep your service running while Seth can sit there and recalculate because it's now aware
10:03
that that OSD no longer exists. Or that machine no longer exists and the collection of OSDs no longer exists. The crush map will then be changed by that because it no longer contains those OSDs. And so, all of the mons update their crush map. The OSDs
10:25
download that crush map update. And they know that they need to replicate that data. So, pools are where we put in that metadata about the replication rules. And there's also a concept called erasure encoding that I'll come to later. So, placement groups,
10:44
as I said, they're sort of the bucket in a hash algorithm. And oh, I think there's one interesting nasty that I'm going to be the nice bits and I've mostly included
11:00
the nice bits of Seth. But one thing you should be warned is that placement groups, you can't shrink the number of buckets you've got in a hash without recalculating everything through the hash. So, we do have a problem in usability with Seth that it's not really possible to shrink the number of buckets in the hash algorithm without
11:25
going through a lot of computation. So, that is one not nice bit. So, it doesn't really fit in this talk, but I have to mention it to this audience because they want to know the nasty bits as well. That one has to sneak in. Placement groups for placement purposes are really just, as I said earlier, a way to avoid too much movement
11:45
of data when something goes into or out of the Seth cluster. So, the next nice topic after we've gotten away from that nasty bit about not being able to shrink the number of buckets and only being able to split buckets easily is we want to talk about something
12:04
really cool that can save you money if you have a Seth cluster. And that's erasure encoding. I'm pretty sure everyone has heard of RAID 10, RAID 1, RAID 0, RAID 5. But I'm going
12:20
to come to that just in a second and say, by default, Seth doesn't do erasure encoding. It does replication. We store all of our data three times with some fault tolerance rules. This is really fast to recover because we just can copy the data from one machine to another. It's annoying when it's six terabytes of data that you have to copy.
12:41
It does take time. But you can just copy it. And it's reasonably fast to write something to Seth. Seth is all about reliability. So, to actually get an acknowledgment that you've written something to disk, you actually need to get acknowledgment from three disks under this strategy to acknowledge that write. So, your write latency is hit by working
13:03
with Seth. Now, erasure encoding, RAID 5, sorry, of the RAID classes, RAID 5 is probably the most common. I hope I've got the maths right later. I'm sure Lars will pick me up if I got it wrong. Typically, on a five disk RAID 5 system, you'll have a 3-2 ratio
13:27
in erasure encoding. No? One parity disk. Oops. Okay. Okay. You're absolutely right. I should have put six because no one should talk about five ever. Yeah. Sometimes we
13:44
have one disk failing during rebuild, which is really, really dangerous. So, if you have one disk failing and then another disk failing during rebuild, then your data is lost with RAID 5. And that's why I say RAID 5 shouldn't exist. And that's why I made the error here.
14:02
Because you should never expect a single failure to happen when you're managing your data. Particularly, as Lars said earlier in the big summary talk, with distributed computing, the only thing you can be sure of is something's going to break. And Seth allows you to have a variable number of parity disks. So, you can tune your cost for
14:26
how much data do you wish to have in parity versus your performance and versus your failure tolerance. So, here I gave an example of 3-2 with five disks to store your data and
14:40
two failures tolerated. Giving a 40% overhead. Hoping I've not got the maths wrong again anywhere. And then I was speaking at a conference just recently. I don't speak in that many conferences. But I was speaking at a conference recently. And one site was going for the extreme
15:02
efficiency but still having some fault tolerance. And they were going for as cheap as they could possibly get. And they chose an 18-3 ratio. This is where they have 21 disks to store their data and they tolerate three failures during its normal operation. Giving only a 14.2% overhead for safety. But this is not without its costs. And
15:26
it's much more efficient in terms of data usage. But it increases your latency. Because within the 18-3 setup, you need to write or read from 21 disks, potentially. When you're writing, you need to write 21 disks. So, that's a lot of disks that have got
15:41
to acknowledge your acceptance. And acknowledge that you've written. And recovery from failure gets much longer. Because now I've got to sit there and read 21 disks when I'm recovering from any one of those failures in the system. And it's not just a matter of copying. There's some calculation there. So, recovery times increase considerably when you have erasure
16:05
encoding enabled. And particularly when you enable it with such high numbers of disks in your setup. So, to overcome this for both programmatic and performance reasons, Ceph provides the idea of cache tiers. Particularly for things like random IO, these are absolutely
16:28
essential. If you were just storing blobs in to a Ceph cluster with erasure encoding and retrieving blobs, then you might not need cache tiers depending on your performance requirements. But if you're gonna be doing random IO, you really need a cache layer
16:43
in front of that. Which fortunately Ceph provides. So, I think I've given my talk a little bit too fast. I'm already on the take home messages. But hopefully this will give us time for questions. And this was really only a dip your toe into what is Ceph. So,
17:03
I gotta say, Ceph's really nice. It's got a nice simple structure. It scales nicely. Because we're doing the calculation on the client and individually in every area when we download that crush map, we don't need to talk to a single bottlenecked service. We can have the client speak directly to the storage service. And then we've got
17:25
all these stateless services that are providing gateways to conventional protocols that allow us to have these stateless levels. So, this makes deployment very easy. There are some
17:42
caveats. Because we've got so many machines, it actually gets better to use Ceph the more machines you put it in. And because of recovery, it's really you can fill up your cluster more the more machines there are. So, I like to say it's better with more than seven servers. I think you can go down quite small. You can even just run it
18:04
on one machine. But my general feeling is it's better with more than seven servers. The no single point of failure makes upgrades very, very nice. The crush map is helping us scale. The erasure encoding allows it to be very cheap. And that's really
18:21
the summary of my very basic introduction to Ceph. I'm sorry I've been so fast on my talk. I still have ten minutes for questions. So, give me hell.
18:56
I'll ask my question while he's walking. So, you say that having the crush map makes
19:04
upgrading easier. But isn't it a problem that if you want to when you move things to the client to change the algorithm like the crush map algorithm, make changes on that layer, you would also have to upgrade all of the clients in synchrony. So, that
19:22
would seem to make it a little bit more a little bit less flexible on that layer. It's a little bit less flexible if you actually wanted to change the algorithm's calculating function, you do have to synchronize the clients with the server. That's correct. But mostly what you're doing is not changing that. And I'm sure that there's some versioning
19:44
in there. I don't know. I haven't checked thoroughly. But most of the time what you want to know is where do I want to put my data now? And where do I want to put my data now is usually with a synchronized where it has to be with a synchronized algorithm
20:02
between the client and the server. And the critical element is to download what is the map of the cluster so that then I can calculate where I should put the data. Does that answer your question saying I'm not absolutely sure but you're absolutely right. I do have
20:23
to have the algorithm synchronized. If you say I'm right, I'm happy. A little bit related, but if you have a hash function, it must logically point you to one point. How do you find your replicas if you need to? The hash function points you to one point, but that's a placement group. And that
20:46
placement group is then described to be in multiple places within the system, I believe. But maybe Lars is going to correct me here. So the hash function actually points you to the primary OSD for a placement group,
21:03
and the OSD server then behind the scenes replicates. So the client only ever talks to the primary OSD which the hash function points it at. Okay, and if that fails? And if your primary OSD fails? Well, if your primary OSD fails, then you get a new map and it computes a new primary
21:24
OSD. Sorry, I stand corrected by Lars there. Anything else that I can possibly try and answer and maybe get wrong or right? Please join the queue to ask.
21:42
So, you said that Ceph is predominantly for large enterprises, et cetera, which is fair enough. A lot of projects started off that way targeting enterprise spaces. Do you see Ceph becoming more prevalent in home use developer style scenarios?
22:06
Yes, no. I think the answer is not until hardware makes some fundamental changes. For me, it doesn't really make sense to have Ceph on a home system at the moment
22:22
because hardware tends to have multiple storage nodes connected to a big computer. I could see with the recent developments in disks, people putting the CPU next to the disk, and maybe when it becomes commodity to have systems like this, having seven network
22:45
attached disks would allow Ceph to be very practical in the home use, but it's quite a big scale for a simple MP3 collection. So if you've got those disks with integrated OSD functionality on them, how would you then
23:06
add the mons layer, et cetera, or is that all baked into the single drive? This is me speculating about technology that I've only seen at the end of presentations
23:21
and the beginning of presentations, but there's actual ARM computers being manufactured now that plug into the back of a SATA disk. They're about this sort of size, about this sort of thickness, and they're intended to be very, very cheap and work in servers at the moment, but I could see this technology coming to the consumer in the future.
23:43
At the moment, the technology is not a good match for consumer use, I'm afraid. With the new versions of Ceph, previous versions of the object, it was more object-based
24:05
storage. Is the OLTP, or online transaction speed, going to increase with the new versions? Very nice question to ask, because I just went to a talk which Sage, the project lead,
24:22
gave, and he informed us about the new backing of OSDs called Bluestore. We haven't baked it into our roadmap for release in Susie's storage yet, but the initial performance benchmarks he showed us showed for write a doubling in performance by no longer using XFS and
24:49
going directly to the disk layer. They're using RocksDB and their own format and no longer having an external journal and working around the file system as they described it,
25:05
but going directly to the raw block device. This doesn't produce advances in streaming reads, because file systems are very optimized for streaming reads, but we should see for
25:20
random reads and write loads, according to the graphs, I didn't do the benchmarks myself, a doubling in transaction performance for the same hardware. That's for each individual piece of hardware. Do remember that you as a user then have to get that acknowledged
25:41
by three nodes, so network latency issues may still not lead you to have a doubling in performance as an end user, but we can hope. So what does Ceph do for ensuring data integrity, like if a disk gets corrupted in one of the
26:05
storage nodes? I'm not an expert in the details of scrubbing, but there are periodic runs through of all the placement groups to check their integrity. Maybe Lars has more details on that? No.
26:20
The other talk about Bluestore, one of the big things that they were chucking into the system was hashes for all data stored, so that there will be less effort in checking for failures than the current system. So they are improving this by improving the scrubbing
26:48
system and making it more dependent on hashes. I believe there is some hashing in there, but it's not perfect yet in terms of you can switch this on, but it's not great, but
27:00
with Bluestore you get better hashing. And of course you have multiple copies so that you can recover from the other copies or with erasure encoding recover the chunks that make up the copy. Any more questions? Well, that's kind of good because I think I've actually gone five minutes over now.
27:27
No? I'm on time. Well, I'm sorry it's not as bouncy a talk as last time, but thank you very much for attending. Sorry for my errors about RAID 5. I think I will have to call it a day then and say thank you very much for coming.