We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Ceph Data Services in a Multi- and Hybrid Cloud World

00:00

Formal Metadata

Title
Ceph Data Services in a Multi- and Hybrid Cloud World
Alternative Title
Data services in a hybrid cloud world with Ceph: Making data as portable as your stateless microservices
Title of Series
Number of Parts
561
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
IT organizations of the future (and present) are faced with managing infrastructure that spans multiple private data centers and multiple public clouds. Emerging tools and operational patterns like kubernetes and microservices are easing the process of deploying applications across multiple environments, but the achilles heel of such efforts remains that most applications require large quantities of state, either in databases, object stores, or file systems. Unlike stateless microservices, state is hard to move. Ceph is known for providing scale-out file, block, and object storage within a single data center, but it also includes a robust set of multi-cluster federation capabilities. This talk will cover how Ceph's underlying multi-site capabilities complement and enable true portability across cloud footprints--public and private--and how viewing Ceph from a multi-cloud perspective has fundamentally shifted our data services roadmap, especially for Ceph object storage.
Portable communications deviceHybrid computerService (economics)MultiplicationWeb serviceData storage deviceMusical ensembleMultiplicationPhysical systemCloud computingProjective planeGoodness of fitRight angleComputer virusComputer animationSource code
Computer fontObject (grammar)Block (periodic table)Data storage deviceComputing platformBroadcast programmingUsabilityData managementHybrid computerMultiplicationData storage deviceFile systemKey (cryptography)Open sourceObject (grammar)UsabilityData managementModeling languageSystem administratorHard disk driveResultantBitPhysical systemMultiplicationService (economics)MereologyDifferent (Kate Ryan album)Web serviceSpeciesSoftwareMixed realityBuildingComputing platformFlash memoryBlock (periodic table)Data recoveryCloud computingComputer hardwareQuicksortProjective planeComputer fileIntegrated development environment2 (number)Expert system
NumberWeb serviceMultitier architectureSynchronizationPoint (geometry)BitService (economics)Data centerCloud computingQuicksortSelf-organizationCartesian coordinate systemDifferent (Kate Ryan album)Process (computing)Coordinate systemInsertion lossData storage deviceStaff (military)Channel capacityData recoveryMultitier architecture2 (number)MultiplicationReal numberCASE <Informatik>System callWebsiteReplication (computing)Right angleMultiplication signMobile WebSet (mathematics)Web servicePhysical systemHuman migrationConsistencyInformation privacyElectric generatorAreaDatabase transactionControl flowView (database)Bit rateSoftware developerFrequencyType theoryFile archiverPortable communications deviceEndliche ModelltheorieState of matterSynchronizationSingle-precision floating-point formatPhysical lawCentralizer and normalizerInterrupt <Informatik>Data managementMereologyAnalytic continuationSatelliteOrder (biology)BuildingComputing platformCrash (computing)WritingComputer animationXML
Block (periodic table)Data storage deviceHard disk driveHuman migrationComputer-generated imageryConvex hullLink (knot theory)FingerprintEstimationComputer fileBlock (periodic table)Data storage deviceObject (grammar)Quality of serviceCovering spaceCartesian coordinate systemCASE <Informatik>QuicksortBackupNP-hardMultitier architectureOrder (biology)ExplosionPoint (geometry)Single-precision floating-point formatWebsiteDifferent (Kate Ryan album)BitGene clusterMedical imagingSoftwareLatent heatVirtualizationRight angleBand matrixMiniDiscDistribution (mathematics)Multiplication signConsistencyVirtual machineMobile WebLink (knot theory)Web serviceDatabase normalizationHuman migrationData recoveryType theoryMultiplicationBit rateSet (mathematics)Normal (geometry)2 (number)Limit (category theory)Overhead (computing)LengthFlow separationVolume (thermodynamics)Open setDemonReplication (computing)SequenceLocal ringKeyboard shortcutCodeData centerSoftware bugEndliche ModelltheorieDevice driverAdditionArea.NET FrameworkDegree (graph theory)Domain nameRule of inferencePlanningStaff (military)Instance (computer science)Wide area networkEntire functionFile systemHard disk driveCrash (computing)WritingMultilaterationSensitivity analysisExistenceComputer animation
Data storage deviceMultiplicationOpen setClient (computing)Point (geometry)Convex hullTunisLogarithmSynchronizationGene clusterMultiplicationBlock (periodic table)Computer fileMusical ensembleDirectory serviceVolume (thermodynamics)NamespaceFile systemMultiplication signOrder (biology)Uniform resource locatorAttribute grammarServer (computing)HierarchyWebsiteMetadataWorkloadRecursionGeometryConsistencyKernel (computing)Data recoveryFlow separationCartesian coordinate systemRow (database)Direction (geometry)Medical imagingMiniDiscSet (mathematics)Scheduling (computing)Quicksort2 (number)Client (computing)Replication (computing)MathematicsProcess (computing)IdentifiabilityPoint (geometry)Device driverMereologyComputer configurationPhysical systemImplementationData centerVirtual machineEndliche ModelltheorieDifferent (Kate Ryan album)Right angleOperator (mathematics)Hard disk driveSimilarity (geometry)Sinc functionLimit (category theory)Enterprise architectureLink (knot theory)Type theoryProduct (business)Scripting languageCASE <Informatik>StatisticsInternet service providerAdditionDirected graphDatabaseInterface (computing)BitComputer animation
Human migrationExecution unitData modelReplication (computing)Computer fileHuman migrationComputer fileBitObject (grammar)CASE <Informatik>Independence (probability theory)Data storage deviceCartesian coordinate systemBlock (periodic table)QuicksortData centerProcess (computing)Interface (computing)Revision controlReplication (computing)Endliche Modelltheorie2 (number)Communications protocolDisk read-and-write headDirectory serviceSet (mathematics)WebsiteWeb serviceMultiplicationFrequencyType theoryInsertion lossKey (cryptography)Operator (mathematics)Instance (computer science)Server (computing)EmailConcurrency (computer science)MathematicsSubsetMessage passingGoodness of fitMereologyReading (process)Direction (geometry)Numbering schemeFile systemOrder (biology)Right angleConsistencyMobile appSource codeComputer animation
Data storage deviceObject (grammar)Data modelTime zoneRow (database)Convex hullComputer fileObject (grammar)Data storage deviceBitSet (mathematics)Different (Kate Ryan album)Computer fileSelf-organizationTime zoneGene clusterEndliche ModelltheorieGateway (telecommunications)State of matterQuicksortReplication (computing)WebsiteWeb 2.0MultiplicationCartesian coordinate systemElectric generatorView (database)Independence (probability theory)CASE <Informatik>Flow separationContent (media)Group actionOperator (mathematics)Medical imaging1 (number)DemonNamespaceWritingConsistencyRule of inferenceLibrary (computing)Cache (computing)Computer architectureReading (process)Block (periodic table)Atomic numberDirection (geometry)Limit (category theory)ImplementationBasis <Mathematik>Data recoveryCondition numberArithmetic meanComplete metric spaceComputer-assisted translationWeb serviceBootingData centerVideoconferencingData structureSingle-precision floating-point formatFile systemEntire functionInterface (computing)Point (geometry)Virtual machineBit rateRight anglePhase transitionOcean currentReduction of orderRevision controlFront and back endsUniform resource locatorComputer animation
Replication (computing)Convex hullTime zoneGateway (telecommunications)Row (database)Data storage devicePolygon meshObject (grammar)Object (grammar)WebsiteDemosceneDirect numerical simulationGene clusterQuicksortDifferent (Kate Ryan album)Address spaceGateway (telecommunications)Data storage deviceNetwork topologyEndliche ModelltheorieTerm (mathematics)Cartesian coordinate systemMultitier architectureWeb serviceProxy serverService (economics)Multiplication signPolygon meshOrder (biology)Right angleInstance (computer science)SynchronizationSingle-precision floating-point formatType theoryData recoveryBasis <Mathematik>Replication (computing)CASE <Informatik>Set (mathematics)MetadataUniform boundedness principleQuery languageGeometryBackupState of matterEvent horizonShift operatorDemo (music)Time zoneBit rateFile archiverGroup actionInformationPlanningMultiplicationIntegrated development environment1 (number)Functional (mathematics)Subject indexingCloud computingLocal GroupSubsetSoftware frameworkData centerAsynchronous Transfer ModeLambda calculusVideo gameRevision controlStaff (military)Program flowchart
WindowQuicksortSet (mathematics)Multiplication signWebsiteBlock (periodic table)Data storage deviceComputer fileObject (grammar)Cartesian coordinate systemMobile WebCategory of beingVolume (thermodynamics)Projective planeComputer animationXML
MultiplicationData storage deviceConvex hullKey (cryptography)Computing platformPhysical systemReplication (computing)SynchronizationData managementKey (cryptography)CASE <Informatik>Term (mathematics)QuicksortDynamical systemInstance (computer science)Shape (magazine)Computer fileGene clusterPhysical systemPlanningBlock (periodic table)Context awarenessObject (grammar)Set (mathematics)Computing platformData centerData storage deviceService (economics)Mobile WebWebsiteComputer hardwareVolume (thermodynamics)MultiplicationIntegrated development environmentOrder (biology)Line (geometry)SoftwareXML
Computer animation
Transcript: English(auto-generated)
Hello, hello. All right. Thank you. Good morning, everyone. Thank you for coming so early in
Friday. My name is Sage Weil. I work at Red Hat on the Ceph project. I've been working on Ceph for almost 15 years now, and today I'm going to talk about some of the multi-cluster, multi-cloud, hybrid cloud capabilities in the Ceph storage system. So just briefly, what I'm going to go
through, I'm going to give the the briefest of introductions about Ceph. Just for a little bit of background, I'm going to talk about what I mean by data services and what the business problems are that I'm looking at and trying to solve, and then we'll just go through file block storage, file storage, and then object storage. We'll talk a little bit about edge scenarios and then try to wrap it up, talk a little bit about the future, and
build a coherent picture. So Ceph is a unified storage platform that provides an object storage API, a block storage service, and a distributed file system. It's all built upon Rados, which is
the reliable, autonomic distributed object store, the part that replicates your data, places it across lots of different nodes, handles failures and recovery, and provides availability, and all that good stuff. And then all these three services are built on top of that underlying storage infrastructure. Ceph is an open source upstream project, obviously. We release
every nine months. Every release is named after a species of cephalopod. Nautilus is the next release. It's coming out later this month at the end of February, and the one after that is going to be Octopus about nine months later. So the features and things I'm going to be talking about today are mostly going to be focused on, it's a mix of what's exists in Nautilus
and previously, and I'm also going to be talking about things that we're planning and we're going to be building after that. So for the Ceph project, we like to talk about having four sort of key priorities. The first of those is usability and management. Ceph has grown up over many years, built by system administrators, four system administrators, and as a result has
developed a reputation for being hard to use and complicated and difficult and so on. And so one of our key focuses over the past two years has really been to simplify the system so that people can understand it without being a Ceph expert. So a lot of effort goes into that. If you want to hear more about some of the things we're doing,
you should attend the software dev room. There are a whole slew of Ceph talks this afternoon. Performance is also key. When we built Ceph, everything was hard drives, and so the software models you choose and so on are a little bit different. Today, new hardware is moving to high performance flash, and so there's a lot of work going into optimizing Ceph so that it will
go much, much faster. And then the last two, we're making sure that Ceph will work and live well in a container environment, in a container world, so making sure that Ceph works well with Kubernetes is a big priority. And then the last one, which I'm mainly going to be focusing on today, is making Ceph work in a multi-cloud and hybrid cloud world, and I'll talk about
a bit about why in a second. So first I want to give a little bit of background about what I mean by data services and why we are here today. So the first point is that the future is cloudy. So IT organizations today tend to have multiple private data centers and infrastructure spread across lots of data centers. They also tend to consume multiple
public clouds and have different applications and different data sets and all these different places. And it's only getting worse. Even the infrastructure that's on-premise is starting to use private cloud services, and so it's really just a cluster of different clouds, some of them private, some of them public. And application developers are increasingly
building their applications by consuming sort of self-service storage resources, whether it's compute or storage. And if it's not already true today, the next generation of applications that people are building are going to be built specifically on top of these services. So it's
going to be a cloud native world. And for all the talk about stateless microservices and container orchestration and so on, they're great, but in reality, in the real world, applications have lots of state, lots of data that they depend on and store, and state is hard to move. And it's hard to move it in a safe and coherent way so that
you don't disrupt your application. As anybody who's had to do some sort of data migration by hand has experienced, this is a tricky problem. So when we talk about data services, we're talking about sort of three key areas. The first is data placement and portability. So where I have a data set, where should I put it?
Can I, once I store it in one particular place, can I move it to somewhere else? Maybe it starts out on-premise and I want to move it to the cloud later or the other way around. And can I do that in a seamless way without interrupting the applications that are consuming that data? So the second piece of this is introspection. I'm a large organization. I have infrastructure spread across multiple private and public clouds. I'm storing 20
petabytes of data. What is it? Who stored it? What type of data is it? Can I delete it? How much is it costing me? Those sorts of things. And then sort of the third area is around policy-driven management of that data. So can I automate the process of deleting old data
after sort of for two years for compliance reasons, that sort of thing? Can I ensure that this particular data set stays in the EU so that it complies with certain laws and legal restrictions? Can I optimize where this particular data set is placed based on the cost and the performance that I need over time? And can I build automation and policy
around all of those processes so that I don't have to do that manually? So data services tends to sort of encompass these three key areas. For the most part, I'm going to be focusing on the first area around where do I put my data and can I move it around for the purposes
of this talk. And it's also important to realize that data service is about more than just the data and just the storage system because the reality is that it's not just that you're modifying it or presenting it to the world or doing something useful with it. And so if you ever move, if you're going to move the data to another data center or to another cluster or
on or off-premise, you also have to move the application that's consuming it and you have to make sure that that migration is done in coordination with whatever the process is around moving that application to make sure it all works. And so for this reason, we believe that container platforms are key because if you are going to do that sort of migration,
you have to have this relationship between the application migration and the storage migration and coordinate that whole process. And so you're going to hear about Kubernetes off and on throughout this talk and that's why because this is sort of the first opportunity where we have a coherent stack that's managing both your applications and your storage and some
possibility of automating this process. So we're going to break this down into sort of five data use scenarios or sort of underlying business problems that organizations are trying to solve with their data. So the first is simply multi-tier. You have different kinds of data,
they have different performance and storage capacity requirements, and you want to have different tiers of storage where you want to put your data and it might vary over time. So it might be that when you first write it, it needs to be high performance. As it ages, you can move it to a slower tier for archive for some period of time and then delete it. So having different tiers of storage is key. The second is mobility, the ability to move an application
and its data between sites, presumably with minimal or no downtime or availability interruption. And this might be that you're taking an entire site and moving it or it might be that you're moving just sort of a piece of one application perhaps in a site and moving it to another. The third is disaster recovery. So if you have your infrastructure spread across multiple
clouds or multiple data centers, you want to be able to tolerate an entire site going down and then re-instantiate the data in another site for business continuity so you don't go out of business. And often in this case, you want sort of a point-in-time consistent view of that
data when you move – when you restart your whole infrastructure so that it's as if you crash and you can pick up where you left off. Maybe you can tolerate the loss of the last few transactions or a few seconds or maybe you need a coherent copy. The fourth case is what I'll call stretch, which is a similar situation where you want
to tolerate a site outage but you want to do it without any loss of availability. So in the disaster recovery case, you lose a data center, you panic and you restart everything in a second data center and maybe you lost a little bit but you didn't go out of business. In the stretch case, you don't even notice. Your application doesn't go down.
Then the final scenario is edge. So you might have a bunch of central data centers and then you have a whole bunch of satellite data centers or mobile sites or something like that. They're all generating data or consuming data and you want all this to be managed and coherent. So these are the five business cases that we'll be referencing throughout the talk.
Ideally, we want to solve all these problems in order to have the nirvana of storage. And finally, sort of the last preliminary point, a lot of this is going to come down to whether you're doing synchronous replication or whether you're doing asynchronous replication. So in a synchronous replication model, your application does a write.
It's written to all of the replicas of that data and only after they're all sort of persisted does the application say, okay, I'm done with my write. In an asynchronous model, you issue a write. It might write to some of the copies and the application says it's done and then in the background later over time, they'll make copies to additional replicas.
So in synchronous replication, you tend to have a higher rate latency because you're waiting for all the replicas to write and complete. But you have a consistent model because all the replicas are always in sync, presumably, and your application can go on.
And usually when we talk about synchronous replication, we're talking about a single SEF cluster because internally to the SEF cluster, we're doing synchronous replication. And in the async case, your application might write and then another cluster might have an asynchronous copy. If you have a failure of the first cluster and you have to pick up where you left off, you might have stale data.
So whether you're doing synchronous replication or asynchronous replication depends on what the needs of your application are and latency and so on. All right, so we're going to talk about block storage and then file and then object. So when we talk about block storage, we're talking about a virtual disk device.
So a virtual disk is generally used exclusively by an application because usually you're layering a local file system on top. It assumes it's the only one consuming that disk and writing to it. So block devices have to provide strong consistency. They're usually performance sensitive. They know your reads and writes. You usually have snapshot features around it.
And these days, they're usually self-provisioning. So you're usually asking Cinder for a block device or you're asking Kubernetes for a persistent volume. Or maybe you're emailing your IT admin and asking for a LUN, but hopefully not the latter. So relatively simple model. What can we do? Well, today and actually from the beginning,
RBD has always supported multiple tiers. So within a single Ceph cluster, you can create different rados pools of storage that are mapped to different storage devices or maybe a specific set of devices in a particular rack or data center. And you can store an RBD image in one of those pools. So you might have a pool that's high performance backed by SSDs.
You might have another pool that's lower performance with hard drives. Maybe it's even erasure-coded. So you get these multiple tiers of storage depending on the quality of service that you want. So that's great. We sort of cover the multi-tier case. In Nautilus, we also have a new feature called RBD live migration that allows an in-use image. So for example, if you have a VM that's running,
consuming a block device, you can do a live migration of that RBD image from one tier of storage to another without interrupting or restarting the VM. So you've always been able to live migrate the VM to another machine consuming the same storage, but you haven't been able to move the storage to a new performance tier while the VM is still running. So this adds that capability.
And that's new in the Nautilus release that's out this month. So that sort of gives you multi-tier and it gives you some mobility within the same data center or within the same Ceph cluster at least. So what happens if you want to stretch across sites? So you can also stretch a Ceph cluster across multiple data centers, you know, two or three data centers.
And in that case, you simply deploy your Ceph demons across two sites. You have some big fat link between them. You set up your crush rules so that your replicas are placed appropriately. So you have some replicas in one data center and some replicas in another. Make sure you have enough monitors.
And lots of people do this. It works. Your application can now move between data centers because the storage is available in both. You have sort of the single Ceph storage available in both sites. The data doesn't move because it's sort of already in both places. But it's important to remember that the performance is going to be different
because you have a wide area network link between the two sites. Ceph is doing synchronous replication, which means every write has to cross that link. So this may or may not fit the use case. But it does solve sort of the disaster recovery use case where if you lose one data center, assuming you've set up your crush rules properly, your cluster will continue to operate accessible
on the other site. And you can also sort of combine these things. So you might have a single Ceph cluster across multiple sites. You have one rados pool that's stretched across them with a particular crush rule. But then you also have rados pools that are confined to each data center. So you have all tiers of service. It might be that some VMs only need local storage. They don't need that multi-site redundancy.
Others do. And so depending on what tier, what type of service you need, you can give that to your applications. I mean, if you combine all of these things, you can even leverage this capability for migration. So you might have an application that starts out using the local rados pool in one data center. You live migrate that to the stretch pool.
Then maybe you even live migrate the virtual machine to another data center or restart it, whatever it is. And then you can live migrate it to the rados pool in the other data center. So if you combine all of these things using a stretch cluster, you can sort of get the full breadth of mobility and disaster recovery and edge sites and so on. So it sounds pretty good.
But the thing to keep in mind is, doing stretch clusters is a little bit sketchy. It isn't necessarily what you want – what we want to do. So the first thing to keep in mind is that the network latency between those sites is critical. So you need low latency for performance because, within a rados pool, we're doing synchronous replication. You also need to keep in mind that you need to have high bandwidth
because, if you have a failure situation, then you're going to have a lot of data flowing across that link in order to do rebuilds and recovery. And that might be more than you expect. And you need to be able to sustain that while you're still doing your normal application reads and writes. It's also important to keep in mind that it's relatively inflexible sort of building these big stretch stuff clusters. So it might work for two or three data centers.
It's not clear that you want to build a stretch cluster across 20 data centers. Or it's even likely that you're going to have 20 data centers with low latency links that are close enough together that you would even want to do this in the first place. And you can't take two sort of existing Ceph clusters that sort of you already had and decide later that you want to join them together
so you can have this capability. You can't take two Ceph clusters and switch them together later. You have to sort of build it and pre-plan ahead of time in order to do this sort of situation. And finally, deploying a stretch cluster means you have a high degree of coupling. You have all of the storage infrastructure across multiple data centers
and multiple failure domains that has a single point of failure of a single instance of Ceph, single piece of software that's responsible. So if something goes wrong with Ceph, if there's a bug or something, it's going to affect all your sites and not just one of your sites. So you definitely want to be, use caution if you're taking a sort of a stretch cluster approach. So what can we do instead?
So RBD also has an async mirroring capability. So instead, you would have an existing Ceph cluster with a normal Redis pool, a normal RBD image doing synchronous replication within that site. And then on top of that, we layer an asynchronous mirroring capability that lets you mirror that image
to a second Ceph cluster or another pool in another site. And the way it does this is on the primary image, we maintain a write journal. So we have a log of the sequence of writes that are happening to that device. And then there are mirror daemons that just sort of send all those writes across to the other cluster. And so you end up with a disaster recovery
backup copy of the image in another site. There's a little bit of performance overhead on the primary because we're maintaining this write journal. In reality, people mitigate this by putting that write journal in a separate SSD high-performance pool. So if you weren't already doing that, it might actually be a performance win. It sort of depends, but you'll have more infrastructure to support it.
Honestly, I actually can't remember if we implemented this or not, but at least in principle, you could configure in a delay for that replication. So if you wanted, like, a five-minute old copy in case you accidentally fat finger and destroy your primary copy, you could configure that as well. But this RBD mirroring feature has been supported since Luminus. And since then, we've been improving the tooling
and automation around it. And then if you have a failure on your primary site, you lose your cluster A, the whole site goes down, goes off the network, or stuff explodes, or something like that. Then you have your backup image in the secondary site. That's point-in-time consistent because it's ordering the mirroring of writes across. The image in the second site
appears as though there's a crash or something and you're just rebooting. So it's a point-in-time consistent copy. It's asynchronous, so you might lose the last few writes. But depending on what your latency and bandwidth limitations are on that link, presumably, hopefully, it's not that much. Maybe a few seconds. And then you can restart your application in the secondary site. And the whole lifecycle
is considered here. So if the primary site comes back up, then there's all the code to sort of resynchronize. It might roll back the writes that got lost and then resynchronize with the new writes that happened in site B. You can flip the master back between the two sites. All the tooling and so on around that exists. So it's sort of a complete, robust solution.
So the first primary consumer of this is OpenStack Cinder. Over the last several releases, there's been an addition of RBD replication drivers, investment in a lot of the tooling that deploys it and configures it over Okada, Queens, and Rocky. But it's sort of an
it's not a complete, least satisfying solution. So there's a lot of tooling around setting this up and configuring it. And sort of the key thing is that if you have even if you have this all set up in OpenStack and you lose an entire OpenStack site, your storage will come back on the secondary site, but it doesn't set up all your NOVA attachments.
It doesn't actually restart your VMs. And I think this really highlights the limitations of how much you can do in the storage or infrastructure layer without sort of the cooperation of the layers above it. It's hard for an infrastructure layer that's just providing like VMs and storage to know what to do with your storage, even if you implement
all these fancy capabilities of doing migration or disaster recovery. And what you really need is sort of a complete picture of what your infrastructure looks like, how you define your applications and how you set up your entire application stack, which is really what you get with Kubernetes and why we're excited about it. So more on that later.
So that's what we can do with Block. So we can do lots of stuff within a single cluster and we have this RBD mirroring capability across multiple clusters. Cephos also has CephFS file, complete distributed file system. So CephFS is awesome. It's been stable since Kraken. We've had multiple metadata servers
for several releases now. We've had snapshots supported since Mimic, which was the previous release. CephFS supports multiple rados pools. So within rados, again, you can have different pools backed by SSDs or hard disks or in different racks and rows, data centers, all that stuff. And in CephFS, you can map a particular subdirectory to any rados pool so that new files created in that directory
get stored in that location. CephFS is fast and scalable. It's got quotas, volumes, sub-volumes, all this good stuff. And you can provision it with Manila or with Kubernetes PV persistent volume drivers. So CephFS is great. It's a little bit different than RBD
in that an RBD client talks directly to all the OSDs that are storing the data. So you have sort of a direct data path between the RBD client, either the kernel that's mapping the device or the virtual machine that's consuming the virtual disk. In CephFS, you also have a direct data path for file data. But when you're talking to the, you're interacting with the file system namespace, directories, doing redir, create, rename,
open, all that stuff, you're talking to a metadata server. So you have a second set of metadata servers that manage the file system namespace, coordinate access to all the files, make sure that the clients are cooperating and doing the right thing. So a slightly different model. So the first question, can we take CephFS and stretch it across multiple data centers
the same way we do RBD? And the answer is yes. You can do it. But it has all the same limitations, right? So if you stretch it across data centers, you have to be careful about the latency, the failure domains, all that stuff. So the same issues apply. So use caution. And in addition to that, in RBD, while you had a direct data path to the OSDs,
in CephFS, you're also talking to the metadata servers. So you have this additional concern that you want to make sure that, well, you have this problem that if the metadata server that's active is on the other data center, then all your metadata access is going to go across the link. And for file workloads in particular, metadata performance tends to be very important,
at least for in general purpose file workloads, maybe if your application is not such a concern. But you need to pay attention to where your metadata servers are placed. So again, stretch, you can do it. Maybe it's not the panacea that you hope it was. So what can we do beyond this? That's what CephFS will do today.
What are the things that we could do with CephFS to improve the multi-cluster capabilities, since we're gonna talk about a couple different options. So the first thing that we could do, and then we're talking about doing, is something called snap mirroring. And the basic idea here is that we already have a robust snapshot capability in CephFS. You can take any directory in the file system and create a snapshot on it,
any subdirectory at any time. You don't have to pre-plan your volumes or anything. And those snapshots provide a very fine-grained point-in-time consistency. And additionally, CephFS has something called rstats, which are recursive metadata on any subdirectory in the system. And one of those attributes is called rctime,
which is essentially the most recent modification of anything beneath this particular point in the file system hierarchy. And if you combine these things with something like rsync, something that's just copying files across, then you can build a system where you create a snapshot periodically, say every 10 minutes of part of your data set,
and then use that recursive ctime to identify efficiently what the changes have been in that subdirectory, and then quickly just copy them over to a second cluster, and then create the same snapshot when you're done on the second cluster. And so you could set up a schedule where every 10 minutes you take a snapshot, mirror all the data across, and then create the same snapshot, and you have this sort of disaster recovery type situation.
This matches pretty closely features that exist in many existing enterprise products with similar names. And it's not that hard to do. So sort of the main missing piece is that we need a flush operation so that we know that those rstat,
that rstat value can be trusted before we start synchronizing stuff, because they're normally lazily updated. There's actually a pull request in flight to do that. And we need to modify rsync so that it can use that rctime to efficiently identify change files before it synchronizes it. And then you want to build some scripting and tooling around automating the process. But this is something that you could do and sort of it would match existing enterprise features
that people are used to. But it raises the question of, do we really need that strong point in time consistency in order to create a disaster recovery? Well, the easy answer is yes, because you have no idea what's running on top of your file system. It might be a database. It might be something else. And applications consuming the POSIX interface might require that consistency.
The reality is that actually that's not always the case. And in fact, many other distributed file systems have geo-replication features that don't provide strong consistency, and people seem to be totally happy with them. So maybe they provide a consistent copy of a particular file,
but they don't provide coherency in ordering between files that are updated. And it turns out that a lot of applications that are using file storage and would benefit from having a geo-replication disaster recovery type capability don't actually need that strong consistency. Maybe they're just storing images and they don't care if they store two images in a particular order and they only got the second one on the resulting site.
And often, maybe you're just casually using the file system and you don't care. So if we decide that we don't need that strong consistency, there are other things that we can do. A simple idea would be to have an update log that is generated by each of the metadata servers
of all the files that are changed and feed that to a bunch of workers that can just copy those files across to a remote system. And this has some benefits over the snap mirroring model because you get sort of more timely updates as soon as you modify a file, it will get copied over to the other side. It should be able to scale pretty well.
The limitation, of course, is that it doesn't give you point-in-time consistency across multiple files in the file system. So it only is suitable for certain applications. And there are some implementation challenges around ordering, making sure that you actually do the mirroring properly, but nothing that's terribly difficult. So we have options.
So at this point, I'm going to take a quick aside and talk a little bit about what it means to sort of satisfy the migration use case. So consider you have an application in storage in one data center and you want to move it to another. How would you actually go about doing this? And sort of independent of file or block storage or object storage, what's the basic process? So the simplest model of migrating data
is just that you have an application in data in site A, you stop it, you copy all the data, and then you start it up again in site B. That works fine. That's what we've all done in the past when we've had to move things around between sites or servers. Your application doesn't have to be modified. It's the exclusive user of the storage. But there's a long service disruption
because you actually have to stop it while you're doing the migration. So certainly not very satisfying. You can improve on the situation a bit if you sort of pre-stage the data. So while your application is still running in site A, you might copy, make a full copy of the data, but it's sort of getting a little bit stale. So it's like 99.9% up to date. Then you stop the application and do a final pass on the data to make sure you get those last few changes
and then start it up in site B. So this is an improvement. You shorten the availability gap while you're shutting down the application. And in fact, with the RBD async mirroring, you might be able to do this in a pretty efficient way because you could set up RBD,
asynchronously mirroring to site B. So it's got, let it build up its copy. It's got basically everything, but it's a few seconds old. Pause the application, switch the RBD master to the other site, and then start up your application. So it's a bit better. But you still have this availability gap. What would be even better is if you could start up
the application in the second site while it's still running in the first site, and then you can sort of seamlessly switch over traffic from one instance of the application to the other. That would be great, but it needs this ability to sort of temporarily be active-active in both sites. So if you can do that, then it's awesome, right? You have no loss of availability.
You have concurrent access to the same data. If there's any performance degradation, it's only during that sort of interim period where you're replicating across both sites. I mean, in fact, that's really just a generalization of the case that you can actually do this active-active replication in general, right? That's sort of the key, hard piece of this process.
And if you can do that, you can have highly available access to your data in both sites. And it's awesome. But it brings up sort of key questions, right? Can your application tolerate concurrent access to the same data? How are you replicating the data across those two copies? Are you doing it synchronously or asynchronously?
And what's the consistency model, right? If you have those two instances, they're both trying to modify the data, what happens if they collide or can they collide? So these are sort of key questions you have to understand about your application behavior in order to do this sort of thing. So coming back to Cephas,
again, the panacea would be, and the thing that everybody asks for and wants when they hear about a new distributed file system is, can I just have my file system replicated in multiple data centers and have it active and available in all data centers and everybody can access it whenever they want and they get good performance and everything is consistent and everything just works. That's what people ask for.
And you always sort of shake your head and say, maybe. And the reality is that we don't have a general purpose bi-directional or multi-directional file replication protocol. And the reason is it's really hard to do that with POSIX, right? The file system interface in Unix and Linux is extremely complicated and there are a million opportunities for conflicts.
There's conflicts on file data, right? If you're modifying the file in multiple sites, one of them maybe truncates the file, the other one overwrites part of the file, and then you try to like asynchronously mirror those. What do you do and how do you reconcile those changes? And there are possibilities for conflict on things like rename, right? If you have two directories and two different sites
rename the directory into the other one, it's an impossible conflict to resolve. I'll get to pick one or the other and then the other end is confused or whatever. So in a general sense, doing asynchronous replication with POSIX leads to these conflicts and there's no easy way to resolve them. But the reality is that you would never actually do this
unless the applications are specifically designed to cooperate in the storage, right? You don't necessarily have arbitrary users doing any operation in situations where you're trying to have them consuming the shared storage. You wouldn't design your application that way because it would immediately break.
So if you have an application that's using storage that's designed to cooperate in the file layer, such that they carefully avoid those types of conflicts. You know, a good example of this is like the mail-to-storage protocol that mail servers use, right? They have a particular scheme where they write new messages and rename them between directories
that is able to tolerate conflicts when they're using an NFS server. Then that's great. So, you know, the other classic example would be that you have content stores, you're just writing new files, you're reading existing files, you're not renaming them, you're not doing anything like that.
And as long as you're doing sort of a simplified set, a subset of the operations that you can do in POSIX, then you're fine, right? You can use a shared file system, your applications aren't gonna conflict, you can scale out your application heads. But I think the thing to realize is as soon as you start to simplify the set of POSIX operations that you use, it starts to sound less like file storage
and more like object storage. And so it begs the question of maybe that application shouldn't really be using the file layer to get this sort of magic multi-directional replication. Maybe you should just be rewriting your application to use object directly. So let's talk about object storage and a little bit about why that might make sense. So why is object storage so great?
It's based on HTTP, which means it operates well with web caches and CDNs and so on, lots of existing libraries. Object APIs give you atomic object replacement, which means that if you put a 10 gigabyte image file to replace an old one, only when you're done and you're sort of completely written a file
or the new object does it atomically replace the previous version, so you don't have sort of existing edge conditions there. And the fact that you can't sort of overwrite the middle of an existing object means that the implementations are much simpler. The consistency model is much simpler. You either have the old one or the new one, and they're completely different and so on.
The namespace is flat, so it's very easy to scale out when you're actually designing the backend architecture. And there's no rename. And so it's just, it's a simplified storage model that lends itself very easily to simplified consistency models and replication models and implementations. And it's sufficiently powerful that you can do a lot.
And I would argue also that there's gonna be a lot of object storage in our future. So I'm not gonna say that file's gonna go away. In fact, I'm saying it definitely won't go away. File is critical. We have a huge existing set of applications that consume file. And the file interface is really useful. There's a lot of stuff that the POSIX API gives you
that's genuinely useful to applications. It's just not for everything. Block isn't going away either. It's sort of the backbone of all the virtual machine and container infrastructures. It's how things boot up. But most of the data that we store in the future, you know, the zettabytes and zettabytes of data that the world is producing, isn't gonna go in a block device or a file system.
It's gonna land in object storage. You know, all those cat pictures, surveillance videos, medical imaging, video telemetry, all that stuff, it's gonna be stored in objects. And the next generation of applications that people are designing are gonna be storing all of their bulk content in object storage as well. So it behooves us to pay attention to how we're gonna solve these more complicated problems around multiple sites and consistency
in the object storage world. So the Redis gateway, which is Seth's S3 API, has a rich set of federation and replication multi-cluster capabilities today. So the way that RGW views the world of federation is through zones and zone groups and namespaces.
So a zone is a set of Redis pools that are storing Redis data, or S3 data. So it's a collection of Redis pools in one cluster and a set of RGW daemons that are serving up that data. So usually probably what you think about Redis gateway deployment today is just one zone. A zone group is just a collection of several zones
that may be spread across a single cluster or multiple clusters that have some sort of replication relationship. And you can do active-passive, you can do active-active, so you can put in either zone and it'll replicate in both directions. And then a namespace is a collection of zones that have the same set of buckets and users defined.
So you can think about the Amazon S3 as having a single global namespace. If you create an S3 user, it exists everywhere in the world. A bucket exists in one site. But if you try to read a particular bucket, it's always gonna send you to the right zone or the right region to read your data. So it's a global namespace. So with Redis gateway, you have the same concept, but you can create multiple namespaces if you want
for different organizations or teams or whatever it is. And then the key thing with one of the points with Redis gateway federation is that the failover between sites and zones is driven externally, at least currently. So the idea is that it's very easy within a Ceph cluster to tell when something fails and what to do about it because you're sort of within a single data center.
If you have multiple data centers and a data center goes down, it's harder to sort of have those extrovertional view of who's down and who's up and where you want to move the master is. So Redis gateway doesn't try to solve that problem itself, it assumes that something above it, whether it's a human operator or some other automation, is gonna decide how to deal with that.
So just to view this visually, you might have three different Ceph clusters. Each of them has two zones. Some of them might be grouped into replication groups, or sorry, zone groups. So XA and XB might be replicating all the same data. Other zones might be standalone, they're not replicating with anybody else. They're just sort of independent collections of buckets and data.
Other ones might be replicating and so on. And I guess that sort of the key limitation with this architecture today is that the initial set of use cases that we built this for were for operators who are operating clusters and dealing with disaster recovery use cases. So the granularity of replication in RGW today is on a per zone basis.
So you say this entire cluster of buckets is gonna be replicated to an entire other cluster of buckets. And so if my whole data center goes down, then my object storage service is still up. The way that it's internally implemented, all the data structures and internal behaviors are written such that we can also do it on a per bucket basis, but all the tooling and automation and so on
is still done on a per zone basis. So we have some additional work to do so that you can say I want this single bucket to be replicated to another site. And this other one not to be. So that's one of the initial gaps. But in the current state today, it sort of solves the disaster recovery use cases and it solves the stretch cases where you want to have the same data set
across multiple data centers. You can access the data, read and write it in both locations, and data will replicate in both directions. And sort of going back to that active-active case I was talking about a minute ago with CephFS, RGW also has an NFS export capability that was originally built with the purpose
of just sort of copying data into the object store or copying it out so you can sort of transition applications on the object. But if you wanted to, you could export both sites via NFS and read and write to both of them and you'd have this bi-directional replication. As long as you sort of follow the rules because POSIX is a rich API, Object is a very simple API,
and so you can't do everything, right? You can't run an arbitrary application on top of the NFS export. You have to only write complete files and read files and do no renames and things like that. But you could do it. But you'd be better off just making your application directly consume the object storage because that's really what it's doing, right? It's using the storage as a bunch of assets
that are immutable, use an API that matches that, and you'll be happier. The way that the Redis Gateway Federation is designed is pluggable, so sort of the usual behavior is that a zone is replicating data from another zone or other zones in its own group.
But you can plug in different behavior. So back in Luminous, we added one for Elasticsearch. So you could have a zone that, instead of copying all the data and metadata, would just take all the metadata and it would dump it all in Elasticsearch and then present a query API. So you could do searches over object names and sizes and owners and so on.
You can define indexes over custom metadata that you put on the objects, so you can do searches. So that's existed for a while. In Mimic, which is the previous stuff released, we added a Cloud Sync plugin, which allowed you to have a zone that would essentially take data from the other zones
and it would push it all into S3, either by taking all the buckets and putting them all into S3 or mapping them all into a single S3 bucket or maybe a subset of buckets. And it would do some best effort to try to rewrite the ACL so that it would be, it would still be usable in the S3 environment. So that's just replicating out to S3 in Cloud Sync. And in Nautilus, we're adding two new ones.
One is an archive zone, where you basically define a zone that enables bucket versioning. So we preserve all copies of an object, past copies, and it'll just copy everything and it'll create an archive of sort of all the data that you've ever stored over all time for archive backup purposes. And then the other new one is kind of exciting. It's a pub-sub zone.
So it essentially builds an index of all the updates that have happened and then presents a set of APIs so that you can subscribe to events. So they did a great demo of this in a talk for KubeCon in Seattle, where when you do a put into an object, it generates an event that then gets fed to Knative,
which is a serverless framework for Kubernetes and triggers a lambda function to go do something. And so you can trigger events through the object store using the pub-sub stuff. There's also work to feed this into Kafka and to AMQ also. So there are a couple different models you can use it. But that's also new in Nautilus.
And the way we're thinking about this is not just in terms of stuff clusters that are backed by bare metal sitting in your own data centers. So for example, with the Cloud Sync module, the idea is that you're gonna have data stored in lots of different sites locally and also in public clouds.
And when you're using public cloud storage, you would presumably, you would have a sort of a thin staff footprint in that cloud, an EC2 or whatever, that's just managing sort of a gateway role and storing some internal state about what is replicating locally and so on, what its role is in the federated mesh. Because the reasoning here is that
you're never gonna compete with actual S3 by building an S3 service on top of EC2 instances. If you can, it's because Amazon totally screwed up their pricing. They're optimized for efficiency. And so what you really wanna do is if you do have a footprint in a public cloud service, then you wanna leverage their storage services
that they're operating efficiently and optimizing for price and so on. And you just really need that gateway in order to access. So when we view the Redis Gateway Federation, you're gonna have thick sites that are on premise, that are actually storing lots of data, and then you're gonna have thin sites that are just gateways to an external storage service that you're making use of.
RGW can also address most of the tiering use cases. So today, you can have multiple Redis pools within a set cluster that are mapped to different storage types. And when you put an object, you can choose which tier that object is gonna go to, or you can set a policy on the bucket
so that new objects that are written on the bucket will go to a particular tier. And currently, there's a limited ability to set some policy to automatically expire objects. In Nautilus, there's a new feature in RGW that implements the Amazon S3, I believe it's called the life cycle, bucket life cycle,
object life cycle, I always mess up the term. But it's basically establishing a policy on the bucket that automatically adjusts the tiering. So maybe your objects land in one tier initially, over time they're moved to another tier, and then eventually they're expired. So that's new in Nautilus, and it will do the tiering between different Redis pools.
In the future, we'd like to extend this, right? So that when you're tiering, you can tier not just within the same cluster to different Redis pools, but you can tier to other Ceph clusters and to external object stores. So that you could have something initially land in your local cluster, and then you could push it out to S3, maybe it pushes out to Glacier, something like that.
That's the plan. So lots to do. So trying to sort of take all of this information I've thrown at you around Redis gateway, as far as what we have today and where we're going. So today, we view Redis gateway as a gateway to a single Redis cluster. So as the name implies, and we sort of have
a bunch of geo replication features tacked on, but it's mostly a gateway to a cluster, is the way we think about it. In the future, what we'd like to do is shift to a mode of thinking where the Redis gateway is really a gateway to sort of a whole mesh, a whole collection of different sites. And so when you're talking to the gateway, maybe the data that you're interacting with is local, maybe it's remote, maybe we're proxying your requests,
maybe we're placing it somewhere else. But really, it's a federated topology mesh of different sites. Today, when you talk to Redis gateway, if it's not stored locally, it'll issue a redirect that bounces you to the right gateway for the other site where your data is located, similar to what you get with AWS S3.
And there's some tricks you can do with dynamic DNS to sort of get you to view the right DNS to the right site that people have done. In the future, we might do that redirect, or we might actually do a proxy, right? So it might be that you have a site where your application is always talking to one gateway, and it's just actually proxying you to remote sites where your data is.
So you don't really have to think of where your data is placed and where it's been migrated to, and we're sort of seamlessly moving it around behind the scenes or applying some policy so you can get your data. And finally, today, RGW replicates its own granularity to sort of address those disaster recovery use cases, but in the future, we wanna make that more flexible so that you can, on a per-bucket basis,
decide where a bucket is placed, whether it's replicated across multiple sites or whether you're migrating it between sites so that you can take one application using a bucket and move it to somewhere else and move the data with it without having to move the whole site. Okay, so sort of the last set of scenarios
I wanna think about is stuff at the edge. I'm almost out of time. So, actually, I'm gonna skip the edge sites because it's not that interesting. So why do I keep copying about Kubernetes? I think the key thing to remember is that regardless of what the storage can do, you need to move the applications with it, right? So it's truly giving you your application mobility
as a partnership between moving the application that's consuming the storage and the storage moving the storage, which is why we're so interested in Kubernetes and the Rook project, which runs the Ceph cluster in Kubernetes in a sane way. And that sort of falls into the persistent volume category which is file and block storage that are attached to your containers
and increasingly object storage, where we're building the tools to sort of automatically provision storage in Kubernetes and buckets to attach to applications in the same way you do block and object storage. So lots of stuff going on there, and that's sort of where we see the future. So summarizing, bringing it all together,
data services are really about this mobility, introspection, and policy. Ceph already has sort of a lot of key features that give you some of this, but also there are some clear gaps, right? We haven't sort of solved all the problems yet. I think we're in good shape on the block side, on the object we have a lot of things solved, but sort of there's a lot of work to do, and on the file there's a lot, there's even more to do.
Our key efforts are really around, driven around defining what the use cases are in the Kubernetes environment, around RWAO and RWX persistent volumes, dynamic bucket provisioning and so on, and sort of addressing the features in the storage system that we need to make all that stuff work, integrating it with Kubernetes.
That's really the way we're approaching it. And that mostly comes down to extending RGW, and also we're planning and designing what we want to do for CephFS, what multi-cluster features we want to add in the future. So sort of the bottom line, if you talked to me two years ago when I talked about what Ceph is, it's software defined storage, it's a unified platform with file, block and object,
hardware agnostic, really thinking about Ceph in terms of a single instance in a data center. And as we sort of move into the future, increasingly we're thinking about Ceph in the context of you have lots of clusters across lots of sites, and you really need a platform that lets you manage your data across all of them and migrate data and so forth. So we're increasingly thinking about these multi-cluster,
multi-cloud use cases.