We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

What's new in Ceph Nautilus

00:00

Formal Metadata

Title
What's new in Ceph Nautilus
Subtitle
project status update and preview of the coming release
Title of Series
Number of Parts
561
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Project status update, and preview of what is new in the Ceph Nautilus release, due out this month. Management dashboard, unified orchestration CLI and GUI across kubernetes and bare metal environments, device failure prediction, PG num autoscaling, memory autoscaling, live RBD image migration, and more.
Right angleComputer animation
Data storage deviceComputing platformBroadcast programmingUsabilityData managementPoint cloudMultiplicationHybrid computerSoftware testingStrategy gameAddress spaceQuicksortFlash memoryHydraulic jumpPlastikkarteMaxima and minimaMathematical optimizationData managementUsabilityProjective planeCloud computingComputing platformData storage deviceFile systemVideoconferencingLevel (video gaming)Object (grammar)MultiplicationSource codeBlock (periodic table)Computer animation
UsabilityData managementWhiteboardSqueeze theoremAbstractionFunction (mathematics)DemonCommon Language InfrastructureGraphical user interfaceImplementationPartial derivativeLemma (mathematics)Metric systemMathematical optimizationCrash (computing)BitUsabilityPoint cloudInterface (computing)Crash (computing)Common Language InfrastructureDatabaseData managementPhysicalismQuicksortElectronic mailing listFunctional (mathematics)Endliche ModelltheorieSet (mathematics)Metropolitan area networkNumberDemonModule (mathematics)InformationSoftware developerPlastikkarteService (economics)Revision controlPrototypeSqueeze theoremSystem callClosed setLoop (music)Plug-in (computing)Multiplication signCASE <Informatik>Metric systemOperator (mathematics)Total S.A.Process (computing)Traffic reportingData storage deviceDefault (computer science)TimestampPeripheralServer (computing)Uniform resource locatorComputer fileRight angleProjective planeGeneric programmingComputer trespassImplementationSteady state (chemistry)PredictabilityKey (cryptography)Fraction (mathematics)Demo (music)AlgorithmPhysical systemInsertion lossAbstractionDatabase normalizationDivisorMathematicsRow (database)Staff (military)Serial portVarianceSingle-precision floating-point formatSoftware frameworkLevel (video gaming)Asynchronous Transfer ModeProof theoryArithmetic progressionOpen setMultiplicationTracing (software)System administratorInstallation artWeb pageFreewareMereologyOcean currentNP-hardComputer animationXML
EncryptionData managementDreizehnBlu-ray DiscPersonal identification numberLimit (category theory)UsabilityRadiusData managementQuicksortBitKerberos <Kryptologie>Set (mathematics)InformationComputer fileModule (mathematics)Data recoveryFile formatConfiguration spaceParity (mathematics)Envelope (mathematics)NP-hardNumberData storage deviceClient (computing)1 (number)Spherical capDatabaseCache (computing)Utility softwareMetadataBasis <Mathematik>Default (computer science)LengthSemiconductor memoryProcess (computing)Metric systemDescriptive statisticsDemonTraffic reportingEncapsulation (object-oriented programming)Online helpScripting languageRaster graphicsCASE <Informatik>PredictabilityResource allocationAuthenticationCompilation albumSubject indexingCommunications protocolComputer configurationMathematical optimizationRow (database)Function (mathematics)Event horizonImplementationHierarchyLevel (video gaming)Bit rateConfiguration managementOperator (mathematics)Multiplication signWritingFlash memorySocial classSoftwareArithmetic progressionInstallation artOpen sourceEncryptionMiniDiscProgrammierstilGame controllerOptical disc driveData compressionLatent heatChemical equationCovering spaceRule of inferenceInterface (computing)State of matterBand matrixHard disk driveOrder (biology)View (database)Limit (category theory)Acoustic shadowMemory managementMusical ensembleDifferent (Kate Ryan album)Type theoryCentralizer and normalizerMoving averageCodeMaxima and minimaProjective planeAndroid (robot)Network topologyoutputTorusVirtual machineStaff (military)MereologyComputer animationJSON
Gateway (telecommunications)CodeWeightFunctional (mathematics)QuicksortService (economics)Electronic mailing listMultitier architectureEndliche ModelltheorieBackupObject (grammar)Information securityTime zoneServer (computing)AuthenticationProgramming paradigmDefault (computer science)Event horizonTouchscreenFile archiverMultiplication signSoftware frameworkWeb 2.0Set (mathematics)Computer-generated imageryStreaming mediaINTEGRALFront and back endsStandard deviationEnterprise architecturePoint (geometry)Interface (computing)TrailBlu-ray DiscTransport Layer Security1 (number)Demo (music)MereologyState transition systemData managementComplete metric spaceMedical imagingProjective planeLatent heatSynchronizationRight angleScaling (geometry)Different (Kate Ryan album)Block (periodic table)Revision controlUniform boundedness principleAdditionStaff (military)Computer animation
Human migrationComputer-generated imageryHard disk driveIRIS-TBlock (periodic table)Service (economics)NamespaceConfiguration spaceMedical imagingDifferent (Kate Ryan album)Function (mathematics)Common Language InfrastructureData managementInformation securityDemonReplication (computing)Basis <Mathematik>Traffic reportingClient (computing)Human migrationObject (grammar)Cache (computing)Interface (computing)Band matrixComplete metric spaceInformationQuicksortGene clusterTimestampView (database)Characteristic polynomialScaling (geometry)Streaming mediaDomain nameMultitier architectureScalabilityGoodness of fitRadiusComputer animation
VolumeRight angleQuicksortFault-tolerant systemSoftware bugView (database)MetadataData managementMultiplicationVolume (thermodynamics)Software testingComplete metric spaceCodeCASE <Informatik>DemonState of matterServer (computing)FrequencyDirectory serviceFile systemComputer fileCache (computing)Common Language InfrastructurePhysical systemIndependent set (graph theory)Projective planeConfiguration spaceStability theoryWrapper (data mining)Set (mathematics)Data recoveryLibrary (computing)Semiconductor memoryDevice driverType theoryWhiteboardIntegrated development environmentShared memoryGastropod shellGateway (telecommunications)Object (grammar)Multiplication signCondition numberSocial classDensity of statesQuantum stateLipschitz-StetigkeitScaling (geometry)Client (computing)AbstractionComputer animationJSONXML
Thomas KuhnFocus (optics)Level (video gaming)Gene clusterQuicksortComputer fileOperating systemOperator (mathematics)Scaling (geometry)Execution unitComputing platformData storage deviceImplementationMereologyDemonSemiconductor memoryLoginSoftwareDefault (computer science)PlanningData managementBefehlsprozessorLogicProjective planeSoftware repositoryFigurate numberRootKey (cryptography)ScalabilityMotion captureRevision controlPlug-in (computing)Game controllerRoutingView (database)State of matterSource codeSet (mathematics)Right angleServer (computing)Physical systemRadiusMultiplication signUsabilityGateway (telecommunications)Control flowProcess (computing)CubeComputer animationSource code
Cycle (graph theory)Computer architectureQuicksortSoftwareSelf-organizationCubeData storage deviceConfiguration spacePhysical systemPlug-in (computing)Hand fanImage registrationMultiplication signBitCommon Language InfrastructureSocket-SchnittstelleVirtual machineBit rateTable (information)Computer hardwareMusical ensembleOverhead (computing)Product (business)CASE <Informatik>Point (geometry)Independence (probability theory)Different (Kate Ryan album)Chemical equationSoftware testingData conversionTrailDistribution (mathematics)Reduction of orderComputer animation
Computer animation
Transcript: English(auto-generated)
Okay, welcome everybody. Please sit down.
I hope everybody's ready to drink from Firehose. Give it up for Sage. Hopefully not a Firehose. Hi everyone, I'm Sage. I'm going to talk about what's coming up in CephNautilus, which is the next release due out at the end of February.
All goes well. Ceph, as probably most of you know, is a unified storage platform providing object, block, and file storage within the same cluster. We do upstream Ceph releases every nine months these days, so we're just about to the Nautilus release here in February.
The next release after this is going to be called Octopus, and will be due out November. And we're doing a strategy now where you can upgrade every release or every other release. So you'll be able to go luminous to Nautilus or mimic to Octopus, but not luminous all the way octopus. So it gives you an 18-month jump if you do sort of the maximum.
Those are the upgrades that we'll be testing and supporting. Okay, so at a high level, yes. There's no mic for the room. Sorry. No PA, sorry. It's for the video, yes. All right, so we have four priorities that we're focusing on
as far as upstream is concerned. One is usability and management. Ceph has developed a reputation over the years of being hard to understand and hard to use. And so a lot of effort has been going in over the past two years to address that to make Ceph simpler and easier to manage. And a lot of the things I talk about will be around usability.
Performance is important. Everybody's moving to all Flash. We have to go faster. The new Flash NVMe cards are ridiculous. And so there's a lot of work in performance optimization. And there's a whole project called Crimson or that's essentially rewriting the OSD to go much faster.
Work around container ecosystems, so mainly supporting Kubernetes and CSI and so on. And then a lot of features around multi-cluster capabilities for multi-cloud and hybrid cloud. So those are sort of the four main themes that we focus on. All right, so talk a bit about the ease of use thing.
So the biggest thing that's happening for Ceph ease of use is the dashboard. So there was an early prototype of the dashboard in Luminous. That was sort of a proof of concept. In Mimic, we brought in the OpenAddict project. That's like much better. And then we continue to be expanding that with Nautilus.
So lots of new features there. So we finally, for the first time, have community convergence across multiple vendors and lots of users on a single management dashboard for Ceph. And it's actually built into Ceph. So it's part of the Ceph Manager dashboard. It's there when you install Ceph. By default, you just have to turn it on and set up an SSL key and so on.
And then currently, most of the stuff in the dashboard is around monitoring and management of your RGWs and your buckets and your RBDs and so on, managing Ceph itself. And we're sort of growing into adding more features around managing the cluster, so you'll be able to add nodes and replace OSDs and so on. But still a work in progress, but lots of progress. Lenz is going to be talking about this later today,
so he'll get a great demo of that. One of the new exciting pieces in Nautilus that's just being added is the orchestrator abstraction that I like to call the orchestrator sandwich. And it's essentially an abstract API in the manager that lets Ceph call out to whatever tools
being used to deploy Ceph. So there are four plugins that are being worked on currently. One is Rook, which is the operator for Kubernetes that deploys and manages Ceph. There's Ansible, DeepSea, which is the salt-based one that SUSE works on, and there's an SSH one that's going to be sort of a bare-bones, trivial thing. The idea here is that Ceph can call out
to the orchestrator and ask it to go provision an OSD on a device, and then the orchestrator will do whatever it needs to do to run the right commands on the right host to do that work. And this is the piece that's going to allow us to have a generic CLI interface or dashboard interface to do these sort of lifecycle management functions.
So things like fetching node inventory, creating or destroying daemons, blinking device LEDs, which has been sort of a frequently requested feature. And there will be sort of a single CLI that will be the same for Ceph, regardless of what tool you use to deploy it. So that's the goal. And eventually the dashboard will be able to do all this stuff too.
So Nautilus includes the framework, and we have sort of in-progress implementations of these four exactly how far and what features each of those implementations are going to have, and Nautilus remains to be seen. It depends on how much gets done over the next month. But this is all new functionality, and so a lot of it's going to get backported also to Nautilus as well. So we're very excited about sort of finally
doing the last mile so we'll be able to manage the full Ceph cluster from the dashboard and CLI. One of the big new Redis features in Nautilus is PG autoscaling. So picking a PG num for a Redis pool has always been sort of a black art. It's sort of the hardest thing to explain to users
and have them understand and pick a number that isn't just wrong and going to be problematic. It was always confusing. The documentation was hard. So in Nautilus, there are sort of two key things. Previously, if you picked a PG num, you could always increase it. So you could increase the level of sharding within a pool, but you couldn't decrease it. So in Nautilus, you can now decrease it, and it will actually merge PGs into fewer.
So if you pick a number that's too big, you can fix your mistake. So that's the big key capability. And then on top of that, we have a module in the manager that will autotune the PG num for you. So it basically looks at all the Redis pools in your cluster, looks at how much data they're storing and how many total PGs you have,
and it figures out that, oh, this one should have fewer and this one should have more. And it can either issue a health warning telling you that your PGs are too big or too small so you can do it, or you can just flip the switch and it'll automatically do it for you in the background. And there's a, you might not be able to read this, but there's an example of the command that will just sort of tell you the status. So it'll tell you what pools you have,
how much data they have. They might be empty because you just created the cluster, but as an administrator, you can tell the plugin how much data you expect to store there. So it can pick an appropriate PG num, or you can tell it the fraction of the cluster that you expect it to be, like this pool will be 20%, and that will feed into the same algorithm. And then it compares how many PGs you currently have
versus how many it thinks you should have. And if it's off by more than a factor of three, then it'll basically either do the action or tell you to make the change. And it can either warn, do a health warning, or it can just do it for you. So hopefully this is gonna be able to let most users just not think about this.
In sort of the worst case scenario where you deploy every PG with one PG to begin with, and then the pool fills up over time to be a petabyte, it turns out that if you sort of asymptotically look at the total amount of data movement, you basically end up writing everything twice. You'll end up writing the data once, and then it'll move one more time before it ends up in its file location.
So sort of the worst case scenario isn't too bad, but if you tell the cluster what's gonna happen, then it'll do better, obviously. All right, device health metrics are sort of another key thing that we've taught Seth about. So the OSDs and monitors now look at what the underlying physical storage device is
that they're consuming. Underneath all the layers of LVM and DM and whatever else, they look at the device model and serial number, and they report all that back up to the manager, and the manager maintains this. So they have a set of Seth device commands that will list your physical devices and what daemons are using them. And then there is a set of capabilities in the OSD
and monitor to scrape all your smart health metrics and store them in RADOS, and there's a module in the manager that can operate in two modes. One of them has a pretrained model that will try to predict the device life expectancy based on the smart metrics, and that's the local mode,
and you can see that reflected here. And then there's also a cloud mode that is contributed by a company called Profit Store that will call out to their SaaS service and either use their free prediction service, that's pretty accurate, or their paid service, which is super accurate, to figure out what the life expectancy is. And then based on that information, Seth can either just issue health warnings
saying that this device is going to fail soon, or you have a lot of devices that are going to fail all at the same time and that's going to be a problem, or you can flip the switch and it will automatically mark those devices out so that your cluster will automatically replicate to new replicas before the device fails so you don't have that loss of redundancy.
So we're pretty excited about this. Even just having this visibility into what devices are attached to what daemons is nice. And this will be tied into the, once we have the orchestrators knowing how to blink LEDs, which I think just Deepsea is going to be the first one to do it, but the others will follow. Then you'll be able to tell it to blink this device's LED, or you'll be able to tell it to blink this OSD's devices,
one or more of them, to give you that sort of full closing the loop experience. And then one of the last sort of nice bits and pieces that we added to the manager is what we're calling crash reports. So previously if sfdemon crashed, you would get a splat and a log file
on some host somewhere and then systemd would restart it, and you probably wouldn't even notice. Now whenever a daemon crashes, it writes a small record to var loop crash with the process ID, the time stamp, and a stack trace. And then those are regularly scraped and reported to the manager, so the manager has a database of all the crashes
that have happened recently in your cluster. And so you can do cef crash ls to list those crashes, and you can get info about them. And then additionally there is a, this is actually new in MIMIC, there's a telemetry module that's in Ceph Manager that is opt-in. Obviously you have to turn it on. But it will regularly phone home
just very basic information about your cluster to Ceph developers, saying this cluster exists, it has this many OSDs, it's running this version, it's this big, these are the services I'm using so that we have a sense of what people are deploying. So in MIMIC, that telemetry module also has the ability to phone home these crash reports. So if you opt-in, which I encourage everyone to do,
but you obviously don't have to do, then we'll find out what crashes are actually being seen by real users in the wild on what versions and what those stack traces are so we can tell what's broken, what's broken the most, and how we should prioritize our work and so on. So we're really looking forward to this actually getting deployed and being used in the wild.
So those are some of the things we're doing around management and usability. Other things that are happening in Redis that are a little bit more under the covers that are exciting. So Messenger 2 is a new specification and implementation of the on-wire protocol for Ceph. There was a talk yesterday that Ricardo did
talking about this in a bit more detail. If you caught that, there's also a recording. The main sort of user-visible feature that this is gonna deliver is for the first time we'll finally have encryption on the wire so all of the traffic that Ceph is doing between daemons and between clients can be encrypted if you turn that on. But there's a bunch of other stuff too.
There's improved feature negotiation, the encapsulation of authentication protocols is cleaner and better, sort of paving the way for us to add Kerberos support, hopefully by Octopus. And there's also all the infrastructure to do a dual stack so that all your Ceph daemons can be using both IPV4 and V6 and both clients will be able to connect.
Today you can do either V4 or V6, only one or the other. This will let you do both, which some people will probably like. We're also moving to our official IANA-assigned port number for the monitors, 3300, which has been sort of a long time coming. So, Messenger 2 is, this is actually the last thing
that we're trying to finish up for the Nautilus release before we get out the door, so hopefully we'll get that sorted soon. A few other things, odds and ends on the Redo side. In the past, it was hard to predict how much memory an OSD daemon would consume and to control that by adjusting cache sizes. There's a new setting now called OSD memory target.
That's just one setting, one number. You say, I want my OSD to use three gigabytes, and then the OSD will figure out all the stuff it has to do internally to fit within that memory envelope. Whether it's adjusting the size of the Bluestore cache or whatever else in it, internally it's monitoring the RSS size and doing sort of a dynamic controller to make sure that happens.
So that simplifies configuration management considerably. There's also a set of commands around NUMA. So the OSDs are looking at what NUMA node their network interface controller is attached to. And they also, if it's an SSD or NVMe, they'll look at what NUMA node their SSD is attached to,
and it reports that to the monitor. And there's a new command, Ceph OSD NUMA status that tells you the OSD and what NUMA node they're on. And then there's a single setting you can set that will pin an OSD to a particular NUMA node. And you can see that also in that output. So this simplifies the process for these high performance nodes
of pinning OSDs to certain nodes. Hopefully, if you have a balanced machine, the network and the flash are on the same NUMA node. You can just pin the OSD there, and you'll get better performance. Previously, this was a super tedious process. You had to do your own bash scripts to sort of manage all that. It was gross, but much simpler now.
In MIMIC, we added centralized configuration management. So instead of having a Ceph comp file that was spread across a zillion nodes, it's all stored in the monitor. That's obviously still in Nautilus, and we've improved it particularly around the manager modules. So you can set settings for manager modules, and you'll have to restart the daemon.
And there's better reporting for help for the descriptions for the options and what the legal values are and so on. So a lot of that just got cleaned up. It works much better in Nautilus. There's also a new command that will spit out sort of a minimal Ceph.conf that you can put on a new node that has just enough information to contact the monitors,
and then everything else you can stuff in the monitor configuration database. So lots of stuff that just uses the configuration. There's also a new manager module called the progress module that attempts to essentially establish some state around long-running events. So for example, if an OSD fails and there's a recovery operation that has to happen,
it'll establish a progress event for that recovery, and it'll actually tell you you're 50% recovered or 70% within ETA and so on. So there's a new command, CephProgress, that you can see those events. Eventually we're going to actually incorporate that into Ceph-S. That didn't quite happen for Nautilus, but hopefully by the next release.
And this is a little thing, but it's annoyed a lot of people. When you have misplaced data that the cluster is simply rebalancing and moving somewhere else, but the data isn't degraded, that's no longer a health warning. It shouldn't make your pager go off. You can enjoy your weekend. So it's a little thing. There's a setting you can adjust this, but by default you won't get health warn
when we're just moving data around, which I'm sure some of you will appreciate. On the Bluestore side, so Bluestore was introduced in Luminous. It's the new default. It's good. People are happy. In Nautilus, there are a few new things. There's a new implementation of the allocator that's used internally.
So it's a bitmap in memory instead of a red-black tree, essentially. So it's faster. It has predictable memory usage that's independent of fragmentation levels. And there are certain behaviors of the old allocator that could lead to a sort of pathological fragmentation that this one doesn't have. So it's just better. So that's there.
You won't notice for the most part. It's just the in-memory allocator, so you just restart the OSD and you'll get the new allocator behavior. So nothing on disk. There's also more intelligent cache management. It turns out that one of the harder bits in Bluestore was managing the size of the RocksDB cache for metadata
versus the cache that Bluestore implements for all the data in O-nodes because there are certain things in RocksDB that are really important to cache, like the indexes and all the SSD files. And so Bluestore is now smart enough to make sure that that higher priority stuff is being cached depending on how much memory you have and how big your store is and all the other stuff.
So it's doing internally, monitoring all that, and making sure that it's doing the right thing. So that's much, much better and generally just get better performance. We also modified the on-disk format. So new OSDs will get this, and you can do an FS check repair to convert old ones. But basically we are tracking more fine-grained utilization metrics on a per pool basis.
So, for example, if a particular pool has compression enabled, we're tracking the amount of user data stored versus the compressed size afterwards, and also the amount of data stored for metadata internally to Bluestore or the OMAP metadata or the file data.
Just much more granular understanding of where your disk usage is going. And that all bubbles up in this FDF and the detail view so you can actually see what's going on. And lots of performance improvements, just optimizations here and there. A few other rate of things.
In Luminous, we introduced a device class concept. So you could tag OSDs as either an HDD or an SSD or any other type, actually, and you could easily write crush rules that would only target one class or another. But prior to that, you sort of had to manually craft a crush map to do this, and you ended up with this manually maintained hierarchy,
like shadow hierarchy with different nodes. It was sort of tedious, but lots of cluster did this prior to Luminous. And so there's now a feature in the crush tool to basically take that old style manually managed hierarchy and convert it to the new style without reshuffling all your data. So certain large installations will appreciate that.
There's also a limit on the PG length. There were sort of a few corner cases where the memory utilization of the OSD could grow in strange recovery situations. That's fixed, so there's now a hard limit, hard cap on the amount of memory that we use for recovery metadata.
And one of the nice examples of sort of academic research translating into the open source project, there's a new erasure code style called Clay Code, coupled layered something, I can't remember exactly. That's basically a more optimal balance of IO and recovery bandwidth. So if you have an eight plus three code,
but you only lose one replica, then you have to read less data in order to do the recovery than if you had lost three of the additional parity or three of the nodes or whatever. So that's new, and it's in there. It's marked experimental because it's a new erasure code.
You want to be kind of paranoid about that stuff, but you should try it out, it should be pretty good. On the Redis gateway side, so this is the S3 gateway. We have a few new things, mostly around the federation stuff. The first one, and I think the most exciting, is a new pub sub federation capability.
So you can create a zone that's essentially generating events when you have things like putting into a bucket or creating a bucket or deleting a bucket. It'll generate an event stream that you can subscribe to. So there's a polling interface where you can just ask it for events. I think this is the interface that was used
for a demo with Knative, which is a Kubernetes function as a service serverless thing. There was a talk about this at KubeCon in Seattle a couple months back. And then we're also working on a sort of push model where we push the event stream into AMQ or into Kafka, which some of our enterprise customers want.
So this is exciting. You can sort of glue Redis gateway to serverless so when you do a put of an object, it'll trigger a function as a service event that'll go do something to that image. It's pretty neat. There's also an archive zone addition. So you can define a zone within your RGW federation that just gets basically a complete copy of everything
in the other zones, and it turns on object versioning so you'll have all versions of all objects. It's a relatively simple thing, but something that some people ask for just for compliance or backup or whatever else. There is an S3 API that implements lifecycle policy, I believe is the correct name for it,
essentially saying that when you put an object that initially is put in one tier in one set of Redis pools or whatever, over time it'll get migrated automatically to another tier, and then at some other point in time it might get expired. There's a whole specification around what policies you can define based on S3. So we're implementing that now, and it can control the tiering within a particular staff cluster among Redis pools.
So you've always been able to specify different tiers for RGW buckets or individual objects, maybe a richer-coded Redis pool or replicated or whatever, and this lets you automate that lifecycle management across those tiers. And then on the performance side, a long time ago we used usually Apache and FastCGI
to talk to RGW. Then we used Civitweb, which is sort of an embedded, simple web server. That's what we currently use by default now. In Nautilus, we've switched to a new web front-end called Beast, which is part of the Boost project. It uses a Boost Agio, a more synchronous programming model. It scales better, it's faster.
So that's new in Nautilus. And then there's also some new features around STS, which is sort of the security authentication framework integration. I'll be honest, it's all a blur to me because there's so many standards, and I can never keep track of which ones they've implemented, but there's something new in STS in Nautilus that you can go ask the RGW folks about,
and they'll be happy to tell you. All right. On RBD, block device. The first big thing on RBD is live image migration. So you've always been able to have different Redis pools with different performance characteristics or placement or whatever and have images mapped to those pools. With live migration, you can have an in-use RBD image
that's mapped by a current in-use by a VM, actively reading and writing data, and you can move it between performance tiers, between Redis pools while it's being used, and everything just works. So that's new. That's good. Excuse me.
Another new feature that we're pretty excited about is RBD Top. This has sort of been an oft asked for feature. So there's a bunch of Redis infrastructure now where the manager daemon will essentially tell all the OSDs to start sampling their request streams, either all of them or restricted by a pool
or by an object prefix and report that information back to the manager so we have sort of a cluster-wide view of what IO is happening. And then there's an RBD CLI that uses that service to basically subscribe to IO for RBD images and tell you what the most active images are, how much IO they're doing, that sort of thing.
Excuse me. I'll be honest. I haven't actually used this feature yet, so I don't know exactly what it looks like, but I keep asking about it, and they keep telling me it's there, so I'm really excited. Mostly I'm excited actually about the Redis infrastructure, so we have a cluster-wide view, not just for RBD of everything, so you can tell what the top clients are and who's using up all your bandwidth.
All right. A few other things on RBD. RBD Mirror is the daemon that does the asynchronous replication feature across clusters for RBD. It's been there since Luminous, but it's always been kind of tedious
to set up and configure. Excuse me. So a lot of work in just simplifying RHDW management so that you can have N of them, and they'll scale out, and also the configuration for connecting to remote clusters is now stored in Redis,
so it's just simpler to use. Namespace support is new. So Redis has long supported namespaces. This idea that within a pool, you can sort of carve it up into different security domains, and you can restrict clients' access to a particular subnamespace of that pool and then have that security enforced.
RBD now supports that, so you can have RBD users that are locked into a namespace. They only see their images, and they're sort of locked in that way with Cefex. So that's all supported through the RBD CLI, complete user experience. That's been asked for for a while. Some things with the config overrides in RBD.
So you've always been able to specify overrides on the configuration for the client on a per-image basis. So, for example, whether the caching is enabled for this particular RBD image without having to go configure it on the client, you can now do that on RBD pools, and the overall experience is simpler. It's the same CLI interface to control it all,
just cleaning it all up to make it simpler. And apparently now we're maintaining timestamps on images, so when you do an RBDLS detail, you can see information about what images are being used by when and when. All right, Cefex.
So Cefex is awesome. The first big thing that happened is that the multiFS volume support is going to be stable now in Nautilus. It's actually been there since Luminus, but it was marked as experimental just because we didn't have a lot of testing around it. But Patrick's been cleaning up a lot of the code here
and documentation and edge cases and whatever, and so it's going to be pretty sure it's going to be stable in Nautilus. This means that you can create multiple CefFS file systems within the same Cef cluster, and each of those file systems will have an independent set of MDS servers. So they're totally independent but sharing the same Redis. We also have now a first-class concept of sub-volumes.
So previously with Manila, we wrote this library called CefVolumeClient, confusingly named, that basically just creates subdirectories within CefFS and sets a quota on them. It sets up CefX capabilities so you can only access that subdirectory,
and then it passes those to Manila as a share. So it's sort of a lightweight volume concept with quota. That sort of concept has been brought directly into the manager now, so you have a full CLI management experience for creating volumes, which are complete CefFS file systems, and these subvolumes, which are just lightweight directories within that file system.
And the goal now is that everything is going to be consuming the same abstraction. So Manila will use it now. Kubernetes, RWX volumes with the new CefCSI driver are going to use the same subvolume concept, and it'll be the same one that you access via the CLI and view on the dashboard. So we're happy about just sort of simplifying
and unifying the experience across the board. On top of that, one of the big new things for CefFS is managed NFS gateways. So this is clustered NFS Kinesha. You've been able to run NFS Kinesha on top of CefFS forever, and lots of people do it. What's really new is that Jeff Layton spent a ton of time making the scale-out aspect of that
and the NFS grace periods, making those work correctly so that if you had a single MDS fail, the grace period would be enforced across all NFS servers in a coherent way so that all the delegations and lock state could be reclaimed. All the corner cases around the recovery that people can easily ignore or not even realize are problems
are all sort of managed correctly. And so the NFS Kinesha demons are now storing their configuration in rados. They have an object in rados that manages the grace. They're sort of doing this in a coherent way. And now those Kinesha demons are also managed by Rook, at least when you're using Kubernetes. So Cef manager using that new orchestrator API
will call out to Rook to create the NFS demons or however many you want to do, and then they'll all be working together to do the grace period. So it's sort of a complete, coherent experience without a lot of tedious manual configuration needed. So we're excited about that. A couple other things on CefFS. There was an outreach intern project
that created a CefFS shell. This is just a sort of a wrapper around libCefFS, a CLI that you can run and do like make durls, things like that, and you can write scripting around. This is really mostly convenient for scripting around CefFS. So for example, if you wanted to set a quota on a directory, previously you would have to go mount CefFS and do a set adder and then unmount it.
Now you can just do a CLI command to the shell to sort of do it all for you in a quick, lightweight sort of way. And then there's just been a lot of work around bug fixes and performance fixes. Most of this is around situations where you have nodes with lots and lots of memory and really big Cef metadata servers with huge caches.
There were a lot of sort of robustness and stability issues that we fixed with those types of environments. Or not less. Okay, container ecosystem. So lots of stuff with containers, especially around Kubernetes. In the Cef project, when we think about Kubernetes, it's really in two sorts of ways. The first way is that we want to provide storage
to Kubernetes. We want to be the storage underneath all your containers because any scale out infrastructure is gonna need scale out storage to go with it. And then the other is running Cef clusters inside of Kubernetes, which might seem weird, but actually has some advantages, because it simplifies and hides the OS dependencies.
You have finer control over upgrades. And you can automate and schedule a lot of the annoying, tedious parts of managing a Cef cluster with Kubernetes that you couldn't do before. So we have OSDs and monitors that have state. They tend to be tied to storage devices. But then you have all these other daemons, the managers, the MDS servers, the Redis gateways, and so on,
that can run kind of anywhere. And not having to think about it and just letting the container platform decide where to run them based on how much memory and network and CPU and so on is available, simplifies things. And so in this view, you're really thinking of Kubernetes as sort of a distributed operating system. Instead of thinking about individual nodes and managing them, you just say, I have a whole cluster. Go run this stuff, and it'll figure out where to book them.
So that's sort of where we think of things. Rook is a newish project in Kubernetes. It's a storage orchestrator. And that's sort of our, as a community, we're essentially all in on Rook as being the preferred or primary focus of how we run Ceph in Kubernetes.
You can sort of do it manually with home charts and so on, but Rook automates a lot of the process, and it's great. And so we're focusing all our efforts on just making Rook work really well to run Ceph. And it does intelligent management of the Ceph daemons. So it'll make sure that if you're adding monitors or removing monitors, it doesn't break quorum. It knows how to place them.
It knows how to map OSDs to devices. It understands Ceph well enough to sort of do all the right things when you're automatically managing it. It also gives you a Kubernetes style of specifying what your cluster is and how it should be deployed. So there's a CRD where you just tell Kubernetes, you should have a Ceph cluster that looks like this,
and the Rook operator will go and do it. And you interact with it in sort of the Kubernetes way using kubectl and YAML or all that stuff. And there's a talk later about Rook, yes. So make sure you don't miss that. But yeah, we're very excited about Rook. It's great. But not everyone wants Kubernetes.
Not everyone wants Rook necessarily. And so we're also thinking about sort of the old school style of deploying things. So with the new container orchestrator, we have all these plugins, right? We have the Rook one, the DeepSea, Salt one, Ansible. And there's also an SSH implementation that's just sort of like a minimal bare bones implementation. The idea there is that you just give the Ceph manager
essentially a root SSH key that knows how to talk to your nodes, and it can do all the basics. And that'll sort of capture the rest of us and drag us into the world where Ceph can manage itself. Separately, Ceph Ansible recently learned how to run all the Ceph demons in containers
instead of just running them the normal way with systemd. And it turns out all it really does is create a systemd unit file for each demon that does a Docker run and runs a container, right? So there's no real magic there. But you have this, the advantage is that your demon is in a container. It can be upgraded independently and so on. And so the plan is to teach the SSH orchestrator how to do that too.
So that the sort of the default bare bones implementation of the orchestrator API that Ceph will include will be able to run Ceph in containers. And the advantages here are installation is easier, right, all this logic around how do we install on Debian versus CentOS, what Apple repos do we need, all this stuff that was just super gross and annoying before is simplified
because all you have to do is say use this container repo and use this particular container version. And you can now upgrade demons independently one at a time. It's actually pretty nice. So I was pretty skeptical about containers a couple of years ago, like what's the value of containers to something that's low level like Ceph. But it really does simplify some of these operations
once you sort of do it right and you figure out how to not lose your logs and so on. So anyway, that's coming. Just a couple of things on the community side. There's a new Ceph Foundation that we launched back in November through the Linux Foundation. We're very excited about it. Essentially it's a way to get all these corporate participants
to pool money together that we can spend on the community. So don't tell anyone, but it's really just a way to get money to spend. But we have 31 founding member organizations, which is pretty exciting. Three more members have joined since launch. We have a couple other conversations going too. So we're excited about that. We also had our first Cephalocon conference in Beijing
in last March, which was awesome. It was two days, four tracks over two days. Like a thousand people showed up. It was bigger than I thought. It was really fun. So that happened, and we're doing a new Cephalocon in Barcelona in a couple of months. So marker calendars. This is going to be the two days right before KubeCon. It's in Barcelona, which is great.
The CFP just closed, so you missed the boat on submitting a talk if you haven't already. But early bird registration is open, so I encourage you, if you're a Ceph fan, to come. And there's also a reduced hobbyist rate too. So if you don't have an employer and you're just self-funding or whatever, you can apply for the reduced rate to attend.
That's it. I'm happy to take any questions about Ceph Nautilus or anything else. Yes?
So the question is around our nine-month cycle and not being aligned with distribution cycles. Yes? The timing? Yeah.
So the question is can we do three instead of two. The challenge is just that there's a lot of investment and overhead into testing and maintaining the ability to do those upgrades. It means that we have to keep around old behavior for that much longer, and it gets harder to maintain and so on. And so two is sort of a compromise that we settled at. And so I'm not super excited about expending it to three,
especially if two is enough to get you from point A to point B, even if it's not as convenient. That said, people are always complaining about their release cycle. There's talk of expanding it to a year. There's talk of shortening it again. And so it's not that it's gonna be that way forever. So if you have a specific request or whatever on that, I would email the list, and we'll bring it up.
Thanks. Mm-hmm.
The benefit is. Yeah, so the question is around the NUMA configuration stuff, and can I quantify how much it improves? It totally depends on your hardware. So there have been cases where we've had people doing high-performance reference architectures, and they fixed the NUMA stuff, and they've gotten,
I think, like 30, 40% or more. But on other systems, it probably will make no difference. It totally depends on what the balance of the device attachments are across the two sockets on your system. It turns out there are actually very few machines that are actually balanced. Most of them just hang everything off one or they hang all the storage off one node
and all the network off the other or something. That's what I have at home. So your mileage will vary. But Quanta apparently has a balanced NUMA node that has half the SSDs and network on both. It's one of the few that you can get.
Then we'll have what? The question is how realistic is it that we'll have Rook and Kubernetes for Nautilus? So the Rook stuff is sort of
a little bit decoupled from the Nautilus release. So Rook is in the 0.9 release. Lots of people use Rook today and have it in production even. We're working towards a 1.0 release. I'm holding out for the 1.0 release before I sort of pull on, grab the megaphone because there are a couple things with the way that OCs are deployed that I want to clean up a bit.
But that's independent of Nautilus. So 1.0 will use the Nautilus release, but Nautilus supports Rook today. The only real tie-in on the Ceph side is the Orchestrator plug-in and it sort of calling back out to Rook to tell it to do stuff. And that's only if you're using the new CLI or eventually the dashboard to do it, which is sort of brand new anyway, so you don't really need it.
Hopefully that answers the question.