Managing and Monitoring Ceph with the Ceph Dashboard
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 94 | |
Autor | ||
Lizenz | CC-Namensnennung 4.0 International: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/45627 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
00:00
Open SourceFreewareGreen-FunktionEreignishorizontZweiRechenschieberGüte der AnpassungMultiplikationsoperatorMaschinenschreibenGatewayInformationProjektive EbeneAnalytische FortsetzungSoftwareentwicklerEvoluteElektronische PublikationCodeKartesische KoordinatenProtokoll <Datenverarbeitungssystem>Open SourceRechenzentrumSoftwareBitMomentenproblemGemeinsamer SpeicherWeb-SeiteSchnittmengeDateiverwaltungKurvenanpassungClientMAPPhysikalisches SystemKlasse <Mathematik>Programmierumgebungp-BlockGrundsätze ordnungsmäßiger DatenverarbeitungSchreib-Lese-KopfObjekt <Kategorie>Notebook-ComputerKomplex <Algebra>QuaderDienst <Informatik>Mathematische LogikDämon <Informatik>InformationsspeicherungBenutzeroberflächeZahlenbereichDemo <Programm>Mailing-ListeTouchscreenWeb SiteStabFront-End <Software>Zentrische StreckungTopologieAlgorithmusRegulärer GraphLoginBeweistheorieVersionsverwaltungComputerarchitekturEinfache GenauigkeitPrototypingOffene MengeRelationentheorieMereologieProzess <Informatik>AggregatzustandWiderspruchsfreiheitGrenzschichtablösungPunktSchlüsselverwaltungTaskDatenverwaltungMini-DiscDefaultVerzeichnisdienstFramework <Informatik>Verzweigendes ProgrammFormation <Mathematik>FehlertoleranziSCSIRelativitätstheorieGibbs-VerteilungXMLUMLVorlesung/KonferenzComputeranimation
10:25
Open SourceSchlüsselverwaltungVersionsverwaltungInformationsspeicherungDefaultSichtenkonzeptMultiplikationsoperatorWeg <Topologie>FunktionalOffene MengeMathematikSchnittmengeSoftwareentwicklerSoftwaretestComputerarchitekturFront-End <Software>Elektronische PublikationKonfiguration <Informatik>Prozess <Informatik>Service providerDatensichtgerätGamecontrollerMailing-ListeMereologieDatenverwaltungFramework <Informatik>Lineare RegressionKonfigurationsraumBenutzerbeteiligungATMPunktObjekt <Kategorie>Gerade ZahlDämpfungTopologieVollständiger Verbandp-BlockDreiecksfreier GraphStützpunkt <Mathematik>Projektive EbeneDokumentenserverAdditionMAPMehrplatzsystemCodeFunktion <Mathematik>Ordnung <Mathematik>PasswortBenutzerfreundlichkeitFokalpunktSystemprogrammDemo <Programm>ChiffrierungComputersicherheitInformationBootstrap-AggregationREST <Informatik>RegelkreisUnendlichkeitEinfache GenauigkeitBitSuite <Programmpaket>DatenhaltungSynchronisierungApp <Programm>Schreib-Lese-KopfGeradeComputeranimation
18:57
Open SourceLanding PageDatenverwaltungSystemplattformDatensatzMultiplikationsoperatorWeb-SeiteGraphMereologieFormale SpracheSichtenkonzeptMetrisches SystemSchlüsselverwaltungBitDatenverwaltungMultigraphFront-End <Software>REST <Informatik>MultiplikationGanze FunktionTranslation <Mathematik>ARM <Computerarchitektur>SystemverwaltungSchnittmengeSoftwareentwicklerIntegralProgrammierumgebungStapeldateiSelbst organisierendes SystemGeradeInformationSoftwarewartungGamecontrollerUmwandlungsenthalpieBenutzerbeteiligungDifferentePunktLanding PageInternationalisierung <Programmierung>Basis <Mathematik>TouchscreenService providerMini-DiscAuthentifikationMatrizenrechnungMathematikOffene MengeMehrplatzsystemPasswortPhysikalisches SystemProtokoll <Datenverarbeitungssystem>Profil <Aerodynamik>IdentitätsverwaltungClientVerzeichnisdienstLastProjektive EbeneBrowserGruppenoperationSynchronisierungFunktionalCodeMedizinische InformatikGemeinsamer SpeicherWiederherstellung <Informatik>Weg <Topologie>RechenzentrumInformationsspeicherungsinc-FunktionZeichenketteXMLUML
27:28
KonfigurationsraumTexteditorDatenverwaltungOpen SourcePseudopotenzialViewerSCSIBimodulFächer <Mathematik>SoftwarewartungAggregatzustandMAPDienstgütePhysikalisches SystemViewerDateiverwaltungp-BlockKonfigurationsraumDatenverwaltungInformationsspeicherungGatewayProzess <Informatik>Projektive EbeneRechenzentrumHierarchische StrukturSchnittmengeGraphfärbungE-MailDefaultUmwandlungsenthalpieFlächeninhaltTexteditorClientObjekt <Kategorie>Radikal <Mathematik>Gemeinsamer SpeicherRegulärer GraphLaufzeitfehlerBimodulBitKartesische KoordinatenBandmatrixNichtlinearer OperatorEin-AusgabeTaskImplementierungVersionsverwaltungZahlenbereichInformationDifferenteFront-End <Software>Inverser LimesBitrateGraphTouchscreenMultiplikationsoperatorWeb-SeiteKonfiguration <Informatik>BenutzerbeteiligungDatenreplikationBildgebendes VerfahrenElektronischer ProgrammführerSichtenkonzeptKlasse <Mathematik>Array <Informatik>Protokoll <Datenverarbeitungssystem>Charakteristisches PolynomAdditionDämon <Informatik>Produkt <Mathematik>SelbstrepräsentationZellularer AutomatFormale SpracheAlgorithmusMetrisches SystemBrowserNummernsystemEinfache GenauigkeitiSCSICodeProfil <Aerodynamik>Demo <Programm>Plug inDistributionenraumMatrizenrechnungXMLUML
35:59
FreewareTotal <Mathematik>Meta-TagOpen SourceDemo <Programm>DefaultProgrammierumgebungSystemverwaltungBitMultiplikationsoperatorInformationDatenverwaltungMinkowski-MetrikLanding PageWeb-SeiteDämon <Informatik>WidgetMini-DiscAggregatzustandObjekt <Kategorie>Notebook-ComputerBeanspruchungLokales MinimumEinsInstantiierungBrowserSoftwaretestProdukt <Mathematik>GatewayXMLUMLFlussdiagramm
38:30
Open SourceFreewareLastBefehlsprozessorMittelwertFinite-Elemente-MethodeFahne <Mathematik>GraphMetrisches SystemTabelleMultiplikationsoperatorUmwandlungsenthalpieWeb-SeiteCASE <Informatik>InformationKonfigurationsraumMultigraphStrömungsrichtungVersionsverwaltungAggregatzustandBitBildschirmmaskeMailing-ListeDienst <Informatik>SichtenkonzeptZusammenhängender GraphRückkopplungSoftwareentwicklerZahlenbereichParametersystemTouchscreenRechter WinkelKommandospracheMatching
41:11
Total <Mathematik>Dämon <Informatik>KanalkapazitätFreewareOpen SourceWiederherstellung <Informatik>MathematikATMPunktProfil <Aerodynamik>DefaultWiederherstellung <Informatik>ZahlenbereichSchnittmengeFahne <Mathematik>GeradeLastProgrammierumgebungClientXMLFlussdiagramm
42:23
Open SourceFahne <Mathematik>Offene MengeFreewareDämon <Informatik>Total <Mathematik>KanalkapazitätKanal <Bildverarbeitung>SoundverarbeitungDemo <Programm>ComputerspielProgrammfehlerCASE <Informatik>Wiederherstellung <Informatik>Web-SeitePunktXMLFlussdiagramm
43:28
SummierbarkeitOpen SourceSCI <Informatik>ServerWiederherstellung <Informatik>Normierter RaumDatenkompressionMarketinginformationssystemSampler <Musikinstrument>GruppenoperationMAPRelationentheorieObjekt <Kategorie>DeterminanteMailing-ListeAggregatzustandSystemzusammenbruchTopologieDatenreplikationBitInformationDatenkompressionBildgebendes VerfahrenDemo <Programm>Hierarchische StrukturPunktKartesische KoordinatenInformationsspeicherungCodierungDefaultFlussdiagrammComputeranimation
45:55
Open SourceFreewareRohdatenClientGatewayMailing-ListeDateiverwaltungp-BlockBildgebendes VerfahrenHilfesystemiSCSIDemo <Programm>Objekt <Kategorie>Prozess <Informatik>InformationsspeicherungWeb-SeiteGeradeEinfügungsdämpfungProgrammierumgebungProgrammfehlerMultiplikationsoperator
47:30
Offene MengeFreewareDemo <Programm>PunktCoxeter-GruppeProgrammfehlerCASE <Informatik>VersionsverwaltungWort <Informatik>Reelle Zahl
48:18
FokalpunktPasswortNotebook-ComputerPunktCoxeter-GruppeMomentenproblemDatenverwaltungGrenzschichtablösungComputeranimation
49:05
DatenverwaltungSchnittmengeMAPMailing-ListeProdukt <Mathematik>MereologieVerkehrsinformationDefaultRückkopplungZahlenbereichUmsetzung <Informatik>Projektive EbeneFunktionalCoxeter-GruppeWeb-SeiteGatewayRechenschieberZeiger <Informatik>SelbstrepräsentationParametersystemPunktObjekt <Kategorie>RichtungAnpassung <Mathematik>VersionsverwaltungInstantiierungBitProgrammfehlerEinsGemeinsamer SpeicherGraphRadikal <Mathematik>Formation <Mathematik>MultiplikationsoperatorUmwandlungsenthalpieDistributionenraumRechter WinkelComputersicherheitTaskWort <Informatik>Demo <Programm>BenutzerfreundlichkeitProgrammierumgebungCASE <Informatik>GeradeVorlesung/Konferenz
55:00
Open SourceFreewareEreignishorizontKartesische AbgeschlossenheitXMLComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:07
Morning everyone. Thanks for joining me in this session today. I hope you enjoyed yesterday's FrostCon. I certainly did. Especially a social event, having nice weather out there is always a good start.
00:20
My name is Lenz. As you can probably tell from the green of the slide decks I work for Susan. Actually this is my second time I'm working for Susan, which is a funny story. If you want to hear more about it, feel free to get in touch with me about this. The topic of this talk is basically kind of a continuation of previous talks that I've been giving at FrostCon before.
00:48
Myself and the team that I'm working with, we are currently involved in the upstream Ceph storage project. We are working on a web-based tool, what we call the Ceph Dashboard, that allows you to configure, manage and monitor a Ceph cluster.
01:04
Previously we have been working on a project called OpenAttic. That was also a tool to manage Ceph, but it was maintained out of Tree as a separate open source project. So I'm going to start with a bit of a history where we're coming from and then giving you an update on where the project is currently heading and what it looks like.
01:25
If the demo gods are kind with me, I hope I can show at least a small demo. I know it's running on a dev environment on my laptop, so it's not really that exciting, but at least it should give you an initial impression. I would like to start with a question. Who of you has never heard of Ceph and has no idea what this is about?
01:45
You? Okay. So maybe I will do a very broad overview to get you an impression. So Ceph's job is storing data. It's a number of services that you can install on ideally a huge number of nodes in your cluster, so it's definitely not a single node system.
02:07
And it does some clever ways in how it makes sure that the data is evenly distributed among the node in that cluster. It ensures availability, redundancy and all of these things by using some very elaborate algorithms
02:22
that also make sure that Ceph scales very effectively across a large number of nodes. It does so by dedicating the kind of the logic that determines where the data is being placed from what many clusters storage systems usually use a directory service for to the clients themselves.
02:42
So the clients have a mathematical way I would say, it's called the crush map that allows the client to compute which node in the cluster it needs to talk to to store the data. So that makes it very scalable, also very fault tolerant. One of the key aspects of Ceph is that it wants to preserve the consistency of the data and it actually does so by sacrificing the availability.
03:10
So if Ceph comes into a state where it can no longer ensure that your data can be properly and securely stored, your client may actually block. So that's something to be aware about.
03:21
It's not eventually consistent in the way that it just takes your data, tells you yeah it's all fine and then deals with it. The OSDs which are the storage demons, they write out the data and then they give the client the okay that the data is securely stored.
03:42
So Ceph basically just takes care of distributing objects on all of these nodes and it then has a number of services that support more common protocols to make that data accessible. By default it ships a Ceph block device, RBD, which you basically can map on a Linux box as an iSCSI target
04:06
or a regular block device that just looks and feels like a dist, but the data is distributed and stored into your cluster. Another way of storing the data would be the CephFS file system where you simply just mount a file system similar to an NFS chair, put data into it.
04:25
It's a POSIX file system and in the background the CephFS data is again distributed and stored in the Ceph cluster. Last but not least we have a more traditional object storage interface which is the so-called Rados Gateway, RGW.
04:42
It supports both the S3 and the Swift protocol from the OpenStack project. So if you have an application that is capable of talking this protocol, you can also use Ceph as your storage backend for your data. Yeah, that should be sufficient to give you an overview.
05:01
If you have more specific questions or want to learn more about Ceph, Ceph.io is the website where you can learn more about this. This is a regular open source project. It originated in a company called Inktank that was later acquired by Red Hat. Nowadays several companies including SUSE, Red Hat and some Canonical and others are
05:24
working on that as well as a growing community of users and developers. Ceph usually does a new major release of the software every nine to twelve months. The release cadence is somewhat changing at the moment.
05:40
We are now looking in doing yearly releases again instead of every nine months. And what I'm going to do is talk a bit about the evolution of this dashboard in the past three Ceph releases.
06:01
And usually each Ceph release has a code name which relates to an octopus. The Ceph developers like octopuses in all forms and so they are usually using names of certain kinds of octopuses to name them.
06:20
And the first version of Ceph that shipped with the dashboard was Luminous. That is August 2017 so it's already two years ago. But that was the first time when the Ceph project themselves included a web-based interface.
06:40
Back then really it was just a dashboard. You can only obtain read-only information to just get a quick impression of how your cluster is doing. I have a screenshot here on how that looks like. So it was read-only. It showed you the health status. You can take a look at the log files underneath the cluster log and the audit log.
07:05
You can see a list of nodes and the storage daemons. What else? So it was really a very simple page that you could bring up somewhere on a screen in the data center to get a quick glance on your Ceph cluster without going to the command line and running commands on top of that.
07:26
Before that Ceph was as many this kind of infrastructure project, something that you always needed to use the command line to work with. And Ceph wasn't really famous for being very command line friendly.
07:40
Just based on the complexity that it's a distributed system, there certainly is a somewhat steep learning curve in getting started with it. So that dashboard was really the first incarnation where users now had a web page that they can look at and get an easy impression.
08:00
After Luminous was released, there was kind of a growing demand for adding more features to that because users liked having a web-based interface. So pull requests were created in master for the next coming release. Lots of features were requested. But it was also somewhat clear that the existing dashboard wasn't really up for the task based on how it was developed.
08:31
So that was the point at which the team working on the Open Attic project approached the Ceph Upstream community if they would be interested in collaborating with that.
08:45
Because the Open Attic project had made, well we have been working on our standalone tool for several years already, added quite a number of Ceph related management features. So we had somewhat of an idea of how this could work. We were proposing if we could simply take our work and take the experiences that we've made into a new dashboard.
09:09
So we were not planning on taking Open Attic and just porting it upstream. Because in the way Open Attic was designed from an architecture perspective, that wouldn't have been easily doable anyway.
09:23
And we then came to an agreement to start jointly on kind of enhancing the existing dashboard and making it more feature rich and more scalable in a way as well. So that's when we started. That has been about one and a half years ago by now.
09:43
Early in 2018 the work started. We first made kind of a proof of concept prototype that was taking the original dashboard, the V1 version, but converted it to a new backend and frontend infrastructure and architecture.
10:02
Just to show what would be possible. But that proof of concept was never really merged into the master branch. So I think the branch is still around but we just did it for demo purposes to see what is possible. And to kind of demo what we are capable of doing with this new framework.
10:23
And then we started going ahead. Dashboard V2 was started and it was first available in its first incarnation in the mimic release which was published in June 2018 if I remember correctly. One of the premises of getting there was, one of the requirements I would rather say, was
10:47
that we wanted to be able to have all the functionality that the old V1 dashboard provided. We also collected all of the pull requests that had been submitted against the old dashboard and added this or ported it to the new architecture of the dashboard V2 as we back then called it.
11:06
And for a time both the old version and the version under development were both in the same git repository and you could switch between them during the development cycle. But shortly before mimic was released we made a hard switch and moved the code around so the new dashboard V2 became the new default in a way.
11:28
And again I have a screenshot. So it still somewhat resembles V1 but if you are or have been familiar with Open Attic you could also see that we have taken some inspiration of the UI from there.
11:43
So it's somewhat of a blend between those tools. But it was really something that we started mostly from scratch. So both the backend architecture changed, we moved to Cherry Pi so it's Python based. We created our own RESTful API, our own controllers and everything.
12:06
The Web UI was designed from scratch using Angular, TypeScript and BootScrap so very popular and common web frameworks. And as I said, somewhat derived the look and feel from the Open Attic UI there.
12:24
Since we were also adding management functionality that was not just display or visualising things we also made sure that access control is provided. So we had to add a very, well back then it was a very simple access control system.
12:42
There was just a single user account that you can freely define both username and password that you needed to enter in order to access the dashboard. So that was about it. At least there was some level of protection to prevent unauthorised users to make changes to your self-cluster.
13:01
Because we were now exchanging this kind of security relevant information we also needed to add encryption. The previous dashboard didn't have support for SSL for example, it was just plain HTTP. And yep, the milestone was that all of the features from dashboard v1 were actually included.
13:25
But in addition to that we had already started porting over functionality that Open Attic provided. The things we started with was the block device management, everything related to RBD block devices. As well as the S3 compatible object store management, RGW.
13:44
So you were able to create new users, obtain their access keys, take a look at their buckets and just those general management functionality was added. Something that was also added in CephMimic was a new way of how the Ceph cluster is supposed to be configured.
14:05
So in the early days there was a file called the Ceph.conf file where you made all the config options that Ceph needed to adhere to. And that file then needed to be distributed across all nodes of your cluster using salt, ansible, whatever.
14:23
But it was very important for Ceph that the config file was the same on all of your nodes which was a bit cumbersome. So with that version they moved the majority of the configuration, these key value settings, into Ceph itself.
14:40
So the months, which are the nodes that kind of keep track of how the cluster is doing, they have their own way of sharing information and data. And it also included a database, a key value store basically. And that one was extended to also store all these various config settings.
15:00
And the dashboard provided in that version a read-only view into the config settings. So you could query them, take a look at their default values if they have been modified from their defaults and so on. Yeah, so that was Mimic. And that's when we also noticed quite an uptake in activity.
15:25
So something that, as the OpenATIC standalone project, we always have been somewhat suffering from was that we were out of tree. And any change within Ceph was something that we always had to kind of run afterwards. So we had to be very careful if the upstream Ceph project made any changes that potentially broke our project.
15:48
And then we had to make adjustments and create new versions. So it was hard to stay in sync. And being part of the upstream project and integrated in the test suite, being in the same Git repo, made this job much easier.
16:02
Because we had our own set of tests that were run. And it was based on this much easier to capture any regressions and make sure that if the developers made changes somewhere else in the Ceph code base, that the dashboard could also easily be fixed and improved along these lines.
16:24
So yeah, around that time we had already somewhat grown our developer community. So after the team at SUSE started that work, pretty shortly afterwards engineers from Red Hat also joined that project and started contributing.
16:44
We had a meeting where we met in person to discuss roadmaps and our joint goals. So we identified the features that we wanted to work on for the next Ceph release. For the team at SUSE, one of the goals that we had was that we wanted to basically reach feature parity with everything that Open Attic provided.
17:06
Since we didn't want to be stuck in a mode of developing two tools in parallel for an infinite future, we really had the goal of making sure that dashboard can at some point replace Open Attic without regressions. Which we didn't reach in the mimic release. There was still a lot to be done.
17:27
Also the folks from Red Hat had some of their ideas and requirements so they started assigning engineers to these features. And we then really collaborated very closely over the course of the last year for what has now been released as the Nautilus release just recently.
17:47
And that's also the version that I'm going to go over in the demo. But first I would like to give you a quick update or kind of a laundry list of the features that we've implemented in the meanwhile.
18:01
So Nautilus was released earlier this year in March. And dashboard was really mentioned as one of the key highlights of that release. But of course there were lots of other internal changes and enhancements. But lately the Ceph project has been putting a lot of focus in enhancing the manageability and usability of the cluster as a whole.
18:27
Making sure that command line utilities are more consistent and making the output more meaningful, more readable, things like that. So generally in addition to of course enhancing performance, adding new features with the
18:43
dashboard, usability, manageability were some of the topics that we were really keen on improving. So what did we add for Nautilus? Yeah, the initial version of the dashboard with the single user, single password was just, well, not adequate.
19:01
So we now added a way to add multiple users and also be able to give them specific roles. So you could create a user, for example, that only manages RBD or you could create a user that is read-only. So they can log in and they can see everything but they are not about to make any changes like deleting RBDs, adding pools or things like that.
19:24
So it's very flexible. We have a predefined set of roles and you're free to add new roles based on, well, a very simple system similar to the UNIX concept you can add. I think read, delete, write permissions.
19:43
I can show you the metrics later on. So that makes it much more flexible and hopefully also more suitable for larger environments where you may have multiple self-administrators managing a cluster.
20:01
Also, that's a requirement, especially in larger organizations. They usually don't want to maintain multiple pockets of user information. So most of them have an active directory, an LDAP or whatever. And the dashboard provides you a way to use an identity provider that supports the SAML v2 protocol.
20:26
So you can configure the dashboard to basically offload the authentication part of the username and password to an external IDP. But unfortunately, you still have to create the user account locally because of the roles and the permissions that the user has.
20:42
But using the SSO, it's now possible for you to, for example, disable users centrally and make sure that other requirements like two-factor authentication or things like that are being implemented. Auditing was added.
21:01
Since the dashboard allows you to make changes to your cluster, it may be a good idea to keep track of who has been doing what and when. And since the dashboard has both the web UI and a backend that provides a REST API, we took an approach where the backend keeps track of this auditing information
21:22
in a similar way to an Apache access log, where you get an entry for each API request, where it originates with the user credentials and what particular action has been performed. By the way, if you have any questions about what I'm saying, just raise your arm
21:41
and I'm happy to address it right away, just to make it a bit more interactive. If there are more detailed questions, of course, we have time at the end. But usually, if something comes to your mind, just raise your arm. New landing page.
22:01
So one of the things that was still looking like the old dashboard was the page that you saw when you immediately logged in, so that overall status view of your whole cluster. That needed some improvement, so with Nautilus we completely overhauled that page to make sure that it's a bit more logically arranged and gives you really the key metrics and health information of your self-cluster at a glance.
22:30
The idea behind it is that this is a page that you can put on a screen in your data center and then you take a look at it and, well, see how your cluster is doing.
22:41
And it's also a bit more live, so we added a lot of graphs that are updated in a very frequent basis, so it's continuously updating the cluster status. Internationalization was added. That was one of the requirements that SUSE had, since as a European company we are addressing a market with quite a lot of
23:10
different languages and we added support for, well, yeah, the first thing that we had to do was actually make the dashboard codebase translatable. And the next step was creating a platform and infrastructure that allowed the community to add languages,
23:27
but we also had a dedicated team of translators working on an initial batch of languages, which by now includes German, Portuguese, Chinese, Japan, Japanese, French, Brazilian, Portuguese, and I think four or five more languages.
23:47
So it's a llama that's continuously growing. And in fact, the day after we had set up the translation platform, Chinese was complete. I was completely in awe about that. We hadn't even properly announced it, but some member in the self-community was following
24:04
the development and just went ahead and added all the Chinese strings in no record time. That was really amazing. So this is something that we update continuously every time a new language is more or less complete, then we will add it to the codebase.
24:24
And the REST API that is part of the backend received some love, so we now have added an open API specification based on Swagger. You can use your web browser to basically browse all API endpoints.
24:42
It's self-documenting in a way, so if you add comments to the REST API controllers in the Python backend code, it will also be rendered in this web view of the REST API. Which hopefully makes it easier for developers to just not use the dashboard frontend to perform tasks, but
25:03
just talk to the backend if they want to, example, create a new pool, obtain certain information or whatever. The intention behind this is that at some point, the dashboard's REST API becomes the defacto self-management API.
25:25
So, new features. OSD management was very high on our list, so we added some functionality to work with your storage team. These are kind of the workhorses of a self-cluster. You need to take a special eye on how they are doing and how they are performing, so we now have a dedicated page for that.
25:46
You can manage OSDs in the sense that you can mark them as down or out if you need to do maintenance on them. You can set various OSD-specific settings. We've added something that is called recovery profiles, which basically gives you an easy way to select if your self-cluster should
26:08
be focusing or increasing its priority on serving client load, or if it should spend more time in rebalancing data, for example. That's something that's important.
26:21
If an OSD was taken out or if you've added new OSDs, sometimes these kind of changes result in lots of internal data movements. So the OSDs talk with each other to share information and data, and you can have an easy way of prioritising how much they spend time on that synchronisation work versus serving a client load.
26:46
Also, one of the things that we've taken over from the Openatic project, we use Prometheus and Grafana for capturing and visualising metrics and performance data of your entire self-cluster.
27:01
And we basically embed those Grafana dashboards within the self-dashboard in various places. And we did so. There was a separate project called self-metrics that just provided the Grafana dashboards, but no further integration. And we worked with these developers in updating and converting those Grafana dashboards so they are easy to embed within the self-dashboard.
27:31
The config settings viewer that I've spoken about earlier has now been converted to config settings editor, which makes it a bit more useful, of course, because now you can not just look at the various config settings, but you can actually change them at runtime.
27:46
So basically any feature, any tweak that you want to provide can be done through the dashboard. So pool management was added. Pools are basically the layer between the object storage demons and your client application.
28:07
So a pool basically identifies, well as the name implies, a pool of where your data is stored, it could have certain availability and performance characteristics. So managing those pools is quite essential.
28:21
So that has been added. Erasure code profile management, that's something that is maybe a bit too specific to explain in detail, but basically determines on how the data within a pool is being replicated among the OSDs. RBD mirroring configuration has been made easier now.
28:43
Ceph allows you to replicate block devices to another Ceph cluster in an asynchronous fashion using RBD mirroring. Setting this up on the command line can be complicated. We now have a web UI that guides you through the process of setting this up.
29:01
And I already spoke about the Grafana dashboards. So in most places of the Ceph dashboard we usually have a Grafana dashboard that gives you kind of an accumulated overview of all of your nodes. Like for example on the OSD page you have one Grafana dashboard that gives you an overall overview of all OSDs.
29:26
Plus for each OSD a more detailed Grafana dashboard for a matrix specific to that particular OSD. Okay, I need to speak up. Crush map viewer. So as I said the crush map is this kind of algorithm that determines how data in your Ceph cluster is being distributed.
29:46
And this is just a viewer so it gives you a graphic representation of kind of the hierarchy that you have identified. So Ceph has a notion of setting up availability areas I would call them.
30:05
So for example you could tell Ceph that it should ensure that the data is available on two racks or on two nodes or on two data centers for what have you. And visualizing that is not possible within the dashboard as well.
30:22
One of the ways on how a CephFS file system can be made accessible to non Linux clients is using NFS. There is an independent project called NFS Ganesha that is a user level NFS implementation that has quite a number of different back ends and one of them is CephFS.
30:40
So the dashboard can now configure NFS shares based on both CephFS and also on S3 buckets. So even though the performance obviously sucks but it is possible to create in Rados gateway an S3 bucket and put an NFS share on top of it.
31:01
Just thinking about all the various layers that data has to go through you can imagine that this is not very fast. But it is sometimes an interesting way if you need to let's say do a bulk import of data. Let's say you have an application that used to store its data in a file system but you want to migrate it to an object store.
31:23
So being able to use regular Unix commands to copy that data over into an NFS share that has an S3 bucket underneath would be an interesting import option. iSCSI target management was added. So again this is something to make Ceph more accessible to non native Linux clients.
31:45
Usually if you are having a Linux client talking to your Ceph cluster it would use RBD block devices natively to access block data. But we now can put iSCSI targets on top and then you can use basically any client that speaks the iSCSI protocol to access your RBD images.
32:08
And automating and configuring those targets is now possible through the dashboard as well. We added support for managing QoS is what they call it but it is more rate limiting I would say.
32:25
So basically without this feature a single RBD client can easily hog up lots of resources of your Ceph clusters without giving the other clients the same amount of speed. So you are now able to perform setting up IO limits like the number of IOPS or the amount of bandwidth that the client can consume.
32:48
And that can be done either on a pool level or even down to an individual RBD level if you want to. We also added alert management from Prometheus.
33:00
So basically in addition to the metrics that we already collect we have now added a number of default Prometheus alerts. So if you are running Prometheus and it is monitoring your Ceph cluster it can now also trigger alerts to certain criteria that you define based on the defaults that we ship.
33:21
And the dashboard is now capable of visualising those alerts within the dashboard. So you can see them directly without looking into your email or whatever notification system you have set up. Ceph Manager module management. That was a feature that was added a few versions ago already.
33:44
So the Ceph Manager is basically a process that takes over some of the administrative tasks. So contrary to the monitors that kind of maintain health and state the Ceph Manager takes care of collecting information.
34:04
It is sending out instructions to the various nodes in a way and it has a plug-in system. The Ceph dashboard itself is a manager plug-in. But there are quite a lot of other additional modules that have been created in the meanwhile. Since they are Python modules it is very easy to create a Ceph Manager module.
34:24
And we have a growing number of these. So making them a bit more easy to enable and disable and configure them. We added support in the dashboard for doing that. And with that I think I can hop into a quick live demo.
34:43
And if there are any questions. As I said, just jump in and I hope we maintain enough time in the end to look into this. I am going to quickly change my screen setup because I... Oh, this way might work. Let me just switch to the browser.
35:03
I am moving it over to the other screen. Can you see that? Okay. Let me increase the font size a bit.
35:20
So now that is the start page. Maybe also worthwhile mentioning, we from the start on made sure that the dashboard is very easy to customize and to change the branding. Because, well, both Reddit and SUSE for example, they do have their own products that are Ceph distributions.
35:41
And if they want to add their own look and feel, we wanted to make it easy for them to use their company-specific schemes. For example, changing the logos and those kind of things. So that is easily doable. This is the upstream branding on how it ships. As you can see here, you can choose among a growing number of languages that you want to use the dashboard in.
36:05
I am going to stick to English just for the sake of this demo. And in this development environment there is a default admin-admin user. And no, I don't want to save that.
36:22
Thank you. So this is the start page. I am going to reduce the font size a bit so you can just see it in one glance. That is what we call the landing page in which you can see how your Ceph cluster is doing.
36:40
So you see the overall health status, how your monitors are doing, which are the ones that take care that your cluster availability is proper. Of course the OSDs, how much you have. As I said, this is a laptop daemon environment, so five OSDs is really the bare minimum of having Ceph up and running.
37:01
So you wouldn't be starting something like that in production. We have just one manager daemon running. If you want the dashboard to be highly available, you would be thinking about creating more than one manager daemon. So many environments have set up usually three managers.
37:22
And at any time only one of them is the active manager. And this is where the dashboard would be running. If there would be a failover to another manager, you would just be redirected through the web browser to the currently active instance. Object gateways, hosts.
37:41
Well, it's just a single host, so not very exciting. It's also an idle cluster, so there's not much IO activity going on, since this is a very measly laptop and I don't have lots of disk space. I made that mistake in a previous conference where I had some demo workload going on
38:03
where I was just reading and writing data to just get the widgets running. And then during the demo it ran out of disk space and everything fell apart. So I'm not trying that again. So this is really, hopefully, most of the relevant information at a glance.
38:21
At some specific states you can dig deeper. So let's go to the hosts for example. I could either click on hosts here or I would go to cluster, hosts, and I'm going to increase the font size again a bit. So if you would have more than one host, that table would be longer.
38:42
Each table has a search form where you can easily and quickly drill down, matching on specific parameters. And in here you see the host and what services are running on it, what version. Right now, well, that's the development version. And from here I could then see more specific information, performance metrics.
39:02
I'm going to show these in a different screen. And overall performance would be an example for one of the Grafana dashboards that we have in which you get an aggregated view of the performance metrics of all hosts in your cluster. So that's something that Prometheus takes care of.
39:22
So that's hosts. How are we doing time-wise? I need to make sure. The months are another key component. So here you can see how your months are doing, how many sessions they are running, how many of them are in quorum, which means that they are in agreement
39:43
about the current health state of the cluster. That page still very much resembles the original dashboard. In version 1, we haven't really looked into how to enhance or improve this, but we are just seeing an uptake in the number of dashboard users,
40:01
so I'm hopeful that they will provide us with feedback on how to enhance and improve those pages, and if some information is missing. Quickly moving on, OSD. So this is one of the pages that you likely will be spending more time on. Again, you can quickly filter that list by various criteria.
40:23
Maybe if you just want to take a look at this one on a single host, you can see the status, usage. Those graphs here usually show IO activities and so on. And you have a number of things that you can configure here.
40:42
Cluster-wide configuration, for example, gives you a way. Let's say we don't want Seth to add an OSD back into the cluster once it has been marked as out, just to show you that this is actually working. As you can tell, Seth immediately shows a yellow heart, so it's not happy anymore.
41:05
Let me go back. Okay, I have a health warning here. And why is that the case? Because I just enabled the no-in flag, which can become a problem because even though the OSD may be alive again,
41:21
we're prohibiting from rejoining the cluster. So you may run into a degraded mode at some point. At this point, it's just a warning, nothing to worry about. I'm just doing it to play around with it some more. The recovery priority I spoke about. So basically, you here have a way to quickly make changes
41:43
to the various settings that define the priority and how recovery activity has been prioritized over serving client load. And we have a number of default profiles that you can select from, but you can, of course, customize them if you have more specific needs
42:03
according to the values that you have determined for your environment. So let me put it into low. Now I'm going to be nasty and I'm going to tell that OSD to go down. And yes, I'm sure about that.
42:25
And after a short while, it should also be seen as down here, waiting. That's the demo effect.
42:44
Interesting, that's maybe me doing something wrong because it's live, or we have found a bug. In any case, let me try something else.
43:05
Let's mark this out, that should work. There we go. So now, going back to the status page, it should also tell me that one of my OSDs is out and no longer available.
43:22
At this point, SEF at some point should start shuffling data around. As you can see here, recovery throughput is now showing some activity. You can also see here in this placement group status that there's some activity going on. Placement groups is kind of a way of how SEF groups objects together
43:43
and how it makes sure that they're evenly distributed among the OSDs. So with this OSD being down, it still has redundant copies of the data elsewhere. Now it's just trying to attempt to, again, re-establish the redundancy level that I've determined.
44:00
So now it's moving data around to make that happen. And if you go, for example, to the SEF pools list that we have here, you can now see which of your pools are being impacted by this and in what state the various placement groups of that particular pool are.
44:20
I have to speed up a bit so I don't think I can go through each of those tabs. Crush map might be interesting. Well, as everything is running on a single node and it's a very small cluster, that tree is not very big, but you can collapse and expand various nodes here. And depending on your crush map, you may see a more detailed hierarchy here.
44:44
Clicking on one of these OSDs would then give you some more information. Thank you. Quickly going on images. If you agree, I'm going to spend maybe five more minutes on the demo
45:01
and then we move to a 10-minute Q&A just to give you some more of an impression. Let me quickly create a pool so you can see this. So one of the things that if you want to, for example, store RBD images in SEF, you would create a pool usually named RBD. You could choose how the data should be distributed.
45:23
It could either be what is called erasure coding. I'm using replication just to copy the data. Let's make it a little smaller. We start with eight placement groups and we store three copies
45:40
and the application of this pool is RBD. Thank you. I could enable compression, I could create quotas and so forth. I'm just going to mess with this at this point and just go with the defaults. OK, so now my RBD pool is being created and there we go.
46:02
And now I can actually go into my block device list and create an image. So that's basically the process of creating a block device that you can then attach from another Linux client to store data on it.
46:25
Again, there are lots of more things that you can play around with here. Due to lack of time, I'm just going to gloss over these. Feel free to test them and toy around with them by yourself. So there we go. RBD image has been created.
46:42
I didn't enable NFS so that's something I can't show. I also didn't enable mirroring and iSCSI in this demo environment. Ceph file systems may be something quickly worthwhile looking into. So in this setup there's one CephFS file system and if it would be mounted by clients
47:02
I would see the activity of these clients and how much I owe them performing, for example. RGW is the object gateway. Here I could see... That's bizarre.
47:23
Congratulations, you have found a bug. Yes, so object gateway refuses to work. That's interesting. Let me just reload the page and see if that helps. This is a snapshot from master two days ago. So it's not in a release version. So bugs can occur and they do.
47:46
Let's see if that helps. Nope, it doesn't. So in that case... Oh, there we go. Yeah, all right. That's the point where the demo falls apart. I'm not going to tempt the demo gods any longer.
48:01
I hope this gives you at least a brief and quick impression of what we have accomplished so far. I'm going to switch back to the demo, to the presentation real quick for final words. And even that refuses to work. Brilliant. Ah, there we go.
48:22
Okay. So even though I have not been doing much IO or anything, it seems like my laptop is now very busy dealing with Ceph in the background and not giving my presentation any more priority.
48:41
Okay, let's stop at this point before it gets more miserable. Roadmap. Well, Ceph Octopus is underway already, so we are now working on more and new features. There will be a focus on more user management-related things like being able for users to refresh
49:02
or have a password expiry, something that comes to mind at this moment. So several things related to user management and security, something we are going to add. The whole Rados object gateway management functionality will be enhanced. So these are some of the topics that we are spending some time on.
49:21
I have pointers to our, quote, roadmap in the presentation that I will be sharing. So even though it currently does not collaborate with me any more, you will be able to take a look at the slides offline later on. And with that, I think we have time, 10 minutes, for questions. Sorry that this broke down a bit in the end.
49:42
That's what happens when you do live demos. Yes, thank you, Lance, for the presentation. Are there questions? Yes, there's a question.
50:03
You mentioned the Grafana dashboards in the web page. Is it possible to import them to your own dashboard so you have it all in one place, or do you have to use the monitor? So the way that we are doing it
50:21
is that we have a stock set of Grafana dashboards that are part of the self-distribution that are made specific in their design so they can be embedded by the self-dashboard. But since this is a standalone Grafana instance, you can actually contact this Grafana instance directly. You can add more dashboards if you want to.
50:41
But in that case, they wouldn't be visible within the self-dashboard unless you would modify explicitly those dashboards that we're embedding. That is possible as well. But by default, we just have a set, and they are being visible. But you are free to add more if you want to.
51:02
Yes, thank you. Are there more questions? Okay. It doesn't seem so. There's one. Ah, there's one. Oh, sorry. Give an example of where it's being used and some war stories or something you like about it.
51:22
So illuminate the background a little bit. Could you repeat the first part of the question? War stories about what? Need not be war stories, but where it is being used from practical, from actual... Why a dashboard at all? No, no, yes.
51:40
I think in your customers, it's already used somewhere. Give some examples of where it's beneficial. Even though the dashboard has been introduced in MIMIC, now with the Nautilus release, it's really the first version where it will be productized in downstream products. So both SUSE and Red Hat are...
52:01
Well, SUSE has just released their downstream product based on Nautilus. Red Hat is still working on it. But we are now just starting to roll this out to SUSE customers to speak as a SUSE representative. But there are a number of reports on the Ceph users mailing list of people using it, looking into it. The people working on Rook, making sure that Ceph runs into containers,
52:24
are looking into this and enabling it by default. There's a growing uptake, but for me, so far I don't really have any real insight of how users are using in their production environment.
52:41
Ceph users are a bit more conservative when it comes to updating to the latest versions. I see there's a growing increase in the number of reports and feedback on the dashboard, but we are still in an early stage. So I don't really have that big war stories to share. But of course, I'm very happy to hear about people using it
53:01
and what they like about it, what they dislike about it, what they're missing. At this point, it's still an incarnation or what we think would be the right things to do and to visualize. And we are a small team. Yes, we of course have influence by the product management team of the companies behind it
53:21
so they have their ideas. But this is an upstream community project and the users are the ones that should tell us what they would like to see and what they like or what they dislike. We do get our fair share of bug reports so there definitely is usage going on. But I personally would love to get more direct feedback
53:42
on how is the usability, how is the user experience, how can we enhance this. We are aware that the dashboard is pretty raw in certain levels. So if you take a look at some of the dialogues that we have, for example, creating a pool or something like that, that still is pretty much an adaption of the command line
54:00
in the sense that you need to know all these values and parameters and you need to be familiar with what you're entering in here. So at some point where I would like to go to or get to is being more workflow-oriented and basically starting from a task, not being too self-specific on how to reach that goal,
54:22
but asking the user what would you actually like to achieve and then guiding him or her through those various steps. Yes. Another question? Or questions answered. Or questions or comments. Out of curiosity, who of you has actually looked and touched that dashboard?
54:46
OK. Fair enough. Thank you. Cool. That's good to hear. Well, feel free to grab me. I'll still be around until late in the afternoon today. I'm happy to have conversations about this.