SQL on Ceph
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61649 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Group actionData storage deviceMereologyRight angleSpacetimeMultiplication signProjective planeDiagram
01:15
Computer fontData storage deviceAbstractionDemonData storage deviceOpen sourceArchitectureData managementComputer fileObject (grammar)Gateway (telecommunications)Block (periodic table)Computer-generated imageryAutonomic computingReplication (computing)Computer hardwareClient (computing)Content (media)Attribute grammarGeneric programmingNormed vector spaceKey (cryptography)Component-based software engineeringLevel (video gaming)ScalabilityInterface (computing)Magnetic stripe cardException handlingRight angleObject (grammar)Exception handlingData managementLibrary (computing)MereologyData storage deviceComputer fileKeyboard shortcutConsistencyProjective planeInteractive televisionPattern languageReplication (computing)BitMultiplication signMechanism designDirection (geometry)Cartesian coordinate systemFile systemData storage deviceInformationDemonNumberCovering spaceAbstractionNP-hardMiniDiscExecution unitBlock (periodic table)Autonomic computingComputer hardwareClient (computing)Wave
08:22
Object (grammar)Exception handlingData storage deviceScalabilityAbstractionInterface (computing)Magnetic stripe cardLibrary (computing)DatabaseRegular graphEinbettung <Mathematik>Parameter (computer programming)Variable (mathematics)Computer fileBlock (periodic table)Open setReading (process)SynchronizationStructural loadConfiguration spaceDemonModul <Datentyp>Computer configurationMetric systemVolumePlastikkarteDatabaseData storage deviceLibrary (computing)Slide ruleLocal ringClosed setObject (grammar)Different (Kate Ryan album)Cartesian coordinate systemPhysical systemVariable (mathematics)VirtualizationNumberFile systemCASE <Informatik>Integrated development environmentTerm (mathematics)NamespaceSynchronizationSystem callWritingPrimitive (album)Binary codeInformationMetric system10 (number)WebsiteBitScaling (geometry)Reading (process)Module (mathematics)Data managementKey (cryptography)Interface (computing)Volume (thermodynamics)DemonLevel (video gaming)Numeral (linguistics)Matching (graph theory)Projective planeBezeichnungssystemKeyboard shortcutMultiplication signComputer fileAxiom of choiceSpacetimePoint (geometry)Computer animation
15:30
Module (mathematics)Group actionDemo (music)Revision controlInterior (topology)MIDIObject (grammar)Asynchronous Transfer ModeDatabase transactionOperations researchDefault (computer science)Web pageSpacetimeCache (computing)InformationModul <Datentyp>Error messageLibrary (computing)Concurrency (computer science)DatabaseStudent's t-testSeries (mathematics)Statement (computer science)Object (grammar)Data managementDemo (music)Multiplication signMetric systemNumberDatabaseTable (information)Computer fileState of matterRing (mathematics)Key (cryptography)Variable (mathematics)Configuration spaceBuildingModule (mathematics)Integrated development environmentSystem callGroup actionExclusive orSpacetimeScheduling (computing)Slide ruleSystem administratorLibrary (computing)Formal verificationLocal ringProgrammschleifeNamespaceElectronic mailing listIntegerReading (process)Concurrency (computer science)Term (mathematics)Infinity
21:06
Program flowchart
Transcript: English(auto-generated)
00:13
Okay. Welcome everyone. Starting with the next session.
00:22
So yeah, try to find a place. Feel free to use the reserve spaces if nobody comes. Let's welcome Patrick on his talk on SQL on Cephalus. All right. Hey everybody. I'm Patrick Donnelly.
00:41
I know I got Red Hat slides up, but I'm actually part of the storage group at Red Hat. They got moved over to IBM as of this year. So technically, I'm at IBM now. If that matters for anybody who wants to ask me questions. Today, I'm going to be talking about SQL on Ceph. It's a small project that started about two years ago.
01:03
It's actually a COVID project for me. While I was dealing with a newborn baby, it had lots of, or some time on my hands. But anyway, this is kind of like an overview for what we're going to talk about.
01:20
Yeah, go ahead. Yeah, I do. But we don't have a PA, unfortunately. Oh, yeah. I just have to speak up. I'm naturally soft-spoken. So if you can't hear me in the back, just wave your hand and I'll try to speak up more, okay? I'm okay right now? All right.
01:42
Where was I? So just a quick canvassing the audience. Who's used Ceph before? Oh, wow. Okay. Who's used SQLite before? Fewer people. That's interesting. Okay. But not much fewer. All right. I'm going to quickly talk about Ceph and what it is.
02:07
I won't spend too much time on it due to the time I have available in my talk. Give you a brief introduction to Rados for anyone who's not familiar with it. Then I'm going to talk a little bit about SQLite. Then some typical storage patterns
02:21
we use for storing data on Rados. Give you an introduction to this new library, libceph-sqlite. Then I'm going to talk about how we use that today within Ceph just to show that this library is not being used by anyone.
02:42
Although I am interested if anyone's using it in the community. Give you a brief, go through a brief tutorial, interactive tutorial using the library. Then I'll end the talk with some retrospective and talk about future work. So what's Ceph? Ceph is
03:01
an ecosystem for distributed object storage. So it's composed of numerous projects centered around managing a large storage cluster. The underpinning of Ceph is Rados, which I'll talk about on the next slide, but it's basically a distributed object store.
03:20
Most people don't use it, use Rados directly. What they use instead is the storage abstractions we built on top of Rados, which provide the more popular storage mechanisms that people are familiar with including CephFS, which gives you your file storage as a distributed file system. RGW, providing the S3 object storage gateway,
03:44
and RBD, which gives you your block device storage on top of Rados. Ceph has evolved more and more recently to become more user-friendly. If you had maybe poor experiences in the past with Ceph, I encourage you to give it another shot.
04:01
The dev team has dedicated a lot of time recently to improving the user experience, and also taking the hassle out of managing your storage cluster out of the experience. There are things like monitoring device health.
04:21
We now have a very mature dashboard for interacting with the Ceph cluster, and the cluster management itself is now largely being done through Ceph ADM, which as you saw in the previous talk, you can start up a Ceph cluster with just a simple command and then start adding hosts to it.
04:42
It's never been simpler. Oops, went backwards. What's Rados? Rados is a number of object storage demons that run on physical disks. They can be hard disks, SSDs, NVMEs. On top of these object storage devices,
05:02
we have this concept of pools, which allows you to have various administrative policies regarding what kind of hardware the pool should use, how the data should be replicated. Clients of Rados talk directly to the primary object storage device for a given object and you can look up which object storage device an object belongs to in
05:24
constant time using a library called Crush. Don't need to use that directly, that's just under the covers. Then as part of the name suggests reliable autonomic object storage, distributed object storage, the cluster self-heals,
05:43
it's autonomic, and the replication is done automatically. You don't have to worry about how any of that works. Sorry. So what's an object? The object storage device is composed of a number of objects,
06:01
and that is the logical unit you have when you're storing things in Rados. An object is composed of three different parts. You can use one or all. They have the data blob, which is analogous to like a regular file. Like you put data in the file, you get data out of the file. You have key-value x-adders.
06:22
This is an older technology that was used in the early days with CephFS for storing certain information about files, which is typically very small data. It's not usually used anymore except in some parts of CephFS. Now, the key-value store that's used most often in CephFS,
06:41
also RGW, is OMAP, and that is much more of a general-purpose key-value store used today. So this is how you interact with Rados through these objects. Now, it's not that simple to take like a number of objects, distribute all over the cluster and try to build something with that
07:02
because you've got consistency issues that you have to deal with. You've got to manage how you're going to stripe the data across all these objects, which is why we have these more popular abstractions that I talked about, CephFS, RBD, RGW, which is how you typically interact with Rados. So what I'm going to talk about today is a SQLite library
07:24
that operates alongside these other three storage abstractions, gives you something on top of libRados, but you can also now run SQL on Ceph. So how do you typically do application storage on Rados?
07:41
Well, we have various bindings you can use to talk to Rados. We have the typical CC++ bindings, which are part of the broader project, also used within Ceph. We also have a Python interface, which is used for manipulating the objects, and that's used in now,
08:01
that's somewhat used in the broader community for various projects, but also within Ceph we use it for some of the new Ceph Manager daemon, which I'll talk about more later. And again, it's not that simple to stripe data across the objects, which is why we have these other abstractions. One of the more notable exceptions is this libRados striper,
08:21
which is one of the ways you can create a file concept on top of objects, where you open and close, read and write and sync to a number of objects and it looks like a regular file. That was developed by some folks at CERN, and it's mostly, I think, in terms of use, it's stayed confined to that space.
08:53
Even though we do have these other storage abstractions, it's still useful to talk to Rados directly because sometimes you want to do something
09:01
that is not dependent on these other storage abstractions, which may, in the case of within Ceph's internals, may not actually be available, which is why a number of Ceph Manager plugins, the Ceph Manager has a number of Python modules, and they talk directly to Rados.
09:21
So this was something I wanted to address because it was a little bit awkward, and I'll talk about that more. So a quick overview of SQLite. For those who've never used it before, it's a user application library that acts as a SQL engine
09:41
and lets you store a SQL database as a regular file, usually two files. There will be a journal and then the database object itself. And depending on how you use it, the journal's transient may come and go. It's widely recognized as one of the most used SQLite engines in the world.
10:01
It's very popular. They estimate on their website there's billions of SQLite databases in use. It's at least tens of billions at this point because it's in every Android phone. So it was an easy choice to make. It's a very simple library, and bindings exist for numerous SQL systems.
10:21
In particular of interest to me was Python. Actually extending SQLite is fairly simple. They have this VFS concept, virtual file system concept, lets you swap in different virtual file systems as needed.
10:41
The basic one is the Unix VFS. That's what comes with SQLite by default, and it's very intuitive. It just passes on open, read, write, close off to the local file system for execution. So libsef-sqlite is a library for
11:02
a SQLite VFS library. It lets you put a SQLite database in rados. It's composed of two parts, libsef-sqlite and simple radostriper. I'll talk about simple radostriper on the next slide.
11:20
The use of this library does not require any application modification, and that's kind of like the killer feature here because you can just set some environment variables and modify the database URI, and you can automatically start storing your database in Ceph. The journal objects, the database objects
11:41
are striped across the OSDs. You don't need to do anything differently. The simple radostriper is based loosely off of the libradostriper developed by CERN. The main reason I didn't end up using CERN's library was because it had some locking behavior that was not really desirable for a highly asynchronous use case.
12:03
I didn't want to modify their library out from under their feet, so I just wrote a simple version. It provides the primitives that SQLite needs, open, read, write, close, sync, and all the writes are done asynchronously.
12:23
Then the sync call that comes from SQLite actually flushes them all out. These are all stored across rados with these names, foo.db, and it's got the block number associated with the database, and so on. Using libstep SQLite, again, it's very easy.
12:41
You just have to load a VFS library. This is done with the SQLite command, .load libstep SQLite, and then you just provide a URI for the database. This is the pool ID or the pool name, the namespace within that pool,
13:00
which is optional, and then give the database a name and specify the VFS as Ceph, and that's it. It just works. You may have to specify some environment variables if you're using the SQLite binary to tell it which Ceph cluster to use or which Ceph configs to read,
13:21
things like that, but that's all fairly not obtrusive. Within the Ceph Manager, so the Ceph Manager is one of the newer demons in Ceph that takes care of certain details of managing your Ceph cluster and trying to provide easier interfaces. Of particular interest to us
13:41
is one that handles health metrics that come from the OSDs, giving the Ceph Manager information about the smart data associated with the disks, being able to anticipate failures in disks, again Ceph trying to reduce the management burden of storage clusters, and then also a portal to higher level commands
14:01
like managing volumes within Ceph that is a subvolume concept that's used by OpenStack or Kubernetes CSI. Within the Ceph Manager daemon, what I observed was that there were several modules that were just storing data in the OMAP key value
14:20
of a particular object, and it turned out this doesn't scale very well. We know it won't scale well, because if you have more than 10,000 key value pairs in a single object, the performance starts to degrade. In fact, you'll start getting cluster warnings that there's objects with too many key value pairs. It was also pretty awkward
14:40
in terms of how it was being used, and just by how we were managing the data, it was a perfect match for a SQL database, except it was not very easy to put a SQL database on Ceph at the time. In fact, Jan here worked on SnapSchedule,
15:04
which is a module for creating snapshots and maintaining snapshots in CepFS and handling retention policies, and that actually used a SQLite database that was flushed to ratos objects and then loaded
15:20
in anticipation of the project that I'm working on now, and that's all been updated now to use the Ceph SQLite library. In terms of how it actually looks within the Ceph Manager, on the left we have a schema. It's fairly simple.
15:42
Just creating a table with the device ID as the primary key, and then another table with device health metrics with the time we got the metrics, the device ID associated with that metric, and then the raw smart text. Then they actually put the device metrics in the database.
16:00
It's as simple as this within the manager. I've taken out a few unnecessary keywords for space in the SQL. You create the device ID, which just calls another SQL statement to insert into this table, and actually execute the SQL statement with the epic dev ID and data.
16:22
It's that simple, and now that's stored persisted in ratos. Here's a quick libceph sqlite in action series of GIFs I've created. Here we're running the Ceph status command, just showing us the state of the cluster. We have two pools right now at dot manager and an A pool that I'm creating for this demo.
16:44
Here I'm purging A just to show that there's nothing in it. It removed one object, and here I'm just listing all the objects within this pool. There's none because I just purged it, so that's just a starter. Then here we're actually going to run
17:01
some libceph sqlite. To do that, again, I mentioned there were some environment variables. If I'm using the sqlite command directly, I have to specify some environment variables so the library knows what to do. Here, because this is a dev cluster, I have to tell it to use the library path associated with my build.
17:20
I specify which Ceph config to load, which key ring associated with the admin user that I'm going to specify here. I was actually going to also add some logging data, but ended up not doing that just to save space. Here I'm actually running a sqlite command. I'm loading the libceph sqlite library.
17:44
That's one of the first commands that libsqlite is going to run. Here I'm opening a database in pool A, namespace B within that pool, and then database name a.db with vfsceph.
18:00
Now I'm in sqlite. Here I create a simple table with an integer column. There's the schema, exactly what I wrote. Then we're going to insert into the table one value, and then dump it. It's now in rados.
18:22
Now just to confirm that, I'm going to run the rados command on the pool A, list all the objects in the pool. You can see in namespace B, I have this a.db. I'm going to use this striper command. If this database were composed of many objects,
18:42
you can use the striper command to actually pull the database out. You can see here I've done that. It's an 8K database. It's small because there's just one table with one value. I loaded that locally. I pulled it out of rados. Sorry, the gif loops. I pulled it out of rados,
19:01
and then I now have the database as a local file. I ran sqlite on that local file database and just dumped it to confirm that it actually wrote the data out to rados correctly, and I can pull it out of rados and verify that it actually worked.
19:21
Here's another demo with just rerunning the same sqlite command I had earlier. Sorry, this is going to be a big paste, but I'm creating a table. This is just some magic in sqlite to basically create an infinite loop. I'm just going to insert a number.
19:41
I think it's 100,000 integers into the table and just see how many objects are in the database now. There's four objects composed of that database. I think for time reasons, I'm not going to go through the performance notes, but it's on the slide if you want to look at it later.
20:01
Just as a retrospective for Quincy, when the database got used live, it's being used in the two manager modules right now, the device health and the snap scheduling module. It's been fairly successful. We had a few minor hiccups that weren't really too much related to the library.
20:25
Just for some future work, I want to add support for supporting concurrent readers. That's not yet possible. Right now all readers and writers obtain exclusive locks when accessing the database. There's not a technical reason why we can't add concurrent reader support.
20:42
And then I also want to look at adding read-ahead performance, improving read-ahead performance, because right now every read call in libcef sqlite is synchronous. That's the end of my talk. Thank you. Do we have any time for questions? No? Okay.
21:02
Sorry, no time for questions this session. You have to find Patrick and ask him.