We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Distributed Storage in the Cloud

00:00

Formal Metadata

Title
Distributed Storage in the Cloud
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Cloud brought many innovations - one of them is inexpensive, scalable and sometimes secure Distributed Storage options. In this presentation we will talk about distributed storage Options modern clouds offers ranging from elastic block devices and object storage to sophisticated transactional data stores. We will discuss the benefits and new architecture options such distributed storage systems enable as well as the challenges pitfalls you need to be aware about.
Chemical polarityComputer configurationSlide ruleData storage deviceMultiplication signPresentation of a groupDiagram
Point cloudAxiom of choiceOpen sourceCloud computingService (economics)AnalogyIntegrated development environmentInternet service providerStack (abstract data type)Pay televisionOpen sourceCategory of beingData storage deviceSoftwareTouch typingLoginDifferent (Kate Ryan album)Service (economics)Axiom of choiceMultiplication signScaling (geometry)Representation (politics)Video gameCloud computingBuildingSource codeBinary codeAdditionOpen setInformation technology consultingProjective plane
Type theoryData storage deviceDatabaseOperating systemElectronic mailing listFile systemDifferent (Kate Ryan album)Multiplication signType theoryDirection (geometry)Computer animation
Query languageData modelFormal languageMetric systemDatabaseCommunications protocolSemiconductor memoryData storage deviceCache (computing)DatabaseQuery languageSingle-precision floating-point formatTranslation (relic)Relational databaseDifferent (Kate Ryan album)Data conversionTime seriesSoftware frameworkData analysisMultiplicationProjective planeProgramming languageData modelINTEGRALLibrary (computing)Front and back endsComputer animation
CAN busCountingDatabase normalizationData modelScale (map)Database normalizationData storage deviceQuicksortConnectivity (graph theory)High availabilityCASE <Informatik>BitTransport Layer SecurityMultiplication signArithmetic meanBit rate
Human migrationBlock (periodic table)BuildingInterface (computing)Point cloudElasticity (physics)MiniDiscVolumeTap (transformer)Stack (abstract data type)Open setData storage deviceComputer fileInstallable File SystemGoogolTelecommunicationOpen sourceType theoryData storage deviceCloud computingDirection (geometry)Cartesian coordinate systemMultiplication signTerm (mathematics)Remote procedure callProcess (computing)NumberLocal ringSoftwareMultitier architecture2 (number)Connectivity (graph theory)WordCategory of beingBlock (periodic table)DatabaseAxiom of choiceSimilarity (geometry)Projective planeFront and back endsQuicksortComputer fileData storage deviceFile systemDifferent (Kate Ryan album)Object (grammar)Interface (computing)Service (economics)Open sourceComplex (psychology)Instance (computer science)Configuration spaceFlash memoryCommunications protocolBuildingCASE <Informatik>FlagBackupComputer animationProgram flowchart
Open sourceFactory (trading post)Event horizonService (economics)Content (media)SQL ServerOracleDatabaseDigital signalDistribution (mathematics)Analytic setShift operatorVertical directionSingle-precision floating-point formatWaveConnectivity (graph theory)Relational databaseComputer configurationWorkloadOpen sourceAxiom of choiceSpacetimeRevision controlMagnetic-core memoryAnalytic setDatabase transactionDecision theoryDatabaseCategory of beingProgramming languageQuery languageTerm (mathematics)Confluence (abstract rewriting)Pairwise comparisonJava appletValidity (statistics)InformationService (economics)CASE <Informatik>Different (Kate Ryan album)NumberSystem callDataflowHybrid computerPosition operatorData managementTheory of relativity
Point cloudEnterprise architectureDatabaseRelational databaseServer (computing)Cache (computing)Table (information)Electric generatorSource codeRevision controlOpen setStreaming mediaView (database)Data managementOpen sourceDatabaseRevision controlAxiom of choiceDistribution (mathematics)Enterprise architectureData storage deviceRelational databaseFitness functionQuantumOperator (mathematics)SoftwareTime seriesSocial classCASE <Informatik>Focus (optics)Boundary value problemCache (computing)Key (cryptography)Single-precision floating-point formatPower (physics)Position operatorData storage deviceNumberFormal languageOpen setEndliche ModelltheorieInterface (computing)Object (grammar)Sign (mathematics)Projective planeLevel (video gaming)Pairwise comparisonSource codeService (economics)Theory of relativityComputer configuration
Source codeOpen setLTI system theoryNetwork socketData storage deviceKey (cryptography)Relational databaseMultiplication signCASE <Informatik>Interface (computing)Direction (geometry)
Program flowchart
Transcript: English(auto-generated)
OK. Well, some of you may be here for my first presentation. This one is going to be different in much more technology-focused, if you will. And we will talk about their distributed storage
in the cloud, right? And my goal of this presentation is kind of provide you a very general overview as in the other options which exist. I am not an expert, right? And probably something I'm going to say
is even going to be wrong, right? So if it is, then say, say like, this is fucking wrong, Peter, you know? So I can fix my slides when I talk the next time. I have the wrong stuff, right? So don't be shy. Be engaged, and that's going to be more fun for all of us. So the thing I would say to start with, we discussed about it.
As I believe there is a different ways you can approach your cloud, right? One is where you really kind of log in with the cloud provider. And then another one is what you really use one of really the open source solutions out there.
And as I spoke in my previous presentation, we can see what's, well, like, I would imagine that is how the cloud was originally taken. Well, I won't spend too much time on this because I already had a presentation,
and also because we don't have too much time. Now, one thing what I often have people asking me is about the open source, right, which I think this conference is about. And if you are thinking in the open source
from the business standpoint, right, we often see a lot of those different companies, right, which promote them as open source or somewhere around open source. But how do you know if it is for real, right?
And of course, one of those, you can look at the open source license and so on, right? And this is all their good stuff. Another is also to make sure what you ask yourself, right, or maybe even kind of company representative some of the questions, right, about how things look, right? One is you always think about how
you can deploy that kind of solution product, right, on your own without getting any additional cost, right? Because software may be kind of open source, right? And the source is available, but well, actually, maybe their binaries are provided only
to four people who have a commercial subscription. Well, in this case, it's maybe technically open source. But on the practical side, there is some of those problems, right? And especially, I have seen some open source projects which would essentially withhold details
about the build process, right, so it's not easy, right, to do that, right? Then another question that I always like to look at is a choice of vendors if you need any help, right? For many companies, just saying, hey, we're just doing to go ourselves
is not going to work or want to hire somebody. And in some of you kind of license this around open source have been kind of some restrictions. Well, you know what? You cannot provide the consulting services
around this software, right, or something like that as a license, right? And I think the third very valuable thing about the open source is to see wherever you can improve the software for your purpose. If something in an open source doesn't really fit, right, can you contribute to that?
And again, that, I think, is another very interesting property of open source software where there may be different shades, right? Sometimes, open source vendors may be, well, maybe more or less open to that kind of things. Well, now, with that maybe open source public service
announcement, I would touch briefly about open source. I think I spoke about that. The previous speaker spoke about that as well. That is a fantastic API, right? And that is something we are going to focus on here, right?
And why I mentioned Kubernetes here, right? As we are going to talk about the open source storage in the cloud, I will focus a lot about, hey, what exactly choices you have in the Kubernetes environment, right, because if you're really speaking about the cloud, right,
modern life scale application, a lot of that is now being built around Kubernetes. OK, now, this storage in the cloud, what does that really correspond to? Well, there are a lot of different storage types.
We really can consider those guys, right? Here is the list which ranges from the simple stuff as a node local storage all the way to the databases. I define storage myself in a very general way in this case. Hey, you need to store the data somewhere, well, that is a storage, right?
Now, there are some of those things like a node local storage is relatively simple, right? They are direct replacement from file systems we have in our operating system for a long time. The others, such as databases, can be very complicated.
It is not just, well, a database, right? But we can see databases being different by data model, query language, like various internal design decisions, and so on and so forth, right? Even if you look at the data model, these are some of the most common data models which you would see.
And what is interesting, over the last, I think, like maybe 10 years by now, we see really this explosion of different special purpose databases versus approach before where we, I think, had like relational databases absolutely dominating
ecosystem. What is also interesting in this regard is what we are having the databases right now, not just being focused on a single data model, but many databases are able to support multiple data models,
which I think is a big trend, and even potentially speak multiple protocols, right? Here are some examples. If you look at the ClickHouse, which is data analytical database, it is able to talk ClickHouse, but as well as PostgreSQL
and MySQL protocol, right? So the idea is, hey, wherever programming language and libraries you already use, you can just connect to us and run your queries. Fantastic idea, right? Or the time series database, Victoria Metrics, also is implementing things like InfluxDB and Graphite API
for data ingest. Again, I think very, very smart. We also see some frameworks which allow us to do some of conversion and translation. For example, FerretDB, the projects
allow you to use PostgreSQL back-end with MongoDB front-end, right? Or Amazon released recently Babelfish, which turns your PostgreSQL in Microsoft SQL compatible database, right? So a lot of this, I think, interesting integration
is going on those days. If you look at their databases, you also see a lot of difference in databases for a purpose and design, right? Like we are speaking about operational analytical,
how is it used, how it's internally structured, and so on and so forth, right? And why am I listing that is because if you look at the complicated environments, right, with a lot of complicated database, it's very unlikely you will be limited only to one database.
Of course, as the previous speaker mentioned, well, you know what, you probably don't want to have 50 because that is way too much complexity, right? And you want to be very mindful about how you introduce them to your environment, but it's probably
going to be more than those days. Now, we also speak about, besides storage, about distributed storage. Like why is that important? Well, if you think about this, that is all of redundancy performance and scale, right?
I mean, saying if I just have a storage which is not distributed, right, which is kind of really sort of one device only, I will be limited in all of those. I think this is even more important in the cloud, right?
Because if you look at the age before the cloud, often we would be in a case where we have some one very powerful, very redundant server, right, maybe with a hotswap rate and redundant power supplies, right? And we expect that beast is never going to go down, right?
Well, that is not how we operate in the cloud anymore, right? We assume any component in the cloud is going to die, right? And they actually do die more frequently, right? If you look at the stats within like a mean time within failures, let's say 4 VMs compared to what you could get with some beast from the past,
well, it is going to be different, right? But that means we need things distributed at least from a high availability standpoint. OK, with that, let's look a little bit about the storage
types as promised. One is our commodity storage types, right? And this comments to the previous talks I did. These commodity storage types, they are pretty much the same in every cloud. There are minor differences, but they are, I would say, like a commodity building blocks, right?
We have a relatively simple interface, and usually it is relatively easy to migrate. So the lock-in, the word we don't like on this track, is going to be relatively low with them.
One is node local storage, I mentioned. Hey, well, it's pretty much every major, and even your second tier cloud typically offer you some kind of local storage, right? And it can vary in terms of performance
it offers and so on and so forth. But that is pretty much the same from what that gives you. And that is fantastic, right? But again, that is where I would, if you are looking from that, I would focus on the performance because that is where surprises can await you, right?
And saying, hey, this cloud vendor and that both have a storage, right? One of them has, it's implemented as a very fast NVMe flash storage. However, something not so fast, well, that may have a very big difference for your application.
The second most common one would be the natural block storage. That's typically how we store the data in the cloud so it can survive their death of an instance. And Amazon's would be EBS, right? And all the other cloud has something similar.
We also have some additional solutions, in this case, coming from the property vendors, right? Like from those vendors, right, which provide you some additional features.
And there is actually quite a lot of different solutions which exist if you want to roll out their block storage in the open source, right? And I think this is kind of very cool. And that shows how things are evolving in open source space,
right? We had sort of this block storage idea for a long time. So a lot of projects evolved, right? And we have a lot of choices. The next type of storage in the cloud would be your file storage, right?
Like when you can say, hey, I can mount something. Locally, not as a block device, but as a file system. In many cases, that would be your NFS or SMB compatible file system or both, right? Again, all the clouds, they support some of the other file systems.
There are a number of major property cloud vendors. They support those solutions in this case. And again, in open source, there are also solutions in this case, right? And you can see there is some connections, right? So many open source protocols which just say, hey,
we are focusing on the storage, right? They may provide different interfaces, right? And that kind of makes sense. The next one would be the object store, right? And that is, I think, a very important component
which appeared in the cloud. And that is interesting, the new commodity storage, right? Because if you think about the age before the cloud, we always had that local file system. We had network servers, right, with your remote file systems for a very long time.
But we didn't really have anything like S3, right? Until like at least kind of in a common use, right? And that has appeared and used a lot those days as a building block for many applications. Because it's actually very cool, right? It's kind of bottomless, right?
You can access it on HTTP directly. So you don't have to process the past data through your application all the time, right? It's very scalable and so on. Even many databases those days, again, like both proprietary and open source
are now starting to be built by using object store as a back end instead of your conventional file system. I think what is interesting in this case is what there are a lot of also object store cloud vendors which exist, right?
So it's not just Amazon or even kind of major cloud anymore, right? And here you can see two types of commercial vendors, right? Our usual suspects, NetApp and Portworx, they do have a solution for S3 compatibility, right?
But also we have solutions like Wasabi or Blackbase, right, which are offering you S3, compatible services which we can use as a less costly replacement or kind of like a supplemental to your main cloud, right?
For example, you might say, well, you know what? I have my stuff in Amazon, but I want to make sure I also have a backup somewhere else, you know, just so well. There are a number of vendors out there. And then if you want to, like, run the storage in your, well, locally, right,
there are also now a number of vendors. And I specifically wanted to flag Minio in this case because I think we have been the most successful, right, as providing S3 compatible interface in there
for a private cloud in those days. OK. Now let's look at the databases and data stores. I think the interesting thing about the database and data stores is what, unlike the previous storage types, which
are kind of relatively commoditized, right, have relatively simple interfaces and relatively simple to replace, like if you store data in S3, right, and now want to store it with Minio, well, guess what, right? You have a different endpoint, maybe
have some little configuration differences, but that is not a big deal. Database are very, very different, right? And even so-called, I would say, like similar offerings actually often end up being very, very distant because of their, well, of a lot of complexity
which exists in the database space, right? So that is, I think, where using some open source solution is especially important. So let's look at some of the databases in this case. One, what I would call queue, stream, data pipelines, right, wherever we want to call it, right?
That increasingly is a very important component of modern data driven architectures, right? We often want to say, hey, we have a data comes in, right, and maybe it kind of flows to a number of consumers being maybe processed along the way.
It's kind of your data plumbing, right? It's not conventional database, but it's very important. What I think is interesting in this case is what there are actually a lot of options. Well, you see at the Amazon AWS, right, and they probably even have more services than that, right?
They have a huge amount of solutions in this case. Some of that is because they kind of started first, right? Maybe implement something, and then open source solution exists, right? And in general, right, because Amazon has a huge number of different services those days.
I think it's like more than 200. If you look at the proprietary solutions, in this case, you can see Kafka is being, I think, the most common solution these days
for building your plumbing. And then additionally, we can see this technology, Red Panda coming up, which is saying, hey, we are providing to you something which is Kafka compatible. Remember, I mentioned earlier, right, what those days people are often building comparability of existing protocols, but it is faster, simpler,
yada, yada, right? I put them in a proprietary side, like specifically Red Panda, because they are one of those companies which started as an open source and then later changed the license
to something not quite open source. We do have a lot of solutions in open source. It's good to point out what Kafka instead, right, is Apache open source project, right? Confluent has a commercial offerings built on a Kafka,
but Kafka itself an open source, as well as actually many other open source solutions in this space. What I think is interesting in terms of like queues, there is also often certain solutions which exist in the given programming language ecosystem, right, so you will find what often,
you know, Golang people will have their own choices compared to the Java people, right, and so on and so forth. If you look at relational databases, well, in the cloud, we have a lot of choices, often ranging from providing you wrapped
and extended open source databases to also proprietary database available in the cloud, right? If you want Oracle or Microsoft SQL, typically that also is available on most of the cloud. What you also see in this case,
so there are a lot of property solutions in this case, right, which exist, right? And in many cases, you will find those either coming from your property vendor,
or you see a lot of companies those days which are providing the property management service, right, around open source databases. So for example, you'll find Avin here, right, which is on like one extended provide the management
services for a lot of open source databases, but I still put them as property vendor because if you can say, hey, you know, is there this open source version of your kind of fancy GUI, right, so instead of paying you, can I take it and run it in my own data center?
Well, the answer would be no, right? Well, so foundation, like if solution includes open source data components as a core database, but as a whole, it is not, right? And that applies to many vendors in those days. Now, if you look at the open source,
there are actually a lot of databases available, both from like an old guard, like MySQL, Postgres, MariaDB, right, as well as the new folks in the block, like Yugabyte, TyDB, Percona also provides their own version for MySQL and Postgres,
but typically that is, requires more of, I would say like manual work to deploy, right, compared to databases as a service, which is in a proprietary space.
Here are some choices in the analytical spaces, right? That is, I would think, one of the big decisions for relational databases because of kind of building database which optimizes for transactional workload and analytical workload is kind of quite different, right? They're designed internally very different, right?
And so there are typically different choices out there. There is a little bit of overlap those days, right? Some database position themselves as HDAP, hybrid transaction analytical databases, but typically the databases are good for one thing
or for another. Here are some relational and analytical databases proprietary, right? You can see a number of very common solution here
and then they also have a number of open source solutions in this case as well, right? I think what is very interesting is what, as you look at the analytical standpoint, right, they are, it's also like a very big focus, right? If, you know, very large amount of needs, right?
So for example, if you look some databases mentions here like, you know, Preston, Trina, right? And saying, hey, we want you to provide information so you can take your data from all the different data sources, right? And join and query wherever you directly, right? That's very valid use case, right?
Something like, you know, Clickhouse focused on saying, hey, we provide you sort of like a real time analytics, right? If you want to insert the data and then have it available for a query next second, well, that's something what we focus on, right? Or tie DB, as I mentioned, the HTAP database, okay?
Have some sign to speed up. So the other class of databases, which is quite important, is the document store, right? I think if you look at for many, you know, simple applications, some new developers, right?
You just say, hey, you know what, SQL, relational databases, yada, yada, too complicated, right? You want just to stuff our JavaScript objects directly in database and work with that natively, not trying to spread them on normalized schema and relational database. But all of the cloud vendors, major one,
they're offering their proprietary solution in this space, as well as we do have number of proprietary solutions in this case. Like I would say MongoDB and Couchbase are probably the most popular in this regard, which come in both cloud and enterprise space.
Now, if you look at the open source, that is where I would have to say like both open source and source available, right? Because well, frankly, the most popular document database is MongoDB, which few years back ditched the open source license, right?
And well, so it is not open source solution anymore, right? If you're looking for open source compatible, right? There is an early stage open source, open source MongoDB comparability very is early stage project FerretDB, which provides interface for Postgres, right?
Which I mentioned. One thing I would point out here is what relational databases. Actually, a lot of work recently is being much better for document store, right? Specifically in JSON support, where we take MySQL, Postgres, right? Or even SQLite, all of them
are also usable, right? So in some cases, when you say, well, you know what? I want to have some document store, but I don't really completely hate open relational databases. That also can be a choice. Key value stores, that is another important model.
In this case, I think it's interesting what they really can go into different buckets, right? One is, hey, we are using that for caching. It's kind of in memory, transient. If you lose it, we don't care, but we want that to be fast. There is number of solution here. For a proprietary non-cloud solutions,
I think Redis is a main leader. In this case, right, we have both Redis Enterprise and the cloud. If you look at the open source key value storage, the solutions, in my opinion. We also have a key value, or I would say a key value plus plus, right?
Because some of those solutions have a much more powerful language than key value. Would be DynamoDB, Cosmos DB Bigtable, right? In a cloud space, Redis cloud, and the enterprise versions of open source solutions, that's what exists here.
And here are some examples of open source solutions, right? Which have a key value stores, right? And again, like a key value store plus plus, right? I mean, you would find like, especially, as Spike mentioned here, they are, well, do much more than,
you know, the key value store, right? Cassandra as well, but they, I would say, don't position themself as, I would say, like powerful as document stores. So yes, we have also time series databases.
That is another class I wanted to cover here, right? Again, you can see solutions from a proprietary vendors, from a cloud vendors, proprietary vendors, and probably what they're most interested here is their open source. It is also interesting what the time series database
is also relatively new in technology, which has a lot more, I would say, choices those days. Well, let me finish, I would also mention maybe Percona's role, right?
In all of this and what we are trying to do, right? What we try to do is to really see, to push where boundaries and what possible is specifically open source databases. Like, hey, you know what? If you want to have something which is totally open source,
our focus is on MySQL, MongoDB, and Postgres, right? I mentioned MongoDB is not open source anymore. Well, that's not our choice, right, but an unstreamed choice, and we are having as much of our tooling even for MongoDB open source
as possible, and what we build is 100% open source software around that, right? If you look at our distributions for MySQL, MongoDB, and Postgres, right, generally include a lot of features what enterprise companies need, like, you know, auditing, authentication, whatever, but it is completely open source,
and we focus both on your kind of conventional or old deployments on Linux, as well as we have operators for Kubernetes, right? I think we have, like, some of the more advanced databases out there,
and all that stuff, again, besides MongoDB is open source, we don't have any proprietary solution. Plus, we do have a quantum monitoring management, which we position as a single tooling where you can monitor and manage databases.
You know, you can get something similar to a database as a service experience with Kubernetes backend, and again, that is all, you know, 100% open source, which you can play with if you choose. So, to finish it up with our storage in the cloud,
right, as you probably have seen me going through that, right, some of you I see falling asleep, some of you rolling in your eyes, and that is totally appropriate reaction, right, because there's a lot of shit out there, right? It's like, there are, like, lots of options out there. So, important to know here, hey,
there is no one size fits all, right? You guys can look with fits for your job, for what your applications need, but one thing I wanted you to come out this case is, like, one last, most important takeaway is, what we could see in all, like,
wherever we slice and dice it, right, all those areas, there have been a choice of more than one open source solutions available in every single class of storage you may need in the cloud. So, that's all I had.
We have a little time for questions. Next question.
Hello? So, question, my question is about an interesting tool which used to exist, and Percona used to have it in the MySQL, you know, package, and the socket. Yes. So, I think it was kind of discontinued,
and I don't think it supports MySQL starting from 5.7. So, is there any movement in direction of supporting this kind of tool, which enables you to access your relational database both ways, in a traditional SQL way, and in a highly available, highly-
Well, that's right. So, the question is about the handler socket interface for MySQL, right, and yes, and there was this, you know, interface, right, gradually, it's, I would say, came mostly out of use, right, and we, you know, stopped supporting that. At Percona as well, there's a couple of replacements,
right, one is, which I think generally cover most of the use case of what handler socket did. One is MySQL supports memcache protocol, right, so if you look at for key value store, memcache compatibility is out there, and then there is also something called docstore, right,
that is that MongoDB-like protocol, right, which allow you to store documents, like JSON documents in the MySQL, that is the other choice, right, so I think within those two, well, it covers most of the handler socket use case as well.
Okay, thank you, Peter.