5 kinds of NoSQL
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 49 | |
Author | ||
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/51734 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FrOSCon 2020 Cloud-Edition46 / 49
1
2
5
8
9
10
13
19
20
21
22
23
25
26
27
29
31
32
33
34
35
42
44
46
48
00:00
Point cloudGraph (mathematics)Hash functionPrice indexBitWeb browserElectronic mailing listPoint (geometry)CASE <Informatik>Cache (computing)Type theoryDatabaseComputer configurationElasticity (physics)Characteristic polynomialStack (abstract data type)InternetworkingData structureWebsiteSubject indexingObject (grammar)Multiplication signHash functionFunctional (mathematics)Cartesian coordinate systemCategory of beingRelational databaseWeb pageMereologyWeb 2.0Single-precision floating-point formatNormal-form gameFront and back endsDifferent (Kate Ryan album)Data storage deviceDependent and independent variablesFile formatUniform resource locatorPhysicalismSemiconductor memoryCodeAxiom of choicePresentation of a groupPairwise comparisonScaling (geometry)Information retrievalWeb serviceOpen sourceKey (cryptography)Query languageRange (statistics)Closed setProjective planeTouchscreenSoftware frameworkWeb applicationTable (information)SimulationInformationForm (programming)1 (number)Video gameVideo game consoleSign (mathematics)Military baseTransport Layer SecurityCellular automatonMathematical singularitySystem callShooting methodPower (physics)VotingMoment (mathematics)Drop (liquid)Normal (geometry)Formal grammarLattice (order)Design by contractExtreme programmingComputer animation
09:34
Price indexHash functionSet (mathematics)Cache (computing)Queue (abstract data type)Read-only memoryWritingType theoryPoint cloudGoogolPartition (number theory)Key (cryptography)Table (information)Partition (number theory)String (computer science)Product (business)BitKey (cryptography)Streaming mediaObject (grammar)Digital video recorderData typeIn-Memory-DatenbankScaling (geometry)ImplementationPoint (geometry)Category of beingRevision controlDatabaseBoundary value problemRelational databaseWebsiteClassical physicsTable (information)Query languageServer (computing)MappingData storage deviceClient (computing)CASE <Informatik>Ocean currentRow (database)Field (computer science)AdditionData structureNumberOpen sourceCache (computing)Game theoryElement (mathematics)Position operatorPower (physics)VideoconferencingCartesian coordinate systemType theorySource codeDifferent (Kate Ryan album)Virtual machineProfil (magazine)Electric generatorState of matter40 (number)Insertion lossBasis <Mathematik>Standard deviationFormal languageSampling (statistics)SequelForcing (mathematics)Hybrid computerStatement (computer science)NeuroinformatikMilitary baseSemiconductor memoryObservational studyGoodness of fitComplex (psychology)Process (computing)SineSemantics (computer science)GoogolExtension (kinesiology)Control flowPrice indexClosed setArithmetic meanSubject indexingSocial classSummierbarkeitComputer animation
19:01
Key (cryptography)Table (information)Data storage deviceWritingElectric generatorScale (map)Subject indexingEnterprise architectureDatabase transactionData structureBookmark (World Wide Web)Multiplication signCommunications protocolData structureDatabaseVideoconferencingSubject indexingPosition operatorKey (cryptography)2 (number)Replication (computing)String (computer science)IntegerOperator (mathematics)Point (geometry)AreaData storage deviceFrequencyControl flowServer (computing)CASE <Informatik>BlogStatement (computer science)Category of beingInsertion lossFile formatConsistencySource codeDifferent (Kate Ryan album)Revision controlWebsiteWritingOrder (biology)ImplementationBoundary value problemGene clusterRow (database)Cartesian coordinate systemMathematical optimizationNumberSimilarity (geometry)Focus (optics)Semantics (computer science)Term (mathematics)LogicStandard deviationNumbering schemeObject (grammar)Sampling (statistics)Right angleSocial classSoftware developerReading (process)Forcing (mathematics)Special unitary groupSoftware testingClassical physicsGroup actionElectric generatorState of matterForm (programming)Military baseFilm editingCrash (computing)MereologyEndliche ModelltheorieECosQuicksortRelational databaseComputer animation
28:28
DatabasePoint (geometry)Graphical user interfaceGraph (mathematics)Subject indexingStandard deviationDatabaseCategory of beingCASE <Informatik>Field (computer science)Subject indexingPoint (geometry)Key (cryptography)Software developerValidity (statistics)IterationTraffic reportingRow (database)Source codeMultiplicationDifferent (Kate Ryan album)Database transactionMultiplication signCartesian coordinate systemGraph (mathematics)Query languageBitProjective planeAnalytic setINTEGRALStrategy gameRepresentational state transferFunctional (mathematics)Function (mathematics)Formal languageTerm (mathematics)Relational databaseData structureGraph (mathematics)Complex (psychology)Data warehouseSquare numberGrass (card game)Digital photographyComputer architectureMilitary baseEndliche ModelltheoriePrice indexWeb 2.0Metropolitan area networkStack (abstract data type)GenderBasis <Mathematik>State of matterFood energyComputer iconEstimatorPlastikkarteSocial classNumbering schemePerturbation theoryClassical physicsCoefficient of determinationVertex (graph theory)Sampling (statistics)Greatest elementLattice (order)GodObservational studyExtension (kinesiology)Presentation of a groupQuicksortComputer configurationSet (mathematics)Product (business)Computer animation
37:56
Analytic setSocial softwareInformation securityGraph (mathematics)Mathematical optimizationStapeldateiInteractive televisionStandard deviationFrame problemGraph (mathematics)Computer programmingGastropod shellArrow of timeDifferent (Kate Ryan album)ASCIIMappingState observerCodeCartesian coordinate systemQuery languageReading (process)Limit (category theory)Source codeAnalytic setCASE <Informatik>Direction (geometry)Combinational logicData storage deviceBitComputer architectureView (database)Computer fileMiniDiscStapeldateiWeb pageKey (cryptography)Set (mathematics)Product (business)HypermediaDatabaseSoftwareRow (database)Hash functionPartition (number theory)Connected spaceInformation securityFacebookCategory of beingLogicPhysical systemMoment (mathematics)Touch typingProgramming languageNumberMereologyElectric generatorLine (geometry)State of matterEqualiser (mathematics)Multiplication signSheaf (mathematics)Formal languageShape (magazine)FreewarePrice indexDigital photographySampling (statistics)ConsistencyBasis <Mathematik>Computer animation
47:23
Traffic reportingInformation securityWordData typeWebsiteGoogolGraph (mathematics)Core dumpTable (information)Electric generatorOpen setTraffic reportingDatabaseStandard deviationSubject indexingWordStapeldateiSearch engine (computing)Process (computing)Revision controlClassical physicsPoint (geometry)Relational databaseAnalytic setFormal languageCategory of beingOffice suiteOpen sourceType theoryProjective planeProduct (business)Elasticity (physics)Real numberQuery languageStreaming mediaInterface (computing)Right angleTwitterCASE <Informatik>Service (economics)Row (database)InformationSpacetimeField (computer science)ResultantValuation (algebra)RankingPattern languageFlow separationInformation securityPoint cloudInsertion loss2 (number)GoogolCovering spaceCuboidUniform resource locatorLoginSoftwareComputer fileFirewall (computing)FrequencyError messageRepresentational state transferInformation engineeringMachine learningTable (information)PlastikkarteDemosceneWave40 (number)Optical disc driveData loggerGreatest elementOpen setWebsiteMassLatent heatMultiplication signGroup actionVirtual machineExtension (kinesiology)Web pageXML
56:50
Table (information)Electric generatorGraph (mathematics)Open setComputer-generated imageryProjective planeRevision controlMathematicsDependent and independent variablesInsertion lossAuthenticationComputer architectureDifferent (Kate Ryan album)Repository (publishing)Server (computing)Elasticity (physics)Source codeMultiplication signDatabaseSoftware developerConsistencyLevel (video gaming)Term (mathematics)Product (business)Similarity (geometry)CASE <Informatik>MiniDiscQuery languageMathematical optimizationOpen setEnterprise architectureInformation securitySocial classDirection (geometry)Category of beingOpen sourceBitBefehlsprozessorOrder (biology)Semiconductor memoryPoint (geometry)Key (cryptography)Flow separationTime seriesArithmetic meanRelational database10 (number)Goodness of fitReading (process)Numbering schemePresentation of a groupInterface (computing)Data compressionData storage deviceCombinational logicOnline chatView (database)Matching (graph theory)Patch (Unix)Data warehouseRight angleMusical ensembleEndliche ModelltheorieQuicksortHand fanGoogolGUI widgetGroup actionDivisorState of matterMilitary basePhysical systemTable (information)Observational studyWebsiteExtreme programmingConnectivity (graph theory)Process (computing)AdditionTimestampGreen computing
01:06:18
Open sourceProduct (business)DatabasePartition (number theory)Information securityRelational databaseScaling (geometry)Information privacySoftware frameworkKey (cryptography)EncryptionLevel (video gaming)MiniDiscOcean currentPoint (geometry)Presentation of a groupACIDBitSet (mathematics)ResultantAuthenticationDifferent (Kate Ryan album)Query languageStandard deviationConsistencyAtomic numberServer (computing)Multiplication signAreaCommunications protocolHypermedia2 (number)Cartesian coordinate systemData storage deviceRevision controlStack (abstract data type)Goodness of fitOnline chatEndliche ModelltheorieSlide ruleSpacetimeLengthCodierung <Programmierung>Data compressionProcess (computing)Physical lawWebsiteElectric generatorView (database)Execution unitWeb pageCodeSystem callService (economics)RhombusForm (programming)Confidence intervalNoise (electronics)MereologyMilitary baseStudent's t-testProjective planeObject (grammar)Flow separationWorkstation <Musikinstrument>SequelGroup actionHigh availabilityWritingPrice indexLength of stay
01:15:45
Point cloudJSONXMLUML
Transcript: English(auto-generated)
00:07
OK, maybe I can start, at least introduce myself a little bit. My name is Henrik. I live in Finland, where I currently am also doing this talk close to Helsinki
00:22
in a small town called Jarvampa, which in English means the end of the lake. So it's, of course, one of the many lakes that we have in Finland. I've last spoken at FrostCon, I think, maybe eight years ago or so. So it's really nice to do it again after a while.
00:44
I currently work at Datastax, where we develop Cassandra database. And I previously worked many years at MongoDB, and before that with both MySQL and MariaDB. So my professional interest has been very much
01:04
targeting open source databases, and that's why I wanted to do this talk for you, to look at the different kinds of knowledge scale databases that we have. So let's start.
01:23
Oh, which screen has focus? Now it works. All right. By the way, I'm also, in my free time, I maintain this project called Impress.js, which I used to do this presentation.
01:43
So it's a browser-based presentation framework. All right. So I don't know if it's still the case. Earlier in my career, when I started speaking about MongoDB and selling MongoDB,
02:01
you would often have this discussion that, should I use a relational database, or should I use NoSQL? And maybe some of you might still think that these are the two options you have. And of course, there are lots of databases to choose from. So then if you start going through it,
02:22
I think there are at least 200 or 300. In fact, the list is so long that the browser runs out of memory in this animation. So the point of this talk is to zoom in a little bit
02:40
on the NoSQL side, though, where it turns out that NoSQL is not like one type of database that you can choose to use. So if we look into this category, there are five different categories. Or actually, there used to be five different categories when I first did this talk.
03:02
But now, much thanks to success of elastic shirts, for example, I've come to the conclusion that, actually, we should consider search as its own category, because it's like quite a big industry already. And it has its own use case and characteristics.
03:26
So then if you can see the NoSQL landscape broken into these six or so categories, then maybe it's easier to reason that, oh, should I
03:41
use a key value database, or should I use a document database, or maybe between white-collar database, and so on. So I want to spend the next hour going through the highlights of each of these categories, which are some of the databases in each category,
04:02
and what would you typically do with them, what are the typical use cases, and so on. So hopefully, this will help you. The next time you need to choose a database for some project, you will know where to look at when you know what your application needs.
04:22
So let's start with the key value one, which is the most simple, also in functionality. And really, this is where the NoSQL movement started, was with memcached. So of course, if we go like 15 years back,
04:44
most of the internet, it's quite amazing to think now, most of the internet was running with MySQL and PHP, Apache, Linux, so it was a LAMP stack. But to make MySQL faster, somebody realized that it's good to have a cache
05:00
between the database and the PHP. So this is how memcached was born, and it's a simple key value cache. So you just store objects in memory and get them with an ID. So this made websites faster back in the day. But of course, it wasn't the database, it was just the cache.
05:23
So if we look at this category today, Redis is the clear leader, which has some, it's still very much in-memory focused, but has some persistent built-in. So you can use it for some use cases, at least, as a database.
05:46
So key value databases, I think, embodied the primary ethos of the NoSQL database, which was when we were using old relational databases.
06:02
Say, look, we need much more scale. We need them to be faster. And actually, they can be simpler. Like, we are willing to compromise in functionality if we can just get more speed, more scale, up to, well, maybe then hundreds of gigabytes. Today, in my work, I see customers using hundreds
06:24
of terabytes or even up to petabytes for the database backend powering certain web services. So, yeah, so the key value category takes this to extreme, because it's, of course, extremely simple.
06:42
And typically, these solutions give a good performance. But, yeah, and when you think about the use case where we have, let's say, a relational database as the backend storage and then a cache in front,
07:01
which you can also use Redis for today, part of the speed is not that Redis is faster than MySQL, although it is. But part of the speed comes from the data structure. So in a relational database, of course, your data is stored in a normalized form. So it means the data that you need to show for a single web
07:24
page, for example, is physically stored in many different tables, many different physical locations on the disk. But typically, what you store in a cache is actually the serialized object that you want to show for that web page.
07:41
So you might be able, in many cases, to show a single web page or a single REST response. You might be able to store in a single key in your key value database. And this already, even if the cache and your relational
08:03
data are equally fast, the fact that you store data in a different format in the key value database already makes it a lot faster. This is the case also for the next categories, wide column and document database.
08:20
They typically all end up being faster than a relational database. And it's not because they are necessarily a better database in an Apple-to-Apple comparison. It's because the data structure is more beneficial for this kind of fast retrieval.
08:40
OK, but other reasons why key value databases end up being a fast choice. So of course, simplicity always is good, typically, or code. Code can be smaller, and you can focus more on optimizing it when you have less functionality. But the other thing, of course, is that these databases
09:02
are designed to store everything in RAM. So of course, it's going to be faster than disk, even in the age of SSD. OK, I already talked about the fact that you store denormalized data. It's actually a big part of it. And in a key value database, because they
09:21
don't support range queries, so that is like a greater than or less than type of query. When you only select individual keys, you can use a hash index, which is a faster index structure than a B3, for example. And then the same for sharding.
09:41
When you have keys that you can hash, sharding becomes quite simple. So this was early on where you could find good scale-out solutions, let's say 10, 11 years ago, when no SQL databases started spreading.
10:01
So what does it look like? This is an example from Redis. So yeah, you have set and get commands, and there is a key. In this case, first I set the key name, and then I set the key age. And then there is a value, which is in quotation marks. And what is interesting here is that also numbers
10:22
are in quotation marks. Redis, yeah, so this is the simple case. So then on the client side, you would have to convert 43 to a number. So in a way, these are just like blobs, and you could have anything inside the quotation marks.
10:43
But there is more. So let's see, what do you use key value stores for? So yeah, of course, the original use case has been caching. There are also use cases which are kind of caches,
11:01
but where your Redis, for example, or other caching solution is the primary data store. So the data that you put in Redis when you use it as a session cache isn't necessarily stored in a more durable way in some other database, like a relational database.
11:24
So a session cache is a good example. Depends maybe a little bit on the type of site you are in, but let's say a gaming site or something. It might be sufficient to have this kind of solution where when I log into the site, my session key and all
11:49
the data are related to my session where I am. For example, yeah, these kind of video recorder applications or streaming applications like Netflix,
12:03
they typically want to remember the position in some video that you were watching. So that if you take a break or if the power goes out, you can log back in and you can
12:21
continue watching from the position where you stopped. So this is an example of some data that you kind of want to store. You want to keep the session state, but it's not terrible if the data is lost. So for this kind of use case, these kind of in-memory
12:43
databases, pure in-memory. So of course, in the case of Redis, you actually can react to this, but some of these, like memcached, you can't. It might be sufficient because there is a small risk of losing this state. So OK, in the video example, for example,
13:01
it just means I have to start the video from the beginning and find the place where I was. But it's not like I lost money or lost some important data. So another case to use these kind of databases is various kind of in-memory computing, like computing, maybe machine learning.
13:20
Yeah, different kind of recommendation engines might also fall into this category where the personalized profile or personalized marketing that is generated can be stored in an in-memory database. If it's lost, we can generate it again
13:43
from the source data that was used to do this recommendation for you. And one use case that you could use these kind of databases for is also queuing. Depends a little bit on your requirements there. But of course, it could provide a good speed.
14:05
So one more thing, especially about Redis, is that it actually supports more data types than just quoted string that I showed in the example.
14:20
So you still fetch these objects by key. But the value actually can have some more complex data type, such as lists, sets, or maps, and even streams in the newest version of Redis.
14:41
So this is a good example where even if we have this kind of category that says it's key value, products tend to maybe evolve and push the boundaries of this. OK, let's go to the next category, which is wide column databases. And I would say this has been created by Google's Bigtable.
15:07
But the most popular open source wide column database is Cassandra, so the one I currently work with. And to some extent, DynamoDB, from a user point of view,
15:22
has similar semantics as this. I'm actually not sure if we know publicly very well what is the internal implementation of DynamoDB currently. So what does a wide column database do? It actually looks a lot like a relational database
15:43
when you first look at it. Because your data is in tables, and the tables have rows and columns. So this means this has more structure than the key value database, because each column could then also have different data types, such as string, or an integer,
16:02
or a decimal, and so on. Of course, also, like in Cassandra's case, for example, you again can have maps and lists, and even your own user defined types. So all of this sounds like a relational database. But actually, in a wide column database,
16:22
all data access happens through primary key. So well, in the kind of classic case, that's the requirement. So in that sense, it's actually similar to a key value database. You need to use the primary key to get your data.
16:42
Then the data you get back actually is in rows and columns. The primary key can be composite. So in Cassandra, you can separately have a column or a few columns that is the partition key. And this partition key is then required for fast queries,
17:02
because if you have a large Cassandra cluster, the partition key is the one that tells you which server is this data going to be found on. So if you didn't use a partition key, which Cassandra does allow, it means you'd have to send the query to all nodes in your cluster
17:23
and scan all the records in that cluster. And this would, of course, be quite inefficient. So the point of a wide column database is not to do that. In addition to the partition key, you can have them composite primary keys.
17:42
So you add more columns that are used as clustering keys. So this could be used, say, if my partition key is Hendrick, like my name. I could find all users whose name is Hendrick.
18:01
And then with clustering key, I could order them by age or something. So within the partition, you can still do these kind of operations, like querying on multiple fields or sorting or others. So it's a bit of a hybrid somewhere
18:21
between a key value database when it comes to the scale out functionality, but has some elements familiar from relational databases. But it's definitely not a relational database, just to be clear. Even if we look at an example,
18:42
this is a Cassandra example. The query language is called Cassandra query language, so CQL. So almost like SQL, but not quite. So you create a table. Columns have types.
19:01
There is an insert statement and a select statement. What is interesting about Cassandra is that actually insert and update are both possible, but they do exactly the same thing. So this is because of the eventual consistency.
19:23
So when you do an insert or an update, you cannot assume that. So let's say you do, after each other, both an insert and an update. You cannot assume that these arrive at the data nodes in that order. So it could also happen that the update arrives first
19:42
at some node, and then the insert. And this is why the internal implementation is such that insert and update essentially do the same thing. It's just a write. It's like an upsurge, actually, is the name that we often use.
20:03
So what use cases are these databases used for? From my career, I have never seen so large clusters as I see with Cassandra users today. So I mean, I have maybe seen some.
20:23
But in Cassandra, this seems to be very common, that clusters are 100 terabytes and beyond. Some really big companies might have petabytes. They're Cassandra clusters. Also interesting for Cassandra in particular,
20:41
the storage engine is write optimized. So typically, you might use this for applications that do quite a lot of writes. Again, like storing session state, if you want it to be more durable than in an in-memory database, you could use Cassandra and other similar.
21:07
Also, actually, this use case. And by the way, Netflix does use Cassandra for this purpose. So storing the position in a video that you are watching
21:25
is actually quite heavy because you need to store it again and again. As you are watching the video, depending on the granularity you want, you might want to store the position each second. I don't think they do that, but at least a couple of times a minute, you want to store a bookmark.
21:43
So this requires definitely a write optimized database. And the other interesting feature that these databases have is they are, well, at least Cassandra and DynamoDB have been based.
22:03
Sorry, this is not true for DynamoDB anymore. But Cassandra definitely still uses Dynamo protocol for high availability, which, again, is a write optimized replication protocol. So even if one server crashes, because it's
22:25
a multi-master protocol, so even if one server crashes, there is no short break when you cannot write to the database. So you can always write data somewhere. And this is important.
22:40
Well, the protocol was invented at Amazon, where the use case has been the shopping basket. So for them, it was important that if I'm looking at the Amazon website and I wanted to buy a book and I click on the Add to Basket,
23:00
then this must succeed because this is the point where I decide that I want to spend money. So it was a very high value write operation to the database. And it must not be lost. And there must not be like a one second or five second period when the database is doing some kind of failover. So you cannot write to it because even in five seconds,
23:21
they would already lose a lot of money. So yeah, so this is an area I've personally been very interested about. I have written in my blog about the Dynamo protocol, but let's go forward.
23:40
So again, pushing the boundaries of what this category definition in the classic sense has been. So Cassandra 4.0, which is now available as beta, actually has invested in a new secondary index type, which
24:01
should be more useful. So also, prior versions of Cassandra and DynamoDB has had something like a secondary index. But then when you read the documentation, it says that you should only use it for like low granularity data and data that columns that are not frequently updated and so on.
24:23
So once you finish reading this documentation, you come to the conclusion that maybe I'll just use the primary key, which kind of was the idea in the first place. But now, this is a really exciting development, I think, in the Cassandra world. In 4.0, more useful secondary indexes.
24:43
And actually, in data stacks, we have, again, a different kind of secondary index implementation, which we have contributed to the Apache Foundation to be included in Cassandra, but it's not going to be in 4.0 yet. So this is an exciting topic to keep an eye on,
25:04
and we'll certainly broaden the use cases that you can use Cassandra for. So let me talk about document databases next. And of course, this is an area where I have a lot of experience as well,
25:21
because I worked many years at MongoDB. And MongoDB is kind of the leader of this category, or not kind of, they are very clearly the leader in this category. But just as another example, I wanted to mention MarkLogic, which is a closed source database.
25:40
So why am I even mentioning it? Probably many of you haven't even heard about it. An interesting thing about MarkLogic is that it actually uses XML as its storage format. So both, and MongoDB uses JSON. So both of these are document databases. In terms of features, they do very much the same thing.
26:05
And they just use a different syntax. One uses JSON, one uses XML. But the semantics user experience is very similar in a different syntax. So the key selling point here is the flexible schema.
26:22
So typically, although it's possible, but typically you wouldn't specify a schema like you do in relational databases, or like you did in the wide column databases with the create table statement. So you can just start inserting data into the database,
26:43
and different records can have different structures. So each of them is a JSON object, and they can even be wildly different. But of course, typically, it makes sense that your application stores data that were at least they are somewhat similar to each other.
27:03
And the logic here is that JSON, for example, let's focus on the case of MongoDB. JSON, of course, is in itself embeds the structure. So if we look at an example, I can insert here a record with a JSON object,
27:25
and I can already see that there are fields, first name, last name, and age. And two of them are strings, and one is a number. And all of this you can just see from the JSON syntax. In fact, JSON is kind of better than XML in this case.
27:43
In XML, unless I specify a schema, I wouldn't know whether the number 42 here is an integer or a string. But in JSON, there is a difference. MongoDB also adds some type like date,
28:01
which don't exist in standard JSON. But yeah, it still follows the same flexible schema model, so the date is encoded in the value here, just like an integer and string are different.
28:22
So you don't need to specify a schema up front. So this can be good or bad, but definitely those who like this enjoy the flexibility that they can just start coding and iteratively evolve their database rather than needing to specify all of their columns up front.
28:41
So the last point about document databases is that they actually allow creation of secondary indexes as well. So here I have created an index, last name and first name. So this means I could efficiently then query on last name,
29:00
even if it's not my primary key. In fact, the primary key in this case would be the ID field. So there is something here that document databases have in common with key value databases, that each record, even if it has flexible format, still the primary key is an ID.
29:23
And the simple use case would still be to just fetch these JSON documents with the ID, kind of like a key value database would be a simple way to use this. So what are the use cases for a document database?
29:42
So actually in the NoSQL space, this is the category where these are general purpose databases. So yeah, you have records with fields, you can have arbitrarily many, you can have primary key, secondary key.
30:01
So you can do all kinds of, you know, querying and sorting. So in many cases, to some extent, these databases or MongoDB competes with relational databases, since some years ago also added transactions report and so on.
30:22
So what would be the main selling points to choose this over a relational database? So first of all, many developers love using JSON, and they might use JavaScript or they might use REST APIs in their architecture.
30:41
So it's very natural to also store JSON in the database. Flexible schema can be powerful, can also get you into trouble, but definitely again, if you just want to quickly get going, it allows more iterative style of development.
31:01
And then of course, sharding, which if you now look at this presentation, you might say, what is so special about sharding, because also wide column and key valid databases are good at sharding, but compared to relational databases, if we think about the document database
31:22
as a general purpose database, this is typically a strong selling point, because even today in 2020, most of our classic relational databases are not so great with sharding, like MySQL and Postgres.
31:42
So then when you think about what use cases they are used for, you should think about, you should think about these points in the previous point that what would, for example, be useful for,
32:03
where would the database with the flexible schema, your strength. So one application could be something like a data hub, which is kind of like a data lake, but more operational and more of a classic database. And the point with the data hub is, is that you aggregate data from many source databases.
32:25
So kind of like a data warehouse, but the difference with data warehouses, that in a document database, you wouldn't spend time designing a star schema, which might quickly become complicated if your data warehouse has multiple different data sources
32:40
that each have different kind of data. So how do you get all of these different source databases stored in the same data warehouse? It would require a lot of planning, how you design the columns and data types. And also the source databases might evolve and change all the time.
33:00
So a database, an OSQL database with flexible schema has a strong advantage here because if two source databases are different, and also if they are different over time, then you can just continue inserting them into your data hub because there is no schema
33:20
that would prevent you from doing so. So now of course, this is not like a magic solution that everything is now very easy. So the difference between relational database now is, yes, we could easily insert data into the database, but it might be more difficult to query
33:42
because all the records are not the same. So if we want to search by first name and last name, for example, there might be also records there that don't have those fields. They might have the name in some, yeah, in just the field called name. And so it's a bit of a mess and that's what it's,
34:04
the strategy here is to postpone the problem. So instead of being difficult to get data into your data warehouse in the first place, it's a bit more difficult to read it out. So in this category, what is there more?
34:24
I actually struggled to think about it because with MongoDB, for example, for a long time, it was missing transactions. Now it has it, it also has like views and other things. So I think actually this is a fairly mature category and a lot of future development
34:42
will be more incremental innovation, such as just better performance or some better tools for analytics or integrations with other technologies and so on. Okay, so graph category, Neo4j, of course,
35:08
is the leader in this category and have, I would say, pioneered a lot of it. Because of some specific features, I added as a second option,
35:20
just our own product, Datastacks Enterprise, which embeds Apache Thinker project for a graph query language. So what do you do with graph databases? So in a graph database, obviously the data structure is a graph.
35:42
So the records are nodes and nodes are connected by edges. And in fact, both of them can have properties. So if you do that, then the edges start looking like records as well. But typically you think about the nodes as the main records and then the edges are what join them together
36:02
to use a relational term. And of course, to do efficient queries, you also need to use indexes here, just like in the document database or relational database. So what does it look like? So because it's a bit simpler,
36:21
I actually used now the Datastacks example, which uses the Gremlin language from Thinker graph project. And yeah, so there is some initialization here to create like a graph session.
36:41
But then on the bottom there, you can see this query where we query for a vertex. So which is node where the name property is Marco.
37:01
And then there is an alt edge. So he knows some other people and we want to output their name. And in this case, it finds two other people that Marco knows are Vadas and Josh. So this was a very simple graph query. It's almost like we could have done this also with a join in a relational database.
37:23
But why I wanted to show this example, just this fluent syntax where you have like a dot and a function and then a dot and another function. So for complex graph queries, I personally like this Gremlin approach of using a fluent syntax.
37:43
Because it's kind of easy to read as you traverse the graph. It makes sense to me. So what do we use graph databases for? So often, and especially in the case of Neo4j, these are used for analytics.
38:01
So you have some data set, which is a graph, then you put them in a graph databases and then you do these queries like, yeah, I want to find all of the friends of my friends who own a cat, for example. This has also been used, by the way,
38:20
in some of these journalistic cases where in the Panama Papers, for example, I believe they used Neo4j because they wanted to find connections that what if Putin has hidden his money in Panama, who are they connected to and which lawyer and which other people were connected to this bank account.
38:46
So you traverse this kind of network to understand the network of shell companies where they hide their money. Of course, any social media is a network.
39:01
So this alone explains why this is a meaningful category. There are a lot of data sets today which are graphs, can also use it a lot for recommendation engines and so on. Because again, recommendation engines often follow this kind of logic that,
39:20
yeah, Amazon was early on famous for this recommendation that there are some other customers who bought this book and they also bought some other book. So this is actually a graph query. And when we talk about analytics, this is used a lot in national security.
39:41
If you remember the Edward Snowden revelations, the typical case there is that which person called which other person with their mobile phone. So again, it's a graph query.
40:01
So what does the future look like for graph queries? An interesting observation here is that there are many different graph query languages. So I showed you Gremlin, which we use at DataStax. Neo4j has developed one called Cypher, which is completely different and it kind of looks like ASCII art
40:22
even where you draw arrows different directions. And now what is becoming popular is GraphQL from Facebook, which is like kind of like a REST API, but more like a graph. It's an interesting combination there. So an interesting question for the future is,
40:42
will there be one standard language? Currently, maybe GraphQL seems to be the most popular one. But it has, I think it has some limitations for really advanced graph analytics. So it's more maybe targeting operational applications.
41:05
Okay, and I said that graph is mostly used for analytics. And in case of Neo4j, for example, I would say their database is the internal architecture
41:22
is definitely more optimal for read queries or analytical queries. And it's used for operational applications. I'm not sure if it's very optimal for that. In DataStax, our graph database is a bit more optimized
41:45
for OLTP because it is running on top of Cassandra, which of course is very much an OLTP database. An interesting, sharding in general is a difficult problem
42:03
now for a graph. So in the case of our product, for example, that is essentially Apache TinkerPop combined with Cassandra. Yeah, we store the graph as Cassandra records that where you then have this partition key
42:21
and then the partitions are sharded over a large cluster. So it's possible. But this is still kind of a hash-based sharding. So all of the records in the graph are equal and they're just spread out based on consistent hashing.
42:41
So an interesting unsolved problem, I think, and unsolved in the sense that the product would actually exist that you can buy and use. Would be how to do optimal sharding for a graph database. So this means, if you think about like a social media graph, for example,
43:02
I have some friends, maybe some hundred of them in this data set. So queries starting with me, of course, are likely to traverse to my friends and friends of friends. So an optimal sharding would mean that nearby nodes that are likely to be accessed in the same query
43:22
would also be stored in the same shard and in the same disk page. And this is not how any of these graph databases that I mentioned actually work today. So they more access all data equally. So in the case of Neo4j, for example, you typically want your active data to be in RAM
43:45
after, in which case, of course, everything can be accessed. Like each hop can be made in constant time. Okay, query engine. I used to call this category Hadoop many years ago.
44:03
But in reality, what people use today is Spark, which has more or less replaced Hadoop. And we shouldn't forget Presto, which is actually powering Amazon Athena. So it's used quite a lot. Presto was published by Facebook.
44:23
So now if you talk to some analysts or other people with opinions, they would say that this category doesn't belong in this talk at all, because they are not no SQL databases, because they are not databases. So this is actually true.
44:42
They are query engines. So Spark and Presto will query data that is stored somewhere. So in the Hadoop case, of course, it used to be Hadoop file system. But today, maybe the most common place to store data is in S3 in Amazon.
45:01
And it could even as a use case be that you have just stored data in S3 like log files or something. And then later on, you realize that, maybe I should analyze something in these log files to understand my users better. And then you can just put Spark or Presto on top and start analyzing your data.
45:22
But you could also use databases as a data source. So both Cassandra and MongoDB, for example, have a Spark connector. And yeah, and these are definitely,
45:42
oh, sorry, use cases come later. So to go back, so definitely, these are used for batch query. So not OLTP databases, because like I said, typically the data already exists in files,
46:02
for example, in S3. So these are just used for read queries, for analytical queries, and sometimes can be really long running queries as well. If you have lots of data. So here is what it looks like in Spark.
46:21
So actually, there is a lot of code here to create a Spark session. This is actually like a shell. This is not the programming language, but the shell uses Scala, I believe this is. So even in the shell, you then create a session
46:41
and you have to create what in Spark is called a data frame. So you have to connect or create a data frame that maps to a file, in this case, a JSON file. And then out of this data frame, you can create a view which you can query with SQL.
47:03
So after you have done those first lines, then you can actually use something that is familiar SQL and in my experience for analytics, in companies, the people that want to do business intelligence, they actually like SQL
47:23
because they have learned it a long time ago and it's their standard and they prefer it over learning. Some other languages such as in MongoDB's case, for example, the language is completely different. So for developers, that was usually okay,
47:41
but for the people who want to do analytics, they definitely wanted SQL. Okay, so what are the use cases? Well, this is the category where we speak about data lakes which used to be a Hadoop thing, but today, if it's just S3, I think if you have data in S3,
48:02
maybe people don't call it the data lake anymore. So what would you do? Well, analytics machine learning is used for a lot of things nowadays. So all kinds of personalization, fraud detection, again, national security, but then in the end,
48:20
it might just be classic reporting just like old school data warehouses. At the end of the month, you want to provide some kind of report or maybe like a live dashboard. So it's not at the end of the month anymore, but it has to be constantly updated.
48:41
Yeah, this is the modern version of a classic data warehouse, I would say. Spark also has a streaming version, which is interesting. So real-time processing of data that is happening currently
49:01
or that arrived from someplace now, but the Spark streaming version is really, it's still batch queries. It's just a very small batch of the recent data, but it's nice. It allows you to use the same interface and same SQL queries on a real-time stream.
49:23
And I guess I already mentioned Amazon's Athena is based on Presto. And then for search, and the crown jewels here is really the Apache Lucene project, which then is used in Solr,
49:43
which is a server, and Lucene is the engine that stores the data and indexes the data. But now I would say the market leader has become Elastic which is a younger product
50:01
compared to this. And also Elastic is based on Lucene as a data engine. So in both cases, Lucene is the real winner here, really valuable Apache project. So what does a search engine do? Well, you can search for words.
50:21
So in all the other databases that I showed, you search for fields. So if you have a name, like Hendrick Ingo, if you want to search only for the last name, typically it means you have to store the last name in a separate field, first name in a separate field. But with text search, you can have a body of text
50:40
that in a database would be stored as a single field, but the search engine Lucene will actually index each word separately so that you can search for individual words and maybe even like some wildcard patterns and so on. And you can get results ranked. So if you search for multiple words,
51:03
if some record matches all of those words, or if let's say all of those words are in the title rather than in the end of some long text, then it gets more points and it gets higher up in the results. And you can do faceting or highlighting.
51:23
So basically these are things, this is kind of like an indexing use case, but it's more complicated than you typically do with the B3s in relational databases. Okay, so what does it look like? This is the Elasticsearch example.
51:41
So it's actually a REST API with JSON records. Possibly this is already one reason why it became so popular. So in the first row there, we post, that is we insert interest terminology,
52:04
one record and notice here now that the name is in a single field. And then in the second row, second query, we use get. So this is a query and we search only for my last name, but because Elastic has indexed each word separately,
52:23
it actually does find this record. And in the results set there in the bottom, you can see that it's not just the name that is returned, but actually we have also,
52:41
there is also like the index name, which in other databases would be called a table, but in the search engine focuses on the index, so called FrostCon here. And sorry, the index would be maybe like what other databases called the database. And then maybe the type of record would be in people.
53:04
And then the ID of the record here is ID one. And this is because in the first row I used FrostCon and people and ID one in the REST URL. So what are the use cases?
53:21
Well, one is a search engine. So if you want the search box for your website, kind of like a mini Google, then these are the products you use. But yeah, also some other complicated queries that typical relational database or document database B3 doesn't cover.
53:46
And then Elastic has this product suite called Kibana, which is like an analytics solution. So for log files, for example, you can use it for same purpose as Splunk. So you have some text data, you put it in Elastic,
54:02
and then you can immediately see, for example, word frequencies or search for errors and so on. Kind of out of the box, it comes very easily from this search engine, supplies it.
54:22
So you don't need to spend a lot of effort creating specific indexes or so on, because you can just index all the words. And well, log files is one thing, but actually security monitoring then,
54:40
for example, monitoring your firewalls or networks or so on, or physical, and I don't know why, probably national security, again, is using these kinds of solutions. So those were the categories. And yes, so just one reason I started taking this seriously as well,
55:03
Elastic is actually one of the biggest companies in the NoSQL space. It's younger than MongoDB, but at least in some point, already had higher valuation even than MongoDB. So both of them are public companies, so this is why we know.
55:22
So it's definitely a big thing and growing fast. So those were the technical details of all of them. I also wanted in the end to say a few words. And if this is like too much information on a single page,
55:44
don't worry. But two years ago, there were a lot of discussion in the database space where a lot of these products change licenses.
56:01
And well, I think mostly in reaction to Amazon taking some of these open source databases and offering them as a service on the Amazon cloud. And of course, Microsoft, Google do the same
56:20
to some extent, which is what they do. And some of these companies felt that that was a threat to them. So many of them changed licenses. And as you can see, the trend is to the right. So they moved from a more open licensing
56:43
to a more closed licensing. But in the case of Redis, yeah, Redis moved actually to the right and then came a little bit back to the left. But I should also point out this is simplified table.
57:00
So many of these products actually have more than a single edition like a community edition and enterprise edition and often something more even. So this is just to get the picture of which of these changed. On the last row, it's an interesting development. So Elastic didn't change as such.
57:25
Well, Elastic also changed. They always had some closed source components, but yeah, but they changed the architecture and how they store them in the repository.
57:41
In response to this, Amazon actually launched the open distro for Elasticsearch. So in this case, it's actually where, yeah, so for example, some authentication security related features which are only available commercially in Elasticsearch.
58:02
Amazon of course wanted to develop and they have open sourced them for their own use. So in this case, it's actually the opposite from what the discussion was in 2018. So Amazon actually has the more open source version compared to Elasticsearch. So on the other hand,
58:22
you can see many of these didn't change. One thing you can of course see that those projects that are governed by the Apache Foundation, for example, cannot change the license. They will always have, they will always have Apache license
58:41
because that's the only possibility for the Apache Foundation. All right, that was all. And I see there are some questions. So maybe I will take this image. And there, we have a few minutes for questions.
59:02
Where would you put InfluxDB? So I saw a little bit of the talk, but not, I don't know if there was some architectural explanation in the beginning, I missed it. But I think generally there is a class of databases which are so-called time series databases.
59:21
And there are various techniques where you store data so that it's efficient to query large amount of data and also compress on disk if you know ahead of time
59:41
that they are ordered by a timestamp, for example, or could be other similar use cases as well. So in a way, it is like a data warehouse, but it's not the same as doing a star scheme on Oracle or Postgres.
01:00:01
the performance difference can be huge, like several things or tens of times. And then of course, often they might have some, some sharding or parallelism as well. So, I would, yeah, I would maybe call it as a separate category time series databases. And could that be like a seventh category
01:00:21
in this presentation? Maybe, but it's not, the time series databases from a user experience point of view is not that different from a relational data. So, typically you use SQL and then it's just faster and you can store more data. So, it's somewhere between, sometimes many years ago,
01:00:43
somebody proposed that there should be a category called new SQL because it was between traditional relational databases and then no SQL databases which were very different. And where would you place Apache Ignite? I think I have forgotten what Apache Ignite does, I'm sorry.
01:01:07
If you want to expand in the chat, you can. And there weren't any more questions, so I will just have to wait.
01:01:22
What do you think, Rafael? Are there more questions or should we start to wrap up? Yeah, feel free for more questions in the chat. We have still some time for them. Yeah. Ah, okay. Distributed in-memory key value store with SQL on top.
01:01:43
Yeah, I see. Ah, yes, yes. And there are others. Yeah, there are others with this kind of combination as well. So, okay, I think you answered your own question very well
01:02:04
distributed in-memory key value store, the SQL on top. This reminds me of one category I was also thinking when authoring this presentation. There is a class of distributed database like Google Spanner, CockroachDB, FaunaDB,
01:02:26
which all try to be like a presenter SQL interface to the user. And try to match what the good old relational databases do
01:02:41
in terms of supporting transactions, OLTP, and providing a very high level of consistency. So many NoSQL databases had this concept of eventual consistency where you use various techniques to deal with the fact that data arrives
01:03:01
at different times in different servers. So in this presentation, I mentioned in the wide column database case, inserts and updates are actually both upserts because they might arrive out of order in that architecture. So yeah, so this newer distributed database
01:03:21
is typically try to provide a higher level of consistency so that they are distributed in their internal architecture, but would actually be more similar to classic relational databases in the user experience. And that's, this category I think is quite interesting
01:03:42
from for people like me who are interested in database internals. So I try to read up on them, but at the same time, I think they are still a bit, the category is still quite small and growing. So it's interesting to see where it's going in the future.
01:04:08
Okay, should the consumption of memory and CPU see it as an important aspect to different database systems? Yes.
01:04:21
So performance is an interesting question and often you do trade-offs in many directions. So for example, I mentioned that some of these databases are write optimized. This means then when you read the data back,
01:04:42
actually there is more work. When you have in-memory databases or data, well, I mentioned Redis as an example of in-memory or memory-oriented database, but also like I mentioned, some databases like Neo4j,
01:05:00
I think work optimally when the data fits in memory. So this is of course a choice for you. Probably your queries will be faster, but the architecture is also more expensive because RAM is expensive.
01:05:20
So then some other databases in this case, then maybe like Cassandra and MongoDB are more disk-oriented, so similar to typical relational databases. So you can have decent performance even if a lot of your data is not in RAM. And then of course, at the other extreme,
01:05:41
you have Spark, which can read huge amounts of data that resides on disk and is so-called cold data. But then Spark again, or all of these query engines, is a good example of something where they typically consume quite a lot of CPU
01:06:00
because the data might not be indexed, for example. So yeah, memory consumption, CPU consumption, and by the way, also disk consumption. So I think the question with InfluxDB and the time series databases typically achieve very high compression ratio because they can use a columnar database storage model
01:06:25
can use different kinds of compression like run length encoding. So then database might use more or less disk as well. It's kind of a space inside which you optimize and it always depends on your application and what kind of data you have.
01:06:45
So yeah, okay, I think there was a comment there, low for NoSQL, high for relational. Ah, yes, okay.
01:07:00
Yes, good point. So I think in the beginning, I mentioned this as well, that in many NoSQL databases, they might actually be more efficient than doing the same thing in the relational database because the data is stored as an object, for example, in a key value database or as a denormalized document
01:07:22
in a document database. And same also with Cassandra actually. Typically a partition will store data, more data together. So again, in a denormalized way. Another question about NoSQL data protection.
01:07:44
This could mean two things now, either you are referring to like security features or durability, which means that your data is safe on disk. So if we speak about the latter,
01:08:00
it is true that originally, of course, NoSQL databases, the products were new and immature, let's say 10 years ago. And MongoDB, for example, still suffers from this reputation. I would say my current database, Cassandra, never had such a reputation.
01:08:23
In fact, in the relational database world, I would say MySQL and Postgres have a similar, similar difference. Okay, so let's talk about security instead. It's different, but I think security,
01:08:46
so having different users, having different permissions for users is something that has evolved. I would say today, actually, if I think about Redis, Cassandra, MongoDB,
01:09:02
all of them are fairly close or equal to the relational database world. I would say security nowadays is quite good with NoSQL databases. But this is, again, we started 10 years ago with simple databases that were focused on scaling
01:09:23
and didn't have many other features. And yeah, user authentication and security features was definitely one of them that were developed over time. But today, I would say situation is good. ACID compliance, so this is another interesting topic.
01:09:47
In relational databases, we talk about ACID compliance when we want to say a database is good. So when I write the data to the database, it's safely durable on disk,
01:10:01
and there is some consistency and atomicity and isolation so that my queries and user experience are what I would intuitively expect to experience. And it's, yeah, already for a relational database, there was like decades of research
01:10:21
to get to the point where we are today with isolation levels. But with distributed databases, this is like a whole new, so in the whole new area, so in the SQL standard, you have four isolation levels. And then there is also many relational databases
01:10:43
support snapshot isolation, which is not in the standard. So then you have like five, but with distributed database, you have like 15 or 20 different isolation levels, and some of them can be very low because if you insert data here on one server,
01:11:03
and then using the query and it goes to another server. So then the data that you just inserted isn't there and from an application or user point of view, it might create situations that are unintuitive.
01:11:21
So for example, as an application, if I post something on a social media site, and then I reload the page, I would expect my post to be there. But in a distributed system, it could actually happen that my post that I just sent isn't actually there
01:11:41
and I can't see it, but five seconds later, I can. So this is like, even if we had a lecture of another hour, it would still only be a small introduction to this area. I actually have such a presentation as well.
01:12:02
But for example, I mentioned the Dynamo protocol or Dynamo higher availability in this presentation, which I think is for me has been a really inspirational paper at the time. And it's a very smart solution where internally
01:12:21
or Cassandra cluster, for example, using Dynamo. Internally there is this eventual consistency. So data updates arrive at different times on different servers, but using the Dynamo protocol, you can then also issue reads in a way that compensates for this. So that, yeah, if you specified
01:12:43
in certain consistency levels in the Dynamo protocol, then you can actually have a consistent experience. So it means that you can read, if you write something, you can also read it back and it's guaranteed to come back or fail if you cannot guarantee that.
01:13:01
So very interesting area. Okay, thank you. I enjoy the questions. It's usually better than the actual slides. So it's good that you are active on the chat. Encryption at rest, I think also belongs under the security heading and I didn't mention it.
01:13:24
I have to apologize now this is an open source conference. I don't remember if Cassandra has it. We definitely offer it that data stacks in our commercial version of Cassandra.
01:13:41
MongoDB, I think it's also only in the commercial version, but I wonder if Percona has an open source version of encryption at rest for MongoDB. So you can ask Peter Tzeitze who is after me in this same, her result.
01:14:01
I don't know, I imagine Neo4j might have it as well because their customer base is security conscious. And I also don't know about Redis. Security in databases is top priority. It is, it depends a little bit on your customer base,
01:14:22
but of course in Europe, of course, we have also European Union at sets bar quite high. So encryption at rest, for example, is often required for user data so that you guarantee a certain level of data protection
01:14:41
for data that has privacy implications. So I totally agree. And it's again a big topic that could be a lecture of its own.
01:15:03
Okay, are there any question left? Okay, yeah, there, okay. Hans has more comments about the database breaches. I think we will have to find, maybe we can go to the Innenhof or something to continue this discussion
01:15:24
in a separate channel. And there is another useful presentation for knowledge scale newbies as me. So I'm glad, yeah, this is hopefully a useful framework that you can use when you look at no SQL databases.
01:15:42
I'm glad you find it useful.