Everything You Know About MongoDB is Wrong!
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 130 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/49998 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Denial-of-service attackRight angleSmith chartMeeting/Interview
00:21
Product (business)QuicksortPhysical systemSlide rule
01:34
YouTubeScale (map)WaveVideoconferencingWeb 2.0Slide ruleCoefficient of determinationDisk read-and-write headYouTubeData miningScaling (geometry)Parameter (computer programming)Office suiteMultiplication signBitLaptopCASE <Informatik>Link (knot theory)Maxima and minimaComputer animation
03:02
DatabaseSlide ruleSmith chartWorkstation <Musikinstrument>LeakLevel (video gaming)Slide ruleCASE <Informatik>Type theoryScalabilityObject (grammar)Intrusion detection systemMultiplication signVirtual machineSampling (statistics)Streaming mediaNumberBitParity (mathematics)Table (information)Key (cryptography)Relational databaseAnalytic setSoftware developerClient (computing)Device driverHigh availabilityMaxima and minimaTouchscreenQuery languageDatabaseRoundness (object)DemosceneData conversionDatabase normalizationDependent and independent variablesConnected spaceFormal languageHierarchyVideoconferencingMereologyCasting (performing arts)Endliche ModelltheorieStructural loadReading (process)InformationControl flowMoment (mathematics)Computer animation
06:29
Revision controlGoogolDatabaseGoogolWebsiteRevision controlSlide ruleSoftware bugWindowBookmark (World Wide Web)DatabaseEntire functionData storage deviceLink (knot theory)Different (Kate Ryan album)Ubiquitous computingDistribution (mathematics)Moment (mathematics)Sinc function
08:02
Programmable read-only memoryDatabaseACIDDatabase transactionOrder (biology)Scale (map)Home pageDatabaseSlide ruleTable (information)Gene clusterOperator (mathematics)Database transactionConsistencyState of matterDifferent (Kate Ryan album)Multiplication signEndliche ModelltheorieGoodness of fitTimestampType theoryNumeral (linguistics)Revision controlQuery languageResultantCommunications protocolFile formatOrder (biology)Set (mathematics)QuicksortPositional notationData storage deviceBinary codeHierarchyFlow separationBitRow (database)1 (number)Traverse (surveying)Server (computing)Software developerSingle-precision floating-point formatRelational databaseTextsystemObject (grammar)DecimalData typeEntire functionMultiplicationExistenceComputer animation
13:57
Server (computing)Slide ruleDatabaseServer (computing)Maxima and minimaVirtual machineQuery languageIntegrated development environmentCASE <Informatik>Database normalizationProduct (business)Computer animation
14:35
Virtual machineDatabaseReduction of orderDatabaseVirtual machineService (economics)Core dumpType theoryData storage deviceInterpreter (computing)BitSlide ruleTerm (mathematics)Multiplication signOperator (mathematics)Database normalizationBackupSinc functionQuery languageUniform resource locatorServer (computing)Semiconductor memoryScaling (geometry)CoprocessorComplex (psychology)CausalityObject-oriented programmingHacker (term)
17:25
Default (computer science)Information securityStandard deviationDatabaseOpen setSmith chartDistribution (mathematics)SoftwareRevision controlCovering spaceSummierbarkeitBitMiniDiscCASE <Informatik>Default (computer science)Firewall (computing)Right anglePerspective (visual)Virtual machineSoftware developerDatabaseProduct (business)AuthenticationInternetworkingCatastrophismSubsetComputer configurationServer (computing)Slide ruleNumberGoodness of fitMultiplication signCausalityConfiguration spaceData structureFlagFunction (mathematics)Standard deviationKeyboard shortcutPasswordComputer fileRead-only memoryExpert systemUniverse (mathematics)Control flowRelational databaseTheory of relativityInstance (computer science)Connected spaceQuery languageE-learningTransport Layer SecurityComplex (psychology)Flow separationService (economics)Mobile appInformation securitySubject indexingWeb serviceTerm (mathematics)Different (Kate Ryan album)Medical imagingOperator (mathematics)RobotTouch typingSet (mathematics)Computing platformFreewareLink (knot theory)Negative numberFile formatData storage deviceComputer animation
24:50
Multiplication signMeeting/Interview
Transcript: English(auto-generated)
00:06
Right. This is this is Mark Smith. I don't think Mark needs an introduction. He's working for MongoDB. He's going to tell us lots of things that you probably don't know about MongoDB. So off you go. Thank you. Hi everybody. I'm known as gg2k online generally for reasons that aren't that interesting.
00:25
And I joined MongoDB last December. So I've been there around seven or eight months and I'm reasonably experienced. In the past, I've built systems on top of MySQL and Postgres. Lots of stuff on SQLite, Redis, CouchDB, Solr, but I'd never actually used MongoDB
00:44
before. So I've been learning a lot over the last seven to eight months. And there were two primary things that I really learned. One is that MongoDB is really powerful and kind of fun. And the other is that almost everything that anyone says about MongoDB online is wrong.
01:02
So I'm going to spend the next 20 to 25 minutes trying to bust a few myths and give you an idea of what MongoDB is, what it isn't, and how it works. So primarily this talk is aimed at people who don't currently use MongoDB and might be interested in what it is and hopefully
01:22
sort of working their way through some of the misinformation that's online. If you use MongoDB regularly, you probably know what it does, but it is a big and complex product. So hopefully you might learn something new anyway. So next slide please. So before I really get started in the real myths, there's this video on YouTube.
01:44
And it involves two dogs kind of having an argument and one uses a lot of technical jargon and a lot of slogans to describe how amazing MongoDB is and the other dog is actually a bit more down to earth and increasingly gets frustrated with the first dog and how unrealistic it is.
02:01
And when I announced that I was going to work at MongoDB, a friend of mine sent me a link to this video immediately, just in case I hadn't seen it. And the thing is, this video was published in 2010. It's 10 years old and you will think that no one at MongoDB has seen it and nothing could be further from the truth. Next slide please.
02:24
This is my colleague Max and like me and everyone else at MongoDB. He has seen the video on YouTube. He has also bought the t-shirt. He's wearing his MongoDB is web scale t-shirt. If you were to walk around our head office in New York, you would see this sticker on about half the laptops in the office. So the next time
02:41
you're tempted to send this 10 year old video to somebody who works for MongoDB, maybe if you could think twice. I'd really appreciate it because it was funny at the time, but it's almost all incorrect now and I'm going to counteract a lot of the misinformation that it's been so easily spreading for about 10 years. So next slide please.
03:02
So now that I've got that out of the way, I'm just going to give a very high level overview of what MongoDB is in case you really haven't touched it at all. Next slide please. So MongoDB is a clustered database. The minimum number of machines you can have in a cluster is three if you want to have high availability and you don't want to lose any data.
03:25
You need an odd number of machines in your cluster so they can have a little conversation among themselves and elect a primary. Now the primary is the machine that essentially handles all the connections to the cluster. So once a primary is elected, all the clients, whether it's your Python client or any of
03:42
the other language clients that can be used to connect to MongoDB will connect to the primary. And it's the primary's responsibility to stream data down to the secondary. So they all store the same data, but the secondaries can be slightly out of date. So this obviously is for redundancy. It's for data resilience rather than scalability
04:01
because all those, the primary essentially acts as a bottleneck to the cluster. You can force your client to connect to one of the secondaries to do say analytics queries on a machine that's under slightly less read load. And that's kind of an interesting model, but you have to understand that you will be working with stale data if you're talking to the secondary because the data comes in through the primary first.
04:22
Next slide, please. I have to keep moving backwards and forwards between my notes and what's on the screen. So what do we store in this database cluster? Well, MongoDB is a document database, so we store collections of documents. Now a collection is like a table in a relational database and a document looks a bit like this on the right. It's a map of keys and values.
04:44
This document in particular is from our sample movies database. It's for a movie called Blacksmith Scene that was filmed in 1893 and it's one minute long and it involves a blacksmith hammering at an anvil and then taking a break, wiping his brow, opening a beer and passing it round and then getting back to hammering on the anvil again.
05:04
It's like a TikTok video from 1893, which I think is kind of cool given I didn't even realize they had movie cameras in 1893. But I'd like to highlight a couple of things about this piece of data that I think are kind of interesting. So the first, next slide, please.
05:22
So the first is that it's hierarchical and kind of multidimensional. So the cast value here is a subarray and you can update individual parts of this document individually. So you don't have to update the entire document each time. So I can append items to this array if I want to or I can insert or delete items from the array.
05:41
The other thing is the IMDB value is what we call in MongoDB a subdocument, but in Python you would call a dict or in JSON you might call it an object. Next slide, please. And the other thing is that there's some values here that are of types that aren't available in JSON. And I'll talk about that a little bit more in a moment.
06:02
The ID is an object ID type, which is a special type we use for generated IDs in the cluster and released is displayed here as a native Python datetime object. It's not actually stored in the database as a Python datetime object, but it is a native MongoDB datetime object
06:20
and the Python driver converts that into something that's useful for you as a Python developer when it's retrieved from the database. So next slide, please. And I'll be covering why some of this is important in a later slide. So the first myth about MongoDB is that it's on version 2.4. You won't see people kind of promoting that fact online.
06:44
But if you install a relatively recent version of Debian, like Debian Jesse and run apt install MongoDB, you will get version 2.4. The problem is that version 2.4 was released in 2013. So it's getting on for eight years old.
07:02
And there have been seven or eight major releases since then, each of them fixing bugs and adding features and improving performance. In fact, the entire storage engine has been rewritten since then. So MongoDB 4.4, next slide, please, is almost a completely different database.
07:25
So it's getting on to 2.4. And if you're interested in actually installing an up-to-date version of MongoDB, Google MongoDB Community, follow the first link that goes to the MongoDB website and follow the instructions for installing it on your favorite Linux, Mac, or Windows distribution.
07:44
Next slide, please. So the first myth, the second myth, sorry, that I've already alluded to is that MongoDB is a JSON database. You will see this quite a lot. It's a very pervasive myth. Next slide, please.
08:02
In fact, at the moment, it's on the MongoDB homepage that we store rich JSON documents. But this isn't actually true. Next slide, please. MongoDB is a BSON database. And this may sound like a sort of minor technicality because a BSON, it stands for binary.
08:20
Actually, I don't know what the S stands for. Anyway, it's a binary object notation. It's not just a sort of binary version of JSON. It's much more efficient to store and traverse than JSON, which you would kind of expect from the fact that it's binary. But it also includes extra data types like the ones that I showed you before.
08:40
And the ones you'd really care about as a developer are the ability to efficiently store binary blob data. It stores date times natively, so you can query against different aspects of a timestamp that's stored in the database. And it also includes various different numeric types like decimal numbers that are good for storing currency. This is important because BSON is totally fundamental to MongoDB.
09:02
It's the format of the protocol that's used to talk to the server. It's not a REST query that you use to talk to MongoDB. There's a binary streaming protocol built on BSON. Database queries are actually BSON structs. Database results are BSON. MongoDB is fundamentally a BSON database. Next slide, please.
09:26
Another thing you will hear about MongoDB everywhere is that it doesn't support transactions. That it's a, I forget what base stands for, basic availability soft state eventual consistency database. And two of those have never been true.
09:41
And there's a reason for this myth existing. It's that essentially two years ago, we didn't support transactions. Next slide, please. Two years ago, we added transactions to MongoDB in version 4.0. And then last year, transactions were extended to support sharded clusters. So now whatever type of cluster you have for MongoDB will support transactions.
10:02
But despite this, it's not so necessary to use transactions in MongoDB as it is in traditional relational databases. Because we have a rich document format that allows you to store nested related data together in a document, you don't need to do so many joins across collections or tables to update data.
10:22
So you don't need a transaction to ensure that those data updates are done atomically. Updates within a single document are atomic by themselves. So if you have a good database design and you're storing related data together, then you can update all of that data in one atomic operation.
10:40
Having said that, now that MongoDB supports transactions, for all those times where you do need to do updates across different collections or across different documents, that facility is available to you. But if you're doing it too much, you probably need to look at your data design, your model design, and try to factor that out as much as possible.
11:01
Next slide, please. Another thing that goes hand in hand with the transactions thing is that MongoDB doesn't support relationships. You can retrieve multiple documents, but you can't join across collections. And this hasn't been true for quite some time. So we support joins, left outer joins, and have done for quite some time using a type of query called an aggregation pipeline.
11:27
Next slide, please. So that doesn't say a lot. So aggregation pipelines, yes, it's been supported since 2.2.
11:40
Next slide, please. So I'll show you just quickly what an aggregation pipeline actually looks like. It's a set of operations you can conduct on a collection. And then they're optimized so that they can be reordered or filtered out based on what will work most optimally with the data and the type of query that you're doing. Now, one of these operations is called a lookup operation.
12:02
So here I'm conducting a single aggregation operation on the orders collection starting at the top. And this is doing a left outer join with the inventory collection where orders.item equals inventory.sku, and then it will embed the resulting documents in a subdocument called inventory docs.
12:21
Next slide, please. So this becomes a bit clearer. And one more slide, please. This becomes a bit clearer when you can actually see the kind of data that's returned. So here it's looked up some documents matching that relationship in inventory, and then it's embedded them in the resulting order document. So here this happens to only have one embedded document,
12:42
but because this is a one-to-many relationship, it's an array of documents that comes back. So you could have multiple documents in here. And I quite like this because when you do this kind of query in a relational database, it flattens it down into duplicated rows in the result set. So whereas MongoDB actually takes advantage of the fact it's a hierarchical document format and embeds it in there.
13:06
So not only does MongoDB support relationships and joins, I think it actually does it quite intuitively. Next slide, please. Another thing people will talk about quite early on if you're discussing the pros and cons of MongoDB is sharding.
13:23
And there's a reason for this. Sharding is a pretty cool feature. Sharding is when you take your entire data set and you divide it into separate pieces. So you take all of your data, you maybe divide it down the middle, and then you have what are called two shards. So they're two separate data sets that are related.
13:41
And then you take one of those data sets and you put it on a cluster, and then you put the other data set and put it on another cluster. So you now have two shards. And when you do queries, you ideally send the queries to the cluster that holds the data that you want to get back. Next slide, please. The problem with this is that I said you need a minimum of three machines in a cluster.
14:04
As soon as you have two shards, as soon as you have sharding, you need three machines per cluster. So you're looking at a minimum of six database servers. And because you need to send the queries and the data updates to the correct shards, you actually need some servers to negotiate that in front of them.
14:20
So you need a shard server. And because you want redundancy in case one of the machines goes down, you actually need a minimum of two of these machines. So you need a minimum of three machines, which is a significant cost in a production environment, to a minimum of eight machines. Next slide, please. So what we recommend generally is if you're working with large data
14:40
that isn't currently performant on your cluster or won't currently fit on your current machine, we recommend you look at upgrading your machine first. As soon as you start adding shards, you start to limit the types of queries you can do. You add a huge amount of complexity to your operations cost in terms of actually managing the cluster.
15:03
So buy more RAM, for example. If your data isn't fitting in memory, buy more RAM. It's probably cheaper than buying another five servers. Or a faster processor, if that's your bottleneck. Essentially look at the thing that's limiting your scaling and attempt to upgrade that first. Look at the cost of that first. Having said that, sharding is there for you if those aren't solutions at the current time.
15:25
Another really interesting thing that sharding can do is if you shard on the location of the users who will be accessing the data, you can move that shard cluster to be geographically closer to those users so they get less latency when they're doing queries against your database, which I think is kind of cool.
15:42
Next slide, please. But if you are looking at upgrading your cluster, I recommend you look at MongoDB Atlas, which is the database hosting service that MongoDB run. They'll run your database for you. They will handle scaling your cluster up and down as required. So you'll only be paying for the kind of usage you require.
16:02
And it still supports sharding and things like that if you need them at a later date. Also, it takes huge amounts of operation spend away in terms of doing backups and handling redundancy and things. Next slide, please. So the reason people talk about sharding is because micro sharding used to be a big thing.
16:22
So back in 2.4, if I remember correctly, MongoDB had a lock in the storage engine that meant that it could only efficiently use one core on a machine. So it was a bit like the global interpreter lock in Python to that respect. Actually, I apologize. It was before MongoDB 3.2.
16:41
So some enterprising DBAs discovered that if they sharded their data and ran multiple shards on a single machine, they could make use of all the cores on that machine, which was quite clever, but a bit of a hack. And this got known as micro sharding. Since 2015, MongoDB has a non-blocking storage engine called wired tiger, which doesn't require this anymore.
17:03
So it makes full use of the cores on your machine. So people used to talk about sharding because it was required to optimally work on multi-core machines. But this is another historical thing. Next slide, please. Oops, that wasn't meant to come up. Those are the nodes that were actually disabled and somehow it's exported to them to PDF anyway.
17:25
Could we move to the next? I'll keep on saying next until we get to next, next, next, next, next. It was the last slide, please. So myth number six, MongoDB is insecure.
17:41
So there have certainly been data breaches with MongoDB, and it has developed a bit of a reputation for it, which I think is slightly unfair. Next slide, please. So the cause of this is that MongoDB, in most distributions with the older versions, automatically binds to the network and it automatically starts up with no authentication.
18:02
So hopefully when you're hosting a production server or anything on the internet, you have a firewall either on your machine or in front of your machine or both. It's the default on Amazon Web Services to have a firewall that only exposes your SSH port. Unfortunately, less and less experienced developers, I would say,
18:20
would usually develop their services as a separate app server and database server. And when they found that they couldn't connect to their database server, they would log into all the firewalls that were available and essentially open up the MongoDB port. And at no point think about adding authentication to their database. And what this means when you expose that port to the internet is that a bot will find your instance of MongoDB and essentially steal your data.
18:42
Or these days, what it does is it encrypts all your data within your database and adds a document telling you where to send Bitcoin to get your data unlocked. I would argue that if, as a DBA, you put an unsecured database on the internet and somebody steals your data, then it's more your fault than anything else.
19:00
But modern versions of MongoDB also don't follow this behavior anymore. So nowadays, MongoDB won't bind to the network. It will only be accessed from local host until you've added some authentication. You can use an override flag to override that security feature if you want to. You still can stick unsecured instances of MongoDB on the internet if you really want to,
19:24
but I strongly recommend that you don't do that. On top of this, MongoDB uses industry standard security. It uses TLS for connections by default. It uses ScramSha256 for authentication. There's no reinventing the wheel here. It's no less secure than any other database you might use.
19:42
Next slide, please. And another quite persistent myth about MongoDB is that it loses data. And I will try to explain where this myth comes from. Next slide, please. So first, I would just say it's difficult to prove a negative. MongoDB is used and trusted by some big banks like Barclays and Morgan Stanley.
20:04
And if it lost data, then I think that would be unlikely. It's also used in a bunch of other industries that really care about not losing their data. Next slide, please. The reason people lose data in MongoDB is because MongoDB actually allows you to trade off
20:24
between the robustness of your data and performance. And the default is perhaps not ideal. So by default, when you send an update to your MongoDB cluster, it uses a thing called a configuration option called a write concern,
20:41
which by default is set to one. And this means that as soon as the data has been accepted by one of the databases, one of the servers in your cluster, which will be the primary, you get an acknowledgement back that the data has been accepted. Unfortunately, this means that if you get your acceptance back and then the primary goes down with a catastrophic disk failure, then you've lost your data. So that's not ideal from a perspective of caring about not losing your data.
21:04
So almost every time, unless you really, really, really care about squeezing out all performance and you don't mind losing some data from time to time, you should use a write concern of majority, which means in a cluster of three machines, it will be accepted by two of those machines and written to disk before you get an acknowledgement back that the data has been accepted.
21:23
It's a little bit slower, but you won't lose your data. Again, this kind of stems from people not quite knowing what MongoDB is or how to use it, but I would say that the default is unhelpful in that case. Next slide, please. So this really sums up all of the other myths that I've covered.
21:43
People will talk about MongoDB being really easy, and I've certainly found it very easy to get started. Storing data in the database is really easy. If you have some JSON data or just some data structures in Python, you can just start storing them in MongoDB without really knowing how to efficiently use the features that are in there,
22:00
how you might lose data, how you might squeeze some extra performance but you don't even need to know relational theory like you do with a relational database. You can just get started, and this is kind of a problem, I would say, overall in terms of generating image because you've got a lot of inexperienced users. So its strength is also maybe one of its biggest disadvantages
22:21
from a marketing perspective. So what you really need to do is learn how to use MongoDB properly from an operations and a development perspective, and MongoDB themselves provide kind of two paths to this. The documentation is really comprehensive and it's always being expanded and revised,
22:40
but less known is that MongoDB run this thing called MongoDB University, which is a bunch of online courses that you can do in many cases for free that will dive into different topics around MongoDB in terms of hosting or indexing or complex aggregation pipeline queries and really help you become an expert at using the product. So if you do decide to take up MongoDB and actually use it in production,
23:04
I recommend doing a few of these courses to make sure that you're really using it properly and that you know all the features that are available to you. Next slide, please. I can't remember quite what time I had until. Is it quarter to? You have two minutes left officially,
23:22
but we have a longer break after this and so maybe you can overrun by a minute or two. I think I only have two minutes anyway. So while I have an audience, I'd just like to pitch something that I personally have been working on recently. So my team are taking the John Hopkins University COVID-19 data set,
23:41
which is stored as a bunch of CSV files that every so often change format or get updated to GitHub incorrectly and we've turned it into a queryable MongoDB database cluster. It has a username and password of readonlyreadonly. If it sounds interesting, the bit.ly link here will take you to all the documentation on how to connect to it and use it from lots of different platforms
24:01
and so if you just kind of want to have a play with MongoDB and with an interesting data set that might teach you something useful, this is a good place for that. If you decide to build something either with it or without it that's based around COVID-19 and in some way good for humanity, we're offering free credits for anybody doing that.
24:20
So if that sounds interesting as well, please do get in touch with me on the Discord. Next slide, please. So this slide had a lovely build up while I was talking, but, you know, we'll just go straight to the end. So now unlike most of the people on the internet, you will hopefully be right about MongoDB at least some of the time.
24:45
Thank you very much. Thank you very much for the excellent talk. That was just in time. I love you, I love you.