Real World Schema Design with MongoDB
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 170 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/50795 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
SoftwareNumbering schemeDescriptive statisticsBuildingRight angleElectronic mailing listReal numberPresentation of a groupCountingMultiplication signGoodness of fitComputer animation
01:25
Data structureProper mapTable (information)Link (knot theory)Lattice (order)Einbettung <Mathematik>DatabasePrice indexLink (knot theory)Computer-assisted translationEinbettung <Mathematik>Coefficient of determinationObject (grammar)Endliche ModelltheorieRow (database)DatabasePhysical systemGroup actionLogicTable (information)Subject indexingRight angleData structureINTEGRALArray data structureCartesian coordinate systemNumbering schemeType theoryField (computer science)Theory of relativityLevel (video gaming)MultiplicationString (computer science)NumberKey (cryptography)MathematicsMultiplication signComputer animation
03:57
Web pageFormal languageHypermediaData storage deviceFocus (optics)String (computer science)Element (mathematics)Endliche ModelltheorieSubject indexingAuthorizationDatabase transactionData storage deviceState of matterType theoryGoodness of fitNormal-form gameField (computer science)Query languageMultiplicationPhysical systemHome pageNumberDatabaseSurjective functionCASE <Informatik>Key (cryptography)Formal languageRight angleUniform resource locatorTheory of relativityEmailOperational amplifierBoundary value problemOffice suiteOperator (mathematics)
07:12
EmailMessage passingView (database)Meta elementReading (process)WritingMessage passingTwitterCuboidSlide ruleEndliche ModelltheorieWritingSpherical capPhysical systemReading (process)Hand fanCartesian coordinate systemStructural loadCASE <Informatik>Term (mathematics)Computer clusterGoodness of fitPoint (geometry)EmailComputer-assisted translationServer (computing)Different (Kate Ryan album)InformationForm (programming)PlanningRight angleLine (geometry)Multiplication signPresentation of a groupNumbering schemeComputer animationJSON
09:24
EmailPrice indexMessage passingSubject indexingMessage passingClient (computing)EmailNumberScalabilityDevice driverForm (programming)QuicksortMultiplication signServer (computing)Router (computing)InformationHand fanCartesian coordinate systemGastropod shellOrder (biology)SineCASE <Informatik>Uniform resource locator3 (number)Reverse engineeringJSON
11:27
Message passingWritingEmailMessage passingRandomizationNumberQuery languageCASE <Informatik>Hand fanLoop (music)Point (geometry)Field (computer science)Order (biology)Reverse engineeringINTEGRALReading (process)BitQuicksortBit rateEndliche ModelltheorieSet (mathematics)Right angleComputer animation
13:59
Message passingEmailRandom numberEmailMessage passingDatabasePoint (geometry)Figurate numberCountingReverse engineeringRandomizationGastropod shellHand fanReading (process)Loop (music)ExistenceLevel (video gaming)Slide ruleQuery languageSingle-precision floating-point formatBoundary value problemWritingSet (mathematics)Atomic numberTouchscreenOrder (biology)Multiplication signNumberCASE <Informatik>Demo (music)MathematicsPhysicalismCuboidRight angleGame theoryQuicksortControl flowForm (programming)GodJSON
19:16
MereologyMessage passingMultiplication signCountingQuery languageCartesian coordinate systemIntrusion detection systemString (computer science)Product (business)Configuration spaceMatching (graph theory)Order (biology)Right angleLevel (video gaming)Goodness of fitLogicInsertion lossComputer animation
24:33
Message passingRight angleUniqueness quantificationIdentifiabilityQuicksortGoodness of fitMatching (graph theory)Process (computing)Multiplication signMountain passServer (computing)WebsiteAbsolute valueDifferent (Kate Ryan album)Query languageComputer animation
27:39
Message passingSurjective functionHand fanDiagramMultiplicationRandomizationReading (process)ScatteringRight angleComputer animationDiagram
28:36
Convex hullMessage passingNumberBitMultiplication signHand fanCASE <Informatik>Regulator geneRange (statistics)MultilaterationDirection (geometry)Point (geometry)Message passingWindowRight angleNumberMatching (graph theory)Computer animation
29:49
Range (statistics)Query languageElement (mathematics)EmailMessage passingSpacetimeQuicksortComputer multitaskingFrequencySubject indexingMessage passingMultiplication signFrequencyString (computer science)Key (cryptography)2 (number)State of matterProcess (computing)Point (geometry)NumberElement (mathematics)Mathematical optimizationCompact spaceSpacetimeSet (mathematics)Endliche Modelltheorie1 (number)Frame problemGoodness of fitDifferent (Kate Ryan album)Source codeWindowRandomizationArithmetic meanRight angleShared memoryComputer animation
36:03
Attribute grammarQuery languageForm (programming)Range (statistics)Object (grammar)Query languageMetadataRandomizationSubject indexingType theoryDifferent (Kate Ryan album)Computer fileData storage device1 (number)Goodness of fitEqualiser (mathematics)BlogRandom number generationTouchscreenArithmetic meanRange (statistics)Computer animation
37:18
Computer fileAttribute grammarPrice indexRange (statistics)Query languageAttribute grammarSubject indexingRevision controlMultiplication signType theoryData storage deviceBinary codeJSON
38:08
Price indexAttribute grammarSubject indexingObject (grammar)Computer fileBinary fileLimit (category theory)Form (programming)Inheritance (object-oriented programming)Subject indexingAttribute grammarNumbering schemeObject (grammar)Type theoryNegative numberGroup actionPoint (geometry)Sound effectMultiplicationGraph coloringSource codeNumberMultiplication signMetadataAlgorithmComputer animation
40:01
Price indexQuery languageRange (statistics)Subject indexingCASE <Informatik>Order (biology)Range (statistics)DivisorAnalogyPlastikkarteMatching (graph theory)Reverse engineeringPoint (geometry)PlanningCore dump1 (number)Object (grammar)MereologyMultiplication signCondition numberCursor (computers)Query languageAttribute grammarCodeRight angleComputer animation
42:28
Price indexQuery languageRange (statistics)Identity managementSubject indexingSingle-precision floating-point formatSmith chartCategory of beingInheritance (object-oriented programming)Abelian categoryElectronic program guidePrincipal idealFocus (optics)Group actionGoogol1 (number)Multiplication signCASE <Informatik>Slide rulePosition operatorPresentation of a groupCategory of beingBit rateTable (information)Port scannerComputer programmingFocus (optics)Query languageSubject indexingString (computer science)Computer configurationPlastikkarteBuffer overflowDatabaseINTEGRALEndliche ModelltheorieInheritance (object-oriented programming)Regulärer Ausdruck <Textverarbeitung>Group actionCausalityStack (abstract data type)Data storage deviceTheory of relativityLocal ringGoodness of fitCartesian coordinate systemHierarchyAbsolute valueRight angleRecursionNumbering schemeSpeech synthesisInstance (computer science)MathematicsLimit (category theory)Core dumpComputer animationJSON
49:12
Group actionGoogolAbsolute valueTheory of relativityStrategy game1 (number)TwitterQuery languageMessage passingError messageField (computer science)Right angleStructural loadCartesian coordinate systemSpacetimeAxiom of choicePredicate (grammar)Operator (mathematics)CASE <Informatik>Server (computing)PlastikkarteControl flowDatabaseAnalytic continuationFlagProjective planeState of matterSet (mathematics)Relational databaseComputer animation
55:17
JSONComputer animation
Transcript: English(auto-generated)
00:02
So, how are y'all? We're good? Who was in here last session? Okay, so back for more pain, awesome. And who's new? This is going to be the exact opposite hands. Awesome. Okay, I just wanted to see y'all. I'm going to go through this again. I did this last time, but who's used MongoDB?
00:25
Who's heard of MongoDB? You can't count like I saw it on the presentation list. That doesn't count. So I'm going to talk about real world schema design with MongoDB.
00:44
This is a little more than just basic schema design, but I'll try to go slow enough that it will be understandable and I'll explain some basic concepts here at the beginning. And I definitely have some extra time built in here, so please ask questions, raise your
01:05
hands, and we'll delve into some interesting things. So I have the description for this thing says three and I actually have five and we'll see how far we get and whatever. So that's that. Questions up front? Questions? Questions? No?
01:22
Okay, good. Because what are you going to ask about, right? So I think the biggest thing that's important to understand about MongoDB is that success with MongoDB is going to come from using the right data structures and the right schemas. You might say that I thought MongoDB was a dynamic schema, and it is.
01:44
You can put all kinds of things together in one collection, in one logical grouping of things, so dinosaurs and dogs and cats and people and cars and fruits. We don't care, but the reality is that your application is going to have a model for
02:03
your entities, and having that model correct is going to be what's going to make your MongoDB deployment really good and work well. If you pick the wrong thing, it's easy enough to come back from it and make changes and evolve over time, but you might suffer some pain, so it's good to think about
02:23
and almost more important to think about this up front with MongoDB. So some terminology here for you non-MongoDB people. A database is still a database. We still call it that. What is a table in a relational system is a collection for us. It's a collection of documents, whereas in a table, it's a table of rows.
02:44
We have a collection of documents, and those documents are JSON. We've all worked with JSON. It's just JSON. Anything you can put in JSON, you can put in MongoDB. We still support indexes, so we have primary indexes and we have secondary indexes,
03:02
and we have something called multi-key indexes, which I'll explain when we get to an example of that. We don't do joins. We don't let you take something from one collection and join it to another, mostly because you don't generally need to do that, because your schema will help you model that correctly such that you don't need a join.
03:24
And in its place, the modeling for that would be something called embedding. So just like in JSON, you can have JSON objects embedded in JSON objects, and you can have JSON arrays embedded in JSON objects and embedded in whatever. So you can do that in MongoDB as well, so embed your stuff. It's called embedding.
03:41
And then we also do something called linking. We actually don't do anything called linking. It's really just you putting an ID in there and calling it a foreign key, but we don't do any referential integrity on that. But we call it linking. We gave it a name. So working with documents. Here's a JSON document. We can put anything in there, and we've got an array with embedded documents
04:04
and an embedded document in there and numbers and dates and strings and whatever. Every top-level document in MongoDB is going to have that underscore ID field. That is your primary key. And it can be of any type. It can be a date. It can be a complex type. It can be another document.
04:24
The only criteria is that it has to be unique. So that's your primary key. It has to be able to uniquely identify this document. And as far as indexes go, you can index any of these things. You can index title, or you can index pages and language, or you can index publisher.location.
04:41
So I can find all the documents with a publisher in Texas, which is the best state in America. I'm from Texas, by the way, in case you couldn't tell that. And you can actually index into that authors array, too. So I could index authors.name. And this is the multi-key index I was saying.
05:00
We'll actually take that out and blow out one index entry for each element in that array, so that when you query it, you can just say, hey, find me all the authors whose name is Christina Chodoro. And we'll find that. And it'll be just as efficient as you had normally done. It's called a multi-key index, where we'll blow out the values of that thing.
05:21
So questions on JSON or storage or that kind of thing? Good. Oh, yes, we do. Do we have transactions? We don't have transactions. We don't have transactions across collections or across multiple documents in a collection. Within a document, when you're doing an update on a particular document,
05:42
we do something called, we have atomic modifiers. And so if you increment a field and set a field and push something onto array all in one operation, all of those will happen together or none of them will happen at all. So we guarantee transactions basically at the document boundary,
06:01
if you want to think about it that way. And yes, I'll ignore that and move on. Any other questions about this stuff? Nope. So where in relational systems, you focus a lot on data storage, right? So you get this model or this entity,
06:22
this ERD from your business people or whatever, and you're supposed to model that. And there's a very prescriptive way you do that in a relational system. You make it 17th normal form and you put it in the database. And that's how things work. Or in Mongo, we actually care a lot more about how you use the data. We care about what queries you're going to ask.
06:43
And we care about making those particular scenarios really fast. And not so much, let's make it third normal form or fifth normal form. Another way of saying that is in traditional ways, you kind of, here's all my answers, I have no idea what they're going to ask, so I can't optimize any of those.
07:01
Where we actually care a lot about the questions you have. What are the questions I'm going to ask? Let me build my schema around answers to those questions so that we can make it really fast. So here's example one, is a message inbox. And that's so descriptive. So there's another big slide with message inbox
07:21
and a bunch of things on it. So think Twitter feed, I guess, or LinkedIn timeline, or Gmail inbox, or whatever. So we're going to talk about how to model this in MongoDB. So right, we're going to send messages. So I need to figure out how, after I send a tweet, or Alvin, Alvin actually works at our company.
07:40
He's our quality assurance guy, CAP. What does CAP stand for? Something available, whatever, it doesn't matter. It's not the CAP theorem, it's a different CAP. And I don't remember what it is. Quality, it's quality, quality assurance. So he's our quality assurance guy, and he's awesome and funny and a good presenter, and I'm going to stop talking about Alvin now.
08:01
And he's going to tweet, and he's going to tweet, and we have to get that to all his followers, right? And then, of course, all his followers have to, like, take all these tweets and put them into one timeline. So how would we model this in MongoDB is the question, right? So here's our design goals.
08:21
We need to be able to efficiently send new messages to our followers. And almost more importantly, we need to efficiently be able to read the inbox. If you think about the load on a system like this, it's more likely that it's read heavy than it's write heavy. So the second point here is actually probably more important
08:41
than something we should focus on. There's many ways to do this. And I can't emphasize enough that each one of these is going to be good given a different use case. So if you have two different applications operating on the same data, it may be very likely that one is write heavy,
09:01
one is read heavy, and that you should model that data differently based on that application. That's why I'm saying that usage is really important when you're doing schema. So there's three approaches here, and each one may be completely valid. In a read heavy schema, we'll see which one actually works out well. So fan out on read, fan out on write,
09:20
and fan out write with bucketing. And I'm going to have to explain this term called sharding. Who knows what sharding is? Think partitioning your data or federating your data on some piece of information. So simplistic form is if I have a three sharded system, I can say, hey, everybody with the name
09:41
that starts with A through F goes on this server, and everybody with G through M, F, G, yeah, G through M goes here, and N through Z goes on this server over here. And this is how we get high scalability in MongoDB. We shard things. And in this particular example, I'm going to shard the inbox on who the message was sent by,
10:03
who was it sent from. And your application doesn't actually need to care a whole lot about this. There's a router that sits in front of all the shards, and so your client application will just send this to the router, and it will handle distributing these things automatically.
10:22
So in this particular case, we're going to shard our collection on who it was sent from, and then we're going to create an index for reading on who it sent to and the time it was sent. And then there's our message. So it's from Joe, and it's to Bob and Jane,
10:40
and new date and hi. So we're saying hi to Bob and Jane. And we're going to save this. Oh, and this is shell syntax, by the way. This is the syntax you would get in the shell. We have lots of drivers. We have 10 officially supported drivers and a number of community supported drivers out there. And their syntaxes would all vary.
11:02
They would mostly look similar in their idiomatic way. But this is shell syntax. This is kind of what's in the docs and all that kind of stuff. So save it. Then when I want to read the inbox, I just go and find all the messages sent to Joe and sort them in reverse order.
11:20
So sort them descending. Questions on that? Fan out on read? Easy. I've got a graphic for you. So this is what it looks like when I send the message. We've got three shards here. And because we're sharded on that from, then all the messages from Joe are always going to end up on that same shard. So we know exactly what shard we need to go to
11:42
to send this message, to put this message into that bucket. And then when we go to read it, what we're actually going to do is we want to find all the messages to Bob. And so because we've sharded on who it was sent from, we have to actually go to every one of the shards and ask, hey, do you have any messages for Bob?
12:03
And every one of these shards will try to figure it out. And there will be a lot of random IO going on on those shards trying to find all this stuff. And it's really not that performant. So sending is super efficient, right? This is great if you're write heavy. If you do a thousand writes for every read, this might be a great use case
12:22
and a great way to model this. If we're very read heavy, this probably isn't the right scenario. Good questions? Oh yeah, I got to go through this thing. Okay, so yeah, the bottom point is the real important one. There's a lot of random IO going on each one of these shards on every one of our shards. If we have a hundred shards,
12:41
we're sending this query to a hundred shards and asking them all to do a whole bunch of work to satisfy one query. And that's probably not good in a large number of situations. So yeah, that's that. So maybe we're going to fan out and write instead. So in this case, let's shard our collection on a new field called recipient
13:01
and also when it was sent, okay? And so here's our message. Again, from, to, and notice that recipient's not in there. But we do have a loop further down. And we're going to loop over everybody in that to field and add a new recipient field. So when I'm sending a message, when I'm sending this message,
13:21
the message will look like, from Joe to Bob and Jane, recipient Bob. So we've actually put Bob in two places. We've denormalized it a little bit. That's okay. That data is never going to change, right? After I send the message, it's always gone to Bob and it went to Bob in the past and I don't have to deal with integrity of that data. If you care about duplicating that data,
13:40
you can always remove it from the array if you want. And then sent a message there. And so Jane's message would look similar, from Joe to Bob and Jane, recipient is Jane then instead of Bob, right? And then we find it. Find me all the things where recipient is Joe and sort them in reverse order. So that's finding it. And what this looks like is this.
14:01
So now we're actually going to fan out on, well, we're going to fan out on write. And so when I want to send to Joe, I've got to go find the shard where Joe is, where Joe's recipient is. Sending a message to Joe will always go to the same shard. But Joe and Bob and Jane may all live on different shards.
14:20
So I basically have to tell all the shards, hey, I want you to send this thing to everybody and they'll figure it out and put it in the right place. But our reads are much better. Because we're always finding the document sent to Joe, we know that Joe always lives on this first shard.
14:40
And so we can go in there and find it. No big deal. The only problem with this is that we're still doing a lot of random IO on that one shard. So we've now taken our random IO from every shard in the system and we've moved it down to just one of the shards. Oh, I got to go through this again. And again, last point on this is the big one.
15:03
We still don't want to do a lot of random IO. So this next solution is interesting. It's not intuitive. It's something you would have to kind of think about to get there. But it makes a lot of sense. And I'll break away from my slides on that
15:22
to prove that this works. Because you might be skeptical. I know you'll be skeptical. So I was skeptical. So we're going to do this thing called fan out on writes with buckets. Okay, so basically what we're going to do is we're going to create a bucket of some n number of messages.
15:42
So a bucket being a single document for Joe. And in that, we're going to push all the messages onto an array. And so we're going to have one bucket for Joe that has maybe up to 50 messages in it. And then when that thing fills up to 50, then we'll create another bucket for Joe.
16:01
And so what we're basically doing is we're putting all the messages we're going to send to Joe in one place on one shard. So that would actually look like this. We're going to loop over that message to field the same way we did. And we're going to build a query here. Find me a document or find me a bucket
16:24
where the recipient is Joe and the message count is less than 50. And if you find one, increment the message count by one, push this message onto the messages array. And then if this happens to be the first document,
16:43
if you're up sorting this document, go ahead and set that created date for me. And this happens atomically. So either that increment and that push and that set will all happen or none of them will happen. So this is that transactional boundary at a document level. And then reading is actually really, really simple.
17:03
Still the same thing. Find me all the buckets where the recipient was Joe and sort them by this created data in reverse order. And chances are I might get one or two depending on how many Joe, how many messages, how popular Joe is. So do y'all believe this works? I actually didn't for a while.
17:21
So I can prove it to you. So you'll see the shell there. Okay, so I'm going to copy this in here. Oh, oh geez. Now I have to look at a different monitor. This will be fun. So let's use Inbox.
17:41
So this is a MongoDB. Is that big enough for you guys to see? Can y'all see that? Okay. I'll try to zoom in if I need to. So I'm going to use Inbox. Inbox actually doesn't exist. I can show DBs. Inbox doesn't exist. We're going to create things as you use them.
18:01
So you don't have to go and pre-create collections or create databases. We'll just do them on the fly for you. So I'm in the Inbox database. And so here's my query. So find all the recipients that are Jack. I changed Joe to Jack. I have this thing with J names. I always use J names. So Jack here.
18:20
And I've changed that less than 50 to less than three for demo purposes. So I don't hit enter 50 times to prove this to you. So less than three. So my buckets are going to be size three. And then I'm going to put an update. Oh, wrong screen. Come back over here.
18:41
And I'm going to put this update in there. Okay. Oh, I need a message. Let's say, what do you want the message to say? Anybody? Thank you. Wow. That's pathetic. Really. But I appreciate it. That's what I put in my slide was hi, too. So we're pathetic together.
19:02
Awesome. Okay. So hi. So let me try that update thing again. Okay. So now that works. Yeah, it failed because this message didn't exist, and it didn't know what to put there. So JavaScript failure, which tends to fail a lot, right? JavaScript. Okay. So increment the message count by one,
19:22
push that message onto the message array, and set on insert that new date. Okay. So finally, I'm going to run this. As an update statement. Oh, not foo. Let's do... Oh, no, foo's fine.
19:41
Yeah, we'll do foo. And the trick here is that last one, upcert true. So find it, and do this when you update it. And if you don't find any of buckets in there that match this criteria, go ahead and upcert it. Go ahead and put it in there for me.
20:00
Okay. And so this is telling me that it just did an upcert, right? It upcerted one, and that's the idea of the new document that was put in there. Let me show you that document real quick. So db.foo.find, and I'll make that pretty for you guys. And I did something stupid.
20:21
Oh, it's not DBO. Y'all are supposed to help me here. Okay. So there's that document, right? So I've got recipient and message count at the top level, and then the message is kind of embedded in down. It doesn't have to be a string. It was a document in the other example, but it's a string here, right? And then that created date got put in there, okay?
20:41
So let me go ahead and run this update again. Same update thing. And you'll notice this time it's not upcerted, but it says it matched one, okay? And it modified one. So then if we go back and look at that thing, we actually have two messages in there, two highs, okay?
21:01
I'm going to do this one more time. And again, matched and modified. And if we go back and find this, then we'll see that there's three messages in there, okay? When I run this a third time, we're going to find a recipient where Jack is that has messages count less than three,
21:21
but my messages count is three, so it's not going to find one this time, just like it didn't find one when it was completely empty. And I'm going to do this real quick so I get a new created date. And then update it again. So you'll notice this time we upcerted instead of modified and matched.
21:40
So when I go and, well, let me clear this out, actually. Let me find a shorter one to delete. So now I've got two buckets for Jack, and I'm going to keep rolling over like this until, you know, if I had a bigger bucket size, it would be 50, right? Does this make sense?
22:02
Maybe it doesn't. Seriously, does it make sense? Because we can keep talking about it. Okay, I'm going to take that as a yes. Yes, you would need to keep specifying the bucket size on each upcert.
22:21
That's a great question. Yeah, so this is going to be part of your application logic. You could put it in a configuration file, right, and read it from there so that you could change it if you wanted, but yeah. And you probably wouldn't actually want to change it after you've already got things in production because we'd start, Matt,
22:40
like if you changed your bucket size from 10 to 20, right, where we'd started rolling over buckets after hitting that 10, we'd start matching earlier buckets, and our message would start getting out of order, right? Because now, if we're less than 20, well, we have two buckets that match, so it'll pick one and upcert it, or and push that message into it until it hits 20,
23:01
and now I've got messages kind of out of order, right? Great question. Yeah, but if I just do, actually, let's do that. Let's try this. So, sorry, what was the question? The question was what would happen if I changed the bucket size after the fact, basically?
23:24
So if I changed it from 10 to 20, right, we'd start matching previous buckets because the message count was now smaller in all those. And then different IDs.
23:41
Yes, they have different IDs, but we're not querying on ID. We're querying on recipient, right? If we were querying on ID, we absolutely could do that, but we're not. And your recipient is not unique, correct. So when we did an update, it actually might put it in more than one, right, because it matched more than one, right?
24:02
So you probably don't want to change your bucket size. So this is something you would want to think about at the beginning of your application. Again, why you would think about schema design ahead of time. Well, let me change it.
24:27
How do you find everyone? For Jack, yeah. You do the query. Both the buckets have different IDs, absolutely.
24:44
Right, is that your question? Yeah, I'm using Jack as if it's unique, right? Jack's probably not unique. You're going to have some kind of identifier for who that target was. I'm using Jack because I know Jack's unique
25:01
because he's the only guy I'm using. Yeah, yes, you would want that recipient to be unique, right? You wouldn't want to be, yeah, absolutely. Cool, absolutely.
25:21
That would actually work really well. So if you wanted a dynamic thing, then make your update query match only on the first document. Would that work? You think it would. I don't think you can specify a sort during an update.
25:40
Correct, you can't specify, oh, is that true? I'm going to look up the docs after this. Our doc site is excellent. So if you go to docs.mongodb.org, we really have a top-notch doc team, so they have done a great job there.
26:02
And I'm unclear as to whether that's allowed. I don't think it is, but it might be. If you could, yeah, you could do it in two steps. You could always do it in two steps, absolutely. Find the document, find the bucket that I want to put this thing in,
26:21
and then do a second one to push it in. Absolutely, absolutely. More questions on this that I can answer, maybe. Yes, it's great for immutable messages. Yeah, if you're wanting to mutate things or remove or whatever, then this becomes different. And I actually, the next example of this thing
26:42
is based on that. Good questions, good questions. Yes, it is. Yeah, right.
27:11
Yeah, we actually, there's a server-side lock that will prevent that from happening. So the question is, if I had two upserts running at literally the exact same time,
27:21
and then neither one of them match, would they both insert? And no, one of them would happen first, and the second one would run after it. Okay, okay, good. Now I can get rid of this. Move on to things that I might know more about.
27:41
Oh, oh, and I've got diagrams for this too. So sending looks exactly the same as we did with the fan out on write without buckets. So basically I'm going to create a document, or I'm going to modify at least a document per send, per recipient. But our read looks a lot better, right?
28:02
There's no more scatter, like random IO on this shard, because we're sharding on a recipient, and also on that created date, then it's most likely that even if we have multiple documents, they'll appear really close together on disk. So we can probably almost just seek down and find out everything. So we've solved that problem there.
28:24
Oh, and I got this thing again. I don't know why I did that. Yeah, that was it, okay, cool. Oh, now there's one more. And okay, so here's history. When do I go to?
28:40
I go to 20? Is that right? I think I go to 20. So 30 minutes left, cool. Plenty of time, so history. So same kind of example. We're going to tweak the requirements a little bit. I need to be able to purge and query over historical documents.
29:02
So things in the past that have happened may be up to a given point, because governments tend to have these regulations where you have to or you can't keep data for longer than seven years, or for longer than three days, or something like that. And I still need to be queried efficiently by this thing, by either direct match
29:22
or over a range. So how do we model that? And again, here's three approaches, and again, there are more, depending on the use case. You might come up with something I'm not talking about here at all. The first one will look remarkably, if not exactly, like the last one.
29:41
Bucketing, again, fan out on write bucket. Okay, and then a fixed size array, which is kind of windowed, and then bucket by date with TTR. I'll show you what that means later. So here's bucket by number of messages. It's the same thing we just talked about. So recipient created, and then we've got messages in there. Excuse me. And then here's a query by date.
30:02
So find me all the messages sent to Joe, which were sent greater than some date or greater than or equal some date. And you can do that here. And you can also remove elements at this point. So find me all the documents with Joe,
30:20
and pull out all the messages in that messages array that were sent greater than or equal to this state. And so we can do that atomically as well. So what are the problems with this? Well, we might actually hit the wrong bucket, because if we're removing things, we might start inserting back into the wrong bucket,
30:43
the same kind of problem we would have with changing a bucket size after we started. And we need to handle removing empty documents, right? If we go and remove all the messages in a bucket, we should probably remove the whole bucket. And so we'd have to handle doing that. And then we've got some wasted space.
31:02
So MongoDB will try to reuse that space. It'll try to reclaim it and use it again. But over time, some of this will just amortize out, and you'll need to go and tell MongoDB to kind of compact itself. And that's okay. There's a command for it, and that's why it's there. But if you don't, if you can build a schema
31:20
that doesn't need that, it's a lot better. So a fixed size array. You can think of this kind of like windowing, right? I want to keep only the most recent 50 messages, okay? And so here's my message. And I want to say, hey, find Bob's bucket.
31:40
And I want you to push on all these messages. And as you push them on, I want you to sort each by their sent date and keep only the most recent 50, okay? So every time Bob receives a new message, we're going to move this window along, and we're only going to keep the most recent 50. It make sense? Seems pretty straightforward.
32:03
The problem is that I have to know, I have to be able to know how big that bucket needs to be based on my retention period. So if Bob only receives one message every day, and I have to keep messages for 50 days, then I know my bucket size is 50, right?
32:22
But if Bob receives messages, sometimes he receives 10, and sometimes he receives two, and sometimes he doesn't receive any, then I actually don't know what the optimal bucket size is for the retention period I'm going for, right? So that doesn't really work out that well. If you can, this works out really nicely, and it's very efficient, and it's very good,
32:41
because you'll have one bucket per person, and you know there's at most 50 messages there, and they're all the recent ones. So it could work out really nicely. Or perhaps you remove messages from that thing every time Bob reads them, right? This is a good model for that, too. But if you really have a retention period,
33:01
SOX or whatever, then you may have a problem with this. So what's going on here? So we have this feature in MongoDB called TTL Collections. They're called, it stands for Time to Live. And Time to Live Collections,
33:20
in this particular example, we're gonna create this bucket in here and have a created date, and then we're gonna create an index on it that says, hey, here's an index, and I want you to expire the document, that this index is pointing at after that many seconds, and that many seconds equals one year, unless there's a leap second in that year,
33:41
and then you have to deal with leap seconds. But other than that, this works out pretty well. And what MongoDB is gonna do is it's basically like this cron job that we kind of are doing on your behalf, that says, hey, go and find me all the documents that are expired and delete them out, right? So we're gonna start pushing,
34:01
you can push messages on there, and the key here is that we're gonna push on, we're gonna have a bucket per day. So the first message, is this Joe? Yeah, the first message that Joe receives in a day, we're gonna create a bucket, and if he receives another message that day, we're gonna use the same bucket. And we're gonna use the same bucket the whole time for that particular day. And at the end of the day, we stop using it.
34:22
The next day, we use a different bucket. And if Joe doesn't get any messages on a given day, he doesn't get a bucket for that day. But what's gonna happen is MongoDB at the end of a year or whatever time frame you specify, is we're just gonna delete those documents as they move out, right? Does that make sense? I actually think that's a really nice feature. And it works really well.
34:41
So our retention period's a year, set it to a year, and we know that we don't have any data in here older than a year, or at least in this particular collection. Questions about that? Yep.
35:06
No, yes, that, sorry, the question is, does the expire after seconds correlate to the date that created is? And yes, absolutely. You wouldn't be able to create this index like we'd refuse to create the index if that wasn't a date or a timestamp.
35:21
You can't create it on a string because none of that makes sense. Inspire Bob after, whatever, yeah. So yeah, we're keying off the date that that thing is, which also means, by the way, that you could come back and mutate that date. So if you need, so maybe it's not created date, maybe it's a last modified date. Delete all the documents that haven't been modified for a year.
35:41
Every time I modify this thing, I go ahead and update that modified date. Well, MongoDB won't expire that for that, because now the modified date's newer and we'll keep pushing that window, okay? Cool? Okay. Oh, this thing again. Yes, that, we just talked about that.
36:02
Okay, next example. So index attributes, right? So users want to put metadata and they want to do random things and query random things on our collections and how do we index and how do we do that kind of thing. So, oh, man, why do I keep doing this? Oh, there's one more.
36:23
So we need to store random numbers of random types of things, you know, metadata for file. And the example is files we want to store as created date and its size and maybe some tag or tags for it and those kind of things. And we need to be able to query these files, or the users want to query these things
36:41
on arbitrary metadata kind of stuff, range-based or equality. And we want to be efficient. I mean, obviously we want to be efficient. So here's two approaches and there's more. I would, if this is something that your company needs or you need,
37:01
these are good approaches. And there are other ones for different types of needs. I would encourage you to go search faceted search in MongoDB. And there's some really good blog posts written by some of the MongoDB engineers on how to do that. So, but here are a couple of those. So attributes is a sub-document, right?
37:21
This is natural and makes a lot of sense. So we just put them in this attribute thing. And so attribute.type is text and size is 64. Or its type is binary and its size is 256. And so we're going to kind of store those. And then we'll create an index for every attribute that we want to search on. So we'll create an index on that attribute.type
37:42
and an index on that attribute.size and then we can find things based on it. And we can actually find things based on both at the same time if we want. MongoDB does support index intersection queries starting with the 2.6 version. So you can put them on both and we'll find you all the documents
38:01
that match both, hopefully utilizing the indexes appropriately. So this works out pretty easily. But every attribute has to have an index. How many of these do I have? Just two. Okay, every attribute needs an index. So if the users all of a sudden are creating these kind of on their own,
38:21
they're ad hoc adding attributes in, which is entirely possible with MongoDB, you would have to have some form of I got an algorithm to create an index every time a user has some whim that they want to add into documents. So this works well if you have a finite set, a small number of these things.
38:40
There is a limit to the number of indexes you can put. You probably won't hit it. And if you are hitting it, you shouldn't be hitting it. So rethink your schema at that point. Yeah, but so if this is small and the users aren't changing things, this works pretty well. And this actually probably works the best. But that's not always true.
39:01
Oh no, there is one more. Yeah, lots and lots of indexes if you have lots and lots of metadata stuff. So we can also store it like this. So attributes is objects in an array. So instead of attributes being a document with type and a size and created, attributes is now an array. And it's an array of little documents where type is this and size is this
39:20
and created is this. And we can artificially add in new things. The users can do this ad hoc, put in new, I don't know, age or whatever things they actually want to color. They can put color in, color for their document. I don't know why that would matter. And we only need one index on this. We index that attributes.
39:41
And this goes back to what I was mentioning earlier about multi-key indexes. We're going to pull out each one of these documents and effectively index them all by themselves and point them back at the same parent document in the index entry. So we only need one index for this and users can go and go have fun
40:00
and do whatever they want. So we only need one index and we can still support range queries. But we can only use that index once per query right now. So if they want to query on multiple things, then they would have to be very careful about the order that they put things right, the cardinality of the things that would come back.
40:23
We would want to, at the beginning, the first part of our query would want to exclude as many documents as possible so that the rest that we're having to scan over are much less. Does that make sense? So if we can eliminate, if we want to query on two things and we can eliminate 99% of the documents
40:40
by putting this condition first, then we only have to scan 1% of the documents to match the rest. Whereas if you did it in reverse order, you might only match 1% of the documents and you have to scan 99% to match the rest. So put the thing that has the highest cardinality first and then you can exclude things later. Make sense? Cool. Questions about this?
41:01
That's the end of attributes. Yes. So we do have a query planner. And you can see what queries are going to happen. In fact, I can show you that. db.foo. You can do this on queries.explain.
41:21
You'll get back a whole bunch of stuff. How many objects were scanned, what index we hit. In this case, that cursor is basic cursor. That means we didn't use an index and that's because I didn't create an index. But what cursor is used, how many times we yield, how long it took, right? And so we get this. The MongoDB query planner is based on try and see.
41:43
So if you have four possible indexes that might match a given query plan, what we're going to do is we're going to try them all. And whichever one finishes first is the one we're going to remember and use for the next 100 or 1,000 queries, at which point we'll then do this try again.
42:01
Real simple. It makes sense. We just rewrote our query planner in 2.6, so it's much better. It's only going to get better now that we've refactored all that code. Any other questions? Cool.
42:22
Let's do another one. Oh, no. That's the wrong... Go to that. Let's do another one. So multiple identities in this case. What time is it? I have 15 minutes. You know what? I'm going to skip this one,
42:41
because it's not that interesting. Let's do... Oh, geez. See, now y'all can go back and look at this presentation and just pause it on every slide that I just skipped over and see for yourself how to do that one. I want to do this one
43:00
because it's much more interesting to think about. Categories and tags. Switching off, let's think about books. We've got a title for some books, and we need to put them in a category. This book on MongoDB is in the category MongoDB. And then we also have this other collection
43:21
called categories, which has this category, IDs, MongoDB, and its parent is databases. Presumably, databases would have a parent of programming, and programming would have a parent of awesome knowledge or whatever. And let's say I want to find all the documents or all the books
43:42
that are in the category of programming, right? So how would I do that if I modeled it this way? This is a very relational way of modeling this. I would actually have to go and find all the documents that match the programming category. Then I'd have to go find all the categories whose parent was programming
44:00
and find all the books that match that. Then I'd have to go find all the categories whose parents had parents of programming and find all the books that match those. And I'd have to keep going until I found no more categories to use. So it's this recursive problem that's completely awful to do,
44:20
and you would never want to do that. Does that make sense? Why that's bad? Okay, cool. So how else can we do that? Anybody have a guess? Yeah, denormalize, absolutely. Yeah, denormalize, right? So let's go ahead and store categories as an array of things, okay? So MongoDB has a parent of databases, and databases has a parent of programming, right?
44:44
So if I want to find all the books about databases, then I can just find all the books that have a category that includes databases in it, right? And that's really straightforward and really simple to do. And again, utilizing that multi-key index. If we index categories,
45:01
this thing is really, really fast. If databases, for instance, existed in more than one category, then I might actually find books that I didn't actually want, because databases is ambiguous as to its hierarchy. So we can solve that too.
45:22
And we can solve that by using a regular expression. Yeah, I know, don't laugh at me. And we can store the category as a path, like a full path, right? As just a simple string. So programming slash delimiter, or delimiter databases, delimiter MongoDB.
45:41
And then we can query using a regular expression here. And we'll be able to use an index on this. As long as you use that hat, as long as you anchor the regular expression at the beginning, we can use an index to still match on it. If you don't use the hat, then it's going to be a big table scan. And that's bad,
46:01
if you have millions and millions of documents. So I think that's it. Schema design's a little different in MongoDB, and we were a little past basic here, but I think you'll understand, and I'm happy to answer some questions. I think we've got time. I think we've got about 10 minutes.
46:22
But basic design principles are kind of going to stay the same, right? We don't want you to denormalize when it's going to cause integrity problems, but when it doesn't, it may make sense. We want you to care about your queries based on how they're getting used. So focus on that. Focus on how your application uses and manipulates this data and base your schema around that.
46:43
And then you can rapidly kind of evolve the schema as you need. So that's it, all I have. So do we have questions? Oh, yeah, here's some resources. So we've got user groups all over the world. So find one near you and go to it.
47:02
There's usually lots of things going on in there and lots of good presentations, either by local people or we bring MongoDB people in. It's like, who's from Oslo? Oh, y'all should come to the Oslo MongoDB User Group on Tuesday night. So I'm speaking there about schema design, not the same talk, a different talk. So you can come and see that one, too.
47:22
And anyway, so there's probably one where you are, too. Google Groups, right? So mongodb.user out there has lots of great questions. And it's likely that you're not the first to have this question, and it's probably been answered, either on that user group or on Stack Overflow somewhere. It's the engineers at MongoDB, as well as some of our commercial support engineers,
47:44
who are monitoring these places. So you'll see me and lots of other MongoDB employees answering all these questions, as well as some community members. But we actually review all the community answers to make sure they're correct. So go out to these places and ask your questions or search them, and you'll probably find an answer to your question.
48:04
And so that's that. So what questions do y'all have for me? We've got about 10 minutes, and you can leave if you want, or we can talk, and I've got business cards and whatever if you want to email me. Yeah.
48:20
Yeah, so the question is, if you denormalize data and you want to update it later, what do you do? Right, so you've got a lot of options. So we obviously support, you can update many documents at once. So if you don't update things all that often, then go ahead and denormalize, because running a massive update once a year
48:40
or twice a year or whatever, just isn't that big of a deal. If you're concerned about it, if you're concerned about your referential integrity and that kind of stuff, not referential integrity, if you're concerned, yeah, well, whatever. If you're concerned about the integrity of your data, then don't denormalize, right? We don't want you to do something that doesn't make any sense. But I would assume that data that changes often is less,
49:09
I don't even know what I'm trying to say, that happens less frequently than data that's probably more stagnant. Most data is probably more immutable than you think it is. But yeah, it's a consideration you have to think.
49:22
Is it okay? How often am I going to have to make corrections? How often am I going to have to deal with this? Absolutely something you want to think about. But denormalization is good. It makes things faster. In fact, in relational databases, you have to denormalize to make things faster, too. So it's not like a new concept. It's not something we're prescribing
49:40
that other people haven't seen. Anything else? Yep, bucket example. Yeah, for particular messages from one sender.
50:01
Yep, so the question is, if I want to find a particular message from a given sender, yeah, so you can query, I think the history example showed a query that was reaching into the message and finding only the ones that were sent after a certain date, right? So you can definitely do that. You can even project those.
50:23
Oh, yes, and you could get the bucket back and then look through them, too. But you can actually project only the message. So if you have an array in the document, you can say, hey, find me this document and within this array, only return me the array entries that actually match an additional predicate.
50:42
And so find me only the ones that were, where they were sent after this date. So I'll maybe get back a bucket would have 50 in it but I may only get a bucket back, the bucket back with only like three messages in it because only three of them were sent after a certain date. So you can do a projection like that, absolutely.
51:09
Right, you can, absolutely. Yeah, I got it. So the question is, we do support multi-updates.
51:21
So that's what he was asking about earlier. So once a year or twice a year, we go back and blow out a big update. And the question is, is it atomic across all of those updates? And the answer is no, it's not. So yes, it's possible, it's entirely possible that if you've given 100 documents and you run an update that's gonna hit 100 documents,
51:42
that it works for 98 of them and it doesn't work for the other two. It's entirely possible to happen. So when you're doing updates like this, it's important to think about idempotency, right? Can I just, like if I sense a failure, can I just rerun that thing and pick up the last two that maybe failed
52:00
or something like that, right? So if you have idempotent updates, then you're basically good to go. Dollar, inc., right, dollar, inc., like increment a field, that's not an idempotent update because if you run it again, then you've now incremented a document twice on somebody. All right, dollar set, we have a dollar set operator. That is an idempotent operation. So follow up, yeah.
52:35
It is, yeah. In 2.6, so the question is, now that I've updated 100 and only 98 failed,
52:41
how do I know which two didn't? In 2.6, you can, we'll tell you that. We'll actually tell you the documents that failed and then you could retry them in your application. Prior to 2.6, you would have to then go re-query the database and ask. So a strategy for updating might be
53:01
to add a flag in there. Hey, this thing was updated. Like increment this field and set this thing and also create this field updated equals true, right? And so you can keep running that update while updated is false or updated doesn't exist and once updated is true for all the documents, you can go back and remove the updated field, right?
53:22
So you can make these things idempotent with a further query. Cool, you had a question? I believe Twitter does.
53:42
I don't work at Twitter and I don't actually know. But I would imagine that that would be a strategy that they have. It really depends on your use case. So I couldn't answer you based on that right now. And I'm saying like Twitter does authoritatively, completely unauthoritatively, they might.
54:01
Let me clarify, they might do that. I don't actually know. But remember space is cheap, right? Space doesn't cost a lot of money. So duplicating data is not gonna break the bank on that. But needing to spin up more servers to handle the load because I didn't do that might cost a lot more.
54:23
So you just have to evaluate it, benchmark it, think about it. So if you're doing, back to the update, if you're doing a bunch of updates and you wanna update many documents at once,
54:43
I think you can tell it. You can say, hey, go ahead and continue on error. Like even after you've errored, go ahead and keep going. And it'll finish that out. And then you can figure out which ones fail afterwards. Or you can say stop on first failure. I think you get the choice. I don't think.
55:01
You get the choice. Anything else? Well, thank you all. I appreciate your time. I've got cards. And I'll be up here. And I'll be walking around all day. So come talk or whatever. Thank you.