Introduction to Python and MongoDB
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 118 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44847 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
EuroPython 201924 / 118
2
7
8
15
20
30
33
36
39
40
45
49
51
55
57
58
63
64
66
68
69
71
74
77
78
80
82
96
98
105
107
108
110
113
115
00:00
IntelPoint cloudGoogolSoftware developerGoodness of fitEvent horizonOnline helpMereologyTwitterBitArithmetic meanComputer animation
00:32
Video gameExpert systemBackupService (economics)BitSoftware as a serviceLecture/Conference
01:06
Address spaceSoftware developerView (database)Data storage deviceField (computer science)Type theoryTerm (mathematics)Representation (politics)Equivalence relationFlow separationCASE <Informatik>DatabaseString (computer science)Device driverArray data structureBitData dictionaryIntegerInformationComputer programmingObject (grammar)Standard deviationProgrammer (hardware)Transport Layer SecurityVideo gameLengthSoftwareQuery languageMultiplication signJava appletCodeBinary fileGeometrySocial classDemo (music)Similarity (geometry)Computer animation
03:35
Set (mathematics)Data loggerWindowRelational databaseDatabaseService (economics)Product (business)Bounded variationSingle-precision floating-point formatComputer fileSet (mathematics)Computer animation
04:16
Instance (computer science)Operator (mathematics)Set (mathematics)MiniDiscBlogDatabase2 (number)Right anglePoint cloudComplete metric spaceLine (geometry)CodeComputer animation
05:19
Military operationReplication (computing)Normal operatorRight angleLine (geometry)CodeNormal (geometry)DatabaseData recoveryClient (computing)Replication (computing)Reading (process)ConsistencyPoint (geometry)Library (computing)Default (computer science)2 (number)Row (database)Streaming mediaProcess (computing)MultiplicationLoop (music)Video game1 (number)Device driverDependent and independent variablesVirtual machineAlgorithmServer (computing)Single-precision floating-point formatAreaData centerWritingBounded variationIdempotentMiniDiscSpacetimeBitBefehlsprozessorCollaborationismLecture/ConferenceComputer animation
08:30
Device driverDatabase transactionScaling (geometry)MultiplicationFlow separationDemonSet (mathematics)RoutingPartition (number theory)WorkloadSlide ruleComputer animationDiagram
09:16
Group actionProcess (computing)Group actionSlide ruleObject (grammar)Server (computing)Dot productDatabaseBitParameter (computer programming)Insertion lossGastropod shellClient (computing)Connected spaceEncryptionMultiplication signIntegrated development environmentLibrary (computing)Roundness (object)Key (cryptography)Computer programmingResultantElement (mathematics)Cursor (computers)Electronic visual displayLine (geometry)Demo (music)System callProcess (computing)Table (information)Row (database)Default (computer science)Software testingData dictionaryDrag (physics)Information securitySimilarity (geometry)NumberSocial classSoftware developerProgrammer (hardware)Kernel (computing)Computer animationLecture/ConferenceSource code
14:48
Client (computing)DatabaseDependent and independent variablesoutputString (computer science)Random numberCore dumpExecution unitRange (statistics)RAIDDemo (music)DatabaseInsertion lossServer (computing)Field (computer science)Electronic mailing listPhysical systemAsynchronous Transfer ModeTime zoneBuildingClient (computing)Revision controlComputer programmingFile formatEndliche ModelltheorieSubject indexingNumberFeedbackData storage deviceMultiplication signData structureString (computer science)Process (computing)RandomizationUniqueness quantificationDefault (computer science)InformationLibrary (computing)Similarity (geometry)Roundness (object)
17:56
Field (computer science)Client (computing)Operator (mathematics)DatabaseGastropod shellView (database)Insertion lossMultiplication signCursor (computers)Particle systemSoftware testingClosed setSet (mathematics)Hidden Markov modelSource code
23:08
Infinite conjugacy class propertyConvex hullClosed setField (computer science)DatabaseOperator (mathematics)Object (grammar)View (database)Poisson-KlammerMusical ensembleRight angleExistential quantificationCursor (computers)Object-oriented programmingNetwork topologyState of matterSource code
28:19
DatabaseReading (process)Subject indexingDatabasePoint cloudType theoryDemo (music)Gene clusterLevel (video gaming)Subject indexingScaling (geometry)Multiplication signInsertion lossSingle-precision floating-point formatSource codeComputer animation
29:10
Arithmetic meanField (computer science)FreewareLevel (video gaming)CodeSelf-organization
Transcript: English(auto-generated)
00:03
Good afternoon, everybody. I'm director of developer advocacy at MongoDB, and what that means to most of you people in the room is that I can give you money for your meetup events, so if you follow me on Twitter, I'm happy to help fund meetup events across
00:24
Europe, especially around Python. MongoDB doesn't have to feature, but we like to talk a little bit about it, and that's part of the package. We do talks like this all over Europe, and I did one at EuroPython a couple of years ago, and this is just a basic
00:41
introduction for people who are not familiar with MongoDB. I'm assuming some familiarity with Python. I've been using Python since 2006. I actually built a business in 2006, a SaaS backup service built around Python and Django. I dabble in it. I'm not an expert,
01:00
but I can show you some tricks with MongoDB that should make your life a bit easier. So for those of you who aren't familiar with MongoDB, it's a document store. That means it stores JSON documents. You should be familiar with JSON if you've done any kind of programming at all. One of the excellent things about programming with Python
01:21
and MongoDB is JSON documents look exactly like Python dictionaries, and so there's a one to one equivalence between the types that you're going to use directly in Python and the objects that you can store straight into the MongoDB database. This means there's no wrapping classes, there's no extra code, you can just use the Python objects directly,
01:44
and that turns out to be very straightforward. If you were using C sharp or Java, you'd have to put a wrapping object around it, and that makes life just a little bit more noisy in terms of the code base. We get similar benefits from Node.js, but you know,
02:00
I'm personal view of Node.js is like that's just too crazy for me. I can't work that stuff out at all. Callbacks are just that's for somebody else who's a better programmer than me. So when you store JSON in a database, it's not obviously pure textual JSON. That would be too expensive in terms of encoding and decoding. You're not going to do that,
02:21
and of course you've got to stuff this stuff onto a wire and send it over a network to the database in the first place. That's what the Python driver is designed to do. So we actually encode it. We encode type and length information, so we understand that things are strings or nested documents, that things have arrays nested inside them,
02:41
that things are integers, and we also understand geospatial coordinates, although I'm not going to demo geospatial queries today. And the way it's stored is Beeson. Beeson is our own standard, Beeson spec, it's an open standard, you can contribute to it. It's effectively a binary encoding of the JSON representation. So if I show you JSON
03:01
hello world, you can see there is a size at the start, then there's a type field, in this case it's a string, so it's a two, and then there's the field name and the field value, hello and world, and then it's terminated by a null at the end. Obviously these can get more complicated as you get nested documents and arrays and
03:21
so on. I just want you to understand that this is what's being sent across the wire, but that's the last time you need to worry about Beeson in your life as a Python programmer. For you, you're going to be working with dictionaries, arrays. They're the key types you're going to use. Now, when you download MongoDB, as I hope all of you will do, you're just going to install a MongoDB on your local desktop.
03:44
In Windows it installs as a service, on other Linuxes and variations like OSX, you can just run it as yourself. I'm actually running it on this desktop myself, although it's a Windows desktop, just so you can see the log files. But the production deployment of MongoDB is as we call a replica set.
04:03
Obviously a single node database like MongoDB, if the node dies, the data goes away. We don't have an independent log file on single node databases the way you do in a relational database. Instead, we keep whole replicas of the data. When you build a replica set, you build
04:21
three instances of MongoDB running in three separate nodes with three separate disks. Then you join them together into a replica set. I'm showing a replica set with three members. They're here. You could have up to 50 members in a replica set. Again, not many people do that. That's a lot of nodes to manage.
04:41
Rights go to the primary. The primary is designed to take the right operations. You cannot write to a secondary, but you'll see how that's managed in a couple of seconds. Once the rights are made to the primary, they're then effectively copied to the secondaries via an internal log called the oplog.
05:00
The cluster, the replica set, manages all this for you. All you've got to do is set it up. You'll see our MongoDB Atlas database in the cloud will do all this for you with one click. You don't have to do this. If you want to set this up locally, there is a Python package called mtools that will allow you to set up a complete replica set, again, with one line of code on the local desktop you're using.
05:24
It's all set up. All the rights are going to the primary. Everything's fine, but what happens if you have a failure? In normal operation, the three nodes connect to each other using a heartbeat that tells each node the other nodes are alive, and with replication streams from the primary to the secondaries.
05:43
Remember, you read and write from the primary for consistent read and write activity. If you don't mind a little bit of eventual read consistency, you can choose to read from a secondary. Of course, this works out great if you're running a distributed cluster where you've got users in New York and London and Basel,
06:03
and you want to be able to eliminate that wide area loop when you're reading from data. In those situations, often it doesn't matter that you don't have the most up-to-date records. In normal operation, this all works away. Again, we all set this up and manage it for you, but imagine the virtual machine that's running
06:23
the primary dies for some reason, and we know nodes die. That's just the life of a node. Somebody kills it. It runs out of disk space. It gets jammed because some process runs and uses too much CPU. Well, eventually, the heartbeat that the other two nodes send is going to not get a response.
06:41
At that point, the remaining nodes, and there must be a majority of the nodes remaining for this to happen, will have an election. It's like any election. You can't have an election if you don't have a majority of people participating. So, the election effectively says whose node has the most up-to-date data, and it uses, for those of you who are into this kind of thing,
07:03
a consensus algorithm, which is a small variation on the Raft consensus algorithm. So, they eventually decide, and it depends on the size of the cluster and how many nodes. It takes a couple of hundred milliseconds, and eventually, they will elect a new node, and that node will spring
07:21
to life as a new primary. What happens to the clients while this is happening? Well, the Python driver, the client library that you're going to use, collaborates with the cluster to ensure that no writes are lost. So, even if there's a write in flight to the old primary, and it dies, the driver will restart that write automatically
07:40
and recover and make that write idempotently on the new primary. So, there's no way to lose data. If, on the other hand, you lost the whole cluster, if your data center went on fire, happens, I know, not very often, but it does, what will eventually happen is your client will timeout, will get a server timeout. That happens after 30 seconds by default.
08:01
And then, you've got to do something else. My database is down. What do I do in that situation? But with multiple nodes and the ability of the primary to move to the nodes that are alive, that's going to happen much less often than in a single-node database. Now, those of you who have been watching astutely will realize
08:20
that if all the reads and writes are going to a single node, there's a point at which that node is going to saturate. Recovery happens, we're going to go through that. Will it scale? How does it jump from, like, a single node to scaling to millions of nodes, scaling from millions of transactions, millions of users?
08:41
Well, it can, and we do it with sharding. Effectively, we run a partition of the data on multiple replica sets, and we use a separate set of daemons called mongoses to route the reads and writes to those nodes. Going into sharding is a whole talk in itself. Trust me, it works, but you don't have to trust me.
09:03
You can trust Fortnite. Fortnite has 25 million users, 10 million active users. They run on a sharded MongoDB cluster in Atlas. And trust me, that's a high workload. So, enough of the slide where let's actually see it in action.
09:23
So, I've got an IPython here. I actually wrote from datetime import datetime because every time I did this demo, I forgot to import datetime, and later on, it would call some class. So, I want to actually connect to a database. I've got a server running here. So, it's kind of shrunk down here.
09:42
So, there's a database server running there. It's kind of getting connections or whatever. So, I'm going to do client. I need to import the PyMongo library first. So, import PyMongo, and then I just need to make a client object.
10:02
The PyMongo library handles connection pooling, security, encryption if you're using it, and also encoding and decoding into Bison. That's why you don't need to worry about it. So, we do PyMongo dot mongo client,
10:22
and if you look at the client object, it's actually pointing at local host 27017. By default, all clients, all servers start on port 27017. So, as long as you don't specify anything, it's all going to work, right? If you want to change that, you can. It's a minus port argument.
10:40
Now, I still need to make a database, so I'm going to look at the client object, and I'm just going to make a database off the client. This is the beauty of MongoDB. It's very simple to spin up new databases. So, we're going to make a database called test, and then we're going to make a collection. Think table when you think collection, but instead of rows, we're going to have JSON documents.
11:02
So, we're going to make a collection, and we're going to call that test as well, because we just have no imagination in MongoDB. And so, if I look at the collection, you'll see it's got a client, and it's got a dict, and it's got two things,
11:20
a database and a collection. Now, inserting into a collection is just as easy as making a dictionary. So, I can do collection dot... Let's get that typo out. Collection dot insert one, and I just put in a dictionary. I'm going to make an explicit dictionary here,
11:40
call it username, and the username is Jadrim Gu. And we close the curlies. Nobody should do a demo in Python without IPython, because it closes your curlies for you. And it's going to return a result object, which basically says that's been inserted. And we can actually look at that object. We can do collection dot find one.
12:05
And we just do one, because we know we've only got one element in this collection, and there it will pop up, and you can see usernames and will. But what's this strange ID? This is added by the Python client library, and it's created on the client, so there's no round trip to the server. The object ID is your unique primary key
12:23
for every object that you insert. So, I can redo this insert now. Let's just do it again. And if I do a find rather than a find one, because find will return all the documents in the database, I'm going to get a cursor.
12:41
That's kind of annoying, because I'm in the Python shell. Normally, for other programmers, they have to use the Mongo shell, which is a Node.js environment. Because Python has this cool REPL, and because Shane Harvey and co., our developers, have done such a great job of building the library, we can look at this stuff, and we get a cursor. So, I mean, I could do something really ugly,
13:02
like get the cursor, and the cursor isn't iterable, so I could do X dot next. And, you know, you'd get the object. Like, that's a drag. Who wants to do that? So, I wrote a Python package just to make this stuff easier,
13:22
called MongoDB shell. Now, actually, I want to get a particular object from this shell.
13:41
So, I'm going to do from MongoDB shell, import MongoDB, and this is like a super object that saves you some typing. Now, let's get rid of those curlies. So, now I'm going to make a new client object called MongoDB,
14:02
and I can pass it in the database I want and the collection I want with the right quotes. And now, if I look at the C object, it's kind of similar, but it's just a bit more legible. It just shows you the URL and the database name and the collection name.
14:21
It doesn't put all the other cruft in there. And now I can do C dot find, and bingo, we get, I've undone some inserts already, we get the objects displaying. So, Mongo shell effectively wraps the cursor and displays it out, and it does pagination and all that nice stuff. It adds those line numbers. They can be turned off.
14:41
We're going to play around with both of these in the future. So, that's finding the shell, but let's have a quick look at what an actual program looks like. So, here's the simplest Python program you can write for MongoDB. It literally just pings the server with an is master command. So, I'm going to run that,
15:02
and you'll see it'll just produce this Java, this JSON document, and it's basically systems information without the server. Is master is true, local time, et cetera, et cetera, et cetera. So, you just get a bunch of data about the server.
15:20
That's the simplest program you can write, but what if I wanted to make a lot of documentation? So, I'm going to build this other program and this is again, let me just put this in a slightly more legible format. Enter distraction free mode.
15:40
So, this is going to make a pile of documents. So, it really imports the programs. We get date, time. We got random. We're going to make random strings. We're going to make an article which just returns a document which eventually has a bunch of random fields in it, an ID, a title. Note that the ID field which we generate automatically previously
16:01
can also be overridden to insert your own unique ID. Because ID underscore ID is always indexed, it means you can save yourself an index if you already have a unique ID for the database. And we're going to make user which does the same thing. And then we've got PyMongo client. We're making a database called EP 2019.
16:21
I'm going to just change that because I don't want to overwrite the demo database I'm going to use. And we're going to drop the users and collection articles. And then we're just going to insert them. And we do something here that is an important performance improvement. Note here, instead of doing insert one, we're building lists of users,
16:42
appending them here, appending make users to the articles. And we're just inserting 500 at a time. Why do we do that? Well, because each insert requires a round trip to the server. And if you have to do one round trip for every document, that's going to take a long time. With this model, you can insert
17:02
any number you want. The nice thing about insert many is you can give it as long a list as you want, and it will internally chunk it if it's exceeding the chunk size that it can use. The default internally is about 1,000 documents. I'm sending it to 500 here so you can see feedback as you insert. Clearly, if you're going to run an insert with a million documents,
17:21
you're going to store your Python program for quite a while. There are async versions of this library, which, again, I'm not going to get into. If you look at PyMongo Motor, it allows you to do all this stuff asynchronously. So having run this, we're just going to spin this up and put it back into non-distraction mode.
17:42
And then we're going to just run that again, and we're going to change it to many docs. And it's going to chug away, whacking those articles in. And it'll bail away doing that. So I'm just going to connect to a similar structure I built already. So let's just create our new database articles,
18:06
which is off DB articles. And we're going to create users, which is a typo there. Users, which is off, again,
18:23
the database and its users. And now, of course, if we do articles.find, we're going to get a cursor. So we're going to... Oh, let's just clear that up. Articles.find, we're going to get a...
18:41
Oh, let's control-C out of that. Let's go back to that. Articles.find. We get our cursor back. So we're going to make our articles view using MongoDB shell.
19:04
And we're going to make that a MongoDB. And it's going to be EP 2009 team. And the database is articles. And the same thing with users.
19:36
And now we can do articles view.find,
19:48
and we'll get piles of documents. Now, that's insert, that's query. What about update? How do we update an article? Well, there's an update one article we can use.
20:01
So we can do articles.update1. And again, we can look for an article, because I know underscore ID, and article 100. And then we're going to do an update operation. And we're going to do a dollar set. And a dollar set just basically sets a field,
20:23
or if the field doesn't exist, it adds it. And we're going to add a comments field. And that's going to be an empty array. We're going to close all our curlies. And one more.
20:41
And again, we've got this problem. We can't see what we've inserted. But that's okay. For now, we're not going to do too much about that. We're going to have a quick look at article view. Find one. And we're going to look at underscore ID.
21:08
And we can see, oh, did I do the insert?
21:20
Hmm. Let's try that again. Let's have a look at articles.
21:42
Ah, it's still on test. That explains it. Articles equals DB EP 2019 articles.
22:07
Okay.
22:29
Just got to set that database up again. And then we will get to articles.
22:43
And then we can do articles dot update one. And we want to pick our ID, which is unique. Articles 100. Close curly. And then we're going to do the dollar set operation as before.
23:02
We're going to set the comment field to be an empty array. Close the curlies. Close the other curlies. Close the bracket.
23:24
I just need to, okay, let's just rename this article to articles view. Make this articles the actual object on the DB.
23:43
And then we will rerun articles update one. We get a cursor back. And then we can do articles view dot find one.
24:08
And we'll add underscore ID is article 100.
24:28
Thank you. And you'll see the post date there. Why is that common? So the common should be in a field there. I'm not going to try and do this anymore. So now what we do is we will then append to that common.
24:43
So we can say articles dot update one. And we will open the curly again. Underscore ID.
25:02
Article 100. And now we're going to do a push, which actually appends to the array. Dollar push.
25:21
And we would do open a comment. And then we can just effectively specify another document here. So it would be like user name is Joe. And body is hello.
25:51
Close those curlies. And now if we look at articles view, we just want to make sure it's the right database.
26:05
Articles view dot find. Look at the actual object we've created, which is articles 100 on there.
26:31
Underscore ID, articles 100. Okay.
26:42
Let me just try.
27:07
Okay. I seem to have messed something up there. But effectively you'd get a push which adds a comment to the database. And that would give you an update operation. Simple deletes are, of course, just articles dot delete one.
27:23
And you'd specify an object as an underscore ID. And just articles 100.
27:49
Article.
28:13
And it would just return the delete result and delete one. So what have we shown today? Well, you've seen how to create a database in the collection.
28:26
You've seen how to read one and read many databases. You've found out how to insert a single document and insert many. And also how to update those documents. Although the update demo didn't quite work out. That's just my fat finger typing.
28:42
I haven't shown you how to check performance and add indexes because we kind of ran out of time. But I'm going to show you one more thing. You can build all of these clusters inside the cloud in MongoDB. You don't have to build these clusters manually. You can set this stuff up in cloud dot mongodb dot com. You can create your own clusters at any level and any scale.
29:04
It's pay as you go. You can turn the clusters off and save yourself money. And if you use the code hack 100, you can go to the top level organization when you create it. It starts you down low once.
29:22
You have to go up to the top level to get to the billing tab. You go into the billing tab. You scroll down. And you apply credit. And put your hack 100 in there. You get $100 of free credit to use MongoDB. That's MongoDB in a nutshell. Free to use.
29:40
And the easiest way to use MongoDB in the world is with Python and PyMongo. Okay, thank you very much.