Logo TIB AV-Portal Logo TIB AV-Portal

Data Analysis and Map-Reduce with mongoDB and pymongo

Video in TIB AV-Portal: Data Analysis and Map-Reduce with mongoDB and pymongo

Formal Metadata

Data Analysis and Map-Reduce with mongoDB and pymongo
Title of Series
Part Number
Number of Parts
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date
Production Place
Bilbao, Euskadi, Spain

Content Metadata

Subject Area
Alexander Hendorf - Data Analysis and Map-Reduce with mongoDB and pymongo The MongoDB aggregation framework provides a means to calculate aggregated values without having to use map-reduce. While map-reduce is powerful, it is often more difficult than necessary for many simple aggregation tasks, such as totaling or averaging field values. See how to use the build-in data-aggregation-pipelines for averages, summation, grouping, reshaping. See how to work with documents, sub- documents, grouping by year, month, day, etc. This talk will give many (live) examples how to make the most of your data with pymongo with a few lines of code.
Keywords EuroPython Conference EP 2015 EuroPython 2015
Robots building Actions Red Hat Development Google Development organization Alexander localization programs
structured data man mapping Google boom framework Databases objects second
point Computer animation phase framework multiple Results
man Projective Spaces Actions matchings Computer animation sets life bits smart
flow information projects storage sets drivers storage drivers Computer animation search engine Google band source code record
man icons intel information storage The list van Verweildauer Computer animation memory objects band structure
period services Computer animation sources band part van
Slides Actions mapping code maximal sets groups smart van Emulation elements product attributes uniformity different framework systems man matchings projects counting data analysis Databases lines connections inclusion Computer animation Query orders life input sort band
man email track neural network maximal van Unruh-Effekt inclusion uniformity Computer animation Query boom framework Results
man Actions Arm matchings necessities projects The list maximal Databases bits cursor limitations MACH orders Computer animation operations framework Right family Results
man information systems sources storage The list groups bits Databases gute dem system call van orders Computer animation operations Prince Results tasks
man Slides flow maximal bits Arm fan van inclusion Indexable processes uniformity Computer animation operations objects
meter man projects maximal bits gute van Runge's theorem uniformity Computer animation Google single boom
man icons point states maximal counting Stream attitudes dictionaries van variance inclusion message-based mathematics uniformity Computer animation boom orders information sort Results
man point states analysis Stream AST bits maximal coma applications van product Computer animation memory level Right information Results
man Actions information key projects AST Menu maximal smart distances completion theoretical van subset attributes Computer animation operations formal table Results
man Actions matchings states sets Menu maximal Arm van proof uniformity Computer animation different Google calculations Density Matrix Rank Results
Computer animation states operations operations bits
point man Actions Computer animation Google boom inner mid online education timestamps van
man multiple Computer animation key hypermedia operations maximal WPAN source code
man Computer animation operations operations bits
matchings Computer animation sin robotics Google operations spacetime
man Barriers information directions maximal part information particle Computer animation Average sort table Free classes
man sign existential quantification uniformity Computer animation time life extent van variance
laptop man modes Actions time PCI Continuation van Computer animation Google operations boom band Results
man comparison enterprises parity ones bits van attributes product number uniformity Computer animation real vector string operations string Forum phase sort
man The list operations Ext smart applications Sans van product inclusion Computer animation case Google boom sets objects
man Regular Expressions mapping The list operations counting bits DoS van product data management Computer animation rates string operations sets framework objects Chi-Quadrat-Verteilung
man words Computer animation key Google functions phase operations boom phase
point man time The list maximal Sans powerful words Computer animation case functions phase framework Gamma
man complex functionality information code gute words splitting image words Hause Computer animation Query Google operations boom green framework reading
man key code moment bits part total frame van words Computer animation Google reduce genetics framework Results
time maximal sets limitations words photos operations Indexable framework Gamma disk man server forces Quark storage operations Databases completion Indexable words Computer animation Query Hardware band
server time maximal scans sets limitations computational memory Google operations Hardware Indexable SSI equations disk systems information server Quark operations maximal Databases cloud platforms scans limitations Indexable data management Computer animation Query case Hardware boom life Right Results
Slides Computer animation Blog time cycle DoS Alexander
Robots Red Hat Computer animation Google time boom
thank you all them from Mannheim in Germany out so I and developer of my own company and organizer our
as well and and speaker sometimes among would the trainer for local for local community and full Python community so served as the program work group co-chairs of building this conference so if you have any comments suggestions what we could do better and so on and around just crap talk to me the interest in my talk today it
is a mom would it be and what's the demand would be aggregation framework that we're going to cover the pipeline bottled pipeline stages and that produces in 1 would be the and at 1st and who knows what would it be was OK awesome and that was actually working with would be pocket and who has worked with the
mom would be aggregation framework the called OK and so let's bring everybody up to speed to avoid orientated to document-oriented databases and 15 seconds basically work with the jails of the document is adjacent like object we can stored in a database with no shame on enforcement on the collection is basically just a collection of documents actually and multiple collections make about database of embraces is a pretty simple our concept
of the mom would you be aggregation
framework was introduced like 4 years ago on with mom would be 2 point 2 our it's a framework for data aggregation of basically the documents that are processed through a multiple stage of the pipeline on the that's giving aggregated would be given back aggregated results it's basically you designed to work straightforward so no unions like an SQL and the candidate technically is looks like this so
we have our documents we do match which is defined we get this documents because we found some of the cell phone sets and then we do some grouping and get even less but actually thought the limited technical so because
basically I think it's more like a relay race you know like relay races they they go and they they they pass the baton to each other and basically this is how the money would you be pipeline works on so that we have a match which is find so we say to plays little doggy get about on please pass it on to the smart folks doing something small which could be like a grouping and then we want to represent all data middle but no monthly and so we posted on an to the the projection space around them let me tell you a little bit about the status that we're going to work and went to 2 % of the 6 and the what we're doing and I have also prepared some life to animals and the
right the the this built with long would it be but we this on the user
be free 0 of the URI title so Tiger of search engine was give lossless compression of I'm gold is obviously the driver really going to use is the 1 that maintained by men would be themselves and it's pretty are well maintained and the driver it's always up-to-date to the gut and we're working on a dataset of 37 gigabytes which gives us compressed wide-eyed about mind you but what's ran on basically as you might remember i t is my 2nd career are used to be in the of record industry will flow technical start up business in the nineties so everything I do still in IT is very close to working with music on so we have a from project we're doing out so that's a whole chapter because we have some collection of playlists from Americans music store a playlist is basically all of the all the information of all the release so that you can find in the iTunes Music Store art that's a a set of playlists that appeared in some charts somewhere on the world over the last 3 years and basically this is what looks like so pretty cool song in
lots of memory as as narrative
down to what we're going to work with today just to give you an impression about our document structure for the for the most so basically all of this is the documents and full was all the release information so like the album artist album name when he was released how much is in store and 2 children and that's that's what we call subdocuments it's it's a list will vote on the objects and that's basically the songs each and every song we we have no way and I was wondering actually found which answers to use for also because it's really hard to choose music artists making everybody happy and I thought the phone something neutral
because I chose Taylor Swift and it's not because I like a music actually at the actually don't know any songs of for about 2 the discrete that post on making Apple pay for the trial period for the new an Apple
music service so what is get paid more money for people using a new source by applesauce so she did a good thing and I think that's so of pre-war of mentioning her even of active you part of so OK and so on that's below
1st pipeline from I've commented in some notes for there's still lots to make it easier so basically on that this is sort of a pipeline The pipeline is basically constant as a list and into the among armed match is just basically find as you might remember from a document and its artist names so we're looking for all the artist's which is a variable of sort already instead force then we going to do a project that basically it's a select and basically we all all we want to do is print out the and get all the releases boy they're Taylor Swift sorted so then we switch to this and go here so basically just import the input PPI Moengo justice just like a simple database connection on the 1st of the or a database which is life on this matter will come and say it's only assigned to gigabytes of front for this database so it's not usually we work with a lot more and more and more would be and so we have a 1 . 3 million playlists formed and that's about like 70 billion songs covered in our data set and usually you could like with the matter could also just like query and operates as we found 93 1 releases all FIL Taylor Swift so and with the aggregation framework and that's the same code and shown you on the slide before we do a match find projects basically just like just project
years just like the renaming of the attribution actually and then we sort by released ascending order and basically that's looking at that it so you have many reasons at the end of the and she's quite busy artist famous correctly and so and so on what else can we do we can extend all pipelines army we can do Our groupings so now we want to group everything boy named which is basically the element of as you probably see we have a lot of Duke duplicate systems some duplicates in our dataset which is because albums are released by different companies will light so sold at the edge of sorting added the new idea basically the different products although it's the same contents and from the music so on the passing in the name as underscore ID on school ideas and the grouping operator basically what we want to group white is mandatory and is always called underscore idea and we want to count how many elements are there and account we don't have the count operating the aggregation pipelines from so basically we just summing 1 for each and every document in our group and then re-projected sold just to make it more lines and the 1st what we get you still have
some different variants with a small you saw a paper so OK now we've used mice
pipeline we've got the results and I think it's so it's it's so nice that spread all of the phone this again and you get this and it's nice I just want to print it again and books what happens I just I just actually just progresses were restored all all queries to aggregation framework and I just wanted to print it again and what what what's happening here while what doesn't do it give us any result back and that's like the 1st track every I want to show you is mom would be be
aggregation framework and returns a cursor so basically the cursor gestures that it points to the data in the database you get back to from from the mom would he be aggregation so once we call list the cursor is x lost so that all the data in and then they're printed and then they're gone so you can't just use them again unless of course is stored and then new wearable arms all right about so
these are all our aggregation stages on as books the like their SQL brothers so on and on the right hand side so basically a match is where or having operator sold pretty obvious more ordered by limits also I think well no explanation necessary project is to select and we can also use it for renaming as industrial flown over all result groups group by unwinds and we're going to go into that rehearsal and is summed out a little bit of a joint not really and then
redactor we're not going to cover up and out basically just an operator please call and those results of the aggregation back to a new collection would be to storage so to
make things a bit easier for you to fall in some and the next we're going to do something the artist name and name of the album title and we're making our own pipeline a little bit even bigger like the group operator what I've already shown you already on looks like prince of women mostly on the next step is how can we work with lists of subterfuge so what as you see we have a list here just with all the and songs on that album and we want to do something with and of course on the natural thing would be I query the database and just iterate over it with money that is Python source from but that's a quite an extensive task and we can do it in the database and there's there's a lot unwind operator and it's basically a from from my experience is at 1st sight really confusing steps because it's quite an unusual well what I've seen so it confuses people than the so a I think just like it's just and
just show you think it's so probably the best explanation because what unwind it basically take all documents in the the list and for each object in our subdocument listed creates a new document rendered the sounds really like an expensive operation but I can assure you that does that a really good job and it is not expensive at all it's really handy and basically and we show of the slide before we do
amount so on and of course and research and go to that later on here in this story so and so that flow so now we have all to 1 the 332 songs by Taylor Swift I found them on the and we we can immediately what them and here in our of by grouping stage and as you see the path has not really changed although this used to be a list before and we don't do because no need to do is to do anything you go about like iterating over like this index or anything like that and I have prepared a little bit more but here this is basically
what's happening on so basically we just get 1 1 when release at limited want to play this and the and we need to unwind and then I'm just renaming it was the project meters and this is basically what we're getting we're getting all these are single documents new documents just like created on the fly we can even be needed you work with so basically it's basically like just yep yeah it's an unwinding of the data it's a little bit in use of concept but it's basically a really simple so if
some the back think about take
another 1 lectures from quite obvious 1 of that here we want to do so of think that some Math also like a sort which is also like an obvious apartment stage and I want all the releases just sort of boy count descending and entries are sending and basically this looks really simple and the it returns 0 so something like this and was going wrong something's wrong because I want my count descending although all our data and then I want to have it sold by release in ascending order but Our result as basically boy are released and then like I so something's going wrong here and I can assure it's not so we're not the not broken is actually like a trap because they in the pipeline because in Python dictionary and Python dictionary of course is an unsorted on and of course the soul of breaches past and something which is not ordered and of course all results get a but unpredictable and but of course I know that's so this like this a solution and I can encourage you always to use some from these on the collection all you can also use collected all data and pass and all so permit as long as an In order to fashion and because otherwise you well the sorting order won't won't really work and so this unknown allowed works online Mary became so this is
just just like a really quick introduction to what of the stages so there's a lot which is mentioned before it's skip just like skipping
documents all of the rights the results to a new collection of and there's a new which is just gives you all the documents around them uh give geospatial . com redactors and analysis some people use it to restrict document axis on a document document level what my memory the production and these are like the stages this is like our race and now we have some data and basically this is very limited from what we can do basically it's just like mangling around a little bit with the data and so on we have for was
this like a minimum and maximum at 1st and
last operator and this is we're going to work again we searching for an artist from the using release date and the at and supple distance is that it really is that what is actually a date on the respective strength the string and on the wielding a new
pipelines barely doing little groping arms by that we want to find all the want to find out what's the various theories of Taylor Swift and what's the latest release of tables so we do a group grammar school ID on Mrs. on the scalp ID is empty on so new primary key how can that be empty yes can then be because we want a group of aura of complete result subset so we can just put money and they're all different entities so there's no need to look for natural beauty which is the same on each and every document just leave it at that and we introduce you to any new attributes mandate next day and basically it's a really simple operation we just walked path info to really sort of true to the information and then max and project and true here from that
and gain now take the swiftest around things 2006 think she started really early release and stuff and she's been around for a while and and so on but so 1st and last good for I mean we have min max on it also would work actually on on on min-max would work actually on the rank and just book it's it's it's just like a little bit different and you cave it can save you some extra calculations on what's different Some the difference to our previously pipeline is where the match and then we do solved quite release dates and then we do 0 grouping and all instruction is 1st and last and what those 1st and last it's very simple to get the 1st document of the group and last this the last document of the proof so there's no need to iterate over complete set within the group to find min or max values basically you just can't say OK I want this document I want to look at this document and what's in the middle I don't really care on so this can be really effective on and as expected the same result so and with states we can even do more we have some nice state operators
pipeline do basis and but really state we do with grouping and I want to have releases grooved following year I'm fanboy now on talk so much about the love want to really know everything so I want to have C which year which releases so I many releases prehistoric and so on we have excellent we extended our ID little bit now it's it's synaptic with our dollar
operator and the past and the dates all which is the date and we just Poseidon and W. real basically just grab the year from
our data and this is done want to group points
and all those works like this it's really easy makes and if you have some off data with timestamps and so see UK account so you see she's she's like a beaches releasing every year a lot of releases she's torture on and on so but what if I want to dig even deeper and more interested in getting on the races by year and also interested in getting our each day release called for
each and every month she has really something and of course would mention that if you couldn't compression when the year we also have a among operator and the
next thing what happens is now the ID which is our primary key can also be a multi key on and so we have the most acute year month and basically we do the same on of this before we get the years and among media tribute IDP has 2 is is below to ever produced here and and on and that's so that's just from that and unwound and see haven't checked probably not among them hardly any month she didn't do anything so on there's a lot more data
operators as you can guess is also like a 2nd 2 minutes many more data operators and it's funny when unable to cover them all in the
small talk of that but on its I'm getting a little bit more now with Taylor Swift because early in the morning and we want some need some attention so actually I thought about who could else who could join so I thought I'd just google Taylor Swift nemesis and the was
says it's native space robot called Katy Perry and so on let's
bring Katie that's a bring and Katy Perry and it's really easy on
we can extend our match operator so located knowledge stored in that anonymous this variable and basically we can also do searches for the dollar in operator and it's basically just the same as in Python so I think some really necessary on that to explain to you guys so
and of course now we have a lot they competition and wondering what who develop deliver small a songs well you form why 99 cents that barriers tables with so I want to see the average playtime of their songs and I'm interested who gives gives me more songs longer songs I can enjoy a full 1 money on so this is not a good thing but it's just like my example so what we do in class as you see we now have
3 unwind stages so basically the 1st thing we unwind the songs and then we unwind the song some offers and with an some offers assets is basically the prior to the playtime stored on so when we want to access to this information so that's why we have a pipeline of want to free unwanted someone online online we can group why just going on the part of the some languages and agenda name and then we just do average of the particle for all of the direction we're sort of the assets and show you 1 it does
and something's wrong when the signing just
extent you look at something's
broken anniversary on won't wasting time to fix this life so basically but I can explain you on basically it's just like the same we did before but with the release this year and the counting the releases on the
next step of course would be getting enough to play time so all I hope my notebook didn't break yes so didn't break looking for again and of course that we have group all play time and we just the projected and as a result we can see OK but in gives us more and more music likeable but 10 % more music than Katy Perry for 99 cents and that's a really easy operation OK know
something something with more challenging on I'm interested in thank you and I'm interested in getting the prices of the releases of the are the artists on and on my own but it's basically scraped data so it's not probably good as clean as I would wish knew this missing easy perform prize was the currency in front enterprise but it's just like in 1 attribute and I'm I'm interested in getting the prices in US dollars on and that's the that's easy to solve a string operations and compare operation so as
to speed up a little bit of basically we to project phase so just focus on the because of the things in both the important ones here on we have is a US dollar basically is a comparison of the lowest strings of the 1st 3 characters in all price much which gives us sort of bacteria you install or some currencies on numbers or whatever and a comparison is basically is this US dollars and it's pretty obvious and then we just do when you match for 0 on you can also fuse with long what compare paramater gives us 0 back when it's match and gives us a minus 1 back if the value is hard and and 1 back with the wife of the value lower so it's pretty are parity anyone who could also do is equal there's also an equal operator which would give us like 1 and 2 vectors we expected as a boolean true false on and then we sort of broke and we can even do something something else we can also go and push every release we find in our group and into a new list and with the prices and the product a that some basically similar to jobless before would be like actually like and then abandoned and Python so school here
on and here you go and you see case very Katie berries
products and use status next and the next next object Taylor Swift and there really is a list of the products so the is really a
lot more operators on on and just as a cancer just if it if if if enter the application
framework is probably useful just go to mom would he be documentation is very well written it has a lot of examples it's it's it's quite easy to get into and and 1 more of the
variable operating it's called it's that's map operator and as you can imagine it's basically the same of a Python maps on and what what do we do here but we're getting the ratings come which is actually how many users have given some style also the product 1 b scrape the data on and we want to adjust to the little bit because management or back and we need to make it a little bit of a look at a little bit nicer but what we don't really want to do but just like a good example of so basically we can pass and to the dollar map and import the ratings count on as value and then we can just reuse the value on our list and we just like at tend to each and every object in our value we find in our list and then it's applied on that's another thing which is not probably obvious we cannot use a some operator on a on a list like that can do really really and in Python and we have to unwind 1st as so basically for each and every value in our list we have mangled with we unwind it to a new document and then we can do a simple grouping as we've done before and yes and there we go
which brings us back of course to the next thing on you can also do mapreduce among would be and how many of you guys work with MapReduce Bono's
MapReduce that and who who actually was a lot of not MapReduce can so bring everybody up to speed
MapReduce basically it's a really simple concept we have all these documents and we mapped them map them is basically you just go through and we find key value pairs which is actually in all word example we find to find the most popular words in novel release titles and we just made them of this so as you can see through the reduce phase which is the wrong way to reduce amount some boys will be just like some of the Council on its reason being really pretty pretty easy operation on basically
we just will use
owning operator on and you might wonder why would we use
MapReduce mom would be so
because we have use a great aggregation framework substrings we can still do so many things to lists of point it should be in using MapReduce and and it's basically for most for most most of the time you can work with creation framework there in most of the cases it's faster it's more accessible thank you and and the a lot however mapreduce gives us more power because you can actually pass and John stripped there and then you couldn't build
much more on the more complex queries and for example for our example was splitting up the release titles in each in works to to count them was this could be quite challenging in the aggregation framework and so on still so OK
this look at me let me show you the more working it it's it's of that's so it's all about and of MapReduce so the MapReduce we just using an from import code which we can just pass on on text on read this is JavaScript function and was this is just a function to it basically stalls the name of the the the the the info which is so in the name of a tribute to adjust bolstered and spread so it's a really simple operation and then just the we just check our for song punctuation self remove it it's probably not the best way to do this is just like for the simple example and if we actually find word we emitted so basically if there's something like my a teenage so we image teenage 1 and of album's called teenage greens because you
also frames 1 we send it to the reduced and a reducer has pretty simple code here we just take all the
keys and basically just count just this somehow the whole wholesome that the keyword actually appear from our metric and here we get a result and the so so anomaly going to do a little bit more because and want to remove stopwords which is not really part of the aggregation framework but just to make it with my so that's why I have added to natural language can remove stopwords and this is like the most popular words in Katy Perry and on tails West albums so you see on that probably have a younger audience of Bremen teenager moment always and the fear wasn't speed on kissed and stuff so on the
really of its so this really easy of course I would take comes to would like a force you don't have enough time as could also run this operation across the complete data set and to see what's basically the most popular words in album releases being sold at the iTunes Music Store so
all to finish on and want
to give you some more of our best practices and text you can use will be aggregation framework 1st of all the database think about indexes expecially if you do queries on that of course
if you have a huge dataset and you don't have an index of mom would you be as a collection scan and if it's a slow computer it's was taking time out and probably frustrating for you think about probably getting your data and the database to be around you can just touch commonsense 1 would be be which actually do something like this similar to UNIX touched on you touch it and then it new really fills up RAM comes the as as much as possible mismatches would ever get from the system and to store data are you can work with life and in in in in in memory of you have to like that the results can be only like 16 megabytes because that's so that you that's the maximum we can store and of the adjacent document by me like 16 megabytes of the huge on at operation has also limit of 100 the but you will hardly ever science management you hardly will ever really hit it on on the prairies you can improve of from there's this nice well sorry for breaking that's on a nice explain operation which basically give you information what would Mom would you be when query doing you carry on and you get some
results to see how many documents scanned it indexes sort it if it will all those all index and then you can really go and say OK I can really Opel optimizing all my work of just like introducing a new index of hardware is of course very important especially run more is better than the simple equation here on minded this performance of course systesm cloud computing makes it really easy and yeah and you can also think about working abroad and and we'll dedicated server in case you have something that replica set and that right heavy on a database all and so you can also say OK just do another copy and work locally and do you aggregation frivolity the having to worry about it if you have a lot of traffic in your database and the last
slide is some useful resources of course as I mentioned mom would as a has a very good documentation that's very at that time was well and I also want to make you mentioned it can are works would be be also as a trainer and she has always cycles some tricks and tips then we
I wouldn't clever enough time for q and a so if you want to ask Alexander questions than that which I defined it means that the antiviral just ask any time of and has