Aggregation with noSQL
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 170 | |
Autor | ||
Lizenz | CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben | |
Identifikatoren | 10.5446/50610 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
00:00
Formale SpracheFortsetzung <Mathematik>DatenbankEinfache GenauigkeitRelationale Datenbank
01:21
Web logTwitter <Softwareplattform>Content ManagementElektronische UnterschriftFramework <Informatik>StichprobeDatenbankRöhrenflächeMultiplikationsoperatorMinkowski-MetrikGewicht <Ausgleichsrechnung>MAPFortsetzung <Mathematik>InformationsspeicherungBitTwitter <Softwareplattform>COMGeradeEchtzeitsystemProzess <Informatik>CodeGrundraumWeb logFramework <Informatik>OrdnungsreduktionNichtlinearer OperatorAssoziativgesetzFokalpunktQuick-Sort
03:11
ZählenGruppenkeimDatenmodellDatenbankOrdnungsreduktionMAPProgrammierungFunktion <Mathematik>Folge <Mathematik>Element <Gruppentheorie>ParametersystemRekursive FunktionDatenstrukturProzess <Informatik>Fortsetzung <Mathematik>Weg <Topologie>Relationale DatenbankProzess <Informatik>Quick-SortUmsetzung <Informatik>DatenbankZählenFunktionalFamilie <Mathematik>Framework <Informatik>MultiplikationsoperatorBitCASE <Informatik>TabelleTermOrdnungsreduktionSchlüsselverwaltungOrdnung <Mathematik>Automatische IndexierungMAPTrennschärfe <Statistik>IntegralVierzigTaskGruppenoperationNummernsystemSchnittmengeProgrammiergerätDifferenteMereologieSoftwareentwicklerInformationsspeicherungZentralisatorÄhnlichkeitsgeometrieTopologieEndliche ModelltheorieWurzel <Mathematik>Ein-AusgabeDeskriptive StatistikFunktion <Mathematik>DatenstrukturFunktionale ProgrammierspracheData MiningFormale SpracheVererbungshierarchieRekursive FunktionOpen SourceElement <Gruppentheorie>Folge <Mathematik>Hecke-OperatorDatensatzVektorpotenzialMehrrechnersystem
10:27
Rekursive FunktionProgrammierungProzess <Informatik>DatenmodellFolge <Mathematik>Element <Gruppentheorie>KnotenmengeEin-AusgabeOrdnungsreduktionMAPFunktion <Mathematik>SpieltheorieMathematische LogikSchiefe WahrscheinlichkeitsverteilungGruppenkeimSoftwareentwicklerBinärdatenOpen SourceOperations ResearchFramework <Informatik>HybridrechnerLesen <Datenverarbeitung>SchnittmengeKartesische KoordinatenFunktionalBinärcodeNichtlinearer OperatorPhysikalisches SystemProjektive EbeneVerschlingungWeg <Topologie>MultiplikationsoperatorSoftwareentwicklerDatensatzObjekt <Kategorie>BitrateMAPOrdnungsreduktionGewicht <Ausgleichsrechnung>Produkt <Mathematik>HilfesystemRechenschieberSystemaufrufArithmetisches MittelRechenwerkRobotikFortsetzung <Mathematik>DatenbankParametersystemOntologie <Wissensverarbeitung>SchlüsselverwaltungSkriptspracheEinfache GenauigkeitMaßerweiterungRoboterNummernsystemBildgebendes VerfahrenGruppenoperationResultanteSummierbarkeitSchaltnetzBildschirmmaskeNabel <Mathematik>Rechter WinkelProzess <Informatik>DifferenteTrennschärfe <Statistik>KundendatenbankOpen SourceRelationale DatenbankEinfügungsdämpfungFreewareSichtenkonzeptCASE <Informatik>Schreib-Lese-KopfEin-AusgabeTermTabelleDeskriptive StatistikFolge <Mathematik>IterationInformationsspeicherungFaltung <Mathematik>VarianzAbfrage
17:01
Elektronische PublikationSichtenkonzeptWärmeausdehnungElektronischer ProgrammführerSoftwareentwicklerSynchronisierungLogischer SchlussApp <Programm>VersionsverwaltungKonvexe HülleZeiger <Informatik>VerschlingungStichprobeCASE <Informatik>BildschirmfensterDatenbankProjektive EbeneAbfrageStichprobenumfangCOMFontRobotikComputeranimation
17:50
Elektronische PublikationSpiraleTermTaskDatenbankMultiplikationsoperatorDeskriptive StatistikRelationale DatenbankVorzeichen <Mathematik>Computeranimation
18:41
Konfiguration <Informatik>Operations ResearchFramework <Informatik>MAPInverser LimesFramework <Informatik>MultiplikationNichtlinearer OperatorGruppenoperationAbfrageTypentheorieZählenJSON
19:25
VerschlingungElektronische PublikationKlon <Mathematik>ZählenEingebettetes SystemSkriptspracheVersionsverwaltungGruppenkeimFunktion <Mathematik>AbfrageFunktionalObjekt <Kategorie>MereologieDatenbankNabel <Mathematik>DatensatzMailing-ListeNichtlinearer OperatorFortsetzung <Mathematik>Computeranimation
20:41
Eingebettetes SystemVerschlingungSkriptspracheVersionsverwaltungGruppenkeimFunktion <Mathematik>Elektronische PublikationSichtenkonzeptSpiraleAbfrageCASE <Informatik>Nichtlinearer OperatorDifferenzenrechnungQuick-SortDatensatzVorzeichen <Mathematik>ÄhnlichkeitsgeometrieZählenFontComputeranimation
21:32
Elektronische PublikationVerschlingungEingebettetes SystemFunktion <Mathematik>GruppenkeimTaskFontAbfrageGruppenoperationDatensatzFunktionalCASE <Informatik>ZählenComputeranimation
22:34
HilfesystemVerschlingungFunktion <Mathematik>TaskGruppenkeimSummierbarkeitZählenGruppenoperationRechenschieberAbfrageOrdnungsreduktionFunktionalMAPKonfiguration <Informatik>ResultanteComputeranimation
23:45
Elektronische PublikationMAPStatistikMereologieZählenOrdnungsreduktionFunktion <Mathematik>Wurzel <Mathematik>ResultanteComputeranimation
24:34
TaskFunktion <Mathematik>SummierbarkeitPhysikalisches SystemImplementierungSummierbarkeitFunktion <Mathematik>Deskriptive StatistikGruppenoperationTaskWurzel <Mathematik>MereologieIterationMAPMultiplikationsoperatorSkriptspracheProzess <Informatik>SchlüsselverwaltungFunktionalComputeranimation
26:19
Elektronische PublikationFunktion <Mathematik>TaskSchreiben <Datenverarbeitung>SummierbarkeitMultiplikationsoperatorDeskriptive StatistikSchlüsselverwaltungZahlenbereichDatensatzWurzel <Mathematik>Funktion <Mathematik>SchlussregelTaskMAPIterationVorzeichen <Mathematik>Computeranimation
27:50
Elektronische PublikationFunktion <Mathematik>TaskSummierbarkeitSchreiben <Datenverarbeitung>Fortsetzung <Mathematik>Quick-SortSummierbarkeitMAPVorzeichen <Mathematik>MultiplikationsoperatorProjektive EbeneSchlüsselverwaltungElement <Gruppentheorie>OrdnungsreduktionDatensatzSchnittmengeTabelleRelationale DatenbankAbfrageObjekt <Kategorie>DatenbankComputeranimation
28:58
Baum <Mathematik>Elektronische PublikationTaskGruppenkeimAbfrageZählenFramework <Informatik>Schreib-Lese-KopfDatensatzZweiGruppentheorieZählenVorzeichen <Mathematik>GruppenoperationNabel <Mathematik>Nichtlinearer OperatorRoboterEinsComputeranimation
30:26
ZählenFunktion <Mathematik>PunktMAPZahlenbereichGruppenoperationDemoszene <Programmierung>SummierbarkeitSchlüsselverwaltungComputeranimation
31:15
GruppenkeimAbfrageZählenQuick-SortCOMAxonometrieTaskSchlüsselverwaltungGruppenoperationProjektive EbeneGruppentheorieQuick-SortNichtlinearer OperatorSchaltnetzZählenFunktion <Mathematik>Elektronische PublikationZahlenbereichCASE <Informatik>GeradeLeistung <Physik>Computeranimation
33:16
Elektronische PublikationZählenQuick-SortAbfrageGruppenkeimAxonometrieDatensatzGruppenoperationAbfrageEndliche ModelltheorieNichtlinearer OperatorProjektive EbeneLeistung <Physik>Computeranimation
34:06
AxonometrieTaskGruppenkeimProjektive EbeneMereologieAusnahmebehandlungNichtlinearer OperatorCASE <Informatik>Computeranimation
35:01
DatenstrukturTaskSchnittmengeEinfache GenauigkeitRelationentheorieAbfragePunktTabelleFunktion <Mathematik>DatensatzDeskriptive StatistikComputerspielComputeranimation
36:09
GruppenkeimTotal <Mathematik>TaskFunktion <Mathematik>GammafunktionElektronische PublikationNichtlinearer OperatorProjektive EbeneRechter WinkelMultiplikationsoperatorDeskriptive StatistikPunktTaskFunktion <Mathematik>Arithmetisches MittelSummierbarkeitBitFramework <Informatik>RobotikComputeranimation
37:47
GruppenkeimFunktion <Mathematik>TaskTotal <Mathematik>Wurm <Informatik>Elektronische PublikationGruppentheorieGruppenoperationMultiplikationsoperatorArithmetisches MittelNichtlinearer OperatorDeskriptive StatistikSummierbarkeitTrennschärfe <Statistik>Computeranimation
38:31
Total <Mathematik>Data MiningTaskGruppenkeimFunktion <Mathematik>Lokales MinimumElektronische PublikationHilfesystemNichtlinearer OperatorFunktion <Mathematik>Ordnung <Mathematik>ResultanteAbfrageInverser LimesHilfesystemLeistung <Physik>Quick-SortGlobale OptimierungCASE <Informatik>Minkowski-MetrikRechter WinkelComputeranimation
40:56
GruppenkeimRechenwerkTaskVariableKonditionszahlTotal <Mathematik>Elektronische PublikationArray <Informatik>BitDatensatzDeskriptive StatistikAssoziativgesetzSchnittmengeTaskNichtlinearer OperatorSchaltnetzComputeranimation
42:10
GruppenkeimVariableKonditionszahlTaskTotal <Mathematik>Wurm <Informatik>Prädikat <Logik>PrognoseverfahrenElektronische PublikationNichtlinearer OperatorFramework <Informatik>SchnittmengeSummierbarkeitGruppenoperationMultiplikationsoperatorMAPProjektive EbeneElement <Gruppentheorie>Wurzel <Mathematik>ResultanteComputeranimation
43:07
KonditionszahlMittelwertTotal <Mathematik>TaskMultipliziererCASE <Informatik>GruppenoperationMultipliziererTaskKontrollstrukturNeuroinformatikProjektive EbeneNichtlinearer OperatorResultanteBefehl <Informatik>QuadratzahlMultiplikationsoperatorVorzeichen <Mathematik>Objekt <Kategorie>VariableElement <Gruppentheorie>Wurzel <Mathematik>VarianzAbfrageComputeranimation
45:51
TaskKonditionszahlVariableMittelwertGruppenkeimMultipliziererDreizehnNeunzehnSummierbarkeitMittelwertFunktionalGruppenoperationNichtlinearer OperatorMultiplikationsoperatorComputeranimation
46:41
Elektronische PublikationTaskMittelwertKonditionszahlObjekt <Kategorie>MultiplikationsoperatorOptimalfilterNichtlinearer OperatorGruppenoperationFortsetzung <Mathematik>DatensatzÄquivalenzklasseTaskVorzeichen <Mathematik>Metropolitan area networkMittelwertProjektive EbeneCASE <Informatik>MatchingVorhersagbarkeitComputeranimation
47:44
SpeicherabzugMittelwertAbfrageProfil <Aerodynamik>Treiber <Programm>Lesen <Datenverarbeitung>Computeranimation
48:41
MittelwertTaskSchreiben <Datenverarbeitung>KonditionszahlElektronische PublikationMereologieArithmetisches MittelNichtlinearer OperatorKonditionszahlCASE <Informatik>DatensatzQuick-SortProgrammierumgebungComputersicherheitRechter WinkelComputeranimation
50:04
MittelwertTaskMereologieGruppenoperationNichtlinearer OperatorQuick-SortFramework <Informatik>DatenbankRelationale DatenbankSchnittmengeFächer <Mathematik>MAPInverser LimesComputeranimation
51:41
CodeSoftwareSichtenkonzeptDefaultDigitalfilterGruppenkeimInklusion <Mathematik>SchlüsselverwaltungRechenwerkTorusTotal <Mathematik>ZehnMAPQuick-SortVersionsverwaltungStützpunkt <Mathematik>Automatische IndexierungGebäude <Mathematik>OrdnungsreduktionÄhnlichkeitsgeometrieFunktionalResultanteDifferenteProdukt <Mathematik>AlgorithmusTreiber <Programm>Fortsetzung <Mathematik>TabelleProtokoll <Datenverarbeitungssystem>AbfrageBitTaskDatenbankMinkowski-MetrikFormale SpracheKonfiguration <Informatik>InformationsspeicherungGruppenoperationProjektive EbeneSichtenkonzeptTermInformationDatensatzOpen SourceSchlüsselverwaltungSchreiben <Datenverarbeitung>CodeVerschlingungHybridrechnerFacebookAbstraktionsebeneCachingApp <Programm>Erlang-VerteilungComputeranimation
56:59
GruppenkeimOrdnungsreduktionMAPSummierbarkeitAxonometrieMittelwertTaskFunktion <Mathematik>QuadratzahlClientSichtenkonzeptUmsetzung <Informatik>Treiber <Programm>Disk-ArrayZählenStichprobeAbfrageCoxeter-GruppeInformationVerschlingungQuadratzahlStichprobenumfangSichtenkonzeptMailing-ListeVersionsverwaltungTreiber <Programm>Coxeter-GruppeMinkowski-MetrikFunktionalStatistikCOMClientArithmetisches MittelLokales MinimumZweip-BlockVisualisierungSummierbarkeitSchlüsselverwaltung
58:51
Computeranimation
Transkript: Englisch(automatisch erzeugt)
00:03
All right, how's everyone doing? Feeling a little refreshed after lunch? So this is NoSQL aggregation. So before I get started, just a couple of comments. It's hard for me to see out there, so if you have any questions, I might not see your hand, so just shout out.
00:23
Also, I've been told that I speak very quickly for an American recently, so if I speak a little too quickly, and English isn't your first language I realize necessarily so, please feel free to just ask me to slow down. So a quick show of hands, just to get a sense of the room.
00:42
How many people are still using relational databases? Wow, that many. How many people are using NoSQL databases? How many people are at least aware of NoSQL databases? Okay, so just to kind of set the tone, this talk is, I'll introduce some of the NoSQL concepts,
01:03
so if you're kind of new to NoSQL, in particular document databases, I'll cover the very basics of it, but it won't be sort of from the ground up, it'll just be here's everything you kind of need to research further to get to the aggregation stuff.
01:24
So a little bit about me. My name is John Zablocki, I'm the director of IT for a company called EF High School Exchange Year, out of Boston. I have a blog at dllhell.net. I'm soon going to be launching a blog on NoSQL at nosqlu.com, short for NoSQL University.
01:42
I wrote a book. You can get my code on github.com slash jzablocki. On Twitter, I'm jczablocki, and if anyone has any questions after this, you can go to about.me slash johnzablocki, and feel free to drop me a line. I'm happy to answer any questions or follow up on anything that we didn't get to cover in this talk.
02:02
Quick look at the agenda. So like I said, we'll start off with some of the basics, remind ourselves what aggregation is in SQL, which everyone here presumably knows how to do things with SQL aggregation clauses and operators and things. We'll talk a little bit about the basics of document stores, and then we're gonna spend some time on MapReduce.
02:20
How many people in the room have used MapReduce before? How many people who haven't used it have heard of it? So if I say MapReduce, what's the first thing you guys generally think of? What was that? RavenDB? Hadoop, anyone? So Hadoop tends to be the big database that everyone associates with MapReduce.
02:41
That's for kind of big jobs. So we're gonna focus more on sort of real-time MapReduce and real-time aggregation. And then we'll spend most of the time talking about MongoDB and its aggregation frameworks. And then depending on where we are with time, Couchbase will be covered too. Has anyone heard of Couchbase?
03:02
A few people. I used to work for Couchbase, but I also am a Mongo user, so I'm kind of impartial. The other thing I'll mention too is that my assumption in the room isn't that I'm here to sell you on NoSQL.
03:20
I'm kind of hoping most people have already bought into the idea. A few years ago when I started giving NoSQL talks, it was more let me convince you how NoSQL can make your job easier. I'm more than happy to have that conversation with you after the talk, but hopefully most of you have sort of gotten over the fear of a non-relational database and you're ready to move forward
03:40
with something like Mongo or Couchbase. So SQL aggregation is pretty straightforward. We've all been doing it for years. Probably some of us more, like myself, more years than I care to remember. You know, if I asked you to count the rows in a table, it's pretty easy to do. Select count from table name.
04:02
Grouping is similarly just as easy. Select column, comma, count of something, and then group by that column. Now these are pretty easy concepts, largely because SQL and relational databases kind of grew up together to perform these tasks. Now it's a little different when we start getting into the document model.
04:22
Document databases were designed for performance, efficiency, flexibility, and all these things, and then over the years, until recently, it felt like aggregation was kind of bolted on. When we look at the new MongoDB aggregation framework, I think you'll see that a lot of thought has gone into that, and it's really a pretty powerful framework that you can more or less achieve everything
04:42
you would with a SQL database. So a little bit about document stores, if you haven't worked with one. So there are three big document databases out there right now. MongoDB is by far the largest. CouchDB is a very popular open source document database,
05:03
and then we have, obviously, Couchbase. So those three databases all are more or less supporting a similar set of features, but there's some big differences, which again, find me after. I'm happy to talk to you about any use cases you might have or questions you might have about which database might fit your needs best.
05:24
In each of those cases, documents are stored as JSON or BSON in the case of MongoDB, which is binary JSON. Is everyone familiar with JSON? So documents have implicit schemas. So this is not to say that a table has a schema.
05:43
It has a set of columns. Every row in the table has the same columns. Each document itself has a set of properties and values, and that schema is implicit to that document, but it has no impact on other documents in the database. So there's no imposed schema at the database level like there is with a relational database.
06:01
There's no referential integrity. This is one of those things that people generally, when they first hear about NoSQL, they say, no way, there's no referential integrity. But something to keep in mind is that as you learn to design documents, that becomes less important. I mean, certainly there are times where you have an unnatural foreign key in the database, but a lot of times it's just the way you design your documents. You can kind of avoid the problems
06:22
associated with referential integrity not being there. And probably the most important feature of working with a NoSQL database is that you denormalize first. When we design a SQL database, what's the first thing we do? We normalize the heck out of it. And then what happens when it doesn't perform? We try indexes. What happens when the indexes don't work?
06:40
We denormalize the data into a denormalized table. So it's one of these things that when we work with NoSQL, people get uncomfortable with the idea of a denormalized data model. But again, I'm sure, how many people in this room have not denormalized a table or a set of tables? Like I saw one hand.
07:02
So if you're uncomfortable with denormalization, just remember you're already doing it just with a database that wasn't designed to support it. So let's take a look at a very simple JSON document. And for the rest of the talk, I'll work with a very simple model where we have work items and tasks
07:22
associated with those work items. So you can think of it as like a PBI and it's tasks if you're into Scrum. But you have work items, which is kind of the master thing that needs to be done. And then tasks are the individual assignments that each developer needs to work on. So in the case here, we have the work item
07:41
which has a description, an owner, and an effort. It has a primary key. So the primary key of this document exists at the root level. And then we have a nested collection. So if this were a SQL database, we'd have a work items table and then we'd have a tasks table that has a foreign key back to work items.
08:02
So I think everyone should be comfortable with that notion. And this is one of those things where you start to get away from the need for referential integrity because you can nest things in a much more natural way. So the parent-child relationship is all part of the same document. Now this won't always be the case when you work with a NoSQL database, but often the right answer is
08:21
to put related items together. All right, before I move on to MapReduce, are there any questions? So just remind me again, how many people have used MapReduce before? So just a few. So key to understanding how you aggregate data
08:43
with a NoSQL database is MapReduce. When we see the aggregation framework in Mongo, they've kind of taken it a step further and made something much more powerful. But generally speaking, MapReduce will work across the three big document databases and the same concepts kind of all come together
09:02
in each of those systems. So a little bit about MapReduce. So the idea was inspired by functional programming languages using the higher-order functions map and fold. A higher-order function in functional terms is a function that either produces a function as output
09:20
or takes a function as input. In the case of map, you write a map function that is applied to each element in a collection or in a sequence. And fold is basically just the idea of taking some recursive data structure and combining it. So if you think of like a tree with keys and values,
09:42
you basically group all the keys together and do some sort of aggregation on the values. Now that's just the sort of basic functional programming definition. Are there any functional programmers in the room? Couple, F sharp? I'm not a functional programmer, so I don't have much to add to that,
10:00
but I'm always curious. So now when we kind of expand the idea of MapReduce to a NoSQL database system, the MapReduce approach is taken to sort of a distributed level where you have a model for processing very large sets of data
10:20
across a cluster with many nodes, potentially. So the big advantage here is that you can do certain tasks in parallel. So the map function, and this is not necessarily 100% accurate across all the systems, but generally speaking, you have a master node who accepts the MapReduce job,
10:41
and it'll take the map function, spread it across the nodes each process some set of the sequence, get the results, send it back to the master node, and then the master node is responsible for running the reduce. So you distribute the work across a bunch of nodes
11:00
in the system, send it back to the master, the master does the combination. So to give you a more concrete example of what MapReduce looks like on a JSON document, here's a very simple set of documents. So I took the work items and took them down to just three properties each.
11:21
So each of our work items has a description, an owner, and an effort. So three very simple documents. The first thing you do when you write a map function is in this, the emit, so most map functions when you work with a NoSQL database will be written in JavaScript.
11:41
So when you write the map function, Mongo and the Couch databases all have this emit method. So what you do is you emit a key and a value pair. So if you want to think of this as a SQL query, this is like saying select owner comma effort from work items.
12:01
The argument on the left is the key, the argument on the right is the value. So what this looks like when you actually run that map function is you take from these documents select owner comma effort and now I have a set of key value pairs where the key is the owner name
12:20
and the value is the effort. Is everyone okay so far? So now what I want to do is in SQL terms do a group by owner where I sum the values. So again we're gonna use the reduce which is effectively select owner comma sum effort
12:43
from work items, group by owner. So the reduce function takes those keys and values as input and then in this case I'm just very simply iterating over the values which iterating over the values,
13:03
that should say values not result. I changed the slide so it should be four var i in values. Iterating over the values and keeping track of the sum. So without grouping them I get the sum of all of them and depending on which database you're using
13:23
the group happens in different ways but if I were to group it you would see that John has 13 and Molly has 34. So this is not indicative of any particular NoSQL database but it's kind of like a high level view of how it works. Most reduction in the higher level,
13:42
I mean most reductions in NoSQL databases happen with built in functions. In Couchbase and Mongo they have some helper functions that kind of do this stuff for you. So writing your own reduce isn't something you may necessarily do. And just to give you a quick example of MapReduce,
14:01
I mean I'm presuming everyone here is a .NET developer and has used Link. So if you wanted to do MapReduce on a data set in Link, this is how you implement it. Select, anytime you call .select that's effectively a map function and when you call .aggregate that is a reduce function.
14:21
So Link has full support for map and fold or map and reduce and then it's done through these different extensions. All right, any questions before I move on to Mongo? So show me again how many people have used Mongo?
14:43
So Mongo is like I said by far the most popular of the document databases. It's been around for a while and I think for a long time most people kind of associated NoSQL with Mongo. They've done a good job at that. I've been using Mongo for probably four or five years now
15:02
in various forms and over time it's gotten much closer to feeling like a relational database. Not necessarily with all the relational features but in terms of query capabilities and just the way you organize your data it feels very relational friendly. So if you're looking for what database should I jump into first to appreciate NoSQL?
15:23
Mongo is probably the easiest one to kind of get your head around as a relational developer. So Mongo is an open source NoSQL database. The company that produces it is called MongoDB. Now it used to be called TenGen.
15:42
So you may find references online to TenGen but now MongoDB is the company. It's a pure document store database. So CouchDB and MongoDB are both pure document stores. Everything is a document versus Couchbase which is a hybrid of key value pair and document database.
16:00
Records are stored as BSON, binary JSON but effectively when you're working with it you can nine times out of 10 if not slightly higher think of it as JSON. Documents belong to collections. So you store your documents in a collection which is sort of like storing it in a table but the difference is that you can have a customer's collection that has a customer object
16:23
and a product object. You wouldn't want to do that. And generally in your application layer you prevent that but collections don't impose a schema. They're just kind of a logical container. And all the CRUD operations that you perform in Mongo are done on those collections.
16:44
So if you're using Mongo, is anyone in the room using RoboMongo? If you're using Mongo and you're not using RoboMongo just stop and start it because it makes a world of difference not having to use the shell.
17:01
So RoboMongo is kind of like having SQL Server Management Studio for Mongo. It's a free open source project. It runs on Linux, Mac and Windows. It's incredibly useful especially if you're presenting large JSON queries which may not be your use case but it's a very useful tool just for visualizing your database and your documents.
17:23
So I've also, all of the examples I'm gonna show you in this talk are available on GitHub along with the sample data. So if you wanna take the data, play with it and run these queries yourself just go to gist.github.com slash J Zablocki and I have everything out there.
17:45
So I'm gonna start off by showing you I have a collection called work items. Can everyone see that okay? The one thing RoboMongo doesn't do is let you change the font size. So hopefully everyone can see that okay. If not I can show it in GitHub and make it bigger.
18:05
So I have a collection of work items like I showed you in the beginning and for each work item you'll see that there's a description, owner, effort is complete and then there's a collection of tasks.
18:20
So each work item has a collection of tasks and each task has a description, a time remaining, an assignee and a due date. We're not gonna care about all these properties but what I wanna start to show you is how you ask questions about data like this using a non-relational database.
18:45
So Mongo is pretty flexible with its aggregation capabilities. On the collections themselves you have operations distinct count and group so you can run basic aggregation queries on those.
19:01
There's some limitations like for example if you're running, so if you have a Mongo cluster and you've sharded a collection across multiple nodes then you can't use these operators on those collections but the MapReduce and aggregation framework are more optimized for that.
19:22
So let me show you to save some typing I'm just gonna copy these. So let's run the first query. So it says db dot, I'm just gonna blow this up
19:40
a little on this here so everyone sees. Make sure everyone sees it before I run it. The first thing I'm gonna do is just run the distinct function of the work items collection and if you haven't worked with Mongo the db object there is just a part of the Mongo shell
20:01
gives you this object on which you can operate on collections and just parts of your database. So you generally just it's db object dot collection name dot collection methods. So if I run the distinct on there
20:20
you'll see that I get a collection out or I get a couple of records out where each of the work item owners is listed. So this just gives me a list of the distinct work owners very simple it's just like running a select query with a distinct operation in SQL.
20:41
I can combine that with a query to say and in this case show me the distinct owner where the effort is greater than 13. So the dollar sign GT is an operator in Mongo
21:02
for greater than there's GTE, LT, all sorts of different operators that as you dig into Mongo you'll see. So if I run that you'll see I just got Molly because the record where the owner was John had nothing greater than 13.
21:21
Is everyone okay so far? Similarly I can run the count query and can you guys see it says five up there.
21:40
So again I apologize for RoboMongo not supporting a larger font. And similarly I can also count with a query. So you see that there's one record that has a effort greater than 13.
22:07
There's also a group method. So let me, so when you use the group method you specify a key, so what am I gonna group on, a reduce function, so this is when we looked at MapReduce so what am I, so I'm already telling
22:22
the group method what to reduce on and then there's the function to do the reducing in this case I'm just counting and setting the initial count to zero. So if I run that you'll see that I got
22:41
a nice group by on owner with a count of work items. So these are the three basic collection based aggregation methods in Mongo. Generally speaking they're great for quick querying and everything but they're not nearly as flexible as MapReduce or certainly not as flexible
23:01
as the aggregation pipeline which we'll see in a couple of slides. So MapReduce in Mongo, so you provide a map function, you provide a reduce function
23:22
and then optionally you can provide an out which is going to be a collection into which you'll put the results of the MapReduce. So if you wanna query a collection after you can do that. So let me run that and show you what that looks like. So this MapReduce is basically the same
23:40
that we looked at with the examples when we were just staring at the documents before. So if I run this, this is the output of the MapReduce. You'll see it has a whole bunch of statistics about what happened and it tells me that
24:05
the collection that it was outputted to is in now work items by count. So if I refresh my collections, or sorry, work items by owner, now if I view these documents, you'll see that I have a new collection
24:22
that's been persisted that has the results of my MapReduce. Everyone okay so far? Getting a little more interesting. So now let's say, so it's conceptually pretty easy
24:40
to think of how do I aggregate at the root document, root part of the document, not thinking about the nested collection. Fortunately it's not much more complicated to get into the nested collection as long as you know the basics of JavaScript. So in this case, my map function, instead of outputting something in the root document,
25:03
I can iterate over the tasks collection and emit a key that's, so what I'm doing is iterating over the task collection, calculating for each work item the sum of the time remaining.
25:21
So again, looking at these documents, what I'm doing is for each work item, iterate over the tasks and compute the sum of each of the time remaining values.
25:44
So it's very simple JavaScript. But then after that loop, I output the description, so I'm outputting for the key, so I'm gonna group by the description and the value that I'm going to aggregate is the sum of all of those tasks,
26:02
their time remaining. So if I run this, and I'll copy this find after just to show you that it outputs to a collection,
26:25
you can see now that for each of my work items, I've grouped and summed the total time remaining. So in the resulting collection, the key is the description and the value is the sum of all of the time remaining numbers.
26:49
In a similar way, I can actually, so in all the examples I've shown you so far, I've had one emit per document. There's no rule that says you can only emit
27:01
one record per document. So if I wanted to find out, show me the time remaining per assignee, so there's owners and there's assignees, so if I want to see what is the, across all of my descriptions, across all of my work items, show me the time remaining,
27:20
how much work left for a particular assignee. Again, I iterate over the tasks, but instead of emitting something at the root of the document, I emit the assignee. So this is gonna end up putting one record into the, or outputting from the map, one record for each assignee and their time remaining for a given task.
27:46
So now, if I run this, you'll see that by assignee, I have a nice sum of everything that they have left in their time remaining.
28:07
So kinda the key to understanding MapReduce is just understanding how to unwind a document, how to look at its nested collections, its elements, and project into the map, out of the map, something that you can aggregate.
28:20
You know, I think over time you learn to sort of think of MapReduce as SQL, or in a similar way to think of your SQL queries. You know, at the end of the day, data, relational databases or non-relational databases, in some sense, are just key-value pairs, or keys and value, some set of keys with an object. All right, so JSON object is a set of properties and values.
28:44
A row and a table is a set of values associated with columns. So the mindset is kind of the same, but the tooling and the way you actually get that data is very different. So now I wanna move on to
29:05
the aggregation framework in Mongo. This is a more recent and potentially more confusing at first way of dealing with aggregation.
29:21
Now I'll admit when I first tried this out, I was calling it in my head the aggregation framework, because you end up having to write a lot of curly braces and a lot of JSON, and doing that in the shell, you know, at the command line is very tricky. But using a tool like RoboMongo, you find that it's actually pretty easy to get your head around.
29:43
So I'm gonna start with a very basic group. So we use the group operator. We specify the column on which we're going to group. So when you're using some of these operators, you'll find this dollar sign owner is basically a way of referencing back into the document
30:01
a property of that document. I'm creating a new property called count, and I'm summing, basically I'm outputting a one for each of the records, and then I'm just summing all those ones. So it's effectively just doing a count of each record.
30:26
So if I run this, you'll see that I have a count of two for Molly and a count of three for John. So that count sum, so basically what this,
30:41
just to kind of give you a sense of what that count sum one is doing. So the output of the map, or whatever's happening behind the scenes, is logically something like this. John comma one, John one, John one, John one, Molly one, Molly one.
31:00
So that count sum one is at some point doing something like that, and then just summing up each of the numbers that I outputted and then associating them with a grouped key. I can do a composite key.
31:20
So if I want to group by two things, so show me how many, so it's one for most of these, but this shows me that by owner and effort, John has two 13 efforts associated with him.
31:40
So let me show you back here, the thing I added within the ID. So again, the underscore ID in the group operator is the key on which you're grouping, and it doesn't have to be, like in this first example, it doesn't have to be a single property. You can create a projection, which is, in this case, the combination of owner and effort.
32:00
So this is basically a composite key, not composite key in the primary key sense, but in that key value pair, it's two things making up the key. Now what makes the MongoDB aggregation pipeline so powerful is that it's a pipeline.
32:24
So you can have n number of operators work, basically, so the group happens, and then I feed it to a sort, then I feed it to a match, then I feed it to a whatever. So you can start to combine operators in a sort of line-by-line way,
32:41
which you can do a lot of the stuff in MapReduce, but it's not nearly as straightforward. So in this case, I'm doing the same group that I did above, but then I'm applying a descending sort based on the count. So again, the output of each operation is fed like a pipeline into the next operation.
33:02
So if you've used PowerShell or Linux, you're probably used to some of the pipeline stuff where you output a file into something else. So if I run this,
33:22
you'll see it's the same result, it's just descending now. I can add a query, and this is also a very powerful feature, and I can do this in MapReduce as well, but again, you don't have the whole pipeline model.
33:40
So here, before I do my aggregation, I wanna filter out all records where the owner is not Molly. So find all the records for Molly, and then group it.
34:00
So that eliminated the John records. There's a projection operator, which is pretty powerful. So in this example, it's not as obvious why it's powerful, but when we start getting into some more advanced examples,
34:24
you'll see that I basically cut out all the parts of my document except for the assignee. Now when we start to ask more interesting questions through the pipeline, then we'll see why we need that.
34:45
Unwind is probably one of the more powerful, but sometimes confusing of the operators, but I also think it's one of the more necessary operators as you start to pick apart nested collections. So in this case, I just wanna show you what unwind does.
35:03
So remember our document structure. We have, for each work item, we have a nested set of tasks. If I unwind the tasks,
35:20
now for each task within a work item, I have a new record in the collection. You probably, at some point in your life, had to do a query where you joined a master detail table together and ended up with a lot of redundant data from the master. So basically, you're repeating data from the master in each row for the child.
35:41
That's what's happening here. So I'm saying, unwind this task's collection and output a set of documents where, for each task, I show just the single task, but on top of it is its data from the description, I mean, from the work item. So you can see that this appears
36:02
over and over and over again. So I'm about to start digging a little deeper, but I just wanna make sure everyone's okay before I move on. Are there any questions? All right, so now let's combine
36:25
a projection with the unwind operator. So before I unwind this, I just wanna show you the output of the projection. So one of the nice things with working with RoboMongo is it's very easy
36:42
to sort of do these step-wise. When I first started learning the aggregation framework, I was using the MongoDB shell, which is a command line interface, and it's obviously with a command line, it's a lot harder to sort of piecemeal things. So at the point of the projection,
37:02
I've outputted the description and the time remaining for each of the tasks. So I have a description, and it's time remainings. But now what I wanna do is unwind it
37:22
so that I can perform a simple aggregation on the description and sum up the time remainings. So when I unwind the tasks, so again, I projected the tasks out so that I have an array of time remainings for each task. When I unwind it, I denormalize it a bit, and I get the description repeated once
37:41
for each of the time remainings. And now what I can do on top of that is add this group operation. So what I'm gonna do here is group by the description and sum up each description's time remainings.
38:13
So that basically did a select count, select sum, or select description, comma,
38:21
sum time remaining from work items where group by description. And then if I wanted to output that to a collection, I can use the out operator.
38:44
Most of the pipeline methods can kind of go in any order. And in some cases, like there's a limit and a skip operator, and the query optimizer is smart enough to sort of put some of the operations in the right order for you. But the one thing you have to do first, or last, I'm sorry, is the out operator.
39:01
If you're going to output to a collection, it has to be the last thing you do. When would you use out?
39:23
Without out, you wouldn't have, so the question is when would you use out? So the question, it's basically if you want to compute something and have it saved in a collection that you can then access from anywhere else. So the results of the aggregation, you can run them over and over again. But it'll potentially be much more efficient to have it stored somewhere.
39:52
Yeah, so well, it could potentially be, if this is a huge collection and you're persisting a very large collection somewhere, I mean that's going to depend on how much space
40:01
and power you have available. Some of the limitations too, you can't output to a sharded collection and there's some things that'll help make you decide whether to use out or not.
40:32
Yeah, honestly the question is about whether Mongo is caching the query and being able to run it again efficiently.
40:40
I'm not 100% sure, I would imagine there's a query optimizer, I'm imagining it does something like that. So in this case, if I run it, you can see now I have my output collection. Another interesting thing you can do is
41:04
work with arrays in a little bit of a different way. So we're going to do this step by step. So the first thing I'm going to do is just unwind my tasks again. So let's start with the unwind.
41:20
So this is what we saw before, just create a record for each combination of description and task, or work item and task. So now, if I run this, you can see that
41:40
for each assignee, I've pushed the set of task descriptions into an array associated with that assignee. So push lets you kind of create a new array of something within your data to associate
42:03
with one of your properties in your document. There are other operators like push that you can use. There are a bunch of set operators that are usable within your aggregation framework.
42:22
And then here, this is similar to an example above, just to show you, you can project a little deeper. So this is similar to the map reduce, the last map reduce I showed, where I dug into the assignee. So when I project, I don't have to just project
42:41
at the root level, I can project a nested element. So if I take the group off of this to show you what this is doing, you'll see that I'm projecting the assignee with the time remaining for each of those tasks. And then I can apply the group on the assignee,
43:00
sum up the time remaining, and see the results here. Now this is a bigger one to show. So you can use variables within your pipeline, which is very nice for certain things. So in this case, I start out by unwinding my tasks,
43:23
and then I create a projection. And within the projection, I'm gonna use variables to decide whether to multiply. So the idea with this query is, let's say that your team performs great for anything that's estimated at an eight or lower, but they're a little slow on the 13s and above.
43:40
So let's really fix the estimate and say that it's one and a half times the effort if the effort's been estimated at greater than 13. So I unwind the task, project the assignee, the time remaining, so this effort colon one, that is just saying include effort.
44:01
So if you aren't using a nested, so here, task.assignee, because it's a nested element, I have to reference it with quotes. The JSON will break with a dot, so the effort, I'm just including effort from the root of the document, so I just, effort colon one will include that. If I don't care about the object ID in my results,
44:22
I just turn it off by ID zero. Now I create a new computed column called predicted effort. And the definition of this, sorry, not column, property called predicted effort. The way I predict this, or create the predicted effort
44:41
is start with a let statement. The let statement is how you define variables. So it has a vars property. So I create a variable called multiplier, and the multiplier is computed by saying, okay, the conditional operator takes an if then else, so if the effort is greater than or equal to 13,
45:03
then multiplier equals one and a half, else it equals one. So it's more or less a ternary operator. So now this variable multiplier exists here, and now here, I use the multiplier variable,
45:21
and the effort to multiply this value times the effort from the document. So multiplier will equal one if it were less than 13, multiplier will equal one and a half if it's greater than or equal to 13. So the result here is going to be
45:42
based on that computation. So let me run that and show you what it does. So now you can see predicted effort here is
46:02
19 and a half, it was 13. Where it was, where it was eight, it's still eight.
46:23
This isn't that different from the sums we were doing before, but just to show you another operator, there's the average operator, the AVG. So I can compute an average of the time remaining by grouping and using the average function.
46:49
I won't run all of these because they're more or less the same, but there's a min, there's a max, there's a first and last, and when you run the first and last,
47:02
they don't make sense if you're not sorting, because it'll be unpredictable what's first and last. So in this case, I'm sorting on time remaining. And if you want to do the equivalent of a SQL having greater than, so again, I unwind my tasks,
47:21
I project out the assignee and the time remaining, like we've done a few times, group on the assignee and the time remaining, but then I can apply a match filter, or match operation, which says, only show me records where the average is greater than 14.
47:41
So if I run this, you'll see that I've eliminated anything with an average of less than 14. If I take that away, you'll see that I got the seven and the 10 and a half back.
48:01
Question? The question is whether there's a profiler to see what the query analyzer's doing, and I can't answer that, but we do have a Mongo employee here,
48:22
so if he can answer that. So Craig Wilson up there actually wrote the MongoDB, or is writing the MongoDB driver, which when you find that out right before you talk in MongoDB, it gets a little intimidating.
48:40
Thanks. And then this is a really kind of a fun feature, I think. There's a, I mean fun and useful, there's a redact operator. So let's say you have some part of your document that you don't want, or you have a document or some part of the document that you don't want
49:01
pulled back in the pipeline, based on some condition, like a particular user. You can set a condition in the redact operator that says, in this case, if the owner is the user, then descend, which will give you the document, else prune, which won't give you the document.
49:21
So if I run this, you'll see that I got all of my records, but I didn't, oops, I got all of my records, I didn't get the Molly records back,
49:43
but if the owner is Molly, then only Molly's records come back. So it's a useful way of sort of adding a little security based on some condition in your environment.
50:07
So this goes till 1440, right? So 10 more minutes. So that's basically it for the Mongo stuff. Is everyone, any questions before I do a quick Couchbase question?
50:25
I'm sorry? So the question was whether you can, what happens when you run this against a sharded collection? The, this aggregation pipeline is, works with a sharded collections.
50:42
The first set of stuff I showed you with the collection group in distinct will not. And I think there are some limitations, I've got the details offhand, but I think there are size constraints on MapReduce versus this framework. So the docs have all that specified though.
51:03
I'm sorry? Out cannot be, you cannot out to a sharded collection. I don't, you can out to, well whenever you out, if it's existing,
51:21
I believe it replaces it. So the question is whether you can out to a different database. I don't believe that's possible as part of the operation. You'd have to do some sort of copy after.
51:40
So Couchbase is very similar to CouchDB. Let me just log in here. But Couchbase, basically the basics of Couchbase
52:01
is it started off as a hybrid of Memcache-D and CouchDB. How many people have used Memcache-D? How many people have heard of Memcache-D? How about App Fabric? So if you've used App Fabric, which is a distributed cache on Azure, it's the same idea. In fact, I think they actually support the Memcache-D protocol.
52:21
So Memcache-D originally started off as just a caching, distributed cache, and it was a text-based protocol. Some guys in California that I used to work with created a binary protocol for it and made it much more efficient. They came up with algorithms for distributing it better and actually saving it and persisting it to disk.
52:40
So Couchbase is, because it's based so heavily on Memcache-D, is lightning fast. And once version two came out a couple of years ago, CouchDB stuff got put in there. It was rewritten from Erlang to C, and now you have the ability to create secondary indexes using MapReduce. So a key value store, generally, the only way to get data out is through the key.
53:02
But now with the key value stored with document capabilities, you can write a MapReduce function to query on secondary indexes. So you have a concept of a view, and within the view, you write a map function,
53:25
and optionally a reduce function. So this is the same map function we've looked at before. So I won't go through all the details of Couchbase and its MapReduce,
53:40
but just understand that it's, I can group it, I just clicked a group button. Now if I show the results, it's grouped by the two properties, or the two owners. What I do wanna show you from Couchbase, and this is not production yet, but this is actually a pretty interesting technology. So there's something that Couchbase
54:01
is working on called Nickel. I forget what it stands for, it's N1QL, but it's a query language for Couchbase documents. So to understand Couchbase, one way it's very different than Mongo, there's no concept of a collection. Everything sits in a bucket, everything exists as a document in a bucket
54:20
without any sort of abstraction over it. Now there are advantages and disadvantages to both of those approaches. Like for Mongo with the collections, you have a nice sort of way of thinking about it in terms of how it relates to a table. But you can't do clever things with, you can't do some clever things with MapReduce that you can do with Couchbase, which is like to create master view records
54:40
in a logical way, because you can look at all your documents, not just documents in a collection. You could do the same with a collection that's a little bit bigger in Mongo, I guess, but generally, Couchbase has some nice features for how you can order your documents as the result of creating MapReduce. I won't get into that now, because we only have a few minutes left, but I want to show you how with Nickel I can,
55:03
can everyone see that okay? This is how I can use Nickel and Couchbase to count how many tasks an owner has. So this is not something you can do with the drivers,
55:21
it's not something you can do without installing like a dev version, the CBQ or whatever executable, but effectively you can see what they're building is a SQL-like language on top of their MapReduce engine. I don't know if it's actually on top of the MapReduce engine, but it's a query language that lets you write SQL-like code instead of MapReduce.
55:43
So it's interesting to see the two different approaches from the different companies. Couchbase is obviously trying to move a step towards what Cassandra, has anyone used Cassandra? So Cassandra is originally a Facebook project, now it's a very popular open source database,
56:02
or NoSQL database, and they have something called CQL, Cassandra Query Language, so they have this sort of key value-like database that has a SQL-like language on top of it, and Couchbase is obviously moving towards something similar, and as you can see the Mongo folks are kind of moving
56:22
in a very different way and kind of keeping the pure JSON feel of the database. But I honestly don't know enough about Nickel to show you more than that, but I just wanted to show you kind of what's happening. So in the last couple minutes, are there any questions?
56:43
So I would definitely recommend going out and trying Mongo if you haven't, and definitely try Couchbase, they're two great products, two great databases. I'm happy to again discuss any sort of questions you might have on that stuff. Just find me after, let me just show you
57:01
some links where you can get more info on this stuff. One thing to add with Couchbase, you do get built-in reduce functions, so you have count, sum, and stats, and stats will give you min, max, count, sum,
57:21
sum of squares, so it's kind of nice to have that built-in reduce capabilities. So in C Sharp, if you were to call these methods, this is what it looks like in Couchbase, I'm not gonna dig in too much, you just create a Couchbase client, get the view,
57:41
then you get a list of key value pairs in Mongo, and I've been told recently by Craig up there that this will be a little cleaner. So when you work with, right now, in the current version of the MongoDB driver, when you work with all this aggregation stuff,
58:00
you work heavily in BSON documents, and it's a little confusing when you're in Visual Studio and seeing lots of curly braces, but that should be hopefully a little easier in the future. So where you can get all this stuff, again, gist.github.com slash Jsablocky,
58:21
that has the sample documents, the sample queries, everything that I showed you today is there, and this presentation you can grab on oneDrive.ms, lowercase L-M, capital M, capital Q, lowercase J-L-R. So this presentation is there, you can grab it, I mean, everything that's in the presentation is generally in the samples, too, so.
58:44
And that's it, thank you guys.