We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

If you can't Beat 'em, Join 'em!

00:00

Formale Metadaten

Titel
If you can't Beat 'em, Join 'em!
Untertitel
Integration NoSQL data elements into a relational model
Alternativer Titel
If you can't beat 'em, join 'em (... a pun)
Serientitel
Anzahl der Teile
29
Autor
Mitwirkende
Lizenz
CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
ProduktionsortOttawa, Canada

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Why, when and how you can integrate documents and key-value pairs into your relational model There is a pitched battle going on between the relational, document-based, key-value and other data models. PostgreSQL is uniquely capable of leveraging many of the strengths of multiple data models with JSON(b), HSTORE, XML, ltree data types, arrays and related functions. This presentation outlines the use-cases, benefits and limitations of document-based, key-value and hierarchical data models. It then presents practical advice and code snippets for incorporating them into PostgreSQL's relational framework. The presentation ends with SQL examples and code snippets for loading, accessing and modifying (where possible) JSON, HSTORE, XML, ltree and array data types. This presentation begins with a very quick review of the rationale, benefits and implications of the relational data model. It then does the same for document-based models and hierarchical models. The balance of the presentation works with three publicly available data sets, world-wide airports, Wikipedia Inbox key-value pairs and Google address JSON objects, showing how they can be be incorporated into a simple relational model. The presentation also includes snippets of code for loading the files and accessing elements. The full SQL, and shell code will be available on the web site.
Element <Gruppentheorie>Relationale DatenbankDatenmodellSchwebungEDV-BeratungMetropolitan area networkStatistikPortscannerMixed RealityProgrammierumgebungAusgleichsrechnungTypentheorieInklusion <Mathematik>Baum <Mathematik>MittelwertCoxeter-GruppeGesetz <Physik>Sampler <Musikinstrument>SchedulingFortsetzung <Mathematik>WechselsprungBitRelationale DatenbankMAPQuick-SortFokalpunktDatenmodellProgrammierumgebungCASE <Informatik>MultiplikationsoperatorRuhmasseDruckspannungBasis <Mathematik>KurvenanpassungProzess <Informatik>TeilmengeComputeranimation
TermVersionsverwaltungOrakel <Informatik>ServerEDV-BeratungSoftwareentwicklerDiskrete-Elemente-MethodeUniformer RaumLokales MinimumVolltextSchlüsselverwaltungGraphZustandsdichteEinfache GenauigkeitDivisionEbeneRelationale DatenbankMathematikDifferenteTabelleDatenmodellPortscannerSchnittmengeGoogolDynamisches RAMMetropolitan area networkGammafunktionSelbst organisierendes SystemMAPMereologieStandardabweichungOffene MengeZahlensystemElement <Gruppentheorie>MereologieInformationsspeicherungMathematikRelativitätstheorieMomentenproblemDatenverarbeitungDatenbankTabelleFortsetzung <Mathematik>Virtuelle MaschineEinfache GenauigkeitGüte der AnpassungTermAbfrageObjekt <Kategorie>Coxeter-GruppeDatenmodellSchnitt <Mathematik>DifferenteMatchingMultigraphRechter WinkelMultiplikationsoperatorStrömungsrichtungAggregatzustandOrdnung <Mathematik>Relationale DatenbankTransaktionProzess <Informatik>Elektronische PublikationResultanteFitnessfunktionSichtenkonzeptWiderspruchsfreiheitGraphGrenzschichtablösungDatentypMultiplikationDienst <Informatik>Gemeinsamer SpeicherAnalytische FortsetzungDatenstrukturLeistung <Physik>Interaktives FernsehenFormale SpracheSoftwareentwicklerEinsQuick-SortWort <Informatik>DatensatzEin-AusgabeSchlüsselverwaltungBitObjektrelationale AbbildungNatürliche SpracheSuchmaschineErwartungswertFramework <Informatik>VersionsverwaltungZahlensystemApp <Programm>MaschinenschreibenLastSchnittmengeYouTubeObjektorientierte ProgrammierspracheReelle ZahlMehrrechnersystemCASE <Informatik>GrundraumPunktHecke-OperatorComputeranimation
Elektronische PublikationTabelleDatentypLastSchlüsselverwaltungSerielle SchnittstelleInformationExt-FunktorVerschlingungMAPBitVerschlingungDatenbankElektronische PublikationCoxeter-GruppeGüte der AnpassungDatensatzReelle ZahlWeg <Topologie>Quick-SortStichprobenumfangDatenfeldGewicht <Ausgleichsrechnung>Rechter WinkelDifferenteStandardabweichungPi <Zahl>InformationsspeicherungÄhnlichkeitsgeometrieStrategisches SpielAutomatische IndexierungWeb SiteTabelleForcingLastProjektive EbeneComputeranimation
Metropolitan area networkMehrwertnetzLokales MinimumElektronische PublikationExt-FunktorMailing-ListePortscannerVerzeichnisdienstVerschlingungInformationInteraktives FernsehenBefehl <Informatik>SummandEinfache GenauigkeitMessage-PassingInverser LimesRemote AccessPASS <Programm>Vorzeichen <Mathematik>GammafunktionVerzeichnisdienstMailing-ListeElektronische PublikationPunktVerschlingungMereologieDatenstrukturSelbst organisierendes SystemDifferenteEchtzeitsystemCASE <Informatik>Ein-AusgabeWort <Informatik>TabelleRuhmasseFortsetzung <Mathematik>InformationsüberlastungRechter WinkelShape <Informatik>SolitärspielGraphComputeranimation
Einfache GenauigkeitMetropolitan area networkInverser LimesLokales MinimumElektronische PublikationVerschlingungTexteditorExt-FunktorMessage-PassingOrdnung <Mathematik>MAPZellularer AutomatZählenSchlüsselverwaltungGasströmungFinite-Elemente-MethodeFächer <Mathematik>Nichtlinearer OperatorSingularität <Mathematik>IndexberechnungVersionsverwaltungTropfenChi-Quadrat-Verteilungp-BlockPortscannerRemote AccessZeichenketteTermKreisbogenMobiles EndgerätEinfacher RingElement <Gruppentheorie>Abstrakter SyntaxbaumTabelleDynamic Host Configuration ProtocolSoftwaretestEuler-WinkelGammafunktionOrtsoperatorEbeneARM <Computerarchitektur>VarianzCloud ComputingSmith-DiagrammRankingGanze ZahlQuarkmodellVerweildauerGroße VereinheitlichungInterface <Schaltung>Weg <Topologie>CodeRelation <Informatik>TabusucheLucas-ZahlenreiheMehrwertnetzMathematikIkosaederManagementinformationssystemWeg <Topologie>Automatische IndexierungSchlüsselverwaltungNichtlinearer OperatorInnerer PunktElektronische PublikationMatchingObjekt <Kategorie>DatensatzBefehl <Informatik>DifferenteBitElement <Gruppentheorie>InformationWort <Informatik>SichtenkonzeptLastCodeUmwandlungsenthalpieRechter WinkelFormation <Mathematik>DatenstrukturMereologieHidden-Markov-ModellTabelleCoxeter-GruppeÄhnlichkeitsgeometrieAbfrageReelle ZahlPunktMultiplikationsoperatorBildschirmmaskeAlgorithmische ProgrammierspracheLesen <Datenverarbeitung>Vorzeichen <Mathematik>Prozess <Informatik>MaschinencodeAggregatzustandVersionsverwaltungQuadratzahlWasserdampftafelSpieltheorieMathematikPhysikalische TheorieForcingKonfiguration <Informatik>RelativitätstheorieBildgebendes VerfahrenComputeranimation
Metropolitan area networkIndexberechnungSchlüsselverwaltungWeg <Topologie>MathematikPortscannerLokales MinimumTabelleBildschirmmaskeElement <Gruppentheorie>Nichtlinearer OperatorMaskierung <Informatik>MaßerweiterungChi-Quadrat-VerteilungSmith-DiagrammMessage-PassingGammafunktionFächer <Mathematik>Inverser LimesEuler-WinkelIntegriertes InformationssystemMehrwertnetzSchnittmengeAutomorphismusAutomatische IndexierungRahmenproblemTLSRemote AccessLateinisches QuadratGanze ZahlSingularität <Mathematik>TropfenMailing-ListeOrdnung <Mathematik>GraphInteraktives FernsehenSpeicherbereichsnetzwerkPCMCIADigitalfilterVerschlingungDateiformatGraphPunktSchnittmengeNichtlinearer OperatorMinimalgradEinsBitDatensatzKartesische KoordinatenInformationsspeicherungAutomatische IndexierungGarbentheorieMehrrechnersystemArray <Informatik>TabelleReelle ZahlQuick-SortWeg <Topologie>RelativitätstheorieDatenmodellÄhnlichkeitsgeometrieRechter WinkelDatenbankLeistung <Physik>Lateinisches QuadratArithmetischer AusdruckSchlüsselverwaltungVerschlingungTouchscreenMaßerweiterungStreaming <Kommunikationstechnik>DateiformatSichtenkonzeptInverser LimesGlobale OptimierungGrundsätze ordnungsmäßiger DatenverarbeitungDatentypFunktionalFortsetzung <Mathematik>WechselsprungRechenschieberÄquivalenzprinzip <Physik>Umsetzung <Informatik>Web SiteComputerspielNetzbetriebssystemStrategisches SpielCASE <Informatik>InverseTurtle <Informatik>Kontextbezogenes SystemProzess <Informatik>KreisflächeVorzeichen <Mathematik>Formation <Mathematik>MathematikGrenzschichtablösungAbfrageRegulärer GraphDatenfeldWort <Informatik>ZahlenbereichElektronische PublikationComputeranimation
Ganze ZahlLokales MinimumMehrwertnetzMetropolitan area networkUniformer RaumGraphIkosaederTabellePortscannerKreisbogenSmith-DiagrammVerschlingungWeg <Topologie>GammafunktionRankingCloud ComputingIndexberechnungChi-Quadrat-VerteilungEuler-WinkelARM <Computerarchitektur>Ordnung <Mathematik>Data Encryption StandardInverser LimesQuantenzustandMessage-PassingBildschirmsymbolGroße VereinheitlichungPASS <Programm>Integriertes InformationssystemDreiecksfreier GraphGewicht <Ausgleichsrechnung>DatensatzAbfragePunktRekursive FunktionGruppenoperationMultiplikationsoperatorTermVerschlingungTabelleZentrische StreckungQuick-SortWeg <Topologie>UntergruppeAutomatische IndexierungRechter WinkelProgramm/QuellcodeComputeranimation
CodeKontrollstrukturVerschlingungMetropolitan area networkLokales MinimumMehrwertnetzIntegriertes InformationssystemEinfügungsdämpfungSoftwaretestBildschirmsymbolVersionsverwaltungSchnittmengeGüte der AnpassungRekursive FunktionInformationsspeicherungBildgebendes VerfahrenSpezielle unitäre GruppeFunktion <Mathematik>ComputeranimationProgramm/Quellcode
Metropolitan area networkTabusucheChi-Quadrat-VerteilungVerschlingungMehrwertnetzARM <Computerarchitektur>Interaktives FernsehenGraphTaskSoftwaretestSchlüsselverwaltungStatistikTypentheorieRemote AccessFlächeninhaltRelationale DatenbankRelation <Informatik>ServerMAPElement <Gruppentheorie>Reelle ZahlKeilförmige AnordnungPortscannerChatten <Kommunikation>DatenmodellResultanteHecke-OperatorCluster <Rechnernetz>Mailing-ListeSchnitt <Mathematik>PlastikkarteIntegralDatensatzVirtuelle MaschineQuick-SortSystemaufrufBenutzerbeteiligungSelbst organisierendes SystemMultiplikationsoperatorRandwertSuchmaschineBitRelativitätstheorieCodeCASE <Informatik>PunktSchlussregelFaserbündelSchlüsselverwaltungProzess <Informatik>SchnittmengeParametersystemDatenstrukturDatenbankFortsetzung <Mathematik>IndexberechnungKonfiguration <Informatik>AdressierungAffiner RaumArithmetisches MittelZahlenbereichEinfache GenauigkeitTabelleFitnessfunktionObjekt <Kategorie>BestimmtheitsmaßTransaktionQuelle <Physik>NormalformTypentheorieAutorisierungGlobale OptimierungForcingTeilmengeOffice-PaketComputeranimation
QuellcodeSchreib-Lese-KopfElement <Gruppentheorie>Arithmetisches Mittelt-TestAdressraumWort <Informatik>Quick-SortReelle ZahlTypentheorieCoxeter-GruppePhysikalischer EffektMereologieTermCodeTabelleFortsetzung <Mathematik>Überlagerung <Mathematik>Zentrische StreckungProjektive EbeneSchlüsselverwaltungInterface <Schaltung>Objekt <Kategorie>InformationsspeicherungGradientMathematische LogikObjektrelationale AbbildungRichtungDatenfeldAusnahmebehandlungEinsMinkowski-MetrikKartesische KoordinatenSichtenkonzeptApp <Programm>Elektronische PublikationGüte der AnpassungRekursive FunktionMultiplikationsoperatorProdukt <Mathematik>KardinalzahlPerfekte GruppeDatenbankÄhnlichkeitsgeometrieService providerDijkstra-AlgorithmusMetropolitan area networkComputeranimation
Transkript: Englisch(automatisch erzeugt)
All right. I think, hopefully, we'll all get over the post-lunch laws. I tried to jazz this up a little bit. It has been pointed out that the presentation isn't exactly as described on the PG Comp schedule. But we're all here, so it's sort of tough.
Anyway, Jamie Hansen, if you can't beat him, join him upon integrating NoSQL into Postgres. And you'll see. Anyway, let's just jump in. All right. We're going to start with, we have about a quarter of the
presentation is unavoidably sort of theoretical and conceptual. And the first is the NoSQL hype check. Everything from a magic panacea to a hot mess. And the rest of the presentation is sort of on the assertion that it doesn't matter. It's sort of really useful in some situations, not in others, but it's here to stay.
That's really the premise. This isn't an argue for or against. NoSQL is. And it's our job, if we want Postgres to thrive, to be able to use Postgres with NoSQL. So how can we thrive in this environment? And I'm asserting that the answer is to know where our
sweet spot is in the relational model. And we have a lot of capabilities. And except the edge cases, Postgres isn't going to run Google search. But I'll show you here how it goes. All right. So rather than the hype curve, this is my stages of NoSQL acceptance, this is like fighting the tide.
So again, NoSQL is. It doesn't kill relational databases. It certainly doesn't kill Postgres. We have more capabilities than other relational databases in the marketplace. And I want to show you ways that we can thrive. Go ahead. Oh, I'm sorry. All right.
So I'm going to try and turn the traditional presentation, or the traditional focus on its edge, saying that ordinarily we say, given that I have Postgres, because we are a Postgres conference, how can I leverage NoSQL? And what I'm trying to do for at least 40 minutes is to reverse, which is to say, given that NoSQL is out there
in the business world, how can I leverage Postgres to make it work? All right. Much of this stuff, Martin Fowler's presentation on YouTube is fantastic. And his book is also pretty darn good. He coined the term polyglot persistence, which he's
selling it. I bought it. And I'm sharing it here. So there's me. OK. Here's part of our conceptual piece. NoSQL, as I said, can be all these other things. For the next 40 minutes, it's just these five things.
Document store databases, we're going to look at JSON for us, could be XML as well. A wide columnar store, we're not going to do that. So that didn't make the cut. Key value pairs, that's H store data type for us. A graph database, that's really not a data type, but it's a recursive query. We're going to go through one of those at the end. And the Apache Solr, sort of as a search engine, put a plug
for, on SlideShare, PGConf, New York City. We had a talk on generating ranked results with natural language queries with Postgres full text search. But that's not in this presentation. So anyway, we're going to focus on document, key value pair, and the graph store.
All right. So part of it is, and we've got to go Debbie Downer just for a moment here. Basically, I need to make sure we all know the three critiques against relational databases. Because we have to know what other technologies are saying about us in order to refute it. The first is, you've got to go to NoSQL because there's
too much data for one or a few machines to process. I know there are smart people who disagree with this, but I'm going for the rest of the presentation. If it doesn't fit on a few machines, it's not a good candidate for relational. That's really an edge case for our purposes.
Point out, for sharding, I know that sharding can be done in a relational model, but it's very difficult. And NoSQL has the expectation of sharding. I realize there's a lot more to this story. But for our purposes, it's a common critique. A second piece.
Because relational models stores data in little tiny pieces in lots and lots of different places. It gives us transactions, a lot of power in this, but it doesn't match object-oriented approach. The example here is, if I was building a Death Star app, I would like to have a method that says, get Death Star or
post Death Star. But in a relational model, you really got to touch each of the little pieces, and that's a bit of a pain in the arse. And then the last critique is because it's difficult over time as your data structure or your business needs change
to keep it all in a relational model. In other words, you can't have one table that says, as of now, I'm changing the structure, but I want to keep all the old records in the same table. You didn't come to Postgres conference to hear bad stuff, so that's the end of that. Just wanted it out on the table. All right. The other side, though, is we're a strong arse contender
ourselves. The first is, we're an incumbent. We are everywhere. Google has relational databases. We rock the house. We have an extremely powerful interaction language, SQL. It's widely known. You can get expertise in it all over the place. It's mature, and it still has current development.
Witness this, because we have transactions, real ones, down at the data element. Not pseudo-ish transaction-like eventual consistency, but real no-crap transactions. You make a change here, it's visible everywhere. And then this last one, because I get that the cool kids play with Google-sized data, but most of us really
don't, you can handle comfortably multi-terabytes of data. You really only need that on the petabyte side, and there's not that many data sets that large around. And then this last piece is big. We're better at finding relationships between data elements.
And there are times when no crap. You do want to stop bad data from loading. The wonderful thing about a glorious, flexible data model is you don't have to think about anything, and you can put all kind of crap in. But you don't always actually want that. So anyway, this is, I wanted to give us a framework of this is where the current state of
relational versus no-SQL. All right, so where does all that stuff lead us? Martin Fowler coined the term polyglot persistence. And the short answer is his argument, and I agree with, is that we're going to have relational databases and no-SQL databases. And our job is to match the business needs plus the data
to the technology. And I'm going to argue that a lot of that does fit very, very well into Postgres. And then this presentation is really about finding where it's a great fit and demonstrating how to integrate it. And again, this is really what it is.
I got four kids, two in college, one in a private high school. I need Postgres to not survive. I need it to thrive, very selfishly, very personally. So that really means we have to make this no-SQL stuff work. So we want to thrive, not just survive, in a world that includes no-SQL.
Questions on this stuff? All right, so anyway, here's all that stuff in one chart. I got amount of data, size of business transactions, basically Postgres relational database model, bigger than one machine, but not really suited if it goes beyond a few machines. We'll talk about transaction sizes,
a roughly document size. Important to point out, no-SQL really doesn't have a capability to change data elements if it's smaller than a document. I realize this is an evolving sort of technology, but it's not well suited. That's not its sweet spot. So there's where we live.
All right, we're done with that. Now let's go the technical part and hopefully the more fun part. So our scenario is we were just given a million JSON files and we want to know what to do with it. And by we want to know what to do with it, that's the universal sort of,
do I really need to go to MongoDB? So I just got two questions. Can our data set easily fit on one or a few machines? Heck yeah. And then how large are the business transactions? And the almost universal answer to that is, hell, I don't know. Answer to that is we're perfect for Postgres.
I point out the business transactions because we get the object relational mapping and you want to have a business transaction. If you're a Death Star app, you don't want to deal with all the individual pieces. But most people don't know all that stuff. And a no-SQL technology doesn't allow you to change individual pieces.
But Postgres allows you to either update the whole document or the pieces so you get more flexibility with us. All right, so quick review. Anyway, JSON, JavaScript object notation. We've all seen it before. Basically, we're looking at key value pairs. I'll show you in here, you can have arrays as elements within JSON.
Everything we're saying would also apply to XML. Sort of conceptually, they're just different standards for documents. But XML is not nearly as cool. So it's not in here. I hope that captures the level of discussion of XML.
All right. So now the good stuff. What's in our dataset? I wanted to do this presentation with real data because I wanted to force index usage. I wanted to show you real data is always inherently flawed. It's just the way it works. That's a good thing for us. I want to show you what it's like to load. So there's a million song database out there.
It's a public works project. They have a sample dataset, 1%. So it's roughly 10,000 rows. We got a track ID, an artist, a song title. And then there's two, we'll convert them to HSTOR, but there are two array fields. One is similarity. Think of it like Pandora.
So this song, this track ID is similar to this track ID with a certain weight. And then you have tags, which is this song is a rock song value 88. This song is alternative value 67. But we'll dive into that. So that's basically, that's our corpus of documents, our popular songs.
All right. So how do we do this? And by the way, we're gonna move through some SQL examples. We'll do them real time, but all of it's posted on the website. So don't worry about any particular syntax piece. Create a table, easy as pie. Just put them all in a JSON file. And we're gonna load them with the copy command.
That syntax right here, that is used to escape. I have embedded tick marks. Think of it as an artist with an apostrophe S. So that's real data. So that's ugly. Anyway, that gets around that problem. We're gonna use JSONB because we should. And that's enough of that.
All right. So how do I load this stuff? It requires a little bit of Linux work, but not too bad. We just, you know, it's a zip file. I just unload it. And then I want to create symbolic links to underneath PG data. I'll show you why it's convenient for me. And I just did it in one xargs command.
Let me show you why. So we are, see, that's the last song. Anyway, it unzips to just a hot mess. It's A songs, A-A songs, A-A-B songs.
It's in this really nasty and nested piece. And then you just get at each piece a bunch of JSON files. And I turn that into symbolic links here.
I did that because I want to, well, Postgres, I want to use lots of different ways you could do this, but I want to do it within Postgres. I use the P-G-L-S dir command, which from Postgres, it gives you a listing of whatever files in our directory. And I use that to generate my copy command.
So in other words, what I want, the whole point of this is I want to be able to say I have a nasty hot mess of this complicated directory structure I want no part of. One Linux command gives me all the symbolic links, and a self-generating SQL command gives me all the loading stuff.
All right, so let's go over there. So there's that. We'll create a table. Five. Five. I'll show you here, if you haven't used it, awesome command. It only works under P-G data.
So we'll use this to, there's a copy command, control-C, file, control-V.
That's the problem with real-time data. Anyway, what I wanted to show you, that's all I did. I didn't move files, I didn't copy files. Someone handed me a crap ton of an ugly setup of files. In a directory I couldn't see from within P-G data, or I wasn't under P-G data. Don't move anything, don't do anything. Just, nope.
All right, so let's look at one of those. So you'll see later, but it looks, it's JSON-ish. Pretty JSON comes in nine, five. And how many rows do we have? All right, so we got 9,300 slides. Cool, all right.
So we've done that part. Now let's browse, because the first answer, or the first question we're gonna get when you get JSON files is, what's in them? And the answer is generally, hmm. So we're gonna use JSONB object keys. That's all the outer layer tags in your file. So it's a great way to explore what you have.
Here's two operators to pull the top one with the two greater than signs, returns a text object. The other one returns a JSON object. You'll use the returning JSON one if you wanna nest them. And this is the tag matching. So this statement says, hey, give me all the records that have the Arctic monkeys.
So we'll show you this in SQL to be easier. All right, so we got our 9,300 files, so. All right, so what's in our JSON files? There we go.
Those are all the tags. Artist similars is like the Pandora piece. Hey, let's look at how many songs by each artist. It is, anyway, this is our data set. It's randomly chosen, so I make no comment on that one. But that you can see how many different songs from each piece.
And let's, what we're gonna show, I'm gonna show you some of the way to index this, to make it faster. But to get fast, first you have to start slow. So this is how many songs, there are four songs by the Arctic monkeys. One thing I wanted to point out, that is the rubber mallet search technique, which is converted to text and use an I like.
But it does work. So I'm gonna show you two other options. So here is where, if the value of the artist tag is the Arctic monkeys, there's our four songs, and then the same piece.
What we're looking at is right here, I have a JSON tag. And what I wanted to show you, it varies each time, but this is roughly 90 milliseconds. So it's, you have to trust me on this one, okay, 83 milliseconds, and it doesn't use an index. But that stinks. I want it to go more better faster.
So create an index. One thing I wanna point out, no thought involved in this. It's pretty easy. One index, the whole JSON object, and what I wanna show you is, trust me, it does not speed up the text search. But look what happens here when I use,
so what we're doing in this query here is we have an index on our JSON object. I don't care which tags, they're just throwing index at the whole darn thing. And I wanna say, show me everything that has that tag right there. Wicked fast.
It went from roughly 80 something to .8. So 99 times faster, no thought process, very easy, easy to maintain. So we'll just play around for a little bit. We'll stroll through our data. That question mark operator there is,
that says, show me all records that have this tag. Our data set, they all have the same tag, but that's not required. It's easy to say, hey, give me all the records that have this tag of interest or that tag of interest. And that just says, you know, what are the artists? All right, so let's look for a specific song right here.
You can match by tag ID. We've been Rick Rolled. You knew we weren't gonna get through a music presentation without that, but c'est la vie. All right, yeah, I can't imagine it was an accident, but anyway, nobody's safe from Rick Rolling.
So let's look at that tags right here. Let's explore this. Because we know it's one tag, but it's really a bunch of pairs. So what we're seeing is Rick Astley is 100% 80s music. Can't argue that. And because this is real data, it means you get things that make no damn sense.
It definitely is not brutal death metal, but we'll accept that as it is. Again, that's the advantage of real data is that, do you see, oh crap, where did I get it?
So let's blow that up a little bit. When I wanted to show you one operator, I pulled what had been an array of key value pairs. I pulled them out in rows. Operators built in. So here is all my different scores. You can see it has, we'll get to it later,
but it has the structure of an H store. Again, array's perfectly valid to stick within. Hey, but now that we know these are arrays, we now have access to a whole bunch of array operators. So let's, what we're seeing here is pull out the zeroth, the first element, and there's the second one.
We're gonna, ooh. First I could be, so what,
I'm gonna move it over a little bit. What you're seeing here, think of this. So you gave me a file. I say, oh, here's an easy way to explore all the tags on the outside, but it's not fast enough. Easy way to index it, no thought on that one. I see one of those pieces is pretty complicated. It's an array of key value pairs.
Now I can put them, one operator, into different rows, and I can pull different pieces out of it. What we're getting at is you can present this however you want to. So here's another logical piece from that. Let's say I want all the rock songs, and that's really, what we're saying there is
every song with a tag of rock. None of it's optimized, so it's a little slow. What you're seeing here is, so this is everything that had in the tags array a tag of rock, but I'm not really happy with that
because some of these aren't, in my judgment, really rock songs. I want, let's change that. What about really rock songs? And what that says is, all right, songs that are at least two-thirds rock, so the second element has to be at least, it needs a rock tag, and then the second element needs to be at least two-thirds, so these should be really rock songs.
See what I'm getting at? So now, you have wherever the value is higher. So we're deep into our JSON B array. We're pulling out individual, pulling out and comparing individual values
of this array of key value pairs, all with SQL. All right, and again, the syntax is all posted on that. All right, but let's switch gears now. What if we're working with Hibernate?
Sorry, it gives me a tickle whenever I say the word. Instinctive reaction we'll all get. There are some technologies and a crap load ton of old code that needs you to present this information in a relational form, and we said we need to integrate it
so let's explore that. How would I take these JSON files, I loaded pretty easily, and make them look like they're just a normal relation? You can create a view, we'll also create a materialized view if you want to, and then it will look to the outside world, just like a table, and the key piece in this one is,
I'm not gonna do anything ahead of time, and you can do this just by exploring your data. All right, I'll go back. So all right, here's a simple one. Track ID, artist, and song. Four, five, now, same data, but it looks like a table.
Again, I didn't create anything, I'm presenting it, so this allows me to integrate with my API, with other tables, whatever I want, that H thing. All right, if I want to index it or do other things, a materialized view, and let's, if I had that,
I would definitely index on track ID, because the intention would be, I want it to behave just like a traditional table. That's all you need to, so you can also refresh the materialized view whenever your JSON files want, or whenever your JSON files change or not.
So, all right, so let's look at this right here. Crap, they got us again. Anyway, my point on this one is we made a wicked fast query,
less than a millisecond, on 10,000 rows of data that I imported 10 minutes ago as JSON files, didn't copy them anywhere in my operating system, and now it looks to all the rest of my application in the world, just like a regular relation, a table,
well indexed, all that stuff is there. So, let me jump a little bit farther. What I want to get at on that one is you're not really giving up much by keeping it inter-relational. In other words, the fact that you have it in Postgres, we have the operators to explore the JSON, but you also have this extra stuff
that you can't do on NoSQL database. All right, but let's jump back to the slides. So those two fields, those two pieces of our JSON file, the tags and the similars. So remember, the similars is where you want, think of Pandora, you're getting relationship.
Those really look like key value pairs, don't they? So let's walk down the HSTOR path. Maybe I can present this data instead of a traditional table or relation, I want to use the HSTOR. So I'm going to create this song. I'm going to say, hey, I know that track ID, artist, and title are all there,
so I'll make those regular columns. Make sense? But I'll have an HSTOR for here's all your similars, and an HSTOR for here's all the tags. And I will use only JSONB and HSTOR operators to do the conversion.
So no Python, no code, no PGPG, PL, SQL, nothing. Just pure SQL food. And we're going to do it, remember we had the JSONB array elements? You can pull out the pieces. And you have an HSTOR, HSTOR is both a data type and an operator,
unfortunate, but it is what it is, and the array ag. So we're going to bust them all apart from the JSON with the array elements, put them together with array ag, and then make them as a, excuse me, an HSTOR. All right, requires the extension HSTOR.
It's already there. Here's my table. Little bit of SQL food, but not too bad. You'll see I had to use a common table expression,
and then it's just a syntax piece that I couldn't use. I had to put the breaking up of the pieces that array elements into a common table expression. But anyway, you'll see.
Anticipation, none of this is optimized. All right, so we have 4,800 rows. And I should have pointed out, not all songs in our dataset have either of these two fields, which is also a sign of real data. It's not uniform. All right, so, let me make sure.
Let's see, which one did I do? Now five. Sorry, I'm running out of screen real estate. All right. Cool.
We got some tags. That's HSTOR. This is Stream 50, hip hop, 60s. So that's exactly what we had hoped for. And then, check out our work. And then we want to do the same thing with similars.
The SQL got a little weird on me for that one. And by weird, just overly complicated. So, I created a view. And this is just sort of an artifact to make the SQL a little simpler to walk through. But I'll show you where this goes in the end. So anyway, uh oh.
I hope we're not out to lunch in this. Anyway, what I wanted to say, so this is what the similars piece, it says that track ID is similar to this track ID with that weight. Does that make sense? Sort of a traditional graph piece.
This last piece right here brings it all together. And then we will check out H songs as soon as that comes back from being out to lunch. What do I want to get at here? All right, so we have rows,
and not everything has similar songs. That's what we would expect. All the ones in our set have tags, so we get tags and rows, and we've now built our same data. We've converted it to an H store model.
All right, which brings us into our next section. Gives us a whole bunch of new operators and functions. Right here is, that's the H store operator for, hey, give me everything with a tag of Latin. This is sort of the equivalent that says, I want really Latin music, which is a tag of Latin and a value of over 67.
So we'll check those. The first thing is indexes, also simple. In this case, I would index track ID and the two H store columns. I'm showing there's sort of reasons to choose gin
versus just, I'm putting them out here for our context, they both work fine. So let's look, we'll just jump into really Latin songs. Less than a millisecond in this one, by the way. Again, no thought to indexing other than I put an index on my H store column.
No more thought process than that. Less than a millisecond, I get all my really Latin songs and I don't know if my data set has. All right, yeah, you can see it's basically, again, I chose an arbitrary value, it has to have that tag over a piece. Pretty powerful stuff that was pretty easy. And again, the optimization piece,
the answer is, I don't know, I just built an index on it and it seemed to work well, which is what we're after. Simple, simple, simple. All right, so here is another sort of question. What are all the tags on all the songs by the Arctic Monkeys?
God only knows. You'll see, I like this piece of the real data. It stinks, it's got problems, it's got, but that's real life for us. In other words, all the syntax examples from the publications don't work, or they're too simple, they don't... Anyway, that's what I wanted to show you. This is every tag by the Monkeys.
All right, so that we just did JSON, which is document store. We pooh-poohed on XML because it's easy to kick them all down, but actually all these pieces really apply with slightly different syntax to XML. The functions are similar and it is still a valid and useful piece.
We converted that same data set internally over to HSTOR. Now the last piece I promised to show you was really a graph database. So let's, we have similars, so let's see if we can't do that. Neo4j is the market leader in this sort of segment of NoSQL data. I wanted to point out, sort of in fairness,
Postgres and its really relational data are suited for graph light. Neo4j is a really, really powerful piece, but we're gonna show you here, excepting a little bit of crappy performance,
we're gonna go five links on our graph, that's five degrees of separation, that would at least get us to Kevin Bacon. Wanted to make sure you're up there. So anyway, just for that proviso, all right, so we're gonna use the recursive query, or the recursive format. I just took it right out of the book,
or excuse me, right out of the Postgres documentation. I'm gonna filter our data set a little bit. I'm gonna limit it to only rock songs, and I'm gonna limit it to rock songs that have a strong relationship. Because what you're seeing in the tags, they're related to every darn thing, and I really wanted to filter our data set a bit to make it more manageable.
So we're going to answer that burning, burning question, is there a path of related songs from Lady Gaga, Poker Face, to Justin Timberlake, What Goes Around Comes Around? Now remember, you don't get this kind of insight from my sequel.
I just wanted to make sure. I just wanted to make sure we know where our bread is buttered. All right, so I, a little more sequel here. I created a table, strong rock links,
and what I'm saying is here, hey, where the weight is greater than, and it's greater than .15, it's just a different scale. The link weight goes from zero to one, and where the tag's a rock, easy. So I got 16,000 rows.
I don't know what I need an index on, but it seems like the ID and the link will work. It turns out they're actually not used super much. And this is what that sort of table looks like. So Stevie Ray Vaughan's Crossfire is related to a song, a track with this track ID and this weight.
Make sense? And it's, they're all related to themselves with a weight of one. So you sort of get, it's a, all right. Now, I, it's gonna run a long time, is what it is.
Tell you about this, I put in the ending song right here, Lady Gaga, I seeded it with that, because I didn't want to ask all relationships because long went, it's a cross join, and I did, I followed the example in the book to make sure that it doesn't cycle. Recursive queries can cycle.
A lot of really smart people wrote the documentation, they can tell you all about it. And I chose ahead of time. I know it's gonna have a depth of five that just reduces my filter set, because you can, there are other relationships from Lady Gaga that are less than five. And what you're gonna see,
it produces, and we'll run it when the query's there, goes song by song, I sort of wrote a subquery to, or a different query to say, hey, where do all the relationships come from?
And that is that piece. This is, I think, what they call awkward. You're gonna have to trust me on this one. All right, maybe we'll come back later. Nope, I don't want to give away this surprise.
Yes, recursion is expensive. And anyway, the slowness on this, that's what I wanted to say. Postgres is a very, very good JSON, XML, H store place. It is an acceptable recursive place. But the good thing is, a lot of data sets are just acceptable.
So, what we're gonna see is, we would have, let me just copy, that didn't work. That, the output of that, that will be, anyway, I forget that.
What you'll get is a bunch of, you'll get the relations between the five steps. That leads us to this insightful business knowledge that Justin Timberlake is related with a value of 0.28 to the Black Eyed Peas, who are in turn related with a value of 0.1
to Britney Spears, who goes down to Rihanna, and all the way to Lady Gaga. My point in this trivial example is, that's not an easy piece of data to get by. I mean, it sounds goofy, but that's five different layers
in a couple of 10,000 row record set that on a VM with no optimization, didn't come pretty, didn't come certainly in web click time, but it did come in an acceptable amount of time. Make sense? All right, so that's our, let me see what I did to,
I think I dorked that side up, but anyway, what you get in this path is a comma delimited, it's an array of a comma delimited list of the path that you walked to get from one to the other. All right, so let's bring it home. Got a couple things. One is the assertion of the presentation,
take it or leave it, but my experience is there's a bit of truth to it, is when NoSQL is here to stay. It is a pointless and ultimately sort of self-defeating exercise to say how much we hate NoSQL and we want it to go away and we've seen other contenders and this one's no different
and relational will slay this one too. There are lots of arguments to say NoSQL is different, it will be here to stay, but the other side of that is it's not gonna kick us out. We need to play nice. There are five-ish types of NoSQL databases. Postgres plays well with four of them.
You didn't see this search engine piece, we have really a heck of a capability with Postgres full text search. Really, really nice. Ranked results and all that sort of stuff. The other piece is we play really well with document key value pair, we play okay here and really well with search engine.
Summary two. You can load NoSQL data, but it doesn't cut you off from the relational world. Does that make sense? So you don't have to say either JSON or a table. Postgres gives you both in 30 minutes.
Nothing else does that or almost no other technologies. I can't say officially nothing, but really that's a pretty remarkable piece because that means you have a whole lot of embedded code that you don't need to throw away. And most real no crap data sets fit, you really have to go above a bunch of terabytes to make this impractical.
Our sweet spot, data that fits well on one or maybe a few machines. Transactions, either the whole document or a piece within the document. You want to enforce some referential integrity. Remember in loading the JSON files, I didn't enforce any there. That's sort of your business choice,
but we have that option. And you want to find relationships within the data. That's a powerful sweet spot. I want to say most organizations with these four criteria, call that real data. Do you know what I mean? Google, Yahoo, Salesforce.com, great headlines, wonderful pushing the boundaries of technologies,
but they're also edge cases for most of us. Postgres plays nice in what's at least in the majority of my office. And I'm asserting if you're buying it, leave the edge cases to no SQL. It is not useful for Postgres to say we can do that too to everything because we can do so much so well.
And then I guess the last piece is the rules are more complicated. Now it's too simplistic to continue down the thought process of saying, hey, if it's not third normal form, you're an idiot and it's wrong. I don't know how much of that is still out there. I run into it on occasion.
I think the world has changed. We have to expand beyond that. We have more capabilities in Postgres than the other guys. Let's leverage them. Another implication is you can't do that unless you know your business needs and your data. The data itself alone won't explain how to use it.
You have to invest more in that other stuff. I know it's not pretty. It's not even an appealing answer in a very technical conference. But I think it's the answer we're stuck with. If you're going to leverage Postgres's no SQL capabilities, you've got to invest some in how the business uses it and what the structure is. And our, my objective, again, remember,
two kids in college, one in a Catholic high school. This thing used to thrive. It can't just get by. We have to kick some of these other guys arses. And that doesn't happen if I either say, no, we've seen a lot of other contenders. No SQL will blow away. Or we can do everything.
Postgres can scale to 10,000 node clusters and things like that. Smart people may disagree. I don't think that's our sweet spot. Questions, follow up? Go ahead. Oh, okay.
Yeah, yeah, yeah, yeah, yeah. And then Dijkstra's algorithm.
Okay. It got around the recursion. And it was only 100,000 times slower than their product. Yes, yeah, but I like. It's within the numerator of living man with a full movie database. I like to hear that because a lot of real problems are manageable like that. And it seems Lady Gaga's face trivializes the example,
but it's not so easy. Is anyone, any other questions? Anyone using JSON? Anyone using this stuff? Good, I would assert that next year's conference we'll see more hands going up. Go ahead.
Yeah.
Foreign data wrappers. Multi-part question.
Unfortunately, a lot of my answers are I don't know. In terms of updating JSON elements, Postgres 9.5 is just around the corner. Clearly there's a huge sale going on next month, so you're not gonna buy that now. So I leave that to the, 9.5 has a lot of that stuff that you're talking about.
I have not integrated with MongoDB. We're playing with it at home. I know I have used Python and Google's address providing. And so when you look up an address with Google search, they'll give you a JSON object on the way back. And I used the Python interface and it worked great. I just plugged my JSON object natively.
With the noted exception, you have to, every real piece of data has embedded field delimiters. I didn't know about that copy syntax, so I had to work my way around it. But I have fed from a JDBC type source, it was Python, not Mongo, loaded it indirectly. And I'm happy to talk to you after that.
I didn't use it, it wasn't in this conference or in this presentation, it's sort of a side project. But Python talking to Google to get an address, I pull it back as a JSON object and I store it in Postgres. It worked. No problem at all. So it's possible to plug in the external driver, basically? Yes. Any other language? Yes, yeah, my assumption is if it works with Python,
it'll work with a bunch of other things. I did that for the obvious reason, because it's easy. Anything else? Perfect. So that's polyglot persistence.
That's saying Mongo can come to our party, but we're still the party. Because there's still pieces that Postgres can do that the other ones can't do. Any other stories from the, go ahead. Quickly.
Yes. I'm gonna do the right thing. I'm gonna build this with SQLAlchemy. A week later, there's still, look at the documentation. And they feel like, you know, the Mongo API is simple, right? So, no, that was the critique of,
this object relational mapping that, I can get a rash if I say the hibernate word. It's real. And students are being taught, so new students coming out of college have it. I don't know what to say about that one.
So one person's experience, all I do for that is I only talk to views whenever I code an application. That's the only thing. That is my object relational mapping. But I am very disciplined. I never, ever talk directly to a table because I'm always wrong when I designed it in the beginning.
That's my methodology is to do it wrong. That's all I can remember. I also learned from Oracle apps. I worked a lot with them and they also. Yes, yes, yay yay. Anything else? Good.
But, so this presentation is on the schedule. Okay. The SQL file is broken on the schedule. Okay, thank you, I'll talk to Dan on that one
because you need that one in it. And it took me a while. So part of it is, okay, good. Please download, please tell your friends. It took me a while to get some of the syntax is awkward. And part of it is because, you know,
anti-SQL doesn't cover this sort of stuff. So we're all in our own wilderness, but it is possible. Shamelessly copy my code because I shamelessly copied other people's. And it works. And the other piece is going into those arrays. Postgres rocks at arrays. Once you get, remember that the tags in the similar pieces, we're home when we get that.
So in other words, we went with JSON and we said, hey, tags, that's just an array of key value pairs. That's my good place. I know how to work with arrays in Postgres. I can pull out individual elements and all that stuff like that. This is, you know, that's goodness. Anything else? Put a fork in us, we're done.