We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

XML, HSTORE, JSON, JSONB - OH MY!

00:00

Formal Metadata

Title
XML, HSTORE, JSON, JSONB - OH MY!
Alternative Title
HSTORE, XML, JSON, JSONB OH MY!
Title of Series
Number of Parts
31
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceOttawa, Canada

Content Metadata

Subject Area
Genre
Abstract
Use the unstructured data type that's right for you. PostgreSQL 9.4 adds a new unstructured data type, JSONB, a variant of JSON optimized for storing objects. JSONB complements the existing XML and JSON document objects, as well as HSTORE. Which one is right for you? We'll take a tour of the features of each, and their advantages and disadvantages for various use cases, all illustrated with real-world examples. There has been a lot of work on the representation of unstructured data in PostgreSQL, culminating in the addition of the JSONB type in the forthcoming 9.4 release. JSONB complements the existing HSTORE, XML, and JSON types, not to mention arrays. With so many options, which do you use? As usual it depends on your use case. In this presentation, we'll review the unstructured data types in PostgreSQL, and look at their advantages and disadvantages for: Document storage Configuration management A "schemaless database" Object serialization Entity/Attribute/Value models Path queries
Civil engineeringSoftware engineeringMultiplication signSubject indexingData typeCASE <Informatik>Dependent and independent variablesMechanism designUML
Data typeData Encryption StandardData typeComputer animation
Civil engineeringAdditionSicIcosahedronData typeoutputImplementationOperator (mathematics)Element (mathematics)HexagonInformation systemsLogarithmMetropolitan area networkTable (information)Slide ruleSubject indexingDatabasePairwise comparisonLibrary catalogRow (database)Theory of relativityMappingFunctional (mathematics)Query languageRule of inferenceExpressionoutputOperator (mathematics)ForestImplementationSequelMultilaterationElement (mathematics)Standard deviationAttribute grammarElectric generatorString (computer science)NamespaceQuicksortGenerating functionResultantConformal mapEquivalence relationTable (information)Parameter (computer programming)Probability density functionSystem callSoftware testingGoodness of fitNumberPredicate (grammar)EmailElectronic mailing listForcing (mathematics)Right angleWordInstance (computer science)Data typeArithmetic meanBeta functionWikiPoint (geometry)Computer animation
CountingPrice indexTable (information)Port scannerCivil engineeringData storage deviceOverhead (computing)Data typePort scannerFunctional (mathematics)Element (mathematics)Single-precision floating-point formatSubject indexingConfiguration spaceQuicksortQuery languageData structureExpressionTable (information)Scalar field2 (number)DatabaseRun time (program lifecycle phase)Multiplication signOffice suiteElectronic mailing listCountingParsingOverhead (computing)Computer fileBit rateRight angleSound effectCore dumpCartesian coordinate systemSystem callGodData storage deviceGraph (mathematics)Software testingMereologyComputer animation
Data typeInclusion mapString (computer science)Electric generatorOperator (mathematics)Extension (kinesiology)Key (cryptography)Density of statesSet (mathematics)Analog-to-digital converterTable (information)Structural loadPhysical systemTotal S.A.Computer fileOrder (biology)CountingPrice indexMathematical singularityOverhead (computing)Regulärer Ausdruck <Textverarbeitung>Product (business)Drop (liquid)Scalar field2 (number)Query languageGroup actionBlogArrow of timeEquivalence relationOperator (mathematics)Module (mathematics)Direction (geometry)Data structurePatch (Unix)Channel capacityDatabaseData typeSet (mathematics)Key (cryptography)String (computer science)Program slicingFunctional (mathematics)Constructor (object-oriented programming)Message passingRight angleDifferent (Kate Ryan album)Product (business)Subject indexingExtension (kinesiology)Computer fileExclusive orParameter (computer programming)Object (grammar)Row (database)Table (information)Overhead (computing)ExpressionInheritance (object-oriented programming)Data storage deviceLengthElectronic mailing listLevel (video gaming)Software testingSingle-precision floating-point formatComplete metric spaceLaptopNumberMultiplication signOrder (biology)AveragePoint (geometry)Touch typingGenderTheory of relativitySequelHash functionVector potentialNetwork topologyComa BerenicesSystem callDirected graphSparse matrixQuantum stateArray data structureDecision theoryLine (geometry)
Binary fileParsingTotal S.A.Physical systemSuite (music)Price indexUtility softwareString (computer science)AdditionoutputServer (computing)Codierung <Programmierung>Order (biology)Operator (mathematics)Key (cryptography)Data typeStorage area networkSet (mathematics)Metropolitan area networkOrder (biology)Functional (mathematics)Subject indexingRow (database)Utility softwareoutputServer (computing)Physical systemFile formatKey (cryptography)2 (number)Object (grammar)Representation (politics)Data storage deviceString (computer science)Axiom of choiceComplete metric spaceMereologyParsingMultiplication signHierarchyBuildingOperator (mathematics)NumberConstructor (object-oriented programming)Device driverGoodness of fitValidity (statistics)Electronic mailing listPointer (computer programming)Attribute grammarRight angleElement (mathematics)Data structurePosition operatorDoubling the cubeSet (mathematics)LengthExtension (kinesiology)Error messageIntegerParsingQuicksortArray data structureExpression1 (number)Data typeArrow of timeOverhead (computing)Scalar fieldDatabaseResultantBoolean algebraNormal (geometry)Data conversionTable (information)SequelSummierbarkeitForm (programming)Parameter (computer programming)CASE <Informatik>Mixed realitySpecial unitary groupClient (computing)SpacetimeMathematicsStandard deviationPoint (geometry)State of matterAdditionPanel painting
Table (information)Total S.A.Physical systemComputer fileData storage deviceOverhead (computing)Metropolitan area networkOrder (biology)CountingParsingMilitary operationRegulärer Ausdruck <Textverarbeitung>Price indexProduct (business)Function (mathematics)Key (cryptography)Density of statesPort scannerSet (mathematics)Mathematical singularityImplementationServer (computing)Codierung <Programmierung>Binary fileOperator (mathematics)Hacker (term)Different (Kate Ryan album)Row (database)Operator (mathematics)Data structureQuery languageConstraint (mathematics)Subject indexingFunction (mathematics)Key (cryptography)Order (biology)Overhead (computing)Object (grammar)GodIntegerEqualiser (mathematics)Binary codeRepresentation (politics)AdditionExpressionServer (computing)QuicksortCodierung <Programmierung>Table (information)Computer fileDatabaseArray data structure2 (number)Group actionBit rateFile formatProduct (business)WaveParsingData typeValidity (statistics)Multiplication signData storage deviceSpacetimeSingle-precision floating-point formatFunctional (mathematics)Axiom of choiceView (database)Right angleStatisticsSummierbarkeitCellular automatonResultantCurveMereologyArithmetic meanMessage passingScalar fieldMedical imagingComputer animation
Operator (mathematics)Table (information)Physical systemTotal S.A.Overhead (computing)Metropolitan area networkProduct (business)CountingOrder (biology)Price indexBit rateBinary fileParsingFunction (mathematics)Key (cryptography)Subject indexingRegulärer Ausdruck <Textverarbeitung>Context awarenessData storage deviceFunction (mathematics)Binary codeDatabaseOverhead (computing)Subject indexingContext awarenessExpressionKey (cryptography)Object (grammar)Representation (politics)AdditionOrder (biology)Data structureExistenceSoftware testing2 (number)String (computer science)Operator (mathematics)NeuroinformatikHidden Markov modelQuery languageBitRight angleMultiplication signRow (database)NumberArray data structureDefault (computer science)Social classGroup actionTable (information)Computer fileMathematical optimizationFunctional (mathematics)SpacetimeStatement (computer science)Product (business)CurvaturePointer (computer programming)Data typeAttribute grammarData storage deviceDrop (liquid)MassOffice suiteSpecial unitary groupLaptopBit rateNetwork topologyBoolean algebraMathematical analysisMereologyFile formatFigurate numberWater vaporEquivalence relationPower (physics)Event horizonGraphics tabletVirtual machineError messageVarianceNear-ringGreatest elementResultantSource code
Civil engineeringMetropolitan area networkInternetworkingAdditionComputer animationUML
Transcript: English(auto-generated)
I think we have better uses of your time. Now a GIN index for XML might be kind of cool, a GIN index on XML.
So where's Peter? We need the trifecta. Oh, okay, thank you.
Ted Orr tells me it's time to start, so I guess we will. We all know how well the subways run in Moscow, so I'm just going to follow that example and try to be as on time as possible. My name is David Wheeler. I'm here to talk to you about scary all uppercase named data types.
Now there are several semi-structured or unstructured data types that you can use for your columns in Postgres. Probably all of you know about them. There's XML, HSTOR, JSON, and in 9.4 we'll have JSONB. So you may be wondering to yourself, self, which should I use?
When should I use it? Why? So here now is your brief tour of my opinions.
Let's start at the top. XML was added to Postgres in 8.2. My slides are posted by the way, they're on speaker deck and there's a PDF that you can
download that's linked to from the talk page, so you can take a screenshot. So anyway, it was added in 8.2, it was an XML data type, it was intended to support things like publishing and exporting database values designed for SQL 2003 conformance and
has XPath query support built in. So the implementation is basically the XML is validated on input and stored as text in the database. There are no comparison operators, there's no equivalence for XML, there's no greater
than or less than or any of those sorts of things. So as a result of that or in relation to that, there is no native indexing of XML data types. You're putting a text blob in there essentially and that's what you get. However, you can cast it to text if you need those kinds of comparative operators or indexing.
But even better, you can use XPath and expression indexes, which I'll get back to a bit later. So I had never actually messed much with the XML support in Postgres until I started preparing for this talk, figuring I should probably be comprehensive.
And I was surprised that XML support itself in Postgres is pretty comprehensive. So to start with, you have a slew of XML generation functions like XML element, XML attributes, XML comment, XML concat. There's XML forest that like strings a bunch of stuff together into some XML thing.
Here's an example where we can use nested calls that are functions, you know, here's we're getting an element named name. Oh yeah, its name is foo, right? This is the funky sequel standard thing, right? Its name is foo and you have an attribute XYZ with the value bar, no, the attribute
bar with the value XYZ, right? That's what it ends up with here. XML element, the name ABC, a comment, another element and an element in that and you get back the nicely formatted XML there. So pretty good generation tools in the database itself.
The XML support also comes with a number of predicates. There is the is document predicate, which tells you if you have a XML document, which I think means it has a doctype header. There's the XML is well formed function, which is more generally useful to tell you if you're essentially your XML is valid and well formed.
There's the XML exists, which uses an XPath expression to see if a value exists in an XML value. Here's an example of it. If we have, we're passing the XML towns with the town Portland and the town Ottawa, we can use the XPath expression.
Any town, anywhere in the XML where the value of its text is Ottawa and so this After last night, some of you may not be sure, but it also has a custom XPath function,
which is probably a little more of the syntax you might expect for Postgres functions. Here you pass an XPath expression as the first argument and your XML to evaluate as the second argument. And it returns a text array of all the values that match the XPath expression.
So here I'm saying all the A tags under the P tag, I want to get the text. And so we're getting test in here and me, and that comes out in the array here. Similarly, there's the XPath exists, which returns true or false if the expression
you're looking for exists within the XML you're looking at. Both of these functions support name spacing. So if you have name spacing requirements for your XML, maybe you have HTML embedded in your atom feed or something, it takes an optional third argument with an array of name space rules or assignments, name spacey things.
One interesting feature is table mapping. We have functions called table to XML and query to XML and there's something else to XML. In this example, I'm just calling, I have a simple query in which I'm getting data
out of the PG catalog and it shows you how it dumps it out. So it's using a schema called XML schema instance, which I think is specific to databases and it's just giving me a table and rows and each row has the column name with its values that I'm extracting right out of the database. So this is all built into Postgres.
You can export to XML right now, today. I did it yesterday. Now, querying. I grabbed all the data from the U.S. Patent Office for grants from this month of January, which I think was about 26,000 documents, and loaded it up into a database to run
some XPath queries against it and I wanted to see how well it could perform. So here's, it looks like, it's just a very long XQuery, but basically I wanted to see a list of all the examiners, the text of their last name, no matter what kind of examiner they are, the star means any sub-element there.
With that, I can say, okay, show me a count of how many documents in the grants database have the two reviewers, examiners, excuse me, Brooks and Roxa.
And so, passing that query, this returns the array as we saw before, so we're just comparing. And so it finds four documents reviewed by those two examiners. But it takes a little while to run this query. This is 37 seconds. So one thing we can do to get around that is we can index this column, or excuse me,
index an expression of this column. As I said, you can't index the XML itself, but you can use XPath to extract values and use that in an expression index. So here, I'm running basically the same function as before and casting it to a text array
and indexing on that text array. So with that in place, I run the same query. We're now down to less than a millisecond of performance, which is pretty good. However, you can only compare scalars to scalars in this way.
So we had an array there, and I had to compare the whole array to the whole array. If what I want to know is how many documents there are where any of the examiners is Brooks, we end up doing a table scan, and we're back to 40 seconds of run time here
because it's not able to pull apart the index of the array value to compare it to a single string value. So yeah, it gets a bit nasty. So if you have a very specific need when you're querying your XML, say you know exactly one value you want to get, and any document has only one value,
so it pays to understand the structure of your documents, you can index that instead. So here, instead of getting any examiners, I'm just getting the primary examiner. There should be only one primary examiner per document. And so I'm indexing on that. So this is executing the XPath query, and it's just getting the first element out of the array
because there's only going to be one. This allows me to get 12 back. So there are 12 documents where the primary examiner is Brooks. And of course, we can index that single expression here by creating an index on just the single element of the array,
and now we're back to pretty good performance around a millisecond for that. So you're quite limited in the kinds of indexing you can do to speed up your queries, but if that happens to match the requirements of the way you want to look at your documents, it can work very nicely.
I just wanted to point out, though, we're just getting the single text value here to index rather than the array of values. So when should you use XML? You should use XML when XML is required. For example, if you have existing documents, you need to support a SOAP API or something.
Maybe you have a bunch of XHTML you want to be able to have in your database. Maybe you have a bunch of Maven or ANT configuration files, God forbid, that sort of thing. It is decent for document storage, for storing a whole document.
And so you have fast IO, so getting data in and getting data out is pretty efficient because essentially on the way in, it's just validating the XML and then storing the text value. And on the way out, it just has to dump the text value. And XPath is awesome. I love using XPath to query XML stuff when I'm forced to use XML.
But it's going to be best performing, obviously, when you have scaler values that you know in advance you need to get well indexed. Otherwise, you need to accept that you are going to run table scans. And for your particular application, perhaps, that's okay. But do know that every XPath function called against an XML document
is a new parse of that document and then an execution of the XPath query against it. So you encumber that overhead of re-parsing the XML document every time you want to query it.
You want a photo, Oleg? So now let's talk about HStore. HStore was also added in 8.2. Its structure is simply key-value pairs, where the keys and the values are only allowed to be strings.
There's no data type support. There's also no nested support, so you can't have a value be an HStore object itself. But it does have a slew of very useful operators occurring the values in an HStore value.
And it all supports GIST and GIN indexing, or GIST and GIN, or however anybody wants to pronounce them. In 9.0, Andrew Rodland submitted a patch that improved HStore quite a lot, including a slew of new operators and increasing the size
that the whole thing is allowed to be the capacity of the HStore. So it could store a lot more potential data, I think, up to the size of whatever is toastable. And B-tree and hash indexing, so you actually have equivalence operations. The syntax, well, first you have to create the extension. This is a contrib module that ships with Postgres.
So you always need to create the extension before you use it. Create extension, of course, was added in 9.1, I think. Before that, you would need to load the SQL file directly. So the syntax is inspired by Perl hashes, where basically you just have key-value pairs separated by a fat arrow between the keys and the values, and a comma between the pairs.
And you simply cast your text expression to HStore. There's one special value you can use in HStore, and that's null. So you can have strings and you can have null values, although I don't believe you can have null keys, right?
Yeah, that would be goofy. So, like I said, it comes with a slew of operators. The arrow operator allows you to fetch the value for a key. So here I'm fetching the user key and getting back Fred.
You can also use the arrow key with an array of values, and it will return an array of keys, and it will return all the values for those keys in an array. It also has the containment operators, including this one that says,
basically, are all the values in the HStore on the right also in the HStore on the left? And that compares the full path of the expression of the HStore. It's essentially an intersect operation. So it not only doesn't have to have ID, but it has to have ID where the value is 1 in order to match.
There are many more. There's a concatenation operator, the question mark operator to see if a particular key exists in your HStore value, question mark ampersand. Wait. Oh, yeah, this is you pass an array, right? To see if any of the keys in your array exist in the HStore.
Oh, right, that's right. This is all of them. So all the keys in your array must exist, and this one, do any of the keys exist that you specify? I knew I should have put in examples. There's also a minus operator, which allows you to delete keys.
And you can convert it to array in a couple of different ways, and there are ways to get slices of your HStore. So there's a bunch of cool stuff you can do with it. It comes with a number of functions. There's a simple constructor where you just pass a key and a value,
and you get an HStore. You can also pass composite values or rows, records to it, to the HStore constructor, and it will turn those into key value pairs where the key is the column name and your composite type, and the value, of course, is the values. So pretty easy to go back and forth between records and HStore.
You can also pass in an array of key value pairs to get that turned into an HStore. And there's also a nested one where you can pass an array of two value arrays, each with a key and a value, to turn into an HStore. AKeys will return an array of all the keys in your HStore.
AVals will return an array of all the values. You can convert HStore to JSON, simply bypassing your HStore to this function, and it will return the JSON equivalent. Note that all the values, including, you know, the numeric one here, come out as strings in the JSON. Again, the only special value is the null.
There are functions that give you sets that allow you to extract set values from your HStore. SKeys returns the set of keys. SVals returns the set of vowels. And each function returns a set of the keys and the values as rows in a relation,
which is great for iterating over a set of values in a function. For a long time, my exclusive use of HStore was to pass key-value pairs to a function call, so I had name parameters. So let's talk about the performance of HStore for a few minutes.
CitusDB had a blog post earlier this year in which they published a bunch of reviews from Amazon.com from 1998. I assume Amazon was a little more open about such things. This is a JSON file, so I converted it with nested values.
I converted it and flattened it into an HStore, so it's just one level of keys and values. So to test it, this has almost 590,000 rows, HStore or JSON values. I created a database and loaded the extension
and just created a table with a single reviews column of type HStore. And then I copied the data from the reviews file into there, and it took about eight seconds to copy it all. So this is a little under 70,000 records a second it took to copy all those values in. This is on my handy-dandy Air with no fsync with SSD.
So the copy file was 233 megabytes, and the database size we ended up with 256 megabytes. So around a 10% storage overhead, which is pretty good as these things go.
So I adapted a query from that CitusDB blog post. Basically it's a list of products, so I was interested in what were the average reviews of all DVD reviews in the database grouped by buckets with the relative lengths of the titles.
Not super useful, but kind of neat. So we have six buckets here of varying lengths of book titles, and the average review for DVDs, did I say books, I meant DVDs, of those lengths and the number of reviews in each.
Not so many reviews with longer titles. So the running time was over 200 milliseconds, which is decent, but it could be better. HSTOR was the first semi-structured data type to support GIN indexing for fast searches of this kind.
So if we want to, we can create a GIN index on the reviews and let it run for a while. Go get coffee. It took a little while to run on my laptop here. But then I run almost the same query, only instead of extracting the review where the product group was DVDs,
I'm using a containment operator. This is the operator that can take advantage of the GIN index. So by using the containment operator and passing basically the complete structure, key and value that I was interested in, we got the query now down to just a little under 30 milliseconds. So this is a, what, eight-fold improvement?
So with this database, with this index, the size of the database is up from 255 megabytes to 301 megabytes. So we have around a 17%, 18% overhead
for the GIN index on this data type. However, we can, if we know in advance, like the XML, that there's very specific values that we want to get, we can instead create a B-tree index on an expression. So if I drop that index and then I create a new one
in which I'm just fetching a product group key, that's just returning the text for the product group. This runs in eight seconds, so it's substantially faster to create and maintain. So now the database is 268 megabytes,
so we got around 5% overhead for this index. And the performance? We run the query again. I'm back to just extracting the value, the scalar value, and it's down to around 20 seconds. So it's a little faster than using the GIN index. The GIN index is going to be more useful
if you don't know in advance what parts of the HSTOR you might want to query for values in. It's basically an index is the entire thing. But if you know there's one key we look at particularly all the time and use for filtering results, it can be more efficient to use an expression on the text value in a B-tree index.
Now let's talk about dumping HSTOR. Internally, HSTOR has a binary representation stored in the database. So unlike XML, which stored it as text, this is basically right down to a binary representation. So for input and output, it must be parsed and formatted.
This does make it much more efficient for operations such as extracting the values out of the HSTOR. But for input and output, we have the overhead of parsing and formatting. Still, it was quite fast on my system. For dumping 590,000 records, it took like 1.4 seconds.
When should you use HSTOR is the question. This is a fast, very fast and efficient key value store using the binary representation. But the I.O. is going to be a little slower than something that uses just a text representation.
If what's important to you is getting documents in and documents out as efficiently as possible, HSTOR may not be the best choice. But if you want to get just particular attributes for particular keys, it's a pretty good choice. Right. With the GIN index support and the fast operations.
However, its utility can be limited somewhat in a number of ways. There is no nesting, as I said. There's no data type support at all other than strings and nulls. And the format is custom, so you would need to have specific formatting and parsing support built into your client
or into your database driver. So those are the costs that you run into for that. Any questions about HSTOR?
Well, good. Let's talk about JSON. JSON was added in Postgres 9.2. All it did, this was like the simplest thing we could possibly do,
it did a simple validation of the JSON structure on input and stored the results as text. This is just like the XML standard, basically. It uses a server encoding, so in this way, it's slightly different from standard JSON, which mandates UTF-8. If you're using a UTF-8 database, you get standard JSON.
If you're using a Big Five database, you're going to get Big Five JSON, which is not technically JSON, but should do the trick. Because it's using a text representation, it preserves the order of the keys in your JSON value. And if you have any duplicate keys, those are preserved as well. Andrew pointed out to me earlier
that it also preserves white space, which didn't even occur to me, because no sane JSON parser should care. But I know a lot of JSON formatters put all sorts of crap into the JSON, so all that stays. In 9.3, Andrew added a bunch of functions and operators to make it much more useful.
Those are actually also available as extensions for 9.2, which is handy. And there are new building functions added in 9.4. So how many of you went to Andrew's talk earlier? Oh, about a quarter. That's pretty good. I thought there'd be more overlap. I thought I was going to have to speed through all this,
because you've seen it already, but the rest of you can take a nap. So like hstore, there's an error operation. The only cool thing here is you can pass an integer to it, and it assumes that what you're trying to get at is an array index. So it will return the value at that array index. So that's this one here.
And of course, you can also pass a string, and it will get the value at that key position. Now, JSON, these operators don't return text.
They return JSON. As you can see here, this is a JSON object, and this is a JSON object. If what you want instead is not JSON but text, use the double arrow operator here. You can use an integer or a string to get back the text representation here. In addition, we have a path operation.
What you pass to this cute operator here is an array that represents the path to the value you want to get, to the JSON value you want to get. So here I want to get, under the A object, A key, there's an object with a B key, and I want the value of the B key.
So we go A to this object, to the B key, to C with foo, and that's what we get in the result here. If what you want is, again, not a JSON representation but a text representation, use the double arrow here with the path operation.
It comes with a couple of constructors. JSON in Postgres and in compliance, I think with the latest RFC, allows you to have pure scalar values that are neither arrays nor objects but are still JSON. So I pass a string here. It's converted to a JSON value
complete with the quotation marks, and internal quotation marks escaped. Similar to the, what was it, row to hstore function, I think, you can just pass a composite object or a record to the toJSON function
and it will return an object of the key value pairs represented by that value. And, of course, where possible, JSON data type preserves compatible data types between Postgres SQL and JSON.
So here when I'm passing an integer, an integer, a boolean, a null, and a string, we get back a numeric value here, a boolean value here, a null here, and a string, all of which are supported by JSON.
New in 9.4, we have these functions that allow you to build more complicated JSON objects. The JSON build array function, you simply pass a list of values which may be of any data type, including mixed data types, and you'll get back the JSON array with the data types preserved in the array.
JSON build object also takes, well, it takes any number of arguments, as long as you have at least an even number of them, to build a JSON object. And, again, data types are preserved where possible. The cool thing about these,
and like the XML construction functions, you can nest these suckers to create more complicated, hierarchical JSON structures. So when I'm calling JSON build object, I want a foo key with a number one. I want a bar key that has an array with these three values. And so that's what I get. Foo, one, bar, array, those three values.
Can't see that, can you? Is that better? No? I can move it back. Again, like the hstore support, you can get sets of values out of JSON.
The JSON each function will return the keys and values for a JSON object. And JSON each text. So here, this is returning JSON values. JSON each text will return text values for this. And, of course, if the values that you store,
have stored in the JSON are numbers, you can cast them from the text representation here. That would be like select key, comma, value, cast to integer or whatever. And there are a whole bunch of other JSON functions.
JSON array length assumes you're passing a JSON array structure to it and will tell you how many values are in it. Object keys will return a set of the keys for a JSON object. And JSON array elements will return all the elements in the array as a set.
Of course, the return values of this would be JSON values. So if what you want are text values, you can tell it to use this function instead. JSON typeof is a handy function that tells you the type of a particular JSON value, which can be string, number, boolean, null, object, or array.
And there's JSON to record. Before, I showed you two JSON to which you could pass a record. Well, you can also use toRecord to convert the JSON structure into a record object. And you call this like a table, so you would pass an as expression and define the columns that you want. You don't even have to define a composite data type in advance for that.
And yeah, lots more. Andrew's been very busy the last two years adding all sorts of crazy stuff. That's Andrew Johnson. He's over here. Wave. So let's compare the performance of the JSON data type
to the HStore data type. Here, I'm creating a table called reviews. This is exactly like what I did before for HStore, only the data type is JSON. And here, I'm copying in the fully nested structure of the JSON, not the flattened one like I did for HStore. So we're preserving the original format provided by Citus.
So this took a little under seven seconds to run, so around 85,000 records a second. And the copy file was 208 megabytes, and the database ends up being about 240 megabytes, so we have about a 15% storage overhead for JSON.
This is to be compared to HStore. This is faster than HStore, which took a little longer to load because it had the overhead of parsing and converting into the binary representation. But this takes up slightly more overhead. The HStore storage was more efficient, although that might also be because it was flattened.
It's hard to tell. So let's look at this bucket of books query again. Here, I'm doing a path lookup. So this is the path expression, right? I'm passing an array for product. Under the product key, the group key, I want its value to be DVD, right?
And I'm doing the same thing here. I'm getting the title of the product that's being reviewed and the rating of the review by extracting paths from the JSON. So we get the same results, which is nice, but performance kind of sucks.
So whereas this query without an index on HStore took around 200 milliseconds, this takes around 1,700 milliseconds. So it's eight or nine times slower or as long to run. So let's talk about indexing. Because the operations are slower than HStore,
this is because of the overhead of having to parse the JSON value for every row for every one of those operator expressions we have. Now, there is no GIN or GIST indexing support for JSON. However, we can use an expression index similarly to how we created an expression index for HStore.
So we don't have the flexibility of being able to say, put an inverted index on this whole thing, and I don't have to worry in advance about what I'm querying. But if you know in advance, you can add an expression index to make it efficient, which is what I do here. Because I know that specifically I want to filter on the product group for DVD, I'm just going to go ahead and index that single expression here.
And I'm using the path query operation that returns a text value. So that takes around 10 seconds to run. So it indexed pretty quickly.
But if we run the query again, we're now down to 91 milliseconds. Now, again, this is still a lot slower than HStore, HStore we got down to around 20 milliseconds for the same basic thing. And I'm pretty sure that's because for every one of the rows that we're actually getting here,
this was an index lookup, so it was efficient. But we're having to do a parse of the reviews JSON here in order to get this stuff out, and we're having to do another one here in order to get this stuff out. So you have the overhead of two parses of the JSON and two path queries for every single row.
So you're going to pay for it in that way. But let's look at the output. Now, the JSON values, like XML, are stored internally as text. So this means there's no need to parse and reformat on output. We can just dump it. Then, in fact, dumping is about 50% to 60% faster than it was for HStore.
Here it took about six-tenths of a second. Three-fifths of a second. So what does this mean? When should you use JSON? JSON is useful when what you need is document storage,
when what's important to you is being able to get things in and out quickly. If it's important to you to preserve duplicate keys or the order of the keys that appear in your JSON objects or the white space that's in your JSON, the JSON data type is your choice.
Now, because of the text storage, it's efficient to get in and out. It's quite fast. And the operations that you can do on it with the querying operators is pretty awesome.
You can fetch keys, you can fetch paths, you can fetch array indexes. This is going to work best on index scalers, of course, or when you're okay with table scans. But remember, you're going to pay the price for every single expression you have against a JSON operation per row.
Sound familiar? These reasons are almost all the same as with XML. The only difference is that it's not XML. Questions about JSON, the JSON data type?
Yes. Well, there's actually some XML schema stuff that you can do with the XML support in here, although it doesn't do schema validation.
No, you can store some schema thing or generate some schema thing, but it doesn't do any validation. I mean, the other thing is that functions use that XPath.
Yes, they are much easier to use than XPath, especially since XPath returns an array, which I found a little annoying. How about the other database? I don't use those things. Do they have JSON? Do they have JSON? Yes.
I don't know. I know Mongo does. I was trying to find a decent quote for JSONB, and all I could find was this kind of incredulous question on Hacker News from some random person. You mean I can do this? So let's talk about JSONB. It's new in 9.4, coming soon.
This is a full JSON representation, just like the JSON data type. And just like the JSON data type, it uses server encoding. It, however, is inspired by the hstore2 experiment that our friends Oleg and Teodor worked on in the last year,
which I wrote a lot of documentation for we will never use. Yeah, I know. But like hstore, it uses binary storage. So the idea of hstore2 was that it would have basic data types just like JSON and nested structure just like JSON, and it ended up being just like JSON,
only with the binary representation advantages you have with hstore. So in addition to the operators that you have for JSON operators and functions, you have hstore-style query operators and constrain operators you can use against it. Unlike the JSON data type, there is no key order or duplicate preservation.
If you have duplicate keys in an object, the last one to appear in the object wins. And the order of the keys can come back in any way, which if you're used to using a dictionary or a hashmap object, you should be pretty used to. But like hstore, it has very fast access operations to the binary data type,
and it has GIN indexing. So for operators, we have all the JSON operators, plus equality. Yes, you can compare that two JSON objects are equal.
And note, here's a demonstration of no order preservation. There's A and then B, and then B and then A. But yeah, they're the same, as God intended. And like hstore, we have the containment operation, where again, you pass a full path to the thing you're interested in to find out if the value is there.
And also like hstore, you have the question mark operation to find out if a key exists in the JSON. And I think you can also pass an integer here, and it will return true if that array index exists in your JSON array object. No?
Oh, OK. OK. In addition, and unlike hstore, we have nested operations here.
So for example, with equivalence, we have to look also, not only do we have all the keys in the object, but we have the same array values. And note here, oh yeah, this is equivalence. Yeah, that they have all the same stuff.
And with the containment operation, you're basically specifying the basic structure you want, but order, again, doesn't matter. So even though I'm looking for an A key with an object, with a C key with a value of 3, it doesn't matter that the C key appears after the B key. And it doesn't matter, oh, there's not an array here.
But if this was an array index here, I could have 3, and there could be an array in here with 1, 2, 3, and it would match. Yeah, here's an example of that. We have an array, 1, 2, 3. Does it contain 3 and 1 in the array, and it returns true,
even though, of course, the order is different. So order is not what matters. The presence of the value is what matters. Like the HSTOR data type, you have an existence operation. Do any of the values in the array you pass exist as keys in this object? Or do all of the values in the array you pass
exist as keys in this object? Now, as with HSTOR and JSON, I did the little performance test with the same data. I created a table with the type JSONB. And I copied all the data in. This took almost 10 seconds to load. So it was a little slower, around 60,000 records a second.
Same 208 megabyte file as for JSON, but the internal storage is at 277 megabytes. So we have about a 30, 32% overhead for the storage here.
Again, so for HSTOR, it was, I think I have these numbers here. Yeah, so HSTOR loaded a lot faster. And I'm going to go out on a limb and guess this is because it was flat, whereas this is a nested data structure. And it only had a 15% overhead. So there's quite a bit more overhead
for the storage here for data type. And again, I don't know how much that is because this is nested values as opposed to a flattened object. However, we go back to running our query, just getting the product review DVDs.
I get all the same results, and we're at about 381 milliseconds. So remember, with JSON, this took 1.7 seconds. And with HSTOR, it took 207 milliseconds. So it's a little slower than HSTOR, but again, I'm going to guess this is because of the nesting versus the flattened data structure, which is going to be-
Right. But it's in the same ballpark, unlike JSON data type. This is unindexed. But also like HSTOR, you can use a GIN index on JSONB value. The GIN index supports the containment exists, the exists and the exists or operators.
So I go ahead and create a review on here, and to my shock, this was 10 times faster to index than the HSTOR value was. I don't know if, you know, maybe my computer was doing something else than doing HSTOR. I did it a bunch of times, but- No, they used different, but a greater path.
Uh-huh. Hmm. Oh, yeah, the path ops is hash, right?
Yeah. I don't know. It's a lot faster than HSTOR. So when- So the database size here is 341 megabytes afterwards, so we have about a 23% overhead for the index. So I think it's quite a bit smaller. No, it's in the same ballpark as with HSTOR.
If we execute the same query again, only now we're using a containment operator instead of fetching a text value and comparing it. So I have to say, in the product key, we have an object with a group key where its value is DVD, right? We get the performance down to 35 milliseconds. And the same basic query with HSTOR, I think, was 28 milliseconds.
So right around this- That's probably within the- Given all the crap my laptop does, it's probably around the same. Right. Now, in addition to being able to just use the default JSON Ops operator class with GIN index,
there's a special one called JSON Path Ops. Now, the advantage to this one is that it's only for the containment operator, not for the exist operators. But if you are only using that operator, as we are in these examples,
it's a lot more efficient. There is only one index entry per path throughout the tree, as opposed to the other one, which indexes everything as an individual value. So if I drop that index and I create a new one using the JSON B Ops Path Ops,
it takes half as long to run and index those values. So this is going to be a big win if all you're using is containment. That's the only place where you need to make sure you use an index. So now the database size is 323 megabytes, so we have only a 17% overhead for the index on this data.
And if we run the same query again, we find that the performance is about the same, but you have a smaller index, so it's going to take up less room on disk, and it's going to stay memory-resonant longer, and it's going to be more efficient in general for lookups for the containment operator.
So it's a pretty big win overall. Finally, I wanted to compare dumping the output. Because it's using a binary representation, like with HStore, it must be parsed on output and formatted as JSON text. So dumping is actually quite a bit slower than with HStore.
Here it took me a little under two seconds to run. So when do you use JSON B, Bruce? The HStore is not nested.
In addition to the HStore not being nested, the HStore doesn't have any data typing. So I'm assuming it's the nesting and the numbers, booleans, and it probably has a switch statement to figure out what the data types are.
I suspect there is some room for optimization on the text-in and text-out functions. I think Oleg and Theodore told me that when they were working on HStore 2 back in the day. But I can live with this. So when do you want to use JSON B? You want to use JSON B when you're on 9.4, of course, so in your future.
When you don't care about the preservation of duplicate keys or the order of the keys or white space that appears outside of your values, which you shouldn't, it's great for object storage because of the binary representation
and the fast operations and the GIN indexing. The operations are really awesome. FetchKeys, FetchPaths, they're all GIN index aware. You can use expression indexes with B-tree indexes or GIN indexes to get full paths. In fact, you guys had an expression index on GIN where you indexed the JSON array.
That was amazing. That was really cool. But do bear in mind that the I.O. is a little slower than it is going to be with JSON. So if what you have is JSON documents and you mostly need to get the whole document in and the whole document out, you probably want to use JSON data type.
But if you want to be able to query the individual attributes of the JSON object or value itself, you want to use JSON B. So quick review as I'm running over time. When you use XML, only when somebody's making you.
You have all the advantages of XML with a JSON B data type, the JSON data type, and all the advantages of ASTORE with a JSON B data type. When do you want to use ASTORE? Well, you want to use ASTORE when you're okay with the flat data structures. You're going to get a little more performance out of that. When you are okay with having only strings and nulls for your values in the structure.
And you require fast operations. The other thing I would say is when you are on something earlier than 9.4. When should you use the JSON data type? Basically, when you want to do document storage or preserve keys or duplicates or white space.
And when you want to use JSON B for everything else would be my recommendation. So that's what I say. My thanks to Oleg and Teodor for doing all the amazing work to make such an amazing, awesome data type.
And to Andrew Dunstan and Peter Gagan for getting into 9.4 because I think this is really going to rock things for 9.4. And thank you for letting me bore you for an hour.