Schema-less data in PostgreSQL
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 31 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/19100 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Ottawa, Canada |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
PGCon 201429 / 31
1
9
13
14
18
19
20
21
22
26
27
28
29
30
00:00
Subject indexingQuery languageZugangsverfahrenSocial classData modelKey (cryptography)Revision controlDecimalMereologyData typeArray data structureData storage deviceBinary fileoutputTable (information)Server (computing)Software testingBookmark (World Wide Web)Operator (mathematics)Overhead (computing)File formatDensity of statesWeightParsingComputer clusterPlanningStorage area networkElement (mathematics)Uniform resource nameMaxima and minimaDigital filterProgrammschleifeHash functionMetropolitan area networkSpecial unitary groupComa BerenicesElectronic meeting systemPort scannerInformation managementCAN busRankingSummierbarkeitCurvatureSpherical capAmsterdam Ordnance DatumValue-added networkBus (computing)CausalityStructural loadPunched card12 (number)Ring (mathematics)Network topologySoftware developerDatabaseMultiplication signSocial classHash functionOverhead (computing)Flow separationDifferent (Kate Ryan album)Bookmark (World Wide Web)Revision controloutputState of matterKey (cryptography)BitArray data structureElectric generatorValidity (statistics)Pairwise comparisonFile formatQuery languageData storage device2 (number)NumberData typeTable (information)Binary codeSlide rulePresentation of a groupEstimatorStructural loadExtension (kinesiology)MereologyMessage passingString (computer science)SpacetimeCartesian coordinate systemQuicksortData structureCASE <Informatik>Set (mathematics)Total S.A.Element (mathematics)Physical systemIntegrated development environmentEndliche ModelltheorieProduct (business)Direction (geometry)SequenceSquare numberRight angleReplication (computing)Arithmetic meanForm (programming)Wave1 (number)AreaGraph coloringUniverse (mathematics)Position operatorData modelLevel (video gaming)ExpressionWavenumberIterationBit rateProgram slicingCycle (graph theory)Computer animation
09:24
Port scannerSubject indexingRootKey (cryptography)Functional (mathematics)Operator (mathematics)ExistenceEquivalence relationQuery languageMereologyType theoryQuery languageData typeConstraint (mathematics)12 (number)String (computer science)Execution unitRaw image formatMetropolitan area networkSpecial unitary groupElement (mathematics)SummierbarkeitBinary fileCAN busWide area networkArmScalar fieldSoftware testingNewton's law of universal gravitationValue-added networkSineNetwork topologyProduct (business)Range (statistics)CountingFree variables and bound variablesTheory of relativityNumberFreewarePresentation of a groupKey (cryptography)Product (business)RankingQuery languageElement (mathematics)Rule of inferenceSampling (statistics)ExpressionData compressionComplex (psychology)Formal languagePhysical lawNuclear spaceOperator (mathematics)InferenceDecision theoryWebsiteMereologyArithmetic meanExistenceDoubling the cubeForm (programming)CASE <Informatik>String (computer science)WeightLevel (video gaming)Pulse (signal processing)Functional (mathematics)BitGoodness of fitWordProcedural programmingState of matterAcoustic shadowRootParticle systemSimilarity (geometry)Sign (mathematics)Flow separationRight angleMultiplicationSound effectData conversionStatisticsData typeStandard deviationEquals signEquivalence relationSelectivity (electronic)outputArray data structureTable (information)Arithmetic progressionSocial classCoefficient of determinationDiagramRevision controlType theorySoftware testingRange (statistics)Computer animation
18:49
Special unitary groupSummierbarkeitQuery languageQuery languageFormal languageMetropolitan area networkUniform resource nameBus (computing)WaveMatching (graph theory)Hash functionRange (statistics)Personal area networkAlgebraic closureElectronic mailing listSocial classInterior (topology)Wide area networkMereologyExecution unitMagneto-optical drive12 (number)Loop (music)Platonic solidArmCAN busBlock (periodic table)Row (database)Level (video gaming)Arc (geometry)StatisticsProgrammschleifeApplication service providerMaxima and minimaOnline helpSupersonic speedAddressing modeComputer iconComputer clusterCone penetration testConditional-access moduleArtificial neural networkNetwork operating systemExtension (kinesiology)FeedbackSample (statistics)Real numberMaizeBranch (computer science)Data compressionLimit (category theory)Network topologyLengthSubject indexingOperations researchZugangsverfahrenFinitary relationTable (information)NumberExpressionKey (cryptography)Parameter (computer programming)Cartesian coordinate systemGreatest elementLattice (order)Query languageDirected graphSocial classSource codeSparse matrixSet (mathematics)Operator (mathematics)Presentation of a groupTunisEndliche ModelltheorieRange (statistics)1 (number)MereologyConfidence intervalMultiplication signStatisticsTotal S.A.MultimediaStudent's t-testMeasurementRight angleProcess (computing)Product (business)AreaTelecommunicationPlanningAdditionDecision theoryCountingTheoryTable (information)Mathematical optimizationObservational studySummierbarkeitArchaeological field surveyWeightUniqueness quantificationImage resolutionType theoryEmailArray data structureTheoremMatching (graph theory)BitHash functionElectronic signatureData storage deviceQuicksortSound effectNetwork topologyBookmark (World Wide Web)WordPattern languageXMLComputer animation
28:13
Social classString (computer science)Subject indexingMathematicsNetwork topologyZugangsverfahrenE-textContext awarenessData structureUniform resource nameZoom lensWide area networkCAN busOperator (mathematics)Parameter (computer programming)Port scannerQuery languageLogicPairwise comparisonKey (cryptography)ConsistencyNetwork operating systemMetropolitan area networkElectronic mailing listSpecial unitary groupUniform boundedness principleExt functorPersonal area networkLoop (music)Arc (geometry)Floating pointAreaMaxima and minimaPlanningAddressing modeComputer clusterVector potentialTrigonometric functionsLevel (video gaming)SummierbarkeitPhysical lawNewton's law of universal gravitationInformationRaw image formatAmsterdam Ordnance DatumProgrammschleifeWater vaporStudent's t-testAdditionSimilarity (geometry)Set (mathematics)NumberInformationCASE <Informatik>Social classFault-tolerant systemVideo gameTheory of relativityData storage deviceElectronic mailing listSource codeMathematicsInsertion lossCurveContext awarenessQuery languageOcean currentParameter (computer programming)Monster groupMultiplication signFood energyCondition numberRange (statistics)Natural numberMusical ensembleConsistencySpeciesSchweizerische Physikalische GesellschaftBitBusiness objectElement (mathematics)Latent heatGame theorySlide ruleFrequencyCorrespondence (mathematics)PressureGroup actionRight angleConstructor (object-oriented programming)Flow separationState of matterResultantData structureForcing (mathematics)Different (Kate Ryan album)MereologyFunction (mathematics)HypermediaUniverse (mathematics)Network topologyString (computer science)Message passingProcess (computing)Position operatorMatching (graph theory)Equaliser (mathematics)Operator (mathematics)PlanningPartial derivativeBinary treeInterface (computing)Numeral (linguistics)Memory managementKey (cryptography)WordConfiguration spaceCodeCache (computing)TupleComputer animation
37:37
Subject indexingCuboidNetwork topologyCASE <Informatik>Point (geometry)Category of beingNatural numberNetwork topologyComputer animation
38:24
Network topologyPolygonRectangleAlgorithmElectronic meeting systemZugangsverfahrenOperator (mathematics)Social classNP-hardExtension (kinesiology)NumberVacuumRootBlock (periodic table)Interface (computing)Subject indexingFlow separationMaizePrototypeQuery languageBranch (computer science)Theory of relativityNetwork topologyPointer (computer programming)Interface (computing)PrototypeBlock (periodic table)Computer fileCuboidSocial classGeometryMultiplicationOcean currentQuery languageAlgorithmFlow separationOperator (mathematics)ImplementationPolygonOvalRootPlanningSpacetimeMereologyGene clusterExtension (kinesiology)Musical ensembleContent (media)Sampling (statistics)Software protection dongleOpen setRight angleWordValidity (statistics)NumberWater vaporDataflowResultantComputer animation
42:04
Lattice (order)Key (cryptography)Lattice (order)Query languageTerm (mathematics)Revision controlKey (cryptography)BuildingImplementationIdentical particlesFile formatDevice driverSoftwareWindowProjective planeMereologyMultiplication signComputer filePresentation of a groupData storage deviceOrder (biology)Binary fileCasting (performing arts)Universe (mathematics)RecursionFilter <Stochastik>Point (geometry)Particle systemQuicksortParameter (computer programming)Instance (computer science)CASE <Informatik>Formal languageVisualization (computer graphics)Bit rateSinc functionRight angleGoodness of fitVideo gameLevel (video gaming)View (database)Execution unitBinary codeStatement (computer science)Form (programming)Confidence intervalComputer animation
Transcript: English(auto-generated)
00:00
And after that, we'll talk about some exercises on gene indexing with support of GSK query. And finally, we'll come with a vodka access method. So let's start. The first slide is about the history of schema-less data
00:21
in PostgresQL. So the first extension who works with the semi-structural data was hstore, which we designed in Theodore more than 10 years ago. Later, we did the gene index for hstore,
00:42
and it becomes part of PostgresQL. And in 9.2, we got JSON data type, which was just a string data type, textual data type, without any indexing support. But it was a bit more flexible, more richer than hstore,
01:03
because hstore is a key value model, and JSON is document-based model. Last year, we started a new work in Theodore with support of Heroku, with Engine Yard, sorry. We started to work on the binary storage for hstore.
01:25
And we called next generation of hstore. And then we switched to the JSONB data type. And so now we have a structured format for storing JSON.
01:41
It's JSONB. It is submitted and will be available in 9.4. So a lot of community work was done for improving our work, and we are very happy. We want to say many thanks to developers who really did push our work to the production state.
02:02
And I think that it still needs some improving. For people who don't know what is JSONB and how it's different from the JSON, this simple select shows all differences. So JSON is textual storage.
02:22
It's like as is. So if you see, you have white spaces. You see the duplicated keys. All everything is as is. In JSONB, we have no white spaces, no duplicates key.
02:41
And all keys are sorted. So this is all four differences between JSONB and JSON. So why we did this? Because if we see it, because JSONB was designed for performance. And since JSON was already available from 9.2,
03:06
there were many applications who already use these features, like duplicated keys, white space, and so on. So we decided to introduce new separate data types, JSONB.
03:23
It's a bit inconvenient, but JSONB now is, I would say, is a legal data type. And we hope that JSONB will give a push to positive SQL, to the non-struct, no SQL.
03:41
So here is a simple experiment with delicious bookmarks. It's a bit more than 1 million bookmarks. I tested it on my MacBook Air. And we did three tests, input performance, access performance, and search performance. Here is a screenshot of the struct shows one JSON.
04:03
So it looks like a mess, but it contains all the stuff, like objects, arrays, colors, and so on. So we see that storage, which introduce JB overhead,
04:25
it's not so much. It's just less than 4%. It's overhead of binary storage for JSONB. Of course, JSON and text presentation looks the same. And input performance. So we use a copy command to load data into database.
04:45
And we see that the fastest is text. It's no surprise. JSON is second, because it needs to do some JSON validation. And JSONB is the slowest, but not so much, because it also need some validate JSON
05:03
and put to the binary storage. So here is this overhead of JSONB. Access performance. Access performance, we see that now the winners. JSONB is about 20 times faster than JSON. It's just because we have binary storage.
05:22
We don't need to parse. So you see, the base number is just to read data from table. And JSONB is like this and JSON. So we have these numbers. If we subtract base number, we can see this clean number
05:43
and get a 20 times fast performance win. And now we just search performance. So JSON has no contained separator. So we get the optimal estimate.
06:05
We use just this query and check just the first element of array. We did not check all array, because we have not just with JSON, we have no access to the element of arrays.
06:23
Yeah. And this query we got for 10 seconds. 10 seconds. For JSONB, which has binary storage, we have a contains operator, and we got 10 times faster.
06:42
This is just a sequential. There is no indexes involved. But now we can, for JSONB, because JSONB has binary storage, we can have a gene index. This is just a simple key and values indexes separately
07:01
and then joined. And we got more than 150% performance. So yes. So this looks very nice. But we can get 10 times faster if we have a new, another op class for gene, which used on a hash.
07:24
And we got very, very nice numbers. So JSONB really outperforms JSON. And of course we could, yes.
07:40
Yeah, I will show all these numbers. Yes, and of course we can skip a comparison with MongoDB. So we use one of the latest version of MongoDB. The first surprise is that it was so slow to load. You see that? 30 minutes to load data into database.
08:02
And JSONB just took us 43 seconds. And search sequential scan was more or less the same. It's about one second, if you remember. It's 260. I know that there is a 261, but I think it doesn't matter.
08:22
Search performance is very fast. It's just one millisecond. But JSONB is a bit faster. So in principle, it doesn't matter. So we can say that JSONB has performance very close to the MongoDB.
08:41
And let me note that MongoDB uses expression index. So kind of indexes are different in MongoDB and in Postgres. We have a monolithic index, which supports search for all keys. For MongoDB, you need to create a special index
09:01
for this key. So our index is more universal. And still we have comparable performance. So this is a summary. So as we see that JSON is 10 seconds, 6 count, JSONB is 8.5 milliseconds, GIN, op class.
09:22
JSONB is, if we use a JSONB pass operation, which is hash-based indexing op class for GIN, got the best performance. This is among the index size. JSONB op class has 600 megabytes. And thanks to the work we've done for the GIN,
09:42
because before compression, it was just 850 megabytes. So new version 8.4 will have a much better GIN index. And here, you see, we save 160 megabytes.
10:01
And JSONB pass operation is just about 300 megabytes. And if we create expression index on tags, we got just 44 megabytes. And if you go deeper, we get just 1.6 megabytes expression index. We compare the mongo, the index for tags.
10:23
It's just huge, 386 megabytes. And the same, you see, compare 387 with 44 and 100 with 1.6. So I don't know what mongo did, but it's very huge. Yes.
10:40
If you see the table size, progress, it's just 1.3 gigabyte. Mongo is much, much bigger. And input performance, incredible. Mongo is a certain minutes. I don't know what's. I use the standard procedure mongo import command. And I try to optimize using Google, but can't find anything like this.
11:02
Yes. So this is straight apart what we have for 9.4. And thanks for the company engineers who support this work, at least our. So this is what we have. Now, currently, we can search JSONB data
11:22
using contains separator, like this, contains and contain. And this separator has index support, gene index support. Here are the syntaxes, you see. But keys should be specified from root. There is no relative key or wildcard on keys.
11:43
You have to specify all keys, so the full path. Of course, we have equivalence separator, JSONB with gene index support. We have an existence separator, so we can check for the key or array of key, any key, and so on.
12:02
And it also has index support. But we supported only root key. So you can't check the deep case. Only took a root level key. So very limited. And operators on JSONB parts, so we can use like this.
12:25
Of course, you understand. It looks very ugly. And too many functional indexes can be not a good idea. So we decided to work, to continue our work
12:41
and did something like JSON BigQuery language. So we introduced something like JSON BigQuery language, more operators on keys, on values, type support, probably schema support. And of course, this language could have indexes support.
13:02
So we introduced JSQuery. JSQuery is a JSON BigQuery language, which actually consists of textual data type and much operator. So here is a JSONB. I don't know how to pronounce in English. Yes.
13:21
At, at? I don't know. In Russia, we say, dvysabaki, like dog's dogs in this query. But for now, we call this match operator. The idea of syntax was talking from TS, like T search. T search, yes, yes.
13:41
So here is a BNF diagram. Just for reference, we did not want to explain everything here. You can download presentation and look at it. So we just will explain some important parts. Here is what is the path. Even very important, we introduce some signs,
14:03
you know, like a number sign. How you say this is a number sign, pound sign, or letter? Sharp. OK, we see many different. We say this is DS. So it means any element of array. So this select, select produce true.
14:24
You see, we have an array. And we specify keys with the number sharp, number sign, equal to. So we have two, so it's produce true. And person sign means any key. So the same, select, but we specify just here
14:45
is maybe any key. So it's like also produce true. Asterisk or wildcard, it means anything. Or element array or key, it doesn't matter. So this also produces true.
15:03
And dollar sign is a current element. So it's very convenient when you have complex expression. So you can use just like a placeholder, you know. You mean multiple set, right?
15:21
Yeah. You're reading there? Yes. So it's in effect taking A or B and replacing the stats. Mm. Is that right? Yes, that's right. So we check the, you mean with this query? No, it's not. It's the one involved, the one involved. Ah, this is multiple. It's anything, you know.
15:41
Right. Anything, any level. What? It's not simply matching a single level. Not single level. Multiple level. Just multiple levels. Just go deep.
16:01
And dollar is a current element which is very convenient to use in complex expression. And of course, you see, we did not use any quotes. But in some cases, when you want to be sure that it's string, we suggest to use double quotes. Because this can be, I would say,
16:23
looks like a note key. Yeah. Is this in chart 4? No, no, no, no, no. This is a, yeah. So this is what we just did maybe several hours ago.
16:45
A lot of hours. A lot of hours, weeks or months. Yeah, no, no, no, no. And yes, and another is that. Value expression. You see this value expression and something like scholar can be checked for using in operator.
17:03
So it's also true. All expressions produce true. Test for key existence. So we can just test, just if there is a key a.b, it's just equal wildcard. So instead of question sign, we use this expression.
17:21
And array overlap. So you can check there for overlap in arrays. And array contains and contains. Yes? Have you considered non-greedy wildcard?
17:42
Who's that? What's that? Your star, your star on the graph, is the most greedy wildcard, right? He wants a star. Everyone has got to match his person. Or as few as possible. Few as possible, so right. Not as much as we thought, but it's a little bit more. Any.
18:02
Just query returns you as just true or false. Yes? Just return you which particular path it matches. So it matches any path, then just query returns true. No, no, I understand how it works. Yeah. Yes. It's the same thing. I mean, either, neither, neither, neither, neither. No, no, no.
18:22
I think you're getting that a little more. OK, here is some example how it looks. So if you want to find a product similar to this ID and have a product sales rank and range between this, on SQL you can write like this.
18:44
And I don't know how to write. It's boring stuff. Convert JSON array to the progress array and do some. So I just didn't try. In JS query, it looks like this.
19:00
And MongoDB, it looks also very awkward. And arrays, like a problem, yeah. So this is just example. Yes. And here are some queries. We check the same query I already shown in the first part.
19:21
We have one second. And with the JS query, you see, it looks very clear, very short. And it is even faster somehow. We don't know why, but it is even faster, JS query.
19:40
Yeah, yeah. Teodor has some idea why it's faster. So we did some op classes for June. We support the JS query. One is JSONB value path operations, op classes. The name, we need some suggestion from you,
20:01
because it means just that value before on top of path. And another is just path value, so value under the path. And this is use Bloom filtering for K matching. So this JSON, you see, transform to the value.
20:25
We preserve value. And here is a word Bloom signature of all keys. So if we have this one, we can use Bloom filtering, like we did in Russian doll tree, in RD tree for arrays.
20:41
And this is good for K matching. So we support wildcard support, but not good for the range query. It's work, but not so good as this one, another op class, which is hash path. And this ABC is translated to the hash of this to the path
21:02
and combined with the value. So of course, it has no wildcard support, but it has no problem with ranges. Here's index sizes. So we see that this new op classes has the same size and a bit bigger than hash-based old op class.
21:28
But we see that this index is designed to support much more richer set of operations. So we can now work with values in keys.
21:40
And here are some numbers, sort of how it looks. For example, the same query I've shown you several times already. Now with indexes, it just has a very, very good performance. Again, like the MongoDB, more or less, or better.
22:01
And here we have an op class. We use value path, value path op class. And here we use a wildcard, example of wildcard.
22:21
So as you can see, that's the risk theorem. And the performance a bit still good, but a bit slower because of wildcard. And next we found the CITUS data set, which is a big, big number of reviews from CITUS.
22:42
I don't remember somebody posted in the mailing list, so I found this. So it's a long customer review data set? Yeah, probably. I found, we found, and we got just for three years, it's just about three million reviews. And it has this structure.
23:00
And it's very nice because we have many numbers. So we can show you how to work with numbers, some range operators. And the table size is 1.5 gigabyte. And we run several queries here.
23:22
To find this, the query is very fast. It's just very fast, but here is very slow. And this is a problem we will discuss on unconference because, unfortunately, JSONB has no statistics.
23:44
And more, moreover, it has no access to the planner, doesn't work with JSONB. So this one, this expression that we just
24:00
add, and this is not very selective, this range query. In principle, you do not need to use index for this query. And now we have about 129 milliseconds. Much better. So in principle, yeah.
24:25
The same problem is there is with TS query because the same optimization to take a part of TS query and use an index scan is possible with T-Search too.
24:41
So it's not a single type which comes in the face of this problem. Yes. And MongoDB just very fast because it doesn't use index on this. It uses index only for this key and doesn't use.
25:02
And the index size is 400 megabytes just for these similar products. Yes. But it's much faster. So if we have a planner support, we can rewrite the query like this to emulate that. Imagine that planner has support for JSONB.
25:21
So to emulate, we got a very good number. Just half of millisecond. So what does it show? It means that Postgres has potential to execute query for half of milliseconds when Mongo does for seven milliseconds.
25:41
So we just need statistics and need planner support. This is some roadmap we will discuss on unconference and all people who want to help us, especially Tom Lane, invited to this.
26:01
And it's a nice thing that all this stuff is just contribution. It's a contrib model. So you can download it and use it for 9.4. It is not included to the release. But you can just download, play, and read your feedback, especially we need some real simple data and queries.
26:23
And please excuse us. There is no documentation at all. Actually, you don't need documentation because everything's here. Just download this presentation. But we always need something better.
26:44
We need better indexing. And GIN is proven. It's a very efficient, effective index access method. And tomorrow, Heike and we and Alexandra will talk about new GIN, new improvements of GIN. So I will not stay here about details.
27:03
But we want to use GIN because it's very good for duplicates. It's very good for duplicates. We need indexing for JSONB with operation on path and values. Path we want not hashable.
27:20
We want to have a path as is. But storing path as is requires a lot of storage. We did some experiment. We did extract from the delicious bookmark. We extracted unique path and made a B3 index.
27:42
So we have B3 index text pattern operation O+. It is huge. So the size of table is 900 megabytes. And the size of B3 index is very fast. So B3, no way to work with. So we did something like SPGist.
28:02
We have SPGist, which can be a very good candidate for storing duplicated data, for storing path. And actually, we did indexing, and we got smaller index, like 600 megabytes. And as Teodor said to us, that this night, he decreased
28:25
this size two times lesser. So we just had no time to re-index. But we believe that this number will be much, much lesser. So what's going from this picture?
28:41
We want gene, and we want something like SPGist. So here's a vodka. So we need to provide interface to change hard-coded B3, which used in gene for entry tree, to SPGist.
29:05
Yes. And we may go even further. For example, we may change B3 in posting tree. And then we can get something like GIS-aware full text search, if we use R3, for example, instead of B3.
29:24
So we decided to play with creating new index access method, which we called vodka. And so this is the name of our talk, Create Using. And now, the most part of this work
29:44
was done by Alexander Karadko. So he continues my talk. I don't know how to give me microphone. Left for him.
30:01
And right. OK, no problem. I can stay.
30:35
So did you understand the idea? Why we come with vodka?
30:44
OK. So if we put the full process to gene B3, it would be quite long. And the entry tree in gene would be large.
31:05
So that's why we can replace it with SPGist, SPGist entry tree, which could store prefixes. And then the rest of path doesn't contain this prefix.
31:22
But SPGist for JSON-B would be quite more complicated than just RADIX-B, because RADIX-3 is good for keys. But for Namorix, we need just a binary tree
31:41
to support range queries. And since JS query, for now, doesn't support something other than equality for strings, then we store caches of strings. They are also in RADIX-3.
32:01
And the rest of structure is the same to gene. Each tuple in SPGist have reference to Poisson list or Poisson tree. So yes, this is interface of vodka. We call this vodka distilling constructions.
32:23
So vodka have config methods, like SPGist. And this specify entry tree op class. So this could be SPGist op class, Gist op class, and so on. Any operator class.
32:42
Equality operator to detect duplicates. So they use the same Poisson list or Poisson tree. Compare extract value the same to gene, as in gene. And extract query now return us not just the entry itself,
33:08
but the operator to scan entry tree and argument to scan entry tree. And this operator can return us one or more of entries.
33:22
And the consistent and the re-consistent methods are same as in gene. So the size of index using vodka is about,
33:49
it's a little bit larger than pass value and value pass. But probably it still can be improvement because
34:01
of the work this night. Yes, and it's also constructing a bit slower, but it also can be a improvement. Now we have just prototype. So on simple queries, it show about the same performance
34:31
as gene, and even a bit faster. And on numerics, it have a similar relation
34:47
of size and construction time. These data sets have a lot of numerics. And this is overlap query with the word code class,
35:03
which appears to be faster than gene. Yes, but when we have range queries, it have the same problem with planning. Yes, but also if we could to put this condition
35:26
into a check, then it became very fast. This example, vodka use the same fast scan as gene,
35:45
but gene use fast scan only when you have no partial match. But vodka have no notion of partial match. It just see if we have one entry or more than one entry
36:02
match in the operator. Yes, and in this particular case, in gene, we have partial match because of range. Yes, but in this particular case, only one entry match this condition. And that's why vodka can use fast scan.
36:23
And this query is much faster. But it is about the more advances in gene fast scan, because gene can be as fast as vodka in this case. And also, vodka also need some positional information
36:43
as gene in posting list and posting trees. Because in this query, both quals is about the same element of array. But in index, you can distinguish
37:03
different elements of array. And then you have a lot of recheck, because you retrieve a lot of results from index where different elements of array are matching corresponding quals. So positional information could help us a lot,
37:20
because most part of time, we spend in recheck in heap. So also, there is a different kind of vodka. And we can implement very many useful op classes with vodka.
37:42
For example, as a case that we call spaghetti indexing. So imagine you have a very long spaghetti. And you want to find the spaghetti which will yield to your fork. So if you have MBR approach, each spaghetti
38:04
have a very large MBR with R tree. And then you will have to do almost a sequential scan, even if you have an index. So the spaghetti case is hard for R tree.
38:26
This example in Ottawa, York and George Street, yes, you will have very large MBRs for them. But if you will have not one MBR, but multiple boxes,
38:43
then you can have much better covering without extra space, without so much of extra space. So vodka allow you to write such a class. In the extract method, you divide one geometrical object
39:02
into several bounding boxes. So it's R tree vodka. Yes, entry tree would be a tree-based gist.
39:21
We need some algorithm for converting polygons with rectangles, but there are some researches on it, elected Google, and found several articles. And we need support of this work. We hope that both geos would help us.
39:43
Also, vodka have several problems. Yes, at first, we need to somehow update item pointer in entry tree. Using current interface of access methods, we can only insert new entry and do vacuum somewhere.
40:06
But it's quite wired and expensive. Probably, we need a new method to do this. Also, if you will implement vodka classes in extension, now it's quite hard to get voids of classes and operators.
40:23
But I think it's a minor problem. Current prototype, storing entry tree in separate file with separate feature relation data, we should construct on the fly.
40:42
This is just a prototype. And the question is, what is correct infrastructure to handle? But we need to discuss it on the unconference. And another question is the future of vodka,
41:01
so-called vodka plus, to replacing posting tree with arbitrary access methods as well. Not only entry trees, but posting tree. But in this situation, you definitely should store multiple posting trees in the same file. And there is a lot of questions.
41:22
What interface should handle this? Should we pass to the access method its root block number? How to work on this? So a lot of questions. In the summary, we have JS query for 9.4.
41:43
It have JSON BigQuery language, two Genome classes. And you can find it on my GitHub. We have prototype of vodka access method. We have a lot of plans of improving its infrastructure. And our work was supported by Heroku.
42:02
Many thanks to them. And that's all. Meet us in Madrid. And we hope we will have better vodka then. Can you ask me about how to understand the acronym vodka?
42:31
Yes. It was quite hard to invent this acronym.
42:40
Yes, but we have a recursive acronym, vodka-optimized-then-reform-case-array. Array, yes. So questions? But I believe we already run out of time. So we will be available for any discussion.
43:17
In principle, yes, it's possible.
43:20
But I believe it's hard to do just JS query matching operator, which will have to parse JSON. And I don't think they're allowed to do this. But you just cast JSON to JSON-B. Yes, you can just cast. Because you know if you have, for instance,
43:41
duplicate keys in JSON, JS query anyway can't distinguish them. And can't distinguish the order of keys. So the cast is an option.
44:01
There are many, many languages. We checked them and then decided better to write our one. There is a research paper, you know, big research paper about 11 JSON query language. But they look so ugly. So you have seen the MongoDB query language.
44:25
So you have seen the MongoDB query language. We don't like this query language.
44:44
Sorry, what we can you repeat? I don't know. You probably could write just cast from JSON of MongoDB
45:03
to JS query. I believe JS query supports. How long will it take to try? We can't answer until try. One week, maybe.
45:21
No, no, no, no. JS query. JS query. The first version was about one month. And then the modification was very fast. It was some operation, because the most problem is the implementation of JS query
45:40
to be comfortable to work with. So it can be a separate project to make a driver for MongoDB or cast curation. Can you build something like C++ language? C? C language? What language? Windows. That's a long part.
46:01
I don't know. Maybe we can release it like a PGXN. I don't know. Better. Or just provide data. But first, we need your support to just download test, because we need to find all bugs. We have time to 9.4.
46:26
How many of the details of the format of the binary format from the GCD? We have a separate presentation we can find from Dublin. Yeah, we have it. Just as all of you see, if you have other,
46:43
how do you find the examples already established for a realisation format for a project, such as universal games, or in terms of special things, it's not very likely that it's designed specifically for reducing the storage size for a human being.
47:08
Your question is very long. Too long to read. I mean, it's just a realisation format. It's just very strict. I don't know.
47:21
I'm just guessing. So JSONB is format optimised for fast access. You mean send receive over network or storage in the files on the disk?
47:45
So JSONB is optimised for fast access. So it stores the keys sorted so you can use a binary search to find the particular key, so it's really fast.
48:05
So let's discuss this on unconference.