GIN in 9.4 and further

Video in TIB AV-Portal: GIN in 9.4 and further

Formal Metadata

GIN in 9.4 and further
Alternative Title
GIN - Stronger than ever
Title of Series
Number of Parts
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date
Production Place
Ottawa, Canada

Content Metadata

Subject Area
This talk presents set of GIN advances in PostgreSQL 9.4 and further which brings GIN to new level of performance and extendability. Most important advances are: posting lists compression, fast-scan algorithm, storing additional information and index-based ranking. This talk presents set of GIN advances: Compression posting lists. Indexes become 2 times smaller without any work in opclass. pg upgrade is supported, old indexes will be recompressed on the fly. Fast scan algorithm. Fast scan allows GIN to skip parts of large posting trees during index scan. It dramatically improve performance of hstore and json search operators as well as FTS "frequentterm & rareterm" case. In order to use this improvement three-state logic support required in "consistent" opclass method. Storing additional (opclass defined) information in posting lists. Usage of additional information for filtering enables new features for GIN opclasses: better phrase search, better array similarity search, inverse FTS search (search for tsqueries matching tsvector), inverse regex search (search for regexes matching string), better string similarity using positioned n-grams. Index based ranking. This improvement allows GIN to return results in opclass defined manner. Most important application is returning results in relevance order for FTS which dramatically reduces IO load. But there are other applications like returns arrays in similarity order. We present the results of benchmarks for FTS using several datasets (6 M and 15 M documents) and real-life load for PostgreSQL and Sphinx full-text search engines and demonstrate that improved PostgreSQL FTS (with all ACID overhead) outperforms the standalone Sphinx search engine.
Keywords Alexander Korotkov Oleg Bartunov
Point (geometry) Cone penetration test Divisor Integrated development environment Code Data storage device Speech synthesis James Waddell Alexander II
Point (geometry) Presentation of a group Table (information) Electronic mailing list Counting Regular graph Mereology Number Local Group Medical imaging Order (biology) Formal verification Subject indexing Extension (kinesiology) Data type Multiplication File format Interior (topology) Electronic mailing list Content (media) Coma Berenices Ext functor Database Mereology Number Arithmetic mean Word Key (cryptography) Cycle (graph theory) Service-oriented architecture Table (information) Marginal distribution Spacetime Data compression Extension (kinesiology)
Web page Group action 1 (number) Electronic mailing list Fitness function Data storage device Number Network topology Personal digital assistant Network topology Integer Table (information) Row (database)
Tuple Email Overhead (computing) Multiplication sign Pointer (computer programming) Positional notation Network topology Memory management Bezeichnungssystem Physical system Form (programming) Overhead (computing) Key (cryptography) Block (periodic table) Web page Data storage device Electronic mailing list Number Uniform resource locator Pointer (computer programming) Personal digital assistant Network topology Revision control Video game Key (cryptography) Block (periodic table) Table (information) Data structure Row (database)
Point (geometry) Web page Tuple Multiplication sign Range (statistics) Similarity (geometry) Electronic mailing list Dirac delta function Variance Variable (mathematics) Number Pointer (computer programming) Mechanism design Different (Kate Ryan album) Integer Codierung <Programmierung> Kolmogorov complexity Data compression Position operator Algorithm File format Data storage device Electronic mailing list Bit Single-precision floating-point format Number Pointer (computer programming) Revision control Freeware Integer Row (database)
Web page Point (geometry) Tuple Vacuum Group action Algorithm Length Code Line (geometry) 1 (number) Electronic mailing list Raster graphics Mereology Variance Heegaard splitting Pointer (computer programming) Centralizer and normalizer Network topology Different (Kate Ryan album) Spacetime Kolmogorov complexity Codierung <Programmierung> MiniDisc Exception handling Algorithm Email Mapping Web page Electronic mailing list Fitness function Data storage device Coma Berenices Total S.A. Bit Numbering scheme Category of being Causality Personal digital assistant output Integer Curve fitting Data compression Vacuum
Tuple Single-precision floating-point format Pointer (computer programming) Number Codierung <Programmierung> Revision control Set (mathematics) Electronic mailing list Dirac delta function Integer Variable (mathematics) Variance
Presentation of a group Beta function Code Gradient Insertion loss Water vapor Mereology Variance Medical imaging Heegaard splitting Exclusive or Different (Kate Ryan album) Personal digital assistant Endliche Modelltheorie Kolmogorov complexity Algorithm Constraint (mathematics) File format Linear regression Web page Electronic mailing list Lattice (order) Numbering scheme Category of being MiniDisc Normal (geometry) Text editor Quicksort Spacetime Row (database) Point (geometry) Web page Reading (process) Table (information) Algorithm Electronic mailing list Revision control Spacetime Software testing MiniDisc Mathematical optimization Form (programming) Scaling (geometry) Uniqueness quantification Code Line (geometry) Causality Pointer (computer programming) Grand Unified Theory Personal digital assistant Network topology Table (information) Integer Vacuum
Algorithm Data storage device Electronic mailing list Variance Medical imaging Network topology Query language Energy level Spacetime Integer Kolmogorov complexity MiniDisc Mathematical optimization Fundamental theorem of algebra Operations research Matching (graph theory) Key (cryptography) Mereology Numbering scheme Causality Word Series (mathematics) Normal (geometry) Key (cryptography) Table (information) Integer Window Row (database) Vacuum
Source code Open source Code Mountain pass Source code Open source Database Database Surgery Open set Word Different (Kate Ryan album) Query language Energy level Row (database)
Functional (mathematics) Beta function Matching (graph theory) Consistency Source code Electronic mailing list Port scanner Database Library catalog Client (computing) Mass Open set Word Word Arithmetic mean Personal digital assistant Query language Resultant Form (programming) Social class Newton's law of universal gravitation
Point (geometry) Matching (graph theory) Mapping Electronic mailing list 1 (number) Database Electronic mailing list Term (mathematics) Open set Word Word Estimator Personal digital assistant Order (biology) Query language Row (database)
Source code Functional (mathematics) Matching (graph theory) Key (cryptography) Information Consistency Consistency Open source Combinational logic Operator (mathematics) Database Rule of inference Open set CAN bus Word Word Function (mathematics) Query language Social class Row (database) output Physical system Social class Row (database)
Source code Functional (mathematics) Matching (graph theory) Open source Mountain pass Consistency Open source Database Database Open set Word Personal digital assistant Query language Operator (mathematics) Query language Condition number
Web page Source code Addition Data recovery Mountain pass Electronic mailing list Code Port scanner Database Mereology Open set Word Crash (computing) Network topology Query language Condition number
Point (geometry) Greatest element Matching (graph theory) Key (cryptography) Mapping Weight Web page 1 (number) Usability Bit Data storage device Number Estimator Network topology Personal digital assistant Query language Energy level Maize Key (cryptography) MiniDisc Extension (kinesiology) Arc (geometry)
Word Mechanism design Mathematics Loop (music) Order (biology) Electronic mailing list Database Electronic mailing list Open set
System call Addition Code Ferry Corsten Multiplication sign Data recovery Bit Workload Pointer (computer programming) Crash (computing) Subject indexing Query language Software testing Code refactoring Information James Waddell Alexander II Pairwise comparison Overhead (computing) Patch (Unix) Data recovery Code Staff (military) Mereology Crash (computing) Function (mathematics) Code refactoring Representation (politics) Mathematical optimization
Building Concurrency (computer science) Code Direction (geometry) Combinational logic Schweizerische Physikalische Gesellschaft Substitute good Mechanism design Mathematics Bit rate Different (Kate Ryan album) Ontology Finitary relation Cuboid Kolmogorov complexity Area Algorithm Mapping Data storage device Electronic mailing list Bit Type theory Data management Process (computing) Normal (geometry) Pattern language Point (geometry) Slide rule Implementation Data storage device Electronic mailing list Graph coloring Binary file Code Network topology Term (mathematics) Subject indexing Mathematical optimization Boolean algebra Pairwise comparison Standard deviation Dialect Matching (graph theory) Key (cryptography) Plastikkarte Independence (probability theory) Field (computer science) Number Word Pointer (computer programming) Personal digital assistant Network topology Interpreter (computing) Video game Extension (kinesiology)
Parsing Overhead (computing) Group action Focus (optics) Data recovery Concurrency (computer science) Disintegration Ext functor Database Attribute grammar 3 (number) ACID Expandierender Graph Web crawler Query language CAN bus Subject indexing Query language Ranking Kolmogorov complexity James Waddell Alexander II
Pairwise comparison Run time (program lifecycle phase) Moment (mathematics) Core dump Ultraviolet photoelectron spectroscopy Limit (category theory) Port scanner CAN bus Order (biology) Read-only memory Uniform resource name Memory management Vector space IRIS-T Row (database) Ranking Key (cryptography) Musical ensemble MiniDisc James Waddell Alexander II Quicksort
Building Table (information) Patch (Unix) Hoax Building Real number Physical law File format Operator (mathematics) Student's t-test Bookmark (World Wide Web) Rule of inference Number Data mining Query language Personal digital assistant Function (mathematics) Military operation Universe (mathematics) Query language Subject indexing Video game Key (cryptography) James Waddell Alexander II
Personal digital assistant Multiplication sign Subject indexing Row (database) Key (cryptography) Procedural programming Counting Electric current
Reading (process) Slide rule Table (information) Disintegration File format Computer-aided design Data storage device Limit (category theory) Special unitary group Variance Order (biology) Flow separation Military operation Personal digital assistant Vector space Subject indexing Query language Ranking Information Lie group MiniDisc James Waddell Alexander II Window Default (computer science) Parsing Overhead (computing) Torus Touchscreen Patch (Unix) Building Interior (topology) Operator (mathematics) Bookmark (World Wide Web) Open set CAN bus Number Multi-agent system Function (mathematics) Uniform resource name Key (cryptography) Block (periodic table) Mathematical optimization Electric current Data compression
Multiplication sign Database Limit (category theory) Special unitary group Rule of inference Emulation Order (biology) Degree (graph theory) Inverse problem Memory management Ranking MiniDisc Conditional-access module James Waddell Alexander II Video Genie Metropolitan area network Overhead (computing) Attribute grammar Mass Menu (computing) Computer network Web crawler CAN bus Word Thumbnail Kolmogorov complexity Key (cryptography)
Operations research Weight Artificial neural network Operator (mathematics) Library catalog Online help Color management Ultraviolet photoelectron spectroscopy Mereology Special unitary group Area Type theory Insertion loss Network topology James Waddell Alexander II
the you yeah OK it started on 2 minus taking environments and I'm going to talk about sturgeon improvements that were made in my point for most of the code is written by Alexandra growth cone of I did a lot of work on refactoring them and discussed and list on so the store of these factors but I also spend a lot of fun working on them so that's why I'm speaking I might lose my voice at some point I did last nite so if that happens I'm going to chew over to him so most
resources so that those who don't know I'm taking the work for the American click on my work and all kinds of back and stuff and posters have been doing that for 10 years on this is a lot of stuff I work from most format for anyway we did 2 major improvements through June another for the cycle 1 is that what we call compressed posting lists which basically means that in mind when foraging indexes are going to be much smaller than before and are generally smaller is better when you thought about databases because smaller means this faster and the scan can them fit more fit in RAM takes less space back out so offensive benefits when your data is smaller became so that's a simple significant thing we did is what we call the fast scan that list of the patters called I tried to avoid using that name anywhere in the comments or documentation because I hate that name of fast can't see anything on what will the feature is really all about is that when you have to search through search for multiple keywords I so that 1 of the key words just happens brokers were frequently and 1 of them is very rare we do have a lot smaller right now so that that's faster before would scan the posting lists of both items which means that you get the performance of you know you're bugged by the frequent item you have scanned items of all those things that happen all of them were now we just you know stand the rare items and verify that the frequent item access to I'll come back later that so part 1 of this presentation is going be compressed posting on just to give you an idea of what this does migrants simple table could numbers has 1 column and I am sure that a bunch of numbers that which goes from 0 to 9 on 1st of all I use the extension between Jennifer not familiar with that is that you should become familiar with that it's really cool 1 false you to use Union indexes on 1 images and text and all that kind of regular of this allows also to use gene which is as a replacement for between anyway once you do that I create a created of normal B-tree index on this and I the Gini index and
just to give you an idea of a table looks like of contents numbers from 0 to 9 and there's 1 million of each of these roles OK so there's a lot of duplicates that's the point of this margin images very at and duplicates that's what I'm trying to demonstrate here so there's only 10 distinct values in the table but then Israel's
on the outside 89 . 3 you can see that it's much better than a B-tree at storing the integers only 58 megabytes were B-tree indexes over 200 the table table size was what 350 megabytes of member all 9 . 4 it's only 11 megabytes for the gene index over there so that's that's that's a big improvement I think this is pretty much the best case scenario for for this fact so I wanted to show up but in any case never going to be worse than it was before it's always going to be the same size in the worst case but is used going be somewhat smaller or even a lot smaller happens to be effective on your so can explain how this
works to select basically this is what the gene index looks like 1 way to think about the gene index is that it's a normal but it's heavily optimized for storing duplicates are so we have a normal B-tree what we call an engine tree and whenever there's a unique value that means that the the thing that this be treated just like any other he works like a normal between index of but if the 1st through items with the 2 2 rows of the same value then instead of inserting the 2nd entry into the the military we just create a list of all the ones that have and if that least euros big so that it doesn't fit in the page anymore we create another page are just to hold hold the list of items that all have the same key OK and when we feel that creates 2 group we create another 3 of 4 pages that all content just duplicate items that all have the same key value that's that's what this looks like on this so each of these
items that is stored index is what we call internalize pointer by life consists of the block number which block on the the actual cable this role is and those a 2 byte offset within the table where where and that of where that faces us so we require a normally 6 bytes per every role in the store the pointer from the index on this is the user location we use a few of these key ideas from the system we use the notation we block number and of the problem and I use that efficient on so in a normal B-tree index we repeat the key for it in the world have 4 rows in the table and they all have a key value for bar and all of the tree is going to contain for also all had the same key for which is 1 pointing to 1 of the whole thing and also because the Gini index we just in store the key once and then we have a list of items which makes perfect sense and why would you store this on and actually i've been wondering for a long time why don't we do this for the normal between index that makes perfect sense but we don't know why but gene index does this and that's what makes it so much smaller but so for example in
this case where you have 4 roles in the heat that have the same key along the index when it's going to require 37 bytes to store the list of pointers are you have some fixed couple overhead and yet the key which is full and 24 bytes to store all of sudden interest and the form that
we use in i . 4 1st of all some guy and this is the list of item pointers sorted by the physical position the which becomes important because now we can take advantage of the fact that if you have on pointers that all point to the same page we can compress them very efficient because they will have similar values on base with reduced the store a list of pointers absolutely for use for the 1st buy them as usual but the others are just stored as delta lost from the previous 1 but the difference from the previous and that itself doesn't help or much but then well this they'll loss but
typically very small integers if for example if you have 100 couples on the same he page you can have 1 1 1 1 1 1 1 that's the difference 1 fact and those are small integers and there's a lot of algorithms on how to compress a list of small integers and the the the compression algorithm we chose something that's called for by encoding all it's basically means that there a number of smaller than 128 we have the 1st bit set 0 and the rest of is the number if it's more larger than that then you add the bytes with the 1st bits into 1 and so forth and you can encode encoding of these things but it just so happens that in this can coding this it just happens that these so that you can encode the full range of values that we need in 6 bytes even though they know every time you compress there's going to be something that for some values but it just happens that old format was so propose and the rows of free bits available that between never need more than 6 bytes and in and went you always need 6 by stories of this up so you're never going to lose with this compression mechanism could go compared to what we we had 3 which is very nice is that indexes are never going to be larger for so I
mentioned that if this list grows very large that doesn't fit in this page and more than we create multiple pages and each page is just a large list of items compress the way shown for all it's not actually just 1 list it's a splitting the multiple segments so that you have to update you don't need to read court all hold page just the part that you updated on we do this for both posting lists that are in the book lists and that can and should be 3 and also this all of which are about the compression
algorithm I spent quite a lot of fun reading the literature on on different algorithms that they're bobwhite encodings what Alexandre implemented originally and so that's there in the 1st version of but then I went and the fact that the code heavily to make it easier to experiment with different different algorithms and unattractive you once 1 of the ones that I tried was called Simple Minds not forget how it works but it it's supposedly faster to operate in a more compact than the war by encoding and I think it was somewhat more compact but it didn't really wasn't a big difference along interesting point here is that we could use any encoding for example you could use a bit mac and that DSS effectively give us a bit mapping discussed this looks exactly the same as a bitmap index dust except that the users worldwide encoding and other bit-mapped store them to put on so the input what I made earlier is that with the target which shows the so this is another point of with this algorithm whenever you remove an item from the middle of the list of all the total is never gets larger many other algorithms don't have this property which means that if you delete the role in the middle of the vacuum if you would have to make the list longer just encoding just works that way for example if you did run length encoding and you remove something in the middle of you have to split into 2 things that are to get a larger than what had origin which is kind of annoying because that means that when you vacuum you might need to do pay splits evacuation so forth but this out that have from this of the of the of the group OK so for example if you wanna let's say that you have they're facets 1 2 3 4 5 and you want compress those all 1 1 way to implement that would be to have a big mac or world world said encode that on all those in the UK central Brooklyn e-mail explaining he came up with a pretty good prove that that is the case but basically if you have both
sets of war but encoding I
probabilities work of let's pretend yet you can not be explained currently related to keep all of this is what we would like to know if he was 1 of the so I what did know what was going on all of the use of 1 on what they're going to get the fact so this was a very
nice property of this this property was really nice and that immediately ruled out many other algorithms I think simple line had this problem and the worldwide sold on this immediately ruled out many algorithms that we might have to on but if the code still kind of model and it's it would be produced 2 experiments deal with different algorithms and you can even use different algorithms depending on whether it's 80 or not but then it gets really difficult to ensure that you keep that property of a someone who was going to be yeah maybe quantity with the size of the of the whole the editor 1 idea I've had is that they could keep a separate bit-mapped of problems that have been deleted or something like that and you can get around that way that gets a lot more complicated so no even if you can guarantee that it can only increase by some value then as soon as it can increase of all you need is that you have to be able to speak during backing which kind of sorts because it means that during which might grow when you're trying to make room for example if you run out of disk space you might then backing to make some space and now you can just vacuum on things because there's no that's based on there are actually at some other code path through he held that can still happen for example the fast optical we did earlier which means that when you insert the gene tree we just put it in a list and then we go and merge it later at that actually has the problem that it's possible that the run of the space when you back you use you would have to require temporarily some more space but that's a different problem I avoided of so that's
part 1 of my presentation that this was the postings wasn't really just a summary of the gene images will be much smaller smaller their code can still read the old format which means that PC operates still works but obviously you're indexes not magically gets magically gonna get smaller renewed PT operate on there will be converted as you update them so when feel if you insert new new rows to the table we will convert the pages on the fly ash as necessary our blood really using gene exists and you upgrade I highly recommend that you read index sometime after the upgrade just to get this benefit the much smaller didn't have to if you don't want to really can do it we can draw some on the this compression makes updates Randall updates of anything in some was somewhat more expensive because though format we did was just plain list of flight and pointers and it was pretty simple to just insert something in the middle and make room on the new format on you have to be called to segment and the nascent with back again we could optimize that on so the orginal version in that election as submitted did something more clever at split the the compressed list insert 1 value in the middle and then form and back again on don't do that currently but we could if if it becomes a problem but in general Jimmy's isn't June isn't very fast in front of updates a laser on thing it's a huge problem and it's not that much slower and yes the all yeah from the dust is are going be much smaller now so you want to choose his genes in more cases than before in the world I couldn't give you any numbers but by tested it but I tried to watch made it as fast as it was before my gut feeling it's like 10 % lower in the worst cases and so not too bad but I tell if you have a new scale if you have a test case it would be great if you could guess that water still in beta and if there's a huge regression somewhere in some corner cases be it's still possible to fix it just sensitive the bolting all no don't think there's any difference there it's small at the beginning so if you have some below that's OK does not occur so that you have meeting in the gene you need indexes you can't you can't build a normal unique constraint on engine in it but you can do an exclusion constraints that is the same but if you want if it's a unique index this this stuff helps just the point of the optimization is to optimize the lists yeah the question is if you can do a unique indexes origin yeah but it wouldn't you wouldn't want to do that because in if you needed unique index than the B-tree is much chose gene is optimized for the non unique case so that kind of thing OK any more questions about the posting list compression before I move on to the best of this 1
and this level now so the 1 thing I'm going to come back to that the Gini index the B-tree just indexes 1 value per row can the gini index can index you can have can values for every row in the table for example you build text search window 1 to see that the so
Eric him out to continue come back to that so there are 3 fundamental things that the Gini index dust of 1st of all when you build the index for every row in a table is going somehow extract the index keys from the roles in normal BG you have a B-tree over an integer for example is that the images that index but the Gini index if you build a text search index for example and you have to start extract all the words and it's every 1 of those roles in the index so if you have 10 words you can have 10 entries in the index for every row in the table and that's why usually end up with a lot of that gets to so that's 1 thing that the gene doesn't extracts these keys the G 1 index then stores among and that's where the will of the compression posting optimization comes that it's much more efficient at doing that now I and then the 3rd thing that dust so once you when you search and you figured out which he's you're searching for it then combines of the matches and it tries to do that as efficiently as possible and that's where the 2nd path comes in OK
so here's an example if you have a like this on an an advanced PostgreSQL well open source database what is going to do it can extract these words and then it's going to do an index surgery searches for posters and it searches for advanced and open and source and database and then it combines the results so that it only returns the worlds that matched all of along so so yeah
that's that's so that's 1 difference between gene and animal between you have multiple but at the on this level there is no difference if you could be answered multiple I consider normal B-tree index and the code wouldn't care basically with our plenary do that for animals between that's all they point to the heat couple yes so
when you run a query like that so what we do is that we collect these lists for every 1 of these keywords by doing an index can induce catalog that we actually do 5 scans of the beta and X 1 form of words in your query and then we combine the results in this case the the query will look like this the
green means that the customer at all of these words for the role mass so the way it works get this lists of items for interest and they're nobody compressed and compact but and then we walk through all of this lists starting from the meaning mind and the smallest by which in this case is 0 . 1 and then we check dust dust dust that satisfy the query became 0 . 1 and only match 1 of these roles we call over over the classes of consistent function which in this case it's the same note this doesn't match because it only matters 1 of these of keywords then we will move on to the next 1 which is 0 . 2 and now we find OK that matches open-ended source and then we call again the consistent function and it says no it still doesn't matter it's the loan amounts to of this in what to the next 1 0 1 3 again only 1 match and a consistent return false and finally we get the whole thing that matches with 0 . 8 to find that OK we have a match for all of roles and in this case the constant function will say yes and the gini index will say greater than going return this to the clients and you finally get the result and the thing is pretty obvious that this is a this is a way to do it when you need to think it's
quite obvious more how it should work in retrospect so in . 4 fewer case like that where you have the you only have 2 matters
for the work posters and if you know that because all of these words it kind of makes sense that you should only matter you just stick those 2 roles why do you bother going through all these other lists and that is exactly what we did in point 4 1st of all the
order of these words of by an estimate of how many items there are so we begin with the word posters you take that 1st row and we check is there a match for all of these other ones and it turns out yes there is so we immediate return that a then you move on to the next next for over all that matters PostgreSQL and we check history maps for all of these other rows and we find that there is and then we need to give up and downstream country that's for explicitly know that there are no more matches and we return this is all there is a lot faster than going through all of this list some of which can have thousands of entries of so there's a trick here
I mentioned that the says it has to match all of these roles but Jin index doesn't actually know that it has to call the consistent function without combination of things that the match and you will return true false saying well was the matched doesn't automatically know that it has not all of these rules if this was an awkward custom at any of these words we obviously
couldn't do this we couldn't steep entries and can be would have this on so for that purpose
we introduce the new function that the over the class can defined as a sculptor try consistent I of try that no consistent value you just give it the true and false for it's either and this 1 you true false or maybe which means that the gene can ask the crime consistent function that OK I know that this match Pope stressed but I don't know if this rule matches any of these other things is it even possible that this row can match regardless of if it matches the lower and the consistent try consistent function can return yes this is definitely a match no matter what you have all these other keys within the downfall saying you know who there's no way this can match or it can be turned maybe it says I I don't know you need more information and we use that to determine
whether it's safe to do this keeping a of the simpler case which I
actually committed 1st because it just makes more sense this 1 doesn't require the consistent function if you do the crew like that we test the match if you use the and operator directly in your query and not hidden inside the T a search query the search for post-disco and open source database then the gene machinery knows immediately that it's in and caught has multiple things and are this you can do even without the try consistent function I that's
what can and questions yes that's a good question so how do we know
how many interest there on the spot all the these scans 1st so we have the sent the tree the entry tree and there immediately if if the if you immediately have there the list of items that matter and it fit on the same page 3 which just kind of know or already how many there are but if it's a longer list so that you end up with the earlier if it's if
you have so many entries with the same key like can pass and that's the end up with this post injury then what we do is that these go true the posting tree and traverse the bottom layer level and estimated OK if if it was 3 levels deep we use that as an estimate of how many many matches there are it's not totally accurate but it's it's good enough do basically trying to do is to separate the very rare items we have got to matches from the ones that match 10 thousand entries and its is accurate enough for the day all I can be it can be used for indexing search as it has been implemented or has had no way of thinking about what or we provide true but if I'm also somewhat full-text search now but used to be changing for example in it dust so that the overall know about so long as you want and you will not mind was there's a good point only supports bit mapping extensive and here you know what you want to know but aside from that problem that I don't see any if this was not possible in every case but there are cases where it would be possible like the example I used the earlier which is kind of silly just and it's of numbers but they do have those numbers in the index you could return them on J. just hasn't been implemented for the special case where it's possible for somewhat full-text search you can't get Peter yeah what this was the
much of this kind of works
like a marriage joined that is 1 way to put this on the problem with this method of right yes OK so I guess 1 way to put it is that the 9 . 3 way was like a merge join the scan through all of the lists and in order they are already in order to read and so on so it's like merge join and and the the as and fast scan mechanism is actually more like a nested loop changes you just take 1 of them and you then go check the others so that's as well as for this in general that is you can on the so
just 1 more thing you're along the way I did all of refactoring of just the existing code and making this readable and possible and 1 thing I also changed by Mary Jane the crash recovery works of splits and I did the same thing for B 3 and 9 . 4 so that the very internal this gearbox fixed to be honest but such a rare obscure Bach and back that that up until I'm just trying to say is that if you have workloads that you can test with please don't hold 9 . 4 be done and test judging indexes and you'll be trained actors on so yeah there's
some more stuff we do 1 thing imagine this like a joint and 1 thing i've been wondering for a long time is that gene is actually like a mini planner N-termini executor of it's all this kind of silly and so you would be nice to actually pull some of this fast scan staff and could just pull it in the exit main executor give you joined together with other indexes and all kinds of interesting started with 1st didn't they call of this into the gini index code on another thing where you can be couple things is you had all like and others talk about what kind axes and I don't know how many people understood what that was all about home if you were and what kind of it but if you how many understood what was a lot so long as the other 2 there is not that I expected
even less as so what what what guys above all 1 aspect of that is that they mentioned that this is a B tree and this is a the 3 but all of these stuff that we do there's no reason why it has to be a big victory it could could this upper layer into tree could be just index or it could be an SPG standards or any other index type you come come up with and we could still do this compression stuff with any other index and when gene extracts these keys like splits so full-text search into a words and index all of them you can do the same thing with index so what what guys all about this to separate these concepts so that you can build gene index over is produced origin index over something else combined it's different combinations it's interesting there's a lot more use cases and just gazing that they cover because like built to a very efficient 3 of points that's really good at storing duplicates if that makes sense I don't know if it was on there's all kinds of interesting stuff you tend to give the couple these things by values you have shown earlier how interesting have thought of that yet this practical in the in the in the middle of the 1 of the things you can do yes there's actually another point I want to make and I think I have a slide that on definitely
yeah my final gene yet this is are lot smaller than the 2 index indexes funeral of duplicates normally be normally people think of gene Hassan and access method for full-text search for kinds of beer stuff it's actually very useful just discipline replacement for the 3 you have people to because of same kinds of use cases that people use bin packing boxes for said area if we replace the compressed compressed posting lists with bit-mapped we would have exactly was a bit mapping at yeah you could make use of the considered the user-controllable there's also no fundamental reason why all you have to use the same mechanism for all of the entries so you could have a bit-mapped for stuff that Bovary contact and list of some other kind of stuff and the gene machinery could just to deal with that just fine yeah you can do that in the all and there is that even then there is no if you use it just as a replacement for tree in the upper enter trees just a B-tree it's just trans implementation of the trees so we just worked on that it could be there's no reason why it has to be smaller than 1 so that is 1 future direction we could go and enhance that and add all of those concurrency optimization and all of them are stuff that we work over the years from all regions that are the way to go would be to just throw these away and make the normal Beecher codes smarter 1 storing duplicates and there is no fundamental reason that you couldn't store multiple-item pointers for the same properly normal the tree index to be trained X code doesn't care that you could when you have multiple gets rid of the it's the normal B-tree code to then do this list stuff and then the compressed list stuff and you make a generic and again that is kind of when you end up with this what kind accessory have yet yes that's kind of yeah I few patterns more compactly than you have comparison with that so you don't want to so both your indexes just through this convention it but the and this is a good thing especially . 4 but even nine-country if you have those column like status that only has a few values on or off and then you should use the tree for that managing for that there's a not only 2 of the that leads to the for following the rate of growth for that that that world yeah so there's some more smarts in the B-tree could that be just hasn't been implemented in Jindigo because of the way that it will ya that's not the my wife we will be immersed in or we could become post these things and do what color rather that I'm open to ideas I I do looking forward to discuss this with William Alexander and interpret tomorrow what are we going to do what kind it is because it seems like this as a set the gene index is like a maniac executor and plenary itself and I'd really like to speed up the process really whereas changed my life or change in original independent yeah I think we I think we always do it so I don't think it's ever going to be it's quite significantly slower there's like this was a bad ontology it's not exactly it's like a merge join with keeping the no it's not the same algorithm and they had actually skip entries and not 1 of that and have all at the same as for a more normal merge join you have to stand both of the lists from beginning to end to find matches but even again this is 1 reason we must put this stuff in the numerical executed this is a trick the normal merge join but also do you have a 5 and 10 David immediately steep everything after 5 up to point 10 on the 1 side so if you're if 1 side of the emergence coming from an index indexical filled the index to reposition needed the tenants to call this fall in between no no that I guess the room generally and yeah I I guess the way to put it is that it's not really a nice little chance it's like merge joined with skipping that but the characters remember through anything you want just like other terms that would also like to add
by the way the here in this kind of thing and so is
that the roads were so I don't know if you want to hear what we will also
hold it and you know and also we will restore their uh how to hold a certain borders is very flexible but it's not very fast because of the use of full full stop the basic which is only a 3rd of the year here and we go out of focus which the soul of the union of the people in the following way so we must also go with the group all full-text search so we can compare all of the little and this is 1 of the
Greece I should go wrong and this resulted In this which is very fast the people so so he was a whole lot of musical and scholars would have us so if you were going this
book so we just isn't music and it was always the reason for all of June in moment you know comparison of the old approach and you will see that a little fast and then we give a
real what we this was the the over the years like queries that will be needed the thing is that although the rebels in which we really want to and know all the boarding it and the use real life we all around with all the all of of the University of all with you will see that that since the last 15 million queries and how would you June provide us with people who would have thought that it would be interesting to important because we use you don't move alterations in all of the search and experience in the region of
the world and because of so that it has all along with of of what they all these 2 of students and the case where there are several other questions still only for the about what we will call that and we all know and that's why I only invited by looking at the number of books like looks like very slower the building next now that true yeah yeah it's the just the of the rule of law mining and the 9 . 3 plus hatch sense that what can I didn't know or see that broken and the relative that had a historical review and here there was nobody interesting
because original called you know this last column is and what it up I will show you the institution like case that was the frequently cited the regret is at peace the is that well so we felt would be used in his of all of that we have to know what what the father and the father of the of so that occasional false so it's usually the user from medieval times glass you this procedure Oscar of goes beyond the standard usually heats up quickly and the will use this
is true examples of all that actually work with all the the and
now all of was that this is what it's
all over the world all the do you
want but just the slides feels that you have both like this doesn't look
right here have right because of the fact
that and technical problems can
just saying you want to hear those lies above 0 his but it goes out up work the full screen there are
the system in large the or with the
required to remove the word of the of the of involved in the almost of the we have to solve for like the and and
that is what the user wants to go with all the time the world will tell you or the of the genes that many of the of the use of what knowledge what will show that in a it's not a it's not what kinds of new rule I get the following you don't want to hear so it during the middle of a word is
just the of the goal of this also holds in the old age you know is that we can with the help of the of the of the weight of the wall at of the age of 18 years who was that there was a lot of of of the city and again and you


  630 ms - page object


AV-Portal 3.20.1 (bea96f1033d39fbe77f82542458e108105398441)