The State of (Full) Text Search in PostgreSQL 12
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46908 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2020147 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
TwitterState of matterE-textMaxima and minimaRoundness (object)Sampling (statistics)DatabaseDemosceneSearch engine (computing)Elasticity (physics)Subject indexingMassMathematicsCodeComputer animation
01:37
Content (media)E-textOperator (mathematics)Function (mathematics)Subject indexingType theorySoftware maintenanceFunctional (mathematics)CollisionOperator (mathematics)Natural languageSubject indexingData dictionaryArithmetic meanNatural numberComputer animation
02:21
Presentation of a groupTerm (mathematics)Markov chainChainDistanceType theoryVarianceRegulärer Ausdruck <Textverarbeitung>Data storage deviceData compressionInformation retrievalSystem identificationMereologyMatching (graph theory)Data miningPresentation of a groupTable (information)Data storage deviceSystem identificationData miningLimit (category theory)Information retrievalLengthExpressionProcess (computing)Data compressionWeb pageField (computer science)ArmMaxima and minimaSpacetimeRankingMathematicsType theoryVariable (mathematics)Row (database)Domain nameWordRegulärer Ausdruck <Textverarbeitung>VarianceString (computer science)QuicksortValue-added networkDifferent (Kate Ryan album)Formal languageMatching (graph theory)MetreObject (grammar)Attribute grammarLatin squareMetadataSet (mathematics)MiniDiscWeb 2.0Operator (mathematics)Codierung <Programmierung>Representation (politics)Content (media)Computer animation
06:18
Operator (mathematics)Pattern languageRegulärer Ausdruck <Textverarbeitung>Regular graphFormal languageSimilarity (geometry)Doubling the cubeFormal languageRegulärer Ausdruck <Textverarbeitung>String (computer science)Subject indexingPattern languageCASE <Informatik>ExpressionQuicksortNumberMixed realitySign (mathematics)CuboidFunctional (mathematics)Position operatorPlastikkarteMatching (graph theory)Theory of everythingComputer animation
08:43
E-textInformation retrievalToken ringDatabasePrice indexSerial portMatching (graph theory)Subject indexingNatural numberFormal languageProcess (computing)Web pageQuery languageProcess (computing)Table (information)AlgorithmInformationBit ratePosition operatorField (computer science)ResultantSummierbarkeitSubject indexingToken ringDatabaseWordInformation retrievalTexture mappingNegative numberPredicate (grammar)Natural languageWeb 2.0Computer animation
10:53
Token ringSocial classParsingData conversionDatabase normalizationString (computer science)Execution unitAbstractionRootRootNatural numberSocial classTheory of relativityType theoryParsingToken ringOrder (biology)QuicksortParsingField (computer science)Form (programming)Row (database)Natural languageWordGroup actionContext awarenessSet (mathematics)Multiplication signResultantNormal (geometry)Execution unitString (computer science)MereologyNumberRoutingCentralizer and normalizerOffice suiteComputer clusterComputer animation
13:06
Token ringSocial classData conversionExecution unitAbstractionRootParsingString (computer science)Database normalizationAlgorithmRule of inferenceEndliche ModelltheorieWordEndliche ModelltheorieData dictionaryResultantNormal (geometry)Rule of inferenceMereologyAlgorithmNumberArithmetic meanFormal languageMultiplication signProduct (business)ArmCentralizer and normalizerRight angleInstance (computer science)Electronic mailing listComputer animation
16:27
Representation (politics)Vector spaceQuery languageUtility softwareFunction (mathematics)Latent heatForm (programming)Query languageVector spaceData typeFunctional (mathematics)Point (geometry)Utility softwareType theoryFlow separationPredicate (grammar)Multitier architectureSoftware bugComputer animation
18:04
Query languageVector spaceDomain nameComputer programoutputReduction of orderData dictionaryTemplate (C++)WordConfiguration spaceQuery languageWordVector spaceData dictionaryCentralizer and normalizerNormal (geometry)Online helpSoftware testingProper mapOpen sourceDivisorOperator (mathematics)ResultantPhysical systemSpacetimeFunction (mathematics)Matching (graph theory)Type theoryOrder (biology)outputSign (mathematics)Computer programmingToken ringConfiguration spaceFormal languageTemplate (C++)WeightMultiplicationSquare numberFlow separationMultiplication signProduct (business)Event horizon1 (number)Set (mathematics)PlastikkarteComputer animation
20:42
Matching (graph theory)Cartesian coordinate systemQuery languageCellular automatonRankingMiniDiscLimit (category theory)Order (biology)Population densityTable (information)Covering spaceDistanceCore dumpToken ringArithmetic meanData miningQuery languageHacker (term)ResultantMatching (graph theory)EmailWeb 2.0Operator (mathematics)WordRule of inferenceGoogolMultiplication signSquare numberDoubling the cubeElectronic mailing listVector spacePlanningTerm (mathematics)RankingRoutingSign (mathematics)Casting (performing arts)AuthorizationContent (media)Functional (mathematics)Search engine (computing)Computer-assisted translationOrder (biology)Selectivity (electronic)Inheritance (object-oriented programming)Type theoryRootComputer animation
26:23
RankingLimit (category theory)MiniDiscOrder (biology)Function (mathematics)StatisticsSystem identificationEmailPatch (Unix)Default (computer science)Network topologySubject indexingPrice indexQuery languageMessage passingPlanningJust-in-Time-CompilerQuery languageSet (mathematics)StatisticsSubject indexing1 (number)Network topologyLink (knot theory)CountingFunctional (mathematics)EmailPatch (Unix)Streaming mediaWordConfiguration spaceNumber2 (number)Web 2.0Insertion lossPosition operatorLimit (category theory)Message passingRevision controlVector spaceString (computer science)Data dictionaryRow (database)Type theoryPort scannerResultantScaling (geometry)Table (information)RankingCompilerBitMultiplication signPattern languageOrder (biology)Electronic mailing listPower (physics)DivisorWeightMatching (graph theory)CircleQuicksortAlgorithmDefault (computer science)Universe (mathematics)Auditory maskingExpressionSelectivity (electronic)Parallel portHacker (term)SequenceRun time (program lifecycle phase)Just-in-Time-CompilerSearch treeComputer animation
32:05
Subject indexingPrice indexData storage deviceTable (information)Vector spaceSubject indexingVector spaceWordData storage deviceMessage passingRule of inferenceEmailFormal languageType theoryFlow separationMiniDiscSpacetimeElectric generatorDivisorSoftware testingMultiplication signComputer animation
33:44
Cartesian coordinate systemSubject indexingPrice indexData storage deviceTable (information)Vector spaceQuery languageMemory managementMessage passingEmailPlanningSubject indexingMultiplication signRun time (program lifecycle phase)Online helpBitQuicksortPoint (geometry)2 (number)Raster graphicsResultantRow (database)Type theoryComputer animation
34:50
Price indexEmailMessage passingPlanningMemory managementMathematicsOperations researchModul <Datentyp>Inheritance (object-oriented programming)Subject indexingData dictionaryString (computer science)Similarity (geometry)DistanceMeta element2 (number)Multiplication signSubject indexingDifferent (Kate Ryan album)Point (geometry)Level (video gaming)Port scannerNumberMatching (graph theory)Operator (mathematics)ResultantFuzzy logicSearch engine (computing)String (computer science)Metric systemDoubling the cubeVector spaceSelectivity (electronic)DistanceSimilarity (geometry)Module (mathematics)Sign (mathematics)ExistenceSlide ruleArmGroup actionTelecommunicationSoftware developerDatabase40 (number)Instance (computer science)Programming paradigmProof theoryFunctional (mathematics)Beer steinState of matterMagnetic stripe cardPressureComputer animation
38:01
Type theoryPrice indexPositional notationInformationRankingTimestampDistanceFreewareString (computer science)BlogRandom numberCodierung <Programmierung>WebsiteType theorySubject indexingPosition operatorData storage deviceWordInformationString (computer science)Operator (mathematics)CASE <Informatik>State of matterTable (information)RandomizationEntire functionSoftware developerProof theoryInternet der DingeHexagonRight angleMoment (mathematics)Run time (program lifecycle phase)Data dictionaryFunctional (mathematics)CodeDistanceFormal languageHoaxTimestampCodeMultiplication signLoginKey (cryptography)Group actionNetwork topologyPhysical lawComputer animation
40:45
FreewareLocal area networkData modelStochastische SpracheMarkov chainChainCartesian coordinate systemSimilarity (geometry)Extension (kinesiology)Price indexMessage passingHill differential equationTable (information)Data conversionSet (mathematics)QuicksortMilitary operationSource codePoint (geometry)Line (geometry)CollisionDistanceNumberSimilarity (geometry)Table (information)Configuration spaceClient (computing)Message passingSubject indexingFormal language2 (number)Thresholding (image processing)CodeOperator (mathematics)Order (biology)WordQuicksortRegulärer Ausdruck <Textverarbeitung>Extension (kinesiology)Revision controlOnline helpInstance (computer science)Multiplication signDatabaseSelectivity (electronic)Different (Kate Ryan album)ChainEndliche ModelltheorieSpacetimeNetwork topologyString (computer science)Dependent and independent variablesData conversionBitType theoryLatent heatLoginBoolean algebraCodierung <Programmierung>Computer animation
45:19
Type theoryBoolean algebraString (computer science)Floating pointRegulärer Ausdruck <Textverarbeitung>Query languageEquivalence relationQuery languageType theoryString (computer science)Numeral (linguistics)Standard deviationFormal languageDifferent (Kate Ryan album)Vector spaceFunctional (mathematics)Key (cryptography)Subject indexingQuery languageBoolean algebraComputer animation
45:59
Software maintenanceVacuumStatisticsSet (mathematics)Table (information)NumberQuery languageCross-correlationResultantNumberStatisticsMultiplication signTable (information)Software maintenanceElectronic mailing listQuery languageSubject indexingNatural languageComputer animation
46:52
Table (information)Serial portSoftware developerType theoryRow (database)Serial portTable (information)CASE <Informatik>System callCodeComputer animation
47:42
ParsingFunction (mathematics)InformationOrder (biology)Natural languageDifferent (Kate Ryan album)Formal languageData dictionarySubject indexingWeightContent (media)Right anglePhysical lawComputer animation
49:37
FacebookPoint cloudOpen source
Transcript: English(auto-generated)
00:05
So we're very happy to have you, Jimmy Angelakos, please give him a round of applause. Thank you. Before we start, how many of you have a background in mathematics? So a few hands, many hands.
00:21
And how many of you have a background in linguistics? Oh, okay. Nice to see some people here with both kinds of skills. It's going to be boring for both of you because it doesn't have enough maths and it doesn't have enough linguistics. But it does have some samples.
00:42
So the motivation for this talk is I was speaking to some, let's say, colleagues, customers, community members, and I was surprised to find out that these things aren't all that well known. There are people that use Postgres that actually use search engines to index their text,
01:07
like external search engines. And it's all basically the same code base because it's always Lucene and Solr, which is Lucene, and Elasticsearch, which is Lucene. And these things aren't databases.
01:22
They create indexes for your text, but they're not databases. Your text is in your database. So why do you have an extra thing when you can index your text in databases? And that's the motivation. Let's see how you can search for text in Postgres. Now, we're going to look at what full text search means.
01:45
We're going to look at operators and functions, what dictionaries are in Postgres, a few examples, how to index text in Postgres, how to deal with text that isn't human language or natural text,
02:04
what collations are, what other things you can search for that are not strictly text, and how to keep this whole thing running. Now, are you excited for full text search? Yes.
02:21
You shouldn't be. Your trust is misplaced. So we do have things like this appearing in the presentation, and they have been known to induce drowsiness and inappropriate sleep. So if you feel like sleeping, nobody will blame you. Let's look at what is text.
02:45
Let's look at the representation of text in Postgres. In Postgres, we have the well-known car with a character limit that says you can only store this many characters. This is padded, so if you define a car of 10,
03:02
then and insert five characters in that field, you will get five blank spaces at the end because it's padded. These blank spaces do not count for anything. They are ignored by Postgres. Then you have varchar, which is variable length but still has a fixed limit.
03:25
And then you have text, and they're confusingly named varchar, which is the same thing. And you should really use the word text, I believe, if you want to store an unlimited amount of text because it makes it clear that it is not varchar.
03:41
It's a different type of column. So this is a character large object. You can store a lot of text in there. And in variable length text, in varchar, we see that trailing spaces are significant. If you leave a space at the end of your string,
04:03
then that counts for like expressions or for regular expressions. How are they stored? Depends on the character set. There are multi-byte encodings like UTF-8 that will use one character for Latin text
04:21
or two characters for other languages or as many as they add in the end with all these emojis and stuff that they keep adding. The storage is one byte plus the length of your string in bytes, not characters. So 126 bytes.
04:43
And if you go over that limit, the storage changes to four bytes plus the length of your string in bytes. But there is no guarantee that this is the amount of space that it will take up on disk because of compression.
05:02
And we also have toast, or the oversized attribute, the oversized attribute storage technique, which puts it in separate tables if it goes over the maximum length of your row. So what is text search?
05:23
We're talking here about information retrieval and more specifically, text retrieval. This is a well-known domain and we mainly use it for searches on metadata, which can be like descriptive, bibliographic, there can be tags, tagging objects,
05:42
and are used for discovery and identification. You are generally not, when you're talking about text search and text matching, you are not trying to identify, let's say, the web pages or the articles that you're interested in
06:01
through their content. You're looking at metadata fields and you're trying to match words out of the whole thing. So, matching, substring search, data extraction is useful for, you can clean your data, you can use it for data mining, and these are the supported operators in Postgres
06:22
for this sort of thing. You can use like or double tilde, and case insensitive like or double tilde star, which is basically wildcard expressions.
06:40
If you want to match something against a wildcard with the percent sign or underscore, percent sign for any number of characters, underscore for just one character, then that is the SQL way to do it. You also get regular expressions and the tilde and tilde star
07:03
are POSIX compatible regular expressions. So, not quite Perl, but very standard, and Postgres implements them out of the box. There's also the function regExpMatch, where you can give it a string of text and a pattern
07:21
and it will return whatever matched the pattern from that original string that you gave it. But are these things enough for searching for text? You can't have, with regular expressions and likes,
07:43
you cannot get ranking of results, because nothing has meaning, it's just characters, it's not language. You can't have relevance in this. So, no concept of language when you're matching with regular expressions, therefore, if you search for a ship
08:03
and you search for shipping, it will be two totally different things that will never be related, they will not match. And also, they can't be indexed, which is bad, if you have a lot of text that you want to search for.
08:20
Okay, you can index these in some limited way. And there's also similar too, you better forget about that one, it's a mix between POSIX regular expressions and SQL that everyone is better off not using.
08:42
So, just forget about it. Now, we have inserted the word full. What do we mean by full text search? Now we're going a step beyond, from information retrieval and text retrieval, we're going to document retrieval. So, based on our predicates, or what we are searching for,
09:04
we want to return results that are documents, articles, web pages, whatever contains text. So, full text search is a search of words, or tokens, as we call them, in a database.
09:22
So, the sum of all documents that you have. If you don't have an index for this, this will degenerate into a serial search. So, when you start searching, it won't be faster than grep,
09:41
because it has to go through the entire document for every document. So, you do need an index for this, because you don't want to scan through the entire document. There are also techniques that, let's say, make it easier to search
10:01
and reduce the amount of text based on natural language processing algorithms. And we also have to talk about the famous trade-off of precision versus recall. What is precision? It's how accurate your search results are.
10:24
Therefore, how many false positives you're getting, or how many false negatives you're getting, when you're retrieving your result set from your query. And recall is how many results you're getting back
10:41
based on that query. How restrictive is your query? And these are influenced by things called stop words and also stemming. And we'll look at those. But first of all, let's look at documents and tokens. So, a document is a chunk of text. It's a field in a row in your database,
11:06
in a table, of course. Any type of text is considered a document for this context today. Parsing of documents into classes of tokens
11:22
is what needs to be done next. So, tokens can be strings, they can be alphanumeric strings, they can be numbers, they can be any sort of character group that you define that is interesting to you for your own purposes.
11:40
So, breaking up the text into tokens and classes of tokens is parsing. And we're fortunate enough, as a result of serious effort and time that has been spent, we do have a parser that is very good, that comes for free with Postgres. Or, if you have different requirements,
12:02
you can write your own parser, as long as you can write it in C. Now, in order to do anything useful with our dataset, we need to convert the tokens from these documents into lexemes.
12:22
Therefore, we have to perform what is called the normalization of our strings. We have to turn them into some more usable form from their original form. And what is a lexeme? It's an abstract lexical unit that is representing the set of related words,
12:46
so you can call it the word root, but a word root only applies for human language. So a lexeme, in the context of natural language, is the word root.
13:02
Therefore, the lexeme search, excuse me, the lexeme search can stand for the words searched and searcher because you can reduce these words down to that lexeme.
13:21
What are stop words that we mentioned? Stop words are very common words that appear in our text. Therefore, they have no value. Like, in English, articles like the and a do not have meaning for our search most of the time. So we filter them out,
13:43
and by filtering out stop words, that increases the precision of our search, right? Because we're getting more relevant results. It does. And how do we remove them?
14:02
We remove the stop words based on dictionaries. Some dictionaries have stop lists, and these are words that are checked and removed and eliminated from the document
14:21
before we can do anything with it. But if you do remove stop words, you have to consider one fact. What happens to phrase search? If you are looking for something that includes articles, for instance, or very common words, as a phrase,
14:40
then you have to consider other things. Let's talk about stemming. Stemming is reducing, as we said, the words to lexemes, and it does increase the number of the results. So it does increase recall.
15:02
And if we have something that matches multiple words, we're going to get more results. It goes without saying. So the algorithms that we can use for stemming are, as we said, normalization using dictionaries, where we can find out what the word stems are.
15:22
We can strip prefixes and suffixes that we know exist in a specific language and break down the words to its constituent parts. We can also use automatic production. So maybe if you find the word stem
15:42
by attempting to add ing to it, you are creating a possible result by producing a new word based on the stem. We have also limitization rules that define how words are stemmed in each language,
16:04
and we also have n-gram models, which are probabilistic models that show how you can reduce words. But we mentioned languages here, and it is hard to do this across multiple languages
16:24
at the same time. So how is full text search represented in Postgres? We have a data type that's called TS vector, which represents a document
16:43
stripped down to its essential form. It's a pre-processed document stored as a separate data type in Postgres, and then we have TS query, which is our search query, our predicate,
17:01
and it's normalized into lexemes. So before you can use Postgres's full text search features, you have to convert everything to one of these types. And these come with utility functions like 2 TS vector that turns your document into the TS vector form by normalizing it,
17:25
and we also have 2 TS query that expects specific TS query syntax that you have to know, or you can just use plain 2 TS query if you have some text that you want to turn into a TS query, and that ignores capitalization and punctuation points and all that.
17:47
If you want to find out how these things happen under the hood, you can use TS debug, and that will show you how everything is converted and what token type it's been converted into
18:01
by the dictionary that you're using. So the operator types that we have for full text search are the double at sign, which means TS vector matches TS query on the other side, and you can also turn it around and you can make it TS query matches TS vector.
18:21
It works both ways. We have TS vector concatenation, and for TS queries we have these operators AND, OR, and NOT that help you formulate your query. We also have the followed by operator, which means that TS query 1 must be followed by TS query 2,
18:42
and that's a way to define order in your search, word order. We also have contains and contain them. So we mentioned dictionaries multiple times. What are dictionaries in Postgres? They're programs.
19:00
They accept tokens as input, and their output is the normalized text that we need. They improve the search quality, as we said, by eliminating stop words and normalizing into lexemes. They reduce the size of your TS vector, which is good,
19:21
and you can use search dictionaries by create text search dictionary name. You can choose one of the many available templates, simple doesn't attempt to do anything smart. It just separates words by white space, which is useful if you want to match proper names
19:43
that are not words in any language. You can also define the set of stop words for your text search configuration. You can also decide to change those in the order which is convenient for you.
20:02
You can have it first perform the dictionary normalization using iSpell, and then the simple one, and you can assign different weights to each one of those, so you can tell apart the proper names,
20:21
let's say, that were kept by simple, but were eliminated by English iSpell, and you can assign different weights to them in your results. These come with your system because we can use the open source iSpell, mySpell, and hunt spell dictionaries for this, on top of the included ones in Postgres.
20:45
Let's do some text matching. Select to TS vector, a nice day for a car ride is our document, and our query is, I am riding. Does it match?
21:03
Yes, it matches, according to our English dictionary rules. But why? I want to find out what 2TS vector generated that made it match. 2TS vector generated these lexemes,
21:23
car, day, nice, ride, and 2TS query reduced everything down to ride. So, yes, it matches, it's one of those.
21:42
Let's see something that doesn't match. So, if I select to TS vector, a nice day for a car ride, and my TS query is, I am riding a bike, does that match? No, it doesn't match, because my plain 2TS query converted my query to,
22:06
it stripped it down to the two essential words, ride and bike. And bike isn't included, so therefore, the document isn't relevant to our search, it's not returned.
22:21
Another example we can look at for matching is Starman and Star, they do not match, because they don't, according to our dictionary always, they don't belong to the same word root. But, I can cheat, and I can say I'm interested in all words that begin with star.
22:42
So I can say, 2TS query, star, followed by an asterisk, and that matches, because that makes it accept everything that begins with star.
23:01
Now you will notice that I am not using 2TS vector and 2TS query here, because Postgres does it automatically for you sometimes. When it is able to, it will cast these to the proper types automatically.
23:21
We also have the wonderful function, web search to TS query, that attempts a Google style query, and you can use things like and, or, and you can use double quotation marks to mark phrases, and you can also use minus signs to remove the results that you don't want,
23:44
like a web search engine. So this makes our TS query become stray followed by cat, and it must not have cat followed by shelter, because I'm only interested in the stray cats, but not cat shelters,
24:03
for the purposes of our search, of course. Let's look at an example table of immediately followed by. If you want, sorry, the question was, does it mean followed by,
24:26
or does it mean immediately after, immediately followed by. This operator means immediately followed by, but if instead of the minus sign in the middle of the operator, if you insert a 2 or a 3, that means 2 tokens away or 3 tokens away.
24:41
So you can define the distance that you want. Let's use an example table of the contents of PGSQL hackers, the mailing list. And the contents have an ID, a parent ID, which email you replied to,
25:02
sent, time stamp. What we're interested in here is the subject, the author, and the body of the email in plain text. So how big is this? This is 478 megabytes. We also mentioned how relevant our results are and ranking them.
25:33
So for ranking we have TS rank, and it's covered density variant TS rank CD.
25:42
Covered density just takes into account how close your terms are in the text, and it bumps up the relevance of things that are closer together. But anyway, let's use TS rank for now. So from our example table, if I select the subject and the TS rank
26:05
of our body converted to, our email body converted to TS vector, and you'll notice that I'm putting coalesce in here so that I can have meaningful results if I have a null text body in my email,
26:26
I am trying to match this to aggregate, the word aggregate. So I'm trying to find the top five subjects, because I'm doing an order by descending limit five, I'm trying to find the top five subjects of emails that contain the word aggregate.
26:46
And when I run it, it produces a rank. And it tells me that these were the most relevant subjects, excuse me, these were emails that in their body were most relevant to the word aggregate.
27:04
I know that's not a very useful search, because it's going to show up a lot, but these are the subjects of the emails. Sorry, 32 is the scaling of the result.
27:20
So it's a bit mask that defines how the weights are applied to your ranking algorithm. These are all documented. As you know, Postgres, or you may not know, Postgres has excellent documentation on the web, and there are going to be links to that at the end of the presentation.
27:42
Let's look at some statistics. We have the function TS stat that can be used for verifying that our configuration for text search and dictionaries and everything else works fine for our own purposes. So if we select star from the function TS stat,
28:03
and we pass it a TS vector, because that's what it needs, I will attempt to find the number of documents, the number of words, et cetera, and it returns, according to my configuration,
28:24
the top words that it found in the documents. So the most common word was use, wrote, would, think, patch, and so on, which is something that you might expect from the PGSQL hackers mailing list.
28:45
Now, we mentioned indexing, and we have to recognize that the normal default of B-tree, which is our index type in Postgres, our default, isn't all that suitable for full text search.
29:02
You can use it for a limited sort of text matching and searching. If you create the index specifying varchar pattern ops, then you can match left anchored text,
29:20
or if you reverse the string in the index expression, then you can search from the end of the string forwards. It is going to be less useful than full text search, but it is one way of doing it. For full text search, we have the GIN index,
29:44
which is an inverted index, which means that each lexeme has one entry in the index. It is quite large, and it's slow to update, so it is better used on less dynamic data.
30:04
And you apply it on TS vector columns. We also have GIST, the generalized search tree index, that is lossy. It is going to be smaller on the same dataset, but it is also going to be slower,
30:23
because as a lossy index, it produces false positives. And in order to eliminate those false positives, Postgres has to go back to the row and actually determine whether the row is a useful result or not. So it is a bit slower. So it is better suited for fewer unique items.
30:43
If you have, let's say, a limited number of things that you are searching for, then it just might be a better index for you. And you can apply it on TS vector or TS query columns. Now, let's see what unindexed full text search looks like.
31:03
I will do an explain analyze, and that will show me how my query is being executed. And my query is going to be select count from mail messages. So find me all the text messages, all the emails, that contain the word aggregate.
31:23
When I run it, it will do a parallel sequential scan, because in recent versions of Postgres, you can have parallel scans of tables. And specifically in version 12, you have the just-in-time compiler enabled,
31:42
which also speeds up things. That is why I have highlighted it here. The execution time is horrible. It took around 26, 27 seconds almost, to look through these 400 megabytes, 485 megabytes of emails, and determine the ones that contain aggregate.
32:06
Indexed, all we have to do is create an index of mail messages using gin. That is the syntax for specifying the gin index type. And then we convert our subject concatenated with our body.
32:26
You can do that. I haven't done it in the examples that I'm actually running, but that is a possibility. You could want to index the subject as well as the body, because people will often use words in the subject that they don't repeat in the body of the text.
32:41
So that is a possibility. You can index both of them as a TS vector using the English language rules. But you can also, in Postgres 12, you can use a generated column. And because you can convert everything into a separate TS vector column,
33:01
instead of doing it on the fly like this, you can alter the table and add a column of type TS vector, and the new syntax is generated always as stored. For now, Postgres only supports stored generated columns.
33:20
That means they stay on your disk. They're not generated on the fly. So as TS vector, coalesce to get rid of nulls. So coalesce this with a blank space and the body. And that will keep your TS vector column updated all the time, even though you may be only updating the emails column.
33:45
And then you just create an index on that generated column. Now let's look at Gist. Does it help us for our search? It's the exact same search. We're searching for the word aggregate.
34:03
It did help. We can see that a bitmap index scan was used. And our search was much faster. But you will notice that we have rows removed by index recheck. So that means that it had to go back and figure out whether these results were useful or not.
34:23
So it is a bit slower than it could be. It's not optimal for this sort of search. The execution time was only 5.6 seconds. So about 4.8 times faster than not using an index.
34:41
It's still slow because it's not the suitable index type for this search. Let's use gin instead. So explain analyze again. 5.6, wow, milliseconds, not seconds here.
35:03
So we're seeing a significant difference here. It uses a bitmap index scan of our index column. And it's approximately 4,700 times faster than not using an index. So you know you're doing something right
35:21
when you're getting things that are less than a millisecond. Less than a second, excuse me. So gin and Gist indexed operations. We have these operators for TS vector.
35:42
Gin indexes the double at sign operators. So it indexes matching as we saw. For JSON-B, or binary stored JSON in our database, it supports the existence operators and also the contains operator. So these are operations that will be very fast
36:02
if you've indexed your JSON-B column with gin. And Gist supports, again, the at at sign for text matching. But it also supports the containment operators for TS query,
36:21
whether something contains the other. Excuse me. Some super useful modules that come with Postgres are PG trigram. We'll look at that in the next few slides.
36:41
Unaccent that removes accents and diacritics from your text so that you can match a U with a num lout to a U without a num lout. If that's something that you need to do. For instance, many search engines do that. They remove accents before returning results.
37:03
And fuzzy string match gives you useful metrics such as string similarity. And it can return things like Levenshtein distances between two strings, how different the strings are. It also supports soundex, metafone, and double metafone.
37:23
To be totally honest with you, I haven't used it for that purpose. So there is a warning on the Postgres documentation that these soundex, metafone, and double metafone may not work well with UTF-8. It works like this.
37:41
Select name from users where Levenshtein, the function, Stephen, on the column name is less than two. So if something is less than two characters different from Stephen, please return it. So I'm essentially looking for Stephen with a V.
38:02
Other index types that exist follow the drinks paradigm. And there is a proof of concept index called Vodka. I do not know the state of its development right now, but it offers advanced features.
38:20
And may I mention here that most of these text search functionality was implemented by our friends in Russia, community members that have contributed significant portions of code for indexing and full text searching. And we also have RUM.
38:41
RUM is currently maintained, and you can find it on that dodgy Russian website. And it stores positional information for lexemes, which means that it can perform faster ranking because our indexes are not that useful for ranking, and faster phrase search because you're looking for consecutive things.
39:04
It also stores distance between timestamps, floats, and money. So these operations with this operator become very fast. Now, one use case that I came across this year was a free text that was not natural text.
39:23
It was a text column that contained arbitrary strings, non-human readable generally, that contained keywords that needed to match something. So basically the user was searching for strings that he did not know beforehand.
39:45
So you couldn't index based on those strings. You had to search through the entire thing, such as keywords and device logs. So dictionaries are not very helpful here because we're not talking about language. And I created an arbitrary example with around 10 million rows
40:07
of about 100 character IoT device log entries. Fake, of course. Some contain strings that are significant to the user, as we said, but the user doesn't know which strings until the last moment at runtime.
40:23
So we populate the table with random hex codes, and we make 1% of these log entries contain a significant keyword. And I chose keywords from ETC dictionaries common words. So they look like this. They have something hidden in the middle of all the non-human readable stuff.
40:46
So select message from log entries. They look like this, barely readable, barely useful for full text search. And it's about 1.4 gigabytes.
41:00
Our query that we're interested in speeding up is select star from log entries where message is like source. So you don't know anything. You're interested in the word source, but you don't know what precedes it and what follows it. So how long will this take? Parallel sequential scan shows that it took 9.6 seconds.
41:22
As we said, too slow. How do trigrams help us? We mentioned that it's a probabilistic language model based on Markov chains, and three characters makes an n-gram a trigram.
41:44
What it can show you is the similarity of alphanumeric text by counting the number, for instance, of shared trigrams between the two strings. You use it by creating extension PG trigram by selecting show trigram.
42:04
You see what it converts your source word into, and it splits it into these trigrams. You'll notice that some of them are padded with white space. Create an index on log entries using GIN, and I specify GIN trigram operations here.
42:24
Did it help? It's about 37,000 times faster than looking through the text. So this index type helps you search for non-language text.
42:41
It also works with regular expressions. So instead of like, you could have used a regular expression here. But this comes at a cost, and the index is 1.6 gigabytes. Excuse me, it's as big as the table. But if you want rapid immediate responses, that's the tradeoff.
43:05
Other tricks that trigram gives you is similarity of text as a number, the distance between two bits of text, a boolean that tells you if the text is more similar than your similarity threshold that you configure,
43:25
and it's supported by GIN and GIST indexes as well. Postgres supports, as we mentioned, nearly every character set that is out there. You can find out what your client is using by querying PG client encoding,
43:45
and you can convert strings between source and target encodings. But you can also have your client do automatic character set conversion by setting your client encoding to whatever you want.
44:03
And that will attempt, if it is possible to convert from the character set in the database to your client's character set, it will attempt to do it automatically. Now we have to speak about collation, which is the sort order that is different. Even with the same alphabets, it's different in different languages.
44:22
So you can define the collation order and the character classification for a specific language per column by creating it at creation time, by specifying it at creation time, or during an operation.
44:41
So you can select with a specific German language collation from a table. That means that you are not restricted by whatever configuration your database has. Excuse me. LC collate and LCC type.
45:09
In Postgres 12, we have non-deterministic collations that can also ignore accents when sorting text. And that is a new feature.
45:21
We mentioned other types of documents, such as JSON. It supports indexing. It can be converted into TS vector. You can use the function JSONB to TS vector. And that will classify numeric key strings and booleans differently if you want to.
45:44
Also in Postgres 12, we have the SQL standard SQL JSON query language that you can use to perform searches on JSON. And we also have JS query that is a different query language.
46:00
Finally, make sure you vacuum analyze so that your table statistics are up to date, so that your GIN entries are integrated into the main index because for update reasons, it keeps a list at the end and only vacuum updates your GIN indexes.
46:20
You can set your statistics. You need to keep them accurate by number of distinct values if you know them, by setting correlated columns that influence the results of your searches. And you do need to run explain analyze from time to time because you don't know that the query that works now perfectly may work in a year's time or not.
46:43
And maintenance workman, because we're talking about huge index creation times here, is very significant, especially for GIN indexes. Now, one final thing is the curious case of text name. I found in the code that I was maintaining,
47:01
I found something like create table user id serial text name. Wait. That is type name, and it works fine. It appears to work fine, but it looks like instead of typing in name text, so the column named name of type text,
47:23
they typed column named text of type name. And that is an internal type that will actually store strings, but it will only store up to 64 bytes, so you'll have nasty surprises if you use type name for your text. So don't be sleepy when you're creating tables.
47:42
Thank you very much.
48:08
So the question is whether there's a difference between indexing very large values
48:28
or very many values with small content, right? So, as we said, GIST is better at storing, at indexing things that are fewer but unique,
48:42
whereas GIN is better with large pieces of text that are non-structured or random. Sorry, can you repeat that?
49:05
So the question is how do you index documents if you don't know which human language they are in? And the question is complex. So you either have to use all dictionaries that you know in order to be able to match
49:21
and then produce weights so that you can decide based on the ranking which is the most probable language that is being used in the text.