We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Manipulating text with PostgreSQL - lesser known PG jewels

00:00

Formal Metadata

Title
Manipulating text with PostgreSQL - lesser known PG jewels
Title of Series
Number of Parts
351
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
PostgreSQL is the most advanced opensource RDBMS. As GIS folks, you most probably use it in combination with PostGIS, its Geospatial plugin. When dealing with Geospatial data, we usually focus on geometries. But most of feature attributes are text data. Of course, filtering on these text data with standard SQL capabilities is a day-to-day operation for database users. But PostgreSQL provides much more capabilities when it comes down to text data management. In this presentation, we will go through a few of them. After a quick look at standard text functions in PostgreSQL, we will discover the lesser known fuzzy matching modules : - `pg_trgm` extension allows for string searches using trigraphs to determine a similarity rank between text items - `fuzzystrmatch` extension provides fuzzy matching functions like soundex, Levenshtein, metaphone Then, we will explore *Full Text Search ( FTS )* PostgreSQL capabilities. Last but not least, we will peek inside PostgreSQL collation concept, which has nothing to do with your lunch. Collations are a powerful feature in PostgreSQL allowing to adapt the way you deal with text data according to the localization. Like trying to answer this - apparently - obvious question : is '12' before or after '2' ? And, because we can, display all of this on a map :-)
Keywords
202
Thumbnail
1:16:05
226
242
Extension (kinesiology)DatabaseOpen sourceRelational databaseGoodness of fitComputer animation
Focus (optics)Attribute grammarGeometryAttribute grammarTexture mappingComputer animation
QuicksortVotingUsabilityRadical (chemistry)First-order logicSet (mathematics)Attribute grammarAddress spaceMereologyElectronic mailing listPoint (geometry)Computer animation
E-textStandard deviationPattern matchingMatching (graph theory)Subject indexingPattern languagePattern languageBitFuzzy logicStandard deviationSubject indexingRegulärer Ausdruck <Textverarbeitung>Operator (mathematics)Query languageLimit (category theory)ExpressionReverse engineeringResultantPattern matchingCASE <Informatik>Classical physicsDefault (computer science)Matching (graph theory)Sensitivity analysisBounded variationSet (mathematics)Latent heatComputer animation
Plug-in (computing)EmpennageSubject indexingSimilarity (geometry)Data conversionCodePrice indexMatching (graph theory)String (computer science)Pattern matchingSimilarity (geometry)CodeFunctional (mathematics)Subject indexingExtension (kinesiology)DistanceBitFunctional (mathematics)Fuzzy logicDatabaseExpressionDifferent (Kate Ryan album)Pattern languageDemonCodeAlgorithmDoubling the cubeComputer animation
Plug-in (computing)Helmholtz decompositionFunction (mathematics)String (computer science)Similarity (geometry)Revision controlSimilarity (geometry)SpacetimeHelmholtz decompositionDampingString (computer science)Element (mathematics)Pairwise comparisonRevision controlNumberFunctional (mathematics)Operator (mathematics)Extension (kinesiology)Set (mathematics)Bounded variationBitLatent heatConfiguration spaceComputer animation
Price indexResultantMereologyGoodness of fitComputer animation
Subject indexingProgrammable logic arrayPrice indexSubject indexingRegulärer Ausdruck <Textverarbeitung>Operator (mathematics)Standard deviationQuery languageDifferent (Kate Ryan album)CASE <Informatik>Limit (category theory)Source codeComputer animation
Type theoryVector spacePreprocessorQuery languageCore dumpParsingToken ringCodeMatching (graph theory)Operator (mathematics)Address spaceTransport Layer SecurityVector spaceString (computer science)Data dictionaryDatabaseQuery languageAddress spaceToken ringDifferent (Kate Ryan album)AreaUniform resource locatorMatching (graph theory)Attribute grammarOperator (mathematics)Table (information)GeometryMereologyData typeArithmetic meanGroup actionElement (mathematics)Formal languageCodeVisualization (computer graphics)Core dumpSubject indexingCombinational logicData storage deviceWeightProcess (computing)RadiusFunctional (mathematics)CASE <Informatik>Doubling the cubeLatent heatRankingLevel (video gaming)Row (database)Computer animation
Address spaceOperator (mathematics)WeightVector spaceRankingPrice indexMiniDiscLimit (category theory)Database normalizationComputer configurationCovering spaceFrequencyDistancePopulation densityCodeData dictionaryProcess (computing)String (computer science)Complex (psychology)Vector spaceArithmetic meanWeightExpressionDeutscher FilmpreisQuery languageOperator (mathematics)Order (biology)Normal (geometry)Condition numberResultantDistanceComputer configurationParsingRankingFunctional (mathematics)Subject indexingString (computer science)Matching (graph theory)ParsingPredicate (grammar)Combinational logicRule of inferencePower (physics)Web 2.0QuicksortSet (mathematics)Different (Kate Ryan album)Library (computing)FrequencyConfiguration spacePhysical systemRaster graphicsVirtual machineComplex (psychology)Level (video gaming)TwitterCASE <Informatik>Functional (mathematics)BitComputer animationJSON
Natural numberLatent heatSensitivity analysisCASE <Informatik>Personal digital assistantExtension (kinesiology)QuicksortOrder (biology)NumberCASE <Informatik>Natural numberSet (mathematics)Sensitivity analysisCustomer relationship managementSource codeJSON
Internet service providerOrder (biology)Standard deviationPhase transitionOrder (biology)Presentation of a groupComputer animation
Transcript: English(auto-generated)
So I talked to you about PostgreSQL and PostGIS, and especially today, PostgreSQL features. Who of you, raise your hand if you use PostgreSQL? Whoa, that's a lot. And PostGIS? Ah, good. Okay, so I will focus on more advanced features.
So today I won't exactly talk about GIS, because in GIS we usually deal with features, and features are geometry and attributes. And actually most attributes are textual data. So it's very important to be able to handle this textual data.
So I will focus today on some text search capabilities in PostgreSQL. I will use a data set which is a list of addresses in DRUM, that's part of France. And in this data set you have points and according in the attributes you will find the name of the commune, the city, the name of the street, etc., etc.,
which we will use for text search. So we present you some lesson on PostgreSQL GWAS, so some hidden features. I start with standard pattern matching, which you probably already know, then fuzzy matching, trigrams for text search, and a bit of collations.
So let's start with standard pattern matching. This is something you probably already use, that's a like keyword in SQL. You also have the operator which is 2 tilde. And it's used for pattern matching, just like in the query you see. You can say select all the communes from our data sets,
where the name of the commune, the name of the city, begins with B. And so you have a few specific characters to match any single character or any zero or more characters. So that's pretty convenient. You have also the I like variation, which is case insensitive. And you can reverse the search with not,
just like any standard SQL expression. Just be careful with indexing, if you use reverse search, for example, any character and then a B for example, it won't be indexed by standard indexes, so you would have to use a few different things.
There is also this similar to keyword, which is less known. It's a RegEx pattern matching, which is a bit specific, because it's a SQL standard definition of regular expression. So it's not POSIX, so it's not compatible with what you are used to use as a regular expression outside of PostgreSQL, but you can use it as well.
So you have a few different characters which you can use to define a regular expression to search for. Then you have the classic regular expression pattern matching, which is a tilde operator. And that's used for pretty advanced regular expression search.
You can use all POSIX capabilities of regular expressions. And the tilde operator returns true if the text matches. So you can just use it in your where clause and it will send you back the same results matching the regular expression. You have also some other operators for case insensitivity or
not matching or not matching and case insensitivity. Same limitations as for like, be careful with indexes, because your queries with regular expression cannot be indexes by default. But we see that we have some specific index we can use using trigrams.
So let's go to some more advanced PostgreSQL text matching with fuzzy matching. So there is this extension in PostgreSQL, which is called fuzzy string match. Fuzzy string match, it's unpronounceable. Which provides you with a few specific functions
that helps to find some text in a corpus. First of all, you have the Levenshtein distance. You have the definition of the Levenshtein distance. It's a bit complicated written like that. But that's basically how much character you have to change to go from one string to another by inserting or deleting some characters.
For example, if we select the Levenshtein distance between Firenze and Firenze, it's just one liter difference, so the distance is one. Between Firenze and Firenze, the distance is three. So you can determine the kind of similarity between two words or two expressions.
There's no indexing possible with this function, so you have to combine it with other such matching functions. Another function for matching is soundex. Soundex, it's in the same extension, and
it gives you the similarity of English text for it actually converts a text to a four-character code, which you can use to compare different words. There was also the different function, which gives you the amount of codes, characters, which are common between two different words.
It's specific to English text. So it works well with English text, and you can use indexing using expression index because you can say index using soundex on my column. For example, the soundex for Florence, you can see it's F465,
and the soundex for Firenze, it's 4652. So it's different, but for Firenze, you get the same soundex. So if you compare the soundex, you will have the same value for Firenze and Firenze, and you can match both words.
Metaphone and double-metaphone are two other sound similarity functions. It's the same principle as for soundex. It's a different algorithm, and the code it gives back is a bit different, but it tries to sound like the original word.
For example, the demetaphone code for Alexson is ALKS, and for Alex, it's ALKS as well. So you can see that both words have the same demetaphone code. So you can compare or search in your database for demetaphone,
and it will send you back the names which sound similar to the one you are looking for, so that's pretty convenient also for Fuzzy pattern matching. TreeGrams. TreeGrams are given in the PGTRGM extension.
And the principle of TreeGrams is a comparison of text based on three liters decomposition of specific words. It gives you a few function, namely similarity, word similarity, strict word similarity, which behave a bit different when comparing, the simple one is similarity, and it takes two string into account.
And it will give you a similarity float number according to how much TreeGram of both words are similar. So you can see the decomposition in TreeGram, for example, for the text, welcome to Firenze.
There are a lot of TreeGrams, space space F, space space T, space space W, space F I, space 2, etc., etc. These are all the three liters elements, three liters items you can decompose the word into. And then PostgreSQL with this similarity function will compile this TreeGram set,
and say, okay, we have x amount of TreeGram similar, so we give you the similarity score of 0.48. So this is a very good way to compare text and to search text for a different variation of a word or of a phrase.
And you can either get the score, or you can also use the operators, true-false version, especially the present operator, which is configured to give you true if the score is of a specific threshold, which is configurable.
An example, using TreeGram is very good to work with typo, for example, when you look for Bordeaux, for example, which is a town in the western part of France and not exactly in Rome. I'm looking for a place in Rome where the name of the city,
TreeGramly speaking, looks like Bordeaux, but I actually made a typo. It's not Bordeaux, it's Bordeaux, and you can see that the results gives back the right name of the community form, which is not exactly the one I was looking for, but pretty similar. So it's very good for fixing user typos.
TreeGram is a very good thing for that. The good thing is also that you can index TreeGrams. You can index with Gist or with GIN, which are two different methods of indexing for PostgreSQL. For TreeGrams, Gist is preferred. The only thing is that you have a limitation
on the amount of text you can put in the index, but usually it's enough for most use cases. So you can create the index using Gist and this index operator, which is Gist TLGMOps,
and it will create you an index on your colon based on TreeGrams. The good thing is once you have this, you can make a query just like the one I showed you before, and it will be accelerated by the index for TreeGram matching with a percent operator, but the TreeGram index will also be able to index your queries
using like, I like, and regular expression. So sometimes when you look for everything beginning with some characters, then with O, U, R, D, E, and some characters after, that cannot be indexed by a standard text index,
but if you create an index on TreeGrams, it will be used for this kind of search. So even if you don't use the TreeGram search, TreeGram indexing can be useful as well. So one other big thing in PostgreSQL is full text search.
Full text is a big beast in PostgreSQL. It's usually not that very well known because it's pretty complex. The idea is to be able to search documents of text for a specific text given in a query. A document can be any text in your database.
Usually you take a lot of different attributes, you group them together, you concatenate them, and it gives you a text corpus which is called a document, and that's what we are going to use to look for any text. It's a core PostgreSQL feature, so it's available everywhere. It's very flexible and customizable.
I will show some use cases but not everything because you can go very deep into customizing full text search. It's useful for all languages, it's not only English, and the principle is that you pre-process your documents to be able to access full text search features and also for efficiency.
The principle is take documents. You take the documents and you process them to tokens which are individual elements in your text. Then you convert your tokens to lexemes which are grammatically single items corresponding to tokens and you use dictionaries for that.
You can customize dictionaries and dictionaries are used to normalize the words. For example, when I say a shop and shopping, for full text search it can be the same word because it's the same lexeme. When you search for shop, it will also find shopping. You can strip stop words as well.
You will strip stop words. There are a lot of words which don't actually have very strong meanings so we strip them. You can replace synonyms with dictionaries so that if you look for a specific word, it will find the synonyms as well in the text. There are a lot of things you can do
when converting the tokens into lexemes thanks to dictionaries. That's the second step for pre-processing the document. The next step is storing a specific data type for searching and also to allow for ranking. Full text search in PostgreSQL
introduces mainly two different new data types which are TS vector and TS query. TS vector is the document converted to lexemes and the TS query is what you want to search in your text. The documents can be any generated text and you can use generated columns
which are features starting from PostgreSQL 12 or 14 I don't remember to store TS vector data because you have to convert your document into TS vector and to convert your document into TS vector you can use this function for example two TS vector. You can say it's in French, it will use French dictionaries
and here I will concatenate a certain amount of attributes the name of the city of the commune, the name of the former commune, the name of the locations, the areas, the name of the streets and also the postcode and the code for the commune itself.
And as you can see I can also set a weight to different attributes so that different parts of the text will have different importance so the communes will have the importance A which is the most important one for example a title in your document will have importance A the weight A and then you have ABTD
which you can define on your text so that you can determine which part of the text has most importance. And here I generate, I have a new column to my table and it's generated automatically, it's stored as well, it's materialized so that I will be able to create indexes on that
and it will be fast as well but as soon as I will add data to my drum table it will automatically create the TS vector data. So that's a very convenient feature from latest PostgreSQL the generated columns before that we had to use triggers to generate the column.
So how do you use it? Once you have generated your TS vector data then what you have to do is create a TS query and use the double arabas text matching operator so that you can query your text.
So here I select everything from my addresses where my generated column, my TS vector match my query and my query is just the name Henry in French language. So I will search for every address which has Henry somewhere in the document
so somewhere in one of the attribute which is a street name, name of the commune, etc. and it will give me back the rows and since the rows have the geometry I can map them with QGIS I can do a heat map for example or kind of different visualization you can have.
Of course you can use policies in combination so here I will do a full text search plus a geometry search so I will say okay I'm looking for everything related to which have Henry in its name but within a 10 km radius around a valence here. So that's standard SQL
but you can use the power of full text search and the power of PostGIS all together in a very simple query. You have a few functions for full text search so two TS vector to generate a document two TS query to generate a query and then mix them together and then you have plain two, phrase two and web search
so I get back to them two TS query the query itself to generate the query it's a text data and you can use a few operators for example if I want to search for rue or chemin street or path I can use a pipe operator which is or
I can use a end operator or a node operator I can say I want rue followed by Henry in this order and I can group conditions so that I can have expressions which are pretty complicated in my query and it will look for that in the text according to the predicates you have. You can also restrict the query
to specific weight you have defined in your document and you can say also as a prefix matching I want every single text starting with Henry for example. To build more queries you have a few other options plain two TS query recombines all words of a standard text which is a query with the end operator
phrase two combined with follow to operator and web search is a different syntax like google-like where you can use the r or the minus quoted and unquoted data text in your query so that it feel likes a web search so it's pretty convenient.
You can index all that of course we use gene index so that's very easy to index on my data using gene on my generated query and you can see that when you do a full text search query it will do a bitmap index scan so it will be fast.
There is also a ranking features so you can get a rank from your search saying which results has the most the best score and the best score is can be computed either by lexeme frequency or by frequency
and proximity of roles so you have two functions for that. You can configure and normalize and configure option for normalization as well so you can go deeper into that but an example of that is that if I get a score for example if I look for Roche there is a common name La Roche de Glen
and another one La Roche de Glen there are a lot of streets named Roche but this one will come first because they have more weight on the name of the communes than on the name of the street so when I rank them I will have a commune with this Roche keyword
more than just treat with the Roche keyword so that's pretty convenient to order your search. There are a lot of other things you can do with full text search highlighting results you can have custom dictionary you can use an accent for special character handling you can use it in combination with PGTRGM
to fix typo first and then look in your text and you can even write your custom parsers for text if you have very specific data you want to parse you can write a parser in C and integrate that into Postgres. Last but not least collations collation is another very lesser known problem
it's for text ordering text ordering is a pretty complex problem for example if you are German you may know that the order for the phone book is different than the alphabetical order usually and that's pretty weird but it's a rule
and sorting emoji how do you sort emoji? Do you know that? Emojis are text so you have to sort emojis in some way how do you do that? You can do that with Postgres SQL so what we call collatable data is everything which is text, worker and characters a collation is defines a sort order
and character classifications as well and you can define the collation granularity per colon or per operation there are two different libraries for collation two different set of collation which is one which is provided by the system which is libc and another one which has been implemented later on which is ICU
which provides you with a lot lot lot of collation capabilities and you can have very specific collation with very specific rules it's pretty complex but you can do what almost whatever you want for example I will create a collation named UNDUCOMOGXICU so this is a NICU collation
and it applies to the locale below and it says orders my emoji by the right order and when I want to use a collation I can just write my query and say order by name and I will say you order by name but you use the collation
Deutsch I don't remember the ICU meaning but it says use the standard Deutsch ordering some use cases why do you want to do that? you want to do that for natural sorts for example here you can see that if you don't use an e-collation
EM001 will be before EM999 that's with collation sorry that's natural sort so you have the numbers are order in the number order but if you take only an alphabetical order below you have 001 coming before 1000
which is okay but then you have EM999 which comes after which is not okay so you can use as well that for geacritic character management you don't know in English but for Deutsch people for Finland or even for French
that's something which is always complicated for example S set in German is very complicated because sometimes it sounds late to some S's sometimes it's S set sometimes you have an uppercase S set which exists but is never used in classic German so that's very complex stuff
but you can do that with pathways and for case sensitivity as well when you want to look for text without taking case sensitivity into account you can do that but you would need an ICU extension further and last one you can sort emojis there's actually a standard
an international standard saying which other emoji has to be sorted and ICU implements that so you know that happy face comes first and that's all