We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Index support for regular expression search

00:00

Formal Metadata

Title
Index support for regular expression search
Title of Series
Number of Parts
20
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
Regular expressions (regex) are powerful tool for text processing. When dealing with large string collections it's important to search fast on that collections (i.e. search using index). Indexing for regex search is a quite hard task. This talk presents novel technique (and WIP patch for PostgreSQL implementing it) for regex search using trigram indexes. Proposed technique provides more comprehensive trigram extraction than analogues, i.e. higher performance. There are two existed approaches for index-based regex search. The FREE indexing engine is based on extractions continued text fractions from regex and perform substring search. Google Code Search approach present more sophisticated recursive analysis of regex with extraction of various regex attributes. This talk presents novel technique of regex analysis which is based on automata transformation rather than original regex analysis. Superiority of proposed technique will be proved by examples and tests. The talk would be organized as following: Introduction. Regular expressions Finite automata pg_trgm contrib module Existing techniques for index-based regular expression search FREE indexing engine Google Code Search Proposed technique Description Examples Comparison with analogues Limitations Performance results.
VideoconferencingFirst-order logicMetropolitan area networkRegular expressionSocial classFormal grammarTheoryFormal languageAutomatonProcess (computing)Regulärer Ausdruck <Textverarbeitung>AreaTransformation (genetics)Real numberGraph (mathematics)Vertex (graph theory)String (computer science)Data typeArc (geometry)State of matterRule of inferenceWeb pagePhysical lawCAN busTowerSequenceSubject indexingInverse problemLengthElectronic signatureTask (computing)Hidden surface determinationImplementationFrequencySelectivity (electronic)Latent heatNP-hardElectronic meeting systemEmulationCommodore VIC-20Execution unitOrdinary differential equationGraphical user interfaceExploratory data analysisSimulationCASE <Informatik>Fraction (mathematics)Variable (mathematics)Performance appraisalQuery languageContinuous functionNetwork topologyVacuumInheritance (object-oriented programming)Thermal expansionGoogolCodeClosed setComa BerenicesPointer (computer programming)Similarity (geometry)Characteristic polynomialHill differential equationSummierbarkeitTrigonometrySinguläres IntegralWide area networkAddressing modeMatrix (mathematics)Menu (computing)Pairwise comparisonDatabasePort scannerPatch (Unix)Electronic mailing listMultiplicationBerechnungskomplexitätPresentation of a groupDeterminismLine (geometry)InfinityMoving averageElectronic data interchangeMUDState of matterVideoconferencingLine (geometry)Form (programming)Graph (mathematics)Task (computing)Different (Kate Ryan album)TheoryMereologyRegulärer Ausdruck <Textverarbeitung>SummierbarkeitSocial classLogicSingle-precision floating-point formatLevel (video gaming)RobotOpen sourceEndliche ModelltheorieTable (information)Mathematical analysisArc (geometry)IntegerUser interfacePairwise comparisonKey (cryptography)Self-referencePresentation of a groupStatisticsFitness functionSource codeFinite-state machineAngleTransformation (genetics)Graph (mathematics)Field (computer science)Video gameDiagramProcess (computing)Order (biology)SpacetimeAutomatic differentiationSet (mathematics)Self-organizationArithmetic progressionOnline helpFraction (mathematics)FreewareSemantics (computer science)INTEGRALGoogolMessage passingMultiplication signAdditionDescriptive statisticsExecution unitFigurate numberStrategy gameAnalytic continuationRule of inferenceResultantCASE <Informatik>Local ringNP-hardExact sequenceNumberCorrespondence (mathematics)Query languageHypermediaCuboidSinc functionSequenceRepresentation (politics)Group actionRight angleInferenceRadical (chemistry)outputPosition operatorOperator (mathematics)Graph coloringAreaImplementationWater vaporOval3 (number)InformationSystem callWeightTheory of relativityShift operatorUsabilitySemiconductor memoryFile formatScaling (geometry)Information securityShared memoryMathematical optimizationPattern languageLevel of measurementFlow separationGreatest elementProof theoryDegree (graph theory)Projective planePhase transitionDimensional analysisReading (process)Thermal radiationDataflowFocus (optics)Observational studyPower (physics)Link (knot theory)Goodness of fitInsertion lossGame theoryStability theoryFormal languageMeasurementSound effectMoment (mathematics)String (computer science)Arithmetic meanConnected spaceCodeSpeech synthesisFunctional (mathematics)Content (media)Directed graphInverter (logic gate)Staff (military)System administratorWaveSkewnessSign (mathematics)Student's t-testRange (statistics)Exterior algebraPhysical systemClosed setDrum memoryGenderDomain nameSeries (mathematics)Infinite setFormal grammarElement (mathematics)Negative numberElectronic mailing listExpert systemMusical ensembleMetric systemOpticsQuantificationSimilarity (geometry)Category of beingPrisoner's dilemmaModal logicDistanceData structureGodFrequencyDatabaseExponentiationStreaming mediaSlide ruleSoftware testing1 (number)Continued fractionComputer programmingFeedbackEmailVertex (graph theory)Type theoryKeyboard shortcutLengthSelectivity (electronic)Computer fileAutomatonInterface (computing)Characteristic polynomialPerformance appraisalMatching (graph theory)PhysicalismMatrix (mathematics)Real numberInheritance (object-oriented programming)DeterminismBounded variationDisk read-and-write headBuildingVariable (mathematics)Port scannerElectronic signatureMemory managementAssociative propertyLatent heatCellular automatonPatch (Unix)Web pageCountingBoolean functionPoint (geometry)Hash functionXMLUMLComputer animation
Transcript: English(auto-generated)
Okay, let's start. Hello, my name is Alexander Korotkov. My talk is about index support for regular expression search, so it refers a problem
when you need to search using regular expression on large stream collection. The possible approaches to accelerate it could be so. You can just do some faster matching of regular expression.
Or you can avoid doing matching of each string in collection by using an index. Modern regular expression engines are already well optimized, and also do faster matching.
Don't let you avoid full read of all string collection, and then index support for regular expression search is very urgent when you need to do such search on large stream collection. At first, I would like to do a gentle introduction into regular expression,
because almost all of us know regular expressions, know how to write them, use them in practical tasks, but not all of us know underlying theory. But this theory in some minimum value is required to understand my speech.
So, we all know that regular expression is powerful tool for text processing, but not all of us know that it's based on formal language theory, now we know it.
And formal language theory introduces notion of language. Language is a state of strings, possible infinite set of strings. And also this theory says that regular expression are expressing same class of languages as finite automata.
This means that we can transform regular expression to finite automata, and then regular expression would have same set of matching strings as corresponding finite automata.
Moreover, such transformation is not only valuable for theory, but it is actually used by regular expression engines. So let's talk about what automata is.
Automata is a graph which vertices are called states, and arcs are labeled by characters of string. And we say that automata reads string, if we can type the string by transformation between states
from special state called initial state to some other special state called final state. Let's consider some example. We have such a simple regular expression, which reads strings which contain character A,
then sequence of characters BC, which can be repeated any amount of times, possibly zero amount of times, and then character D. This regular expression will be transformed to following automaton.
In this automaton, initial state is marked by star, and final state is marked by number sign. So initial state have cell referencing arc, which matching to any characters.
It means that we can start search this regular expression matching in any point of string, so we never lose this initial state. Then we can read character A,
and we can read any amount of sequences BC by transition to upper state and back. And then we can read character B and reach the final state. Final state also have cell referencing arc. It means that once we reach final state, we will never miss it.
That means that if we find much, it doesn't matter what is... After what, you anyway have matching. Your questions, please. Yes, then no of that state is activated.
So this state was active, then this state becomes active.
If next character is not C, neither of these states is active. But initial state remains active during whole process of matching. So in the moment of time could be active any count of states from 0 to all states.
Wouldn't you want to keep reading the string? Would you want to go back to the initial state? Actually, if no state is active, we can stop, because if no state is active, it will remain until the end of the string. And we know that we will not reach the final state.
I will consider some example, probably it will clarify this thing. For example, we have such string and initially initial state is active.
Then we read character X and then no other state is activated, only initial state. Because the character is not A, but self-referencing arc work and initial state remains active.
The same situation is repeated until we meet character A. When we meet character A, then it activates state in the middle.
Initial state remains active, it will remain active during all matching process, because it has self-referencing arc which matches any character. Then we find character B and then upper state is activated.
And so on. Then we reach final state, and this state will remain active until end of the string. And when we reach end of the string and final state is active,
we finish our matching process and we can say that string is matching to regular expression. Can you stop the first time you get to a final state? Actually, yes, because it has self-referencing arc.
But I show this process without any optimization. So now all of us know something about regular expressions. Actually, this picture is not very relevant, because this guy just writes regular expressions.
This doesn't mean he knows the underlying theory. But anyway, this is fun. OK, now about regular expression-based search in a database. It's good news that podcast.ql can do regular expression-based search.
And while it's only a sequential scan, this is bad news. All of technicas to do regular expression-based index search is based on inverted indexes on Q-grams.
Then I would like to do a gentle introduction into what inverted indexes of Q-grams are. Q-gram is the substring of original string which have length Q. And it can be used as signature of original string. What does it mean, signature of original string?
It means that you can say something valuable about string without knowing of this string, by only knowing that some Q-grams present in the string and some Q-grams not present in the string. And the Q-gram technique is widely used in various string processing tasks,
not only regular expression matching. For example, it's also used in edit distance searching, in accelerated edit distance searching. And also, number of matching Q-grams can be used as also some metric of string similarity,
like in PG-T-RGM contrib model. Inverted indexes on Q-grams maintain association between Q-gram and all the strings where this Q-gram is present.
So, PG-T-RGM has an implementation for some particular Q, for Q equal to 3. A little example of logical structure of PG-T-RGM index, it's not physical structure.
Physical structure is much more complicated, it's just logical structure. For each Q-grams, you can find the number of strings where this Q-gram is present. Something about Q-grams.
Q-grams have different frequencies in the strings. For example, DBLP paper titles, which amount is 2.5 million, contains very many titles which contain 3 grams there.
It's the most common 3 grams in English language. But if we would like to find titles which contain 3 grams Z, Z, Z, there is only one title containing Z 3 grams.
So, this title, I get my interest and I present the information about it. Probably someone is also interested in what is the only one title containing Z, Z, Z.
So, not all Q-grams of fixed Q are equally useful. For example, you probably don't like to get the posting list, the list of strings where 3 grams there is present, because it is very large. Probably you can do better without getting such list.
Probably it would produce more false positives, but it anyway will be quicker. To avoid these minuses of the 3 grams of fixed Q can be used variable length grams, so-called V-grams or multigrams.
This means that each Q-gram can have specific Q, so that make selectivity of all Q-grams similar.
So, we can do selectivity of them equal, but we can make it to be in some not very wide range. For example, if some 3 grams is very frequent, we can use 4 grams in such case.
And if some 3 grams is very rare, we probably would like to group such 3 grams into V-gram. And this could lead to more effective index search.
But there are some problems to get this into progress, because when some paper describes techniques of V-grams or multigrams, they describe it like we have some fixed string collection, and once we collect frequencies and build index,
they don't describe how we can maintain it online, because in progress, we can say user, if you build index, you can't insert or update or delete in the table. It's unreasonable. This is the first problem. This problem is avoidable.
But there is also the second problem. The second problem is patents and patent trolls. Because many scholar papers have corresponding patents which prevent us to implement this in open source database.
Probably it is avoidable because we can use some other algorithms, but I'm not a lawyer, and I don't know if these patents cover the whole idea of 3 grams, and there is not a way to avoid these patents.
So I will post this question to PG Advocacy, and probably they will help to resolve this problem. Probably not a good idea to mention or read the patents in the United States.
It's too late anyway. I'm not sure we can discuss this later.
The next question is how to use such indexes for regular expression search.
General idea is so. When we have some regular expression, we can write some logical expression, which express which 3 grams should be present in the string, in order to the string can be matched to regular expression.
This doesn't mean that it should necessarily match. It allows some false positives, but doesn't allow false negatives. For example, in this case, we have that first character should be A or B.
Then there should be necessary 3 grams ACD or BCD. And the last 3 characters are just continuous fraction of string, sequence of exact 3 characters, and this means that 3 grams CDE should necessarily present.
Then we get posting list for all of these 3 3 grams. For all numbers of string, we get the table which presents for each 3 grams.
Is it present in following string or not? Then we can evaluate this logical expression, and if logical expression is true, we can fetch original string from heap. And when we fetch original string from heap, we can do recheck using actual regular expression.
But the result can be false, because our logical expression allows false positives.
General idea is so, but the challenge is how to do this in general case. In this case, it is very evident, but to do this in general case is not so easy. First, let's consider the existing approaches for such Q-gram extraction.
The first is a paper called Fast Regular Expression Indexing Engine, and this paper is still widely referenced in a state-of-the-art work about this problem. In short, this method is so. We extract only continuous string fraction from regular expression,
and ignore other part in some way. And then we transform this continuous fraction to multigrams.
And when we get logical expression on multigrams, we can use inverted index. Let's consider some example. We have for this example regular expression, we get following tree. It contains or consists of logical operations, continuous strings,
and there is a symbol X, which can be repeated any amount of times. We represent it as a star, which have a child of X character.
Then we replace star with null. Null in general means that it is something we don't know what. So then, if we have something we don't know what under or,
because it's or, we actually don't know what is the whole sub-tree. And then the corresponding parent or will be replaced with null. But when null is a child of and, we can just remove null,
because we know that this part is present, and something we don't know what. We can just remove it. Also, we can remove some unnecessary elements, and then replace this with some q-grams.
For example, it is 3 grams. Another existing approach was Google code search. Google code search was launched in 2006.
And Google code search supported regular expression search on open source code. This was very interesting for me when I found it. And I thought that Google guys are smart. It can be that they do sequential scan of the whole amount of source code.
And it also seemed to me that they do something better than that paper, something more smarter. But we don't know what. Actually, we didn't know what they do until Google code search closed.
And then Google engineer Russ Cox published a description of this indexing technique in January 2012 on his website. So it was more five years of intrigue when we can use Google code search,
and didn't know how it works. Now we can use Google code search, but now we know how it works. So, a brief description of this method.
It collects five characteristics about each part of regular expression. It's emptyable, it's just Boolean, which represents can this part of regular expression be matched to an empty string or not.
Exact, probably this part of regular expression only matches to some exact set of things. Then we fill the property exact. Then prefix and suffix means that this regular expression necessarily requires for string to have some prefix and suffix.
And match means that regular expression part requires some string to be in any part of the matching string. And then they recursively union these characteristics of regular expression by some rules,
and use inverted index on trigram for query evaluation. This index is very similar to PGTRGM. For example, we have the same regular expression as in the first example.
At first we get properties of continuous strings, so it's easy, it's just exact. When we apply the plus symbol, we know that this regular expression would require from string to have this prefix and this suffix.
So it's pretty easy when we add characters at the start of regular expression. We add it to the prefix, and when we add it to the end of string, we add it to the suffix.
Then transformation is solved. We get characteristics of the original expression, and then we can easily transform it to the logical expression on trigrams.
Okay, now we consider an existing approach, and I would like to talk about the proposed methods, the methods which I propose for use. My method is not based on analysis of original regular expression, but based on analysis of corresponding automaton.
And transform it to some automaton-like graph, but this graph has not characters on its arcs, but trigrams.
Then we could possibly simplify the graph, and then use PGTRGM indexes. So it's quite difficult to explain the whole method formally, then I would like to explain it on the example.
Let's consider some regular expression. This regular expression reads character A, then it reads any amount of character B, at least one character B or C.
And in the end, it reads character B. There is a corresponding automaton for this regular expression. And now we can start our transformation. The first step is easy, we just put initial state of resulting graph, and then we start to assign to this resulting graph keys.
The key is two things. At first, it's a pair of the last read characters, and the number of the state.
For example, for initial state of automaton, we can read character A, and then character B, and reach state 2. So our key is corresponding to state 2, and last read characters was AB.
Similar, we can read characters AB and reach state 4. And so on, we can read AC and reach state 3, and read AC and reach state 4.
Then it's all possible keys for initial state, and then we should look into what in other states we can reach from this case.
If we are in state 2, and last read characters was AB, according to source automaton, we can read character B.
This means that trigram ABB necessary present in the string, and then last read character is BB, and we are still in state 2. According to another arc of original automaton, we also can reach state 4 by saying trigram and same last read characters.
Also, we can reach character D from state 4, and if last read characters was AB, then
trigram ABB necessary present in the string, and then last read characters is BB, and state 5.
An important moment is that corresponding state of original automaton is final, then corresponding state of the resulting graph is also marked as final.
And we do similar thing for another key, and then when we have finished processing this initial state, we start processing that state which was made by previous.
If we can use self-referencing art of the state 2, and then we can read the trigram BBB and leave in the same state, it will be self-referencing art.
So this is similar transformation, I will not describe it at all, I think the idea is understandable now.
Then we get some resulting graph. What is important property of this graph? So if some string is matching to regular expression and matching to corresponding automaton, then
we can traverse from initial state of resulting graph to final state of this resulting graph. By only traversing in the arcs which trigrams is present in string.
So this property allow us to treat this graph as logical expression on trigrams, and use this on the previously described method of regular expression search.
This result could be simplified, because we can reach from here to here reading ABB, we
require only trigram ABB, so we also can reach this by reading ABB and BBB trigrams. But it doesn't matter if just ABB is enough. So we can simplify this.
The implemented simplification technique is to collect following matrix. This matrix represents all possible minimal passes which can bring us from original
state to final state, and this matrix can be easily transformed to logical expression. For example, for the first string only one is for trigram ABB, and then ABB trigram is first summoned.
Then the second line of this table, trigrams ABB and BBD, and this means that the second summoned is ABB and BBD trigrams, and so on for another lines of the table.
Some comparison of different techniques of trigrams logical expression, extraction from regular expression.
For this regular expression, first regular expression indexing engine, the shortcut is free, can extract regular expression only containing continuous fraction of string.
For example, Google code search method could give more comprehensive expression, and my method produce some similar logical expression, but not in such simple form, it has some duplication.
Another example, because there is no continual fraction of string, because C++ will be replaced as null,
and then no continuous trigrams can be extracted from here, and the free method can extract something. And Google code search can see that ABC and CDE trigrams should be present, and my method see something
more, not only ABC and CDE, but also there should be BCD or the pair of trigrams BCC and CCD. There are also some examples where nasal free method and Google code search method can extract any trigrams.
So Google code search method is confused by this star, and inside plus, it's probably too complicated, because it can't see any fixed suffix, because this character C can be repeated any amount of times, including zero amount of times.
Also some examples where free method and Google search method can do something. And performance result on DBLP paper titles is some regular expression, for example,
it's not real-life regular expression, it's just something I write for the testing. And we can see some pretty dramatical difference between index scan and sequence scan,
and also we can see that the time is increased when regular expression becomes longer. It's because the longer regular expression, we can extract more trigrams for it, and we should read more of Poisson Clears.
This is also avoidable, because if we had selectivity of trigrams, then we could ignore
some of them, getting more false positives, but having to read less of Poisson Clears. So there are some optimizations that are possible, even if we don't know
selectivity of trigrams, we can anyway decide to exclude some trigrams from our search. Working programs patch was posted to the mailing list, this patch is based on PGE-TRGM-CANTRIB model, any feedback to this patch is welcome.
This patch is a work in progress, and there are still some problems in it. The first problem is that the transformed graph could be very large. So in this case, we possibly should decide that when the transformed graph reaches some size,
we should say that we should stop doing it and do just the full index scan. Also, the process of simplification could produce a very large table of possible passes. In this
case, we probably should not try to create such a table, but use just the transformed graph. And also, the usage of trigrams for this particular task is not optimal, it would be better to use variable length grams.
In my presentation, I did some simplification for more easier understanding, and I would
like to mention that in order to my presentation wouldn't be disinformation for somebody. I didn't mention difference between deterministic and non-deterministic automata. So, deterministic automata can have only one arc marked with same character from same state.
Then we deal here only with non-deterministic automata, but regular expression engines actually transforms non-deterministic automata to deterministic in order to do faster match, because in
deterministic automata only one state can be active in same time. It's very good for performance. Also, in real life, regular expression engines don't mark arcs with exact
characters, because, for example, there could be some meta-symbol of regular expression, which, for example, represents any numeric character, or any space character, or just
any word character, and this means that there will be very many arcs. Then regular expression engines group characters into colors. Colors is a group of characters which are indistinctable by regular expression.
And also, regular expression engines have some special method for handling start and end of string, of the whole string, or the line of it.
In short, it can be done by representation start and end as some special characters. For example, in our example how regular expression works, we didn't use some start and end characters, but regular expression engines use it.
My project also needs some help. I need string collection and regular expression from real life
tasks to prove effectiveness of my method, and also finding some corner cases and optimizing them. So if you have some datasets for me, it would be great. You should contact me.
So now, thank you for your attention. Ask your questions, please. Trigrams, because we already have indexes on trigrams implemented in Postgres.
But it could be done with, for example, d-grams, 4-grams, and if we can avoid problems with patterns, it would be better to use variable-length grams.
The question is about the selection of q and q-grams.
So if you built an index that has trigrams and quadrams and all the lengths, then you could decide which lengths to look up in the index based on the stats saying that there are...
If there are actually not enough trigrams to make it less than enough to look up, then you can expand it further and look for quadrams. But you'd have to look at the stats to know which ones are more common or less common.
You said about built index which contains, for example, both trigrams and 4-grams. Yes, it's also an interesting possible approach. Okay.
These examples. Which particular slide? Next slide. This? Okay. You have some of your examples. Ah, about...
Does that just generate a lot of trigrams? Negative. Negative. Are you asking for the logical expression on trigrams for that example? What I'm curious was that most of your examples use specific... Is the reason that the very first example there is the slowest?
Is it the... Is this? Actually not. Because it just ignores this character because there are too many characters matching this... any character.
It's slower just because it's longer. It's longer than... It's not because of this part, but because the whole number of trigrams of another part is larger.
And possibly they have the greater selectivity and larger percent lists. And it's because of this. Actually, I wanted to... So you have more trigrams, but they're more selected, like you said. They're mostly n's.
Not the... If I have more complicated expression on trigrams, I have less false positives. So I get less fetch the heap. But I get more...
I have to read more posting lists. And there is a trade-off between amount of false positives and reading posting lists. So you couldn't decide... Yes. You don't need to read them all. Yes, yes. If some posting list is large, and also it doesn't influence the false positives very much, with some estimation,
I should decide not to read that posting list and remove it from logical expression. Any other questions?
To what? Yes, it's possible. You might reference that Pgtrjm converts multibyte characters to the hash sums.
Yes, it's also possible to build index on multibyte characters. It will have to make the keys Varilena.
It will add some additional size of index, but it's definitely possible. And also it would have some advantages, because, for example, you index trigrams, and you would like to find all things that continue some digram, which this trigram starts.
And you can do this using Patil Match. When you have hash sums, you can't do so. So this would be possibly a better solution for this particular task.
Yes. I apologize if I missed it, but did you say how long it took to do it? I didn't say that because for a while it's not novelty of my work, it's just the same Pgtrjm index as it is in the head.
Any other questions? Yes? I guess I have one that's a variation on other comments that were made. Which is, if you looked at the list of trigrams that you had, and you had frequencies for those trigrams,
then you could start by doing the queries for the least frequent ones. Yes, yes, sure, but it's not so easy to implement in Postgres, because the kind of collecting statistics is not dependent on index now.
And also, from the interface method of an index implementation, you can't get any statistical information. There are two problems, but ideally we would collect statistics about trigrams when we have corresponding index,
and can get the statistics from interface function and do this optimization. Yes, it is the right way, and I would like to do it. Any other questions? It seems that's not...
I'm just wondering how... I don't actually know the actual simple positive matching like that, compared to regular expressions that don't have trigrams. Like if I say backslash dot, dot, backslash d...
Yes, there are many regular expressions where we can't extract any trigrams. Yes, in this case we just should do some sequential scan. For example, if you are searching using gene index, then planner call extract query on the stage of plan,
and if this extract query says that I can't do something better than full index scan, then planner decide to don't do an index scan and do sequential scan,
which is faster than full index scan in this case. So if we have regular expressions for which no trigrams can be extracted, you just will fall back to sequential scan.
I think it's unavoidable. Any other questions? Yes? Okay. I didn't understand.
Please repeat the question. How do you represent that in memory?
I actually represent in memory this table. So I write this logical expression only for better explanation, but I just iterate over lines of table,
and check if all the trigrams marked with 1 are actually present in string. If it's so, I stop the iterating and say that the result is true. So the string is possibly matching regular expression.
And if I reach the end of the table without it, I return false. This means that string definitely doesn't match regular expression. Is it your own operator with your own name? There is an already existing operator in PostgreSQL.
I just add this operator to operator class in PGTRGM contrib model. My page just contains this additional line in the SQL file.
Any other questions? Yes? Is there any way to use this with negative matches? Negative matches. So you mean negative matches? For example, you decide that some string which match regular expression
should definitely don't contain some trigram. Yes? Do you mean that? Possibly, but this regular expression should be both prefix regular expression and suffix regular expression.
So it shouldn't contain some part, it can contain anything. If regular expression contains this part, then no, it's possible.
Also, if we don't allow this, we deal with some particular class of Boolean function. It's class called monotonic Boolean function because there is no negative.
And dealing with this is easier. Possibly I will find some better ways to do simplification because it's monotonic. It simplifies our analysis. Any other questions?
It seems that no questions remains.