We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

PostgreSQL and Sphinx

00:00

Formal Metadata

Title
PostgreSQL and Sphinx
Alternative Title
Sphinx and Postgres
Title of Series
Number of Parts
25
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceOttawa, Canada

Content Metadata

Subject Area
Genre
Abstract
Full Text Search extension How to integrate both tools and obtain the best performance + reliability. This talk is focused on the new and hottest features on Sphinx (2.1.1 beta) and PostgreSQL. How to combine those tools and HA new features will be showed up during the presentation and also, how to reach a high performance and simple text search
DatabaseRow (database)Sign (mathematics)CAN busSpreadsheetInformationService (economics)Client (computing)AreaExt functorDensity of statesComa BerenicesBlogOperator (mathematics)BefehlsprozessorOrder (biology)WordRegulärer Ausdruck <Textverarbeitung>Regular graphFormal languageReading (process)Operations researchConditional-access moduleZoom lensComputer fileSpecial unitary groupDatabase transactionDisintegrationSubject indexingSynchronizationPrice indexMathematical optimizationQuery languagePersonal digital assistantMoving averageObject (grammar)ParsingFunction (mathematics)Configuration spaceComputer configurationComplex (psychology)Moment (mathematics)Computer configurationData dictionaryInsertion lossPoint (geometry)Port scannerDatabaseQuery languageServer (computing)Subject indexingFormal languageLogical constantRow (database)Vector spaceObject (grammar)Position operatorINTEGRALReal numberService (economics)Profil (magazine)ConsistencyComputer fileSet (mathematics)RootTable (information)Product (business)Right angleWordMultilaterationNeuroinformatikFile systemSearch engine (computing)Real-time operating systemRegulärer Ausdruck <Textverarbeitung>Forcing (mathematics)Revision controlDifferent (Kate Ryan album)AreaSoftware developerSpacetimeSummierbarkeitCellular automatonComputer-assisted translationRankingLibrary (computing)Software testingImpulse responseOperator (mathematics)Sound effectCASE <Informatik>AdditionPhase transitionNumbering schemeStreaming mediaStatement (computer science)QuicksortBoss CorporationPhysical systemThermal conductivityFeature spaceSquare numberState of matterData storage deviceSampling (statistics)TheoryGroup actionComputer animation
Independent set (graph theory)DemonScalabilityExt functorFormal languageRead-only memoryAttribute grammarData typeBenchmarkSource codeSign (mathematics)Special unitary groupAddressing modeMathematical morphologyReal numberStapeldateiSubject indexingArchitectureDisintegrationParallel portGeometryKey (cryptography)Infinite conjugacy class propertyUniform resource nameEwe languageExtension (kinesiology)DialectOrder (biology)Beta functionSet (mathematics)Library (computing)Limit (category theory)WordPrice indexLocal GroupStatement (computer science)SubsetData miningPersonal digital assistantDivision (mathematics)SineView (database)ParsingTable (information)Single-precision floating-point formatSubject indexingNoise (electronics)Flow separationRevision controlFunctional (mathematics)DatabaseSampling (statistics)MereologyScripting languageProduct (business)Scaling (geometry)Level (video gaming)Query languageMatching (graph theory)Office suiteComputer fileForm (programming)Control flowStaff (military)SpacetimeStability theoryComputer configurationRange (statistics)Distribution (mathematics)Arithmetic meanState of matterData dictionaryAsynchronous Transfer ModeGraph (mathematics)Error messageDemosceneGroup actionOcean currentWordBitDampingMultiplication signNormal (geometry)FrequencyAreaLibrary catalogScalabilityStatisticsUniform resource locatorDirac delta functionConfiguration spaceServer (computing)Physical systemInformation overloadHigh availabilityComputer animation
Subject indexingView (database)CAN busSubsetPrice indexBeta functionLibrary (computing)Mathematical morphologySet (mathematics)Order (biology)Limit (category theory)WordLocal GroupStatement (computer science)Personal digital assistantData miningSatelliteMoving averageConditional-access moduleGamma functionConfiguration spaceInformation managementScalable Coherent InterfaceMaxima and minimaSource codeProcess (computing)DistanceComputer virusE-textExtension (kinesiology)Matching (graph theory)RankingForm (programming)Time zoneEwe languageSign (mathematics)DatabaseCompilation albumTable (information)Subject indexingRootVector spaceWordNeuroinformatikComputer configurationLatent heatInstance (computer science)MereologyImpulse responseData storage deviceSign (mathematics)Different (Kate Ryan album)Flow separationWebsiteDistribution (mathematics)BuildingConfiguration spaceSheaf (mathematics)Set (mathematics)MiniDiscComputer fileAlgorithmRegulärer Ausdruck <Textverarbeitung>Data warehouseDistanceProduct (business)Principal ideal domainExtension (kinesiology)Game theoryRoundness (object)Inheritance (object-oriented programming)Network topologySoftware testingPopulation densityCellular automatonHypermediaSequenceSource codeMilitary baseMultiplication signAreaPhysical systemOperator (mathematics)Water vaporVideo gameGene clusterForcing (mathematics)Fitness functionExpressionMathematical morphologyQuicksortSampling (statistics)Electronic mailing listState of matterWhiteboardRule of inferencePhysical lawMaxima and minimaNumbering schemeComputer animation
Client (computing)SummierbarkeitData storage deviceLibrary (computing)Different (Kate Ryan album)DemonDynamic random-access memoryInformation managementApplication service providerAlgebraWeightQuery languageOrder (biology)Local GroupComputer configurationMeta elementSystem callStatisticsMetropolitan area networkMulti-agent systemMaxima and minimaArmCAN busSpecial unitary groupTime zoneRange (statistics)DistanceGeometryPoint (geometry)Density of statesTerm (mathematics)SineDatabaseService (economics)PlastikkartePrice indexNewton's law of universal gravitationUniform resource nameType theoryQueue (abstract data type)Web pageComplete metric spaceServer (computing)Structural loadAverageWebsiteData miningInstallation artComputer-assisted translationSoftware testingConfiguration spaceCompilation albumConditional-access moduleComputer fileConfiguration spaceRange (statistics)Query languageDatabaseSubject indexingString (computer science)Table (information)Bit rateVarianceMatching (graph theory)Revision controlDifferent (Kate Ryan album)INTEGRALRankingData miningCompilation albumService (economics)WordCASE <Informatik>Multiplication signWebsiteRule of inferenceAngleNumbering schemeStress (mechanics)PasswordInstance (computer science)SpacetimeData storage deviceConnected spaceStrategy gameSet (mathematics)Pulse (signal processing)Vector spaceTotal S.A.Term (mathematics)Boss CorporationOrder (biology)WhiteboardLink (knot theory)AdditionSource codeStandard deviationOffice suiteDevolution (biology)Logic synthesisCodeSoftwarePlug-in (computing)Projective planeFormal languageWrapper (data mining)Computer configurationEmailSoftware testingProcess (computing)Overhead (computing)Data dictionaryElectronic mailing listImplementationDistanceFrame problemComputer animation
Sign (mathematics)Price indexSummierbarkeitCAN busSatelliteInterior (topology)Special unitary groupZoom lensMetropolitan area networkFunction (mathematics)DatabaseQuery languageMathematical morphologyDemonDiagramNewton's law of universal gravitationComputer iconPointer (computer programming)Raw image formatEmulationUniform resource nameRow (database)AlgebraTwin primeAmsterdam Ordnance DatumNetwork operating systemLogarithmValue-added networkStatisticsDedekind cutCategory of beingSpacetimeDistribution (mathematics)Presentation of a groupLevel (video gaming)Library (computing)Sampling (statistics)Cartesian coordinate systemExecution unitRankingTable (information)Group actionOrder (biology)Subject indexingQuery languageChannel capacityOffice suiteMathematicsMatching (graph theory)Data storage deviceWordDemosceneState of matterLine (geometry)Replication (computing)Default (computer science)LastteilungCodecAdditionAffine spaceFlow separationMoment (mathematics)Combinational logicIntrusion detection systemDatabaseExistenceStructural loadComputer fileDemonConfiguration spaceCodeComputer animation
Value-added networkWide area networkMaxima and minimaMetropolitan area networkCAN busSpecial unitary groupColor managementUniform resource nameSummierbarkeitAlgebraActive contour modelUltraviolet photoelectron spectroscopyEmailConditional-access moduleEmulationArc (geometry)Level (video gaming)MassExecution unitMUDMenu (computing)Physical lawMulti-agent systemNetwork operating systemRaw image formatHand fanMoving averageGamma functionMultiplicationCore dumpPiArithmetic meanDuality (mathematics)Interior (topology)ArmCloud computingArtificial neural networkNewton's law of universal gravitationLink (knot theory)DatabaseVariable (mathematics)Data dictionaryPlanningTask (computing)WordAttribute grammarSubject indexingQuery languageComputer fileConfiguration spaceParsingFlow separationCore dumpMereologySheaf (mathematics)Group actionTable (information)Library (computing)DemonAreaNormal (geometry)Content (media)Shared memoryWave packetEndliche ModelltheorieCASE <Informatik>Mathematical analysisOrder (biology)NumberMultiplication signDampingWhiteboardPasswordSource codeComputer animation
Price indexInterface (computing)Subject indexingDemonQuery languageRange (statistics)Set (mathematics)Heat transferDatabaseTable (information)Process (computing)SynchronizationWide area networkMetropolitan area networkArmSign (mathematics)Gamma functionWrapper (data mining)Ring (mathematics)Coma BerenicesUniform resource nameComputer-assisted translationWeb pageWebsiteServer (computing)Structural loadAverageInformation managementFunction (mathematics)Mathematical morphologyMenu (computing)Phase transitionModal logicPressureSource codeDataflowData typeRow (database)EmulationComputer virusExecution unitCAN busSpecial unitary groupSupremumIntel3 (number)MaizeMaxima and minimaMulti-agent systemSummierbarkeitDatabaseRoundness (object)Maxima and minimaSubject indexingTable (information)Computer animationSource code
Special unitary groupQuery languageDatabasePrice indexData typePhase transitionModal logicPressureDataflowSource codeSign (mathematics)Boss CorporationRoundness (object)Square numberSubject indexingPulse (signal processing)Different (Kate Ryan album)Query languageTable (information)Posterior probabilityDatabaseProcess (computing)Computer animation
Cartesian coordinate systemExecution unitDedekind cutCAN busSpecial unitary groupWindowMetropolitan area networkMulti-agent systemArithmetic meanUniform resource name3 (number)Maxima and minimaSummierbarkeitPort scannerConditional-access moduleSupremumMenu (computing)ArmMassEmulationInterior (topology)VacuumSineComputer-assisted translationRule of inferenceRaw image formatWebsiteBinary fileWide area networkMoving averageCloud computingNetwork operating systemUltraviolet photoelectron spectroscopyMUDAmsterdam Ordnance DatumMultiplicationNewton's law of universal gravitationDialect19 (number)Information managementStrutValue-added networkLevel (video gaming)UsabilityDemosceneQuicksortForm (programming)Software testingSheaf (mathematics)Cycle (graph theory)Arithmetic meanMatching (graph theory)LoginQuery languageTable (information)WordCodecMereologySpeech synthesisSubject indexingEndliche ModelltheorieGroup actionMultiplication signRight anglePressureState of matterInstance (computer science)Distribution (mathematics)Process (computing)Computer fileDatabaseBinary codeMaxima and minimaBitSource code
Execution unitStatisticsSpecial unitary groupArtificial neural networkSicWide area networkUniform resource nameSummierbarkeitMassPointer (computer programming)Personal area networkCAN busMetropolitan area networkRoyal NavyCartesian coordinate systemEmulationMaxima and minimaAverageMulti-agent systemArmArithmetic meanIRIS-TElectronic meeting systemQuantum stateWebsiteSource codeGroup actionAttribute grammarTheory of relativityConfiguration spaceAreaEuler anglesOffice suitePrototypeProduct (business)CASE <Informatik>Source code
Maxima and minima19 (number)SummierbarkeitVarianceMetropolitan area networkRaw image formatUniform resource nameSpecial unitary groupCAN busMenu (computing)Royal NavyStorage area networkInterior (topology)EmulationDedekind cutRange (statistics)Query languageSet (mathematics)Heat transferDatabaseTable (information)SynchronizationProcess (computing)Sign (mathematics)TunisScalable Coherent InterfaceSubject indexingPrice indexApplication service providerLevel (video gaming)ArmCuboidHand fanMusical ensembleConditional-access moduleGamma functionRootWrapper (data mining)Group actionTable (information)Roundness (object)Multiplication signElectronic mailing listResultantStrategy gameChemical equationRow (database)Communications protocolSubject indexingRandomizationSet (mathematics)Range (statistics)Query languageAverageStructural loadConfiguration spaceEvent horizonPort scannerRankingExecution unitPressureOperator (mathematics)Video gameWordFreezingCorrelation and dependenceSource codeComputer animation
Transcript: English(auto-generated)
Awesome. Well, thanks everybody to select this talk. I am Emmanuel Calvo, I am from Spain. Actually, I'm moving to Argentina, so I'm changing my country right now.
The talk is about Sphinx and Postgres, and maybe it's not something that is usual to hear, because in general, we hear about the Sphinx related with MySQL. But actually, Sphinx works with Postgres,
but we need to recognize that Sphinx has more features with MySQL. I'm working for PalominoDB, it's a company which is based on Las Vegas. I am operational DBA, we are working with MySQL,
Cassandra, Postgres, and other databases. Actually, I am the Spanish press contact, and I am working with the Argentinian community also. Well, this is my public profile. I need to thanks to Andrew Blad,
which is one of the guys that are working with us, who helped out with the new features of the new Sphinx version, and all the PalominoDB people, obviously. This is all the service offerings that we are giving on our company,
so if you're interested, just don't hesitate to make more questions at the end of the talk. Okay. Basically, the agenda is very easy. We know we have full-text search on the database. Why we should use Sphinx? Some of you are using Sphinx on production right now,
or development. Well, which is your experience about Sphinx?
It's very small, pretty small. Yeah, yeah, I will see some stuff, but yeah, Sphinx has this, one of the features that it has is very small.
It's like, it's so simple. A difference with Solr, maybe you heard about Solr, and Solr has a lot more complex features, but it's, I don't want to say it's lower, but it's more complex than Sphinx.
So the main question, okay, this is the news. I will talk about this later. So the goals of the full-text search is, basically, we need to search text with complex queries on the database. But you know, we can't just do a like
with wildcards across one billion rows. We'll be a suicide. So we're using full-text search for that, which is a technique that has a special index, which contains a vector for,
that specifies the position of the word, and you can search using that index and look up on your table. But in addition, you can have more complex queries that, for example, steaming, for example, you want to search about computers,
but in your table, you have computers, computer or compute. So you can search, you know, with the root word and across a large data set using the index. So from that, you want to reduce the IO, okay?
You're using the index, so you're reducing the IO. Instead of making full scans of the table. Does anybody use, sometimes, full-text search or before? Okay, awesome. So in Postgres, we have two options
to use full-text search, externally using another tool, which is a string, Solr, or internally using the native full-text search. Other stuff that full-text search can do is all the words by relevance. You can put a rank, okay? For example, if you have a text
with a title and a body, and you want to search a word, but some articles have some words in the title, you want that those articles be the first in the resource set. So you can add ranks also. Is language sensitive?
Because it's based on dictionaries. So it's not the same that you order, they have an index using an English dictionary that using a Spanish dictionary. Obviously, it's faster than regular expression and like operands in general, okay?
Okay, that's the question, why Sphinx? So if you have native support on Postgres, why we should use Sphinx? Well, the main question of Sphinx is you are isolating all the full-text search queries in other servers. So all those heavy queries that are running in database,
you're just isolating them from the database and put on other nodes. And because Sphinx feeds from data that comes from the database, you can crash an Sphinx node and rebuild it without sacrificing any data consistency or whatever.
You don't care about the consistency of Sphinx at that point because you can build up a lot of nodes, a replicated nodes with the data coming from the database. So one thing I will just say before continuing is
Sphinx can feed those indexes in real time, okay? You can, if you insert something in database, can feed the indexes, but it's only for MySQL. In the Postgres case, we are using everything with asynchronous feed, okay?
Okay, this is additional purposes, but basically the only thing that you need to have in mind is all the queries that are running on the Sphinx are not transactional, okay? So I will explain how it works on production, but basically all the things that you have in Sphinx
maybe are not in this moment on the database because you will feed the indexes. For example, you can feed them each five minutes, 10 hours, whatever you want, basically. As I told, the native supports like real time indexes,
but basically we can use it asynchronously. I don't have too much customers that ask about real time indexing, because if you have a lot of inserts and a really high load, it will be crazy to feed
the full text search engines with data constantly. It will be crazy. So if you want still native support, you have some cases, for example, Antrix. You can store the text externally. You know you have an option and techniques
using large objects on the database, on Postgres. So you can have the indexes on the database and have all the text, the raw text, put on the documents on your file system. So you don't fill up your database and you have all the documents
actually on your file system, but you can search using the full text search capabilities of Postgres. Then what you can do is all the way, just store the whole document on the database because you want the integrity of that document
and index that document with the Sphinx. You know, different approach. The main thing of the native support is try to don't, try not, don't index everything. One of the most common errors of some customers is just, okay, I want to search something.
I grab all the data that I have and put all on the indexes. So it's a very, very common mistake. Just be clever how you index the data. Indexes are expensive. The search will be expensive if, expenses if you have a lot of data on the indexes.
So select your data. For example, index the titles, the subtitles, and the most frequency words of an article. Don't just index words that appears just one or two times on the document.
This is another thing that I will explain later, but basically Sphinx uses an ID. So you need, that ID is, should be 64 bits to store on the try system. TCStats is a catalog table that you can query
for check your full-text search stats. And what about, this is parse URLs. Basically, if you don't use before the full-text search,
basically you have a lot of functions on parse side that you can, for example, parse and know how Postgres is parsing the words on the full-text search to convert them for TCVector. Okay, Sphinx.
Written on C++, high-scale label. High-scale label, basically you build the scalability because you just bring data from your database and put them off nodes. So you are building up the distribution. It's manual.
There is no magic in Sphinx. So if you want, for example, you have a big database and you want to to index one table only, but it's very, very big table. So you just grab a node and index only that table.
And then you can grab other nodes to index all the other tables. The scalability is made just manually. So it's very scalable because it's simple. You just need a configuration file and tell Sphinx how to reach the data.
But that's it. There is no other magic here. There are other things that have Sphinx. It has some options like you can create the snippets or just not only for the search,
you can, for example, know which are the most common queries or the misspelling database. For example, if you make a typo on the search, you can just query the misspellings dictionary. So Sphinx will give you all the correct,
the correct words. Okay, and this is obviously the return data from Postgres is the way that we are working now. This is one of the new,
this is one of the key features. The new one, which is the feature that we were talking about at the end, is the fileover searches. Basically, Sphinx doesn't manage the fileover, which means if your node is broken,
Sphinx, it won't make a fileover of the node and create another one. No, the only thing that it does is detects the fileover or detects the node instead and just doesn't use it. Okay, it's the same with the distribution also. It has an action that basically checks with an interval
if the node is alive or not. So basically, it's the only thing that it does, the action, but it has a really nice consistency because if you think about it, if you have a broken node
and you want to query the search and the search, try to search on the other node, if it's dead, it will just work as a standalone. The same action that works for the mirroring works for the distribution. You can have, for example, several nodes
with several chunks. This node has, for example, articles from one to one billion, one million. This second node has one billion to two billion. So you can search pointing to a node, yeah.
No, no, no, no, no, no, no, it's a single search. It is basically, that's why I always said,
you make it manually. If you make a mistake building up the indexes, you're screwed because basically, you will spread the error across the nodes. So the main trick of Sphinx is, think about it before to do it, that's the only thing, but the better thing is you're not touching the data.
You're just retrieving data from the database. So you don't break anything if you just make a mistake here. That's the best thing of Sphinx. You are just isolating all this stuff to another node. If you break a node of Sphinx, you can just build up a new one. But the best thing of this approach is,
and the best thing is just always keep the nodes small. Don't have huge nodes with large data with Sphinx. It can handle a lot of data, but that's the main area of Sphinx, is have small nodes spread across
with the same indexes, with distributed indexes, and if some node is just, it breaks, you just, you can rebuild again that node. And the main thing is, when you query the database, use small queries.
You don't grab all the table to the Sphinx. You can configure to make delta's incremental indexes, just from, not just index from scratch. And you can build up range queries,
which means instead of bringing a huge resource set to the Sphinx, you can just make a query to make by range, you know? Each like, thousand documents and bring small pieces of the table, just to don't hit the overload of the server.
Well, what's new in the Sphinx? This version is not production yet. It's the last version is 2.3 row, which is the last stable version.
This is a beta version. It has some features, the feature that we are looking for right now, it's the HA support for this, for high availability. And other thing that maybe is interesting,
because Sphinx is being used for data warehousing, when you need to search, for example, you need to know where are users that are looking for, I don't know, tables. And you want to see those user, what are the related searches of the users?
For example, in general, the users that looking for tables are for computers or just for the living room. So you can build up related searches and push them to the data warehouses. And you can offer to the customers specific things
that if you buy a table, maybe you want to buy a, for example, other thing. And this one is related with the vrun indexing also,
which is other feature that was added specifically for data warehousing, which is you can specify what are the most common words that you will face on the document. For example, if you have a site that is related to databases, maybe for sure the most common words
will be databases, tables, and schemas, for example. So you want to specify those words here. So it will index different if you just don't say nothing.
Okay, and those are all the features that are more advanced for this one.
Our follow sheets, when you build up words, for example, it's not a sign on it. It's different, but basically you can list, how can I, for example, you say Mac or iMac, for example.
So you want to build up a more follow sheet that the iMac is, the i, for example, for the first letter and the second part is the product of Apple.
So you can, with the more follow sheet, specify which are the words inside the words. So for example, if you say iMac, you know the i represents, it's just an example. You can build everything with that. For example, battle force, for example.
You have a game that's called battle force, so you can build up more follow sheet that means if you find this word, it means that you have two words inside. So when you search, you can search by the internal words that that word has. If you search Mac, we'll search Mac as a word
and it will find Mac. Or if you search iMac, we'll find Mac because it's the more follow sheet of the word. It's basically just a sign on it.
You can build up sign on it, but it's different. The sign on it is just a link. You say that word is that word. More follow sheet is different. More follow sheet sets. If you have a long word that has several words inside, you can split those words, yeah, those words
and search by those words. And the steaming is basically what I told before. When you have, for example, computers, computer, you have the root word is the steaming. So that's why full text search is dictionary-based
because if you search computer in a Spanish dictionary, you won't find anything because computer is English. So basically steaming, what it does is if it found, for example, yeah, computers,
it will just store in the index computer because it's the steaming word of this computers. Yeah, basically it's that. It is, you can see very easily if you build up a full text search index.
I will show some examples now because it's very easy to build up. You just convert that. You just grab a phrase and convert it to DC vector and you will see the steaming actually. You will see that how it converts each word in the steaming.
How to fit in the Sphinx. Well, this is, the MySQL impulses are the main, the main options are when you install Sphinx with the packages or you want to install the Sphinx with the compilation method, you will see actually
that you have those two databases on the options. MSQL, basically you can build up from OVC or Jiva connector also. And you can just feed it with the XML also.
There's not too much science. The main things when you configure Sphinx are where to look for the data. I mean, where you will grab all the data, how to process it. Okay, it's what I saw, it's what I told you before. Don't index everything.
So you can choose in the Sphinx side which data you want to index actually. Where to start index. And this is another funny thing because actually you can build up a Sphinx node and for some reason you want to build up several instances in the same host.
Why you should, why you want to do that? Because if, for example, you made a mistake, building the distribution or building up, for example, the new indexes, you don't need to kill up to kill the whole instance. You kill only the instance that you want to kill up.
So you can store indexes in several places and have isolated every part of the indexes on your disk instead of have everything in one single database. Data is not a database, but it's all,
but basically you can store in different parts the data. The sections. When you configure Sphinx, you will see there are a large configuration file. Basically it has those sections, which is the source, where is the data coming on,
the index, how to process the data, the indexer, which is how many resources I will use to index, and the search, which is the daemon itself. So here you can set which ports to use,
the ports, the logs, the PID files, et cetera. Those are one of the most common full-text search extensions that has Sphinx. As you will see, there is not too much difference
with the full-text search on Postgres, but basically the most usual are the AND and OR operands, which basically are the most usual. It's very similar to the regular expressions. You can, for example, the other usual,
other usual search is this one, the proximity search, which it's made by trigrams. I don't know if you heard about the trigrams algorithm, basically. Well, Sphinx used that algorithm to search the proximity.
Basically the trigrams take three layers of the word, and it takes all the word in three letters consecutive. So for example, hello world will be, the first trigram will be H-E-L, the second one will be E-L-L,
and then it will compare the proximity using those bunch of characters. And the other one, which is the most useful, but not operand node, but the most common are OR will be more than enough,
because usually when you have a site, you will search about which words are very similar only. You don't have very complex queries in the practical. The connection, well, this is interesting. You can connect to Sphinx using an FBA
for like every other database. You can use the Sphinx storage engine for MySQL, which is more interesting. And that's one of the best things that Sphinx has with MySQL. That means you can build up your MySQL instance
and basically connect to the MySQL and use the table, the indexes as tables inside the MySQL. Or you can just query the tables, the indexes that are in the Sphinx by showing with MySQL tables. It's like in a storage engine, like InnoDB or MyISAN,
just you have the, it's like, no, it's a plugin, a storage plugin. It's one of the things that we were hearing on the mailing list that some, that I don't remember if it was just Vercus or that it was talking about the plugin storage,
but basically it's related with that. And maybe it's related also with the frame data way. And the other thing is SphinxQL. Basically, Sphinx has its own SQL. I don't want to say SQL, but it has a query language. It's very similar to SQL.
You can connect to use that language using the MySQL client, okay? Just, you just install the MySQL client, or if you have it already, you just connect to the Sphinx and just hit the Sphinx with, you will see that very similar QOL queries.
The only difference you will have is this, the source is not a table, it's an index, and the searches will be like a full text search without the against clause.
In MySQL, when you want to search with full text search, you use two operands, the match and the against. In this case, you don't use the against, you use the match, and it will match with the hiding columns. I will show you why I said hiding columns.
Those are, yeah, sorry? Ah, using the process client? No, it doesn't work. You mean if you, well, it can be, yeah, it can work.
I didn't check it, but it will be very tricky, but basically, it can work because the storage engine is like, you know, when you connect to the MySQL, when you query a storage engine,
it's totally transparent for the user. You can query a table, but you don't have any idea which storage engine it's using. So yeah, maybe it's working, but it should be very weird to have a Postgres query in a MySQL to query a storage engine which is in Sphinx. But yeah, it can, it can basically. But the main thing is at the end of the keynote,
but that's a good question. I was searching about foreign data wrappers for Sphinx, and I didn't follow them, so it would be nice to see why, or maybe we can build up a new foreign data wrapper for Sphinx. It's not that complicated because the queries are very,
it's very similar for the MySQL, basically, foreign data wrappers. So we can see or build a foreign data wrapper for that. Or in the future, if the plugin storage project comes up, so it would be awesome to have a, you know, a plugin Sphinx storage on the Postgres side.
The main thing I will say, because it's related, but don't forget that Postgres, when it makes full text search, it uses indexes. So you don't see how it's indexed the data.
You can store the vector of the full text search on the table, but it's not useful. Sphinx stores everything in indexes. It's not like Postgres that have tables and indexes. It has indexes only.
That's the main difference between the Sphinx and between the Postgres, between all of the databases. This syntax is some of the options that you can have with the Sphinx SQL. You have rank also, but it's not here.
For example, you can have the option, the order by ranker, which means order by rank instead of order as it comes. But basically, it's very similar with the SQL.
Other features, shield instance, I never use it, so it's new for this version. I didn't have time to test it. Maybe it's interesting for people that is using Postgres to have a Sphinx for, to isolate overhead on the databases
can just have shield data on the Sphinx and can use on the Sphinx side instead of push queries on the Postgres side. The range queries, which is basically with the intervals, I think, I don't want to say nothing,
but at the same time, we see the interval feature on Postgres. We see the interval feature on Sphinx, but basically, it's cool. The MSPL services, actually, you can build up a configuration file. I will show you where you find it on the code,
but basically, you can build up a configuration only for the MSPLings. But the first thing I shall say here is, you need to have the words, and you need to build up the dictionary. Okay, it's not magical also. You need to have the correct words on the database,
the first thing, and then you need to build up the dictionary. Also, you can, as I told before, the implementation services, you can build up Sphinx to show the most popular queries.
Like Google, when you put something, it will show you the suggestions. Basically, Sphinx can do that. Related search is being used for data mining also. This is what I told before about, if you buy something, maybe you want to say to the customer,
okay, you can buy these things which are related, because all our users just buy them. Okay, the compilation. Why I didn't recommend to install it by packages,
because basically the main package integrates MySQL. So if you want to run Sphinx, but only for passwords and to keep it small, this is the only way that you, the only way you can have Sphinx with passwords only support.
The only thing, the command is very basically the same with other software that you compile. The only thing that you need to have in mind is the enable ID 64 option for bigger indexes, for big indexes. The other lines are very basic,
is the libraries and includes, is the only thing that you need to have in mind, but the rest of the line is the only thing that you need. Maybe not mandatory, because we work, but for sure recommend it. Make install, then obviously if you compile and you didn't import the libraries, import the libraries.
And then, okay, for this presentation I used an SQL. The main, the main distribution for Sphinx doesn't come with, with a compatible, Postgres compatible SQL.
I build up, I build up one, so I can show you then the code that basically is the same. It creates some tables with some data. And I tested and it works with Postgres XE. Everybody knows which is Postgres XE.
The new cluster project, well basically it's a fork for the 9.1, which has the capability to have clustering, clustering with the data. You can build up, distribute the tables, replication, basically it's the same.
It's just compiled with using the libraries from the Postgres XE, but it works. This is the best thing. Well, this is the best, the basic schema. How it works, the application, first query the Sphinx. The Sphinx always return IDs. That's the first, the other thing that you need to know.
It doesn't return data. I mean, it doesn't return the additional data. When you, when you build up an index and you search a word in the index, Sphinx will return you those columns
that are the ID of that data, okay? So if you search about, for example, tables, you will get an ID from the Sphinx, nothing more. You won't get all the phrase of the data. That's why Sphinx, it is not a database
that you can use as main storage. It always been using for, yeah, sorry? Sorry?
It's the order. Yeah, you can have the rank of each ID, but you don't have the data. I mean, when you query the Sphinx, Sphinx search the data, but doesn't return to you the data itself. So what you need to do is you can return at least one or two columns. For example, for each ID, you can have
like a related column, which can be the group or whatever from the data. I will show an example there also. But basically we return the ID. If the application needs more data,
it will go to the database. And if not, it would just use that ID, yeah? Yeah, well, basically it's also with the, you can build up with the storage,
with the MySQL storage engine also. It automatically will bring back all the data using the storage engine. You can build up also. Yeah, that's for MySQL. But basically data is the same. That's why I said just avoid to index everything
and just keep it simple and small because the Sphinx is useful just for the most common queries. For example, you need to know if that user exists. So you don't want to query the database. You want to query the Sphinx. The Sphinx will return ID and then with that ID,
you can search on the database, whatever you want. But just imagine how many load you will isolate from the database. If you search those simple queries on the Sphinx, you have the table with the users and just search on the Sphinx side. And the indexing side is obviously just taking,
grabbing data from Postgres and pushing them from the indexes. This is the common approach is the basic, the basic schema of how to implement Sphinx.
Yeah, you can debug the, you mean for debug purposes? No, no, Sphinx won't show you by default the line itself, all the data.
You mean like the only matching search
but you mean when you are feeding the indexes? I am. Yeah, you can build up, you can, if you want, just,
or you have the table with several columns, you can, yes. Because you, basically you manage how to index. In fact, you can build up an index with several tables. You can join a, have a huge join with a lot of tables and have only one index. It's because when you are feeding the indexes, you're feeding as you want because you are using,
I will show the configuration now, but basically you are using queries. So, Sphinx doesn't care how you are selecting those, that data, you will just push in an index. The Sphinx daemon, okay.
The Sphinx daemon is a standalone, always. You can use essence for, you know, file over search but no file over proposes itself. When I mean file over searches, I mean, you are searching with, against a search daemon
and one of the nodes just go down and the daemon only will say, okay, that node is out and it will continue working with the other nodes. But it doesn't work like a file over itself. So be careful with that. Sphinx is basically standalone. The only thing that it has as an addition
is it has an action to know where to balance the queries. It's more a balance, a load balancing than a file over it. So this is an example of how to start it. Basically you have, I will show the example
so we can move, we have only 15 minutes, so. Okay, at this side I have the Postgres database. I don't need right now the Postgres because I won't see anything here.
And basically you will see once you have compiled the Sphinx, you will have the binary, the share, the bar, and etc. I added data because basically you can create a folder
in whatever you want that it contains the indexes itself. So just query the data here. You don't need like build up a cluster with like Postgres. You don't need to start any DB, nothing.
You just create a folder and when you configure the Sphinx, push all the indexes on that folder and that's it. You don't need to have to do anything more. And binaries, the primary tools are, you will see it's not a huge tool. You have the indexer, which is the tool
which grabs all the data and push it to the indexes. Index tool can make some specific tasks. One of the tasks can be, for example, merge indexes. Merge indexes is very useful
if you have really big indexes. And instead of rebuild all the index from scratch, you can just grab two indexes and merge that in one. It has less IO and it's faster. The search command, the search command is like a MySQL or PSQL kind tool.
You just search against the search D. The search D is the main daemon. The spelldamp is the, it's like, I don't want to say PC dump because it's not that, but basically,
it's the tool for dump all the dictionaries. And the word breaker is basically what in Postgres is the TC parser. It will show you how to break the words in the Sphinx.
Okay, and the configuration. It's very basic, this configuration, because it's a test one.
The main part is this one, which you say each section of the Sphinx has a type.
In this case, we are using PGSQL, which is the main library to connect. Okay, the other variables are very surface planning. It's host, user, password, database, port. The main SQL query, for example,
this is another attribute. I have a table, you will see here, that you have the main table that we are, we will build up for the indexes documents. And we are selecting the ID, the group ID.
Then the seconds, okay, from the data. You can see that this is a normal SQL. You can have here, for example, instead of the port, you can have, you can use whatever you want. In fact, you can just build up, you can cross instead of have the title and the content,
and you can concatenate both columns and make only one if you don't care about the title. For example, this is how I build up the index later.
You will see that you have several queries, because, forget about this one, because it's the MySQL, but I keep it in the configuration file to show you how it works in MySQL and how it works in Postgres. Basically, the SQL query is that one.
It doesn't respect the order, but basically you have several steps. I will show them the keynote, because it's explicit there.
Oh, I quit. Okay, here it is. When you're running data from databases, you have a flow, which is the pre-query,
and I put here my SQL, but it's Postgres. Every initial setup is here. That's why we set up first to search the max ID here, because the last run, it will be used,
that table will be used for the incremental indexes. So when we made the main query, there is two steps, the post query,
then the post query will index itself, and then the post index query. The difference between this one, the post query, and the post index query is the second one, will execute once the index, the indexer process is done and is correct. So it's better if you need to update anything
on the database, do it on this step, instead of the other one, because the post query will just execute once the main query is done. So if you update a table, and for any reason the indexer just fails,
you have the last run table, for example, just with another ID that is not correct. So you will see incremental here,
that once I have the SQL query, the main query, it will just update that table with the last ID. So, yeah, no, okay, yeah.
It will take, this is not the best approach, but it was for a test it was worked. Basically it takes the max ID of the test documents table.
And when it starts to index again, it will just grab the last run table and search against the table since the last ID I have on the last run. It's very easy. But as you can see, it's very manual. There is no magic, Sphinx doesn't just guess anything.
You need to pull up everything. But basically the main question is, for each table, you need to have a checkpoint table, which is the table that will contain the last ID process. Then the index, here is the most interesting part
because then you have, once you have the data, you need to process it. And here you will specify the synonyms, the exceptions, the war forms, everything you need to process of the index will be on that side. On that section, sorry.
And okay, this is the new feature, which is implemented on the Sphinx 2.1, which is the actions. I have two hosts and you will see that the actions works per index, okay?
You don't have the action for across all the indexes. You need to specify which host, which nodes has each index. For example, you can have, for example, three nodes, but the three nodes doesn't have the replicated indexes across all the nodes.
You can have only replicated two nodes and the distribution is different. You specify everything here. Yeah, it's a mess, but you can have it. The indexor section, as I told, which is the resource usage for the indexor process
and the searchd, which is which ports you are listening. It's as every other database. Always have a different port. Don't forget to change the bit file.
If you need to have several instances in the same node, just don't forget to change the bit file, the logs, the query logs. The binlog path has a binlog also, like in my SQL. And postres have the wall.
It has the binlog, bini binary log, and that's it. You just, you can have, or I can build up right now the same, I have two searchd, for example.
I should have one only, but okay. Those tests, yeah.
Yeah, yeah, yeah, exactly.
And you need to have, it should be like that, yeah. Here you can change like this, just to don't push anything weird. We use two, two.
Oh, well, this is one thing. The first time you run the things, you can start the searchd because you don't have indexes. So you need to first index the data.
Okay, awesome. Here we are.
I think this one is the fail one. It start, but without nothing, and the second one started after the indexer.
But basically it's like this. It's just a process, and then you can connect to the MySQL, to the, sorry, to the Sphinx. For example, like this, and here we are. Match America, as you can see,
you can search data like the Sphinx SQL, and, ah, and we can see the data on the posture side. So you can see.
Oi, shit, ah, that's okay. For example, I search America.
Just, as you can see, it's case insensitive, so it will return the ID and the other attribute that I specify on the source. If you remember, when I configured the source in the configuration, one of the attributes
was the group ID. So why I return the group ID also? Because maybe if you have a site that you are searching for some product, each product has an ID, but also, it belongs to an area, for example, technology or books or whatever.
So you can create your own group ID and search all the related books, for example, not only just the book itself. And as you see that the data here on the posture side, you can.
I should have another. But yeah, it's using group by ID here, grouping by.
This table doesn't have any sense. They just put random values, so don't worry. And, okay. The range query was at all. If you want to avoid the large sequential scans,
you can use the range query, which says, okay, which is the step for each result set. So you can avoid just to bring up one million rows in just one step. It's very useful. The delta indexing is what we saw before.
It's like just indexing with incremental data. The list of searches is that one. So the agents are using a protocol to search if the other actions are alive.
There is a value that is HIA strategy. You can say, okay, balanced query in random mode, which is just push whatever you want. You can have round roving, okay,
just using one node per time. And the other one, which is the most interesting, is adaptive, which is, it will score each node for its value. So if you have a node that is working very bad
because it has a really low, high load average, the rank, the score will be lower. So it will use it less. And when the node starts to work again better, it will raise the score again. So you will use, Sphinx will use that node
more than before, according the load average of the node. And that happens with the mirror indexes. Those are distributed searches. Which is different. This is another example of Russian mirrors.
This is an example that we saw in the configuration. And the foreign data wiper. So I hope today we have the hacker lunch. So maybe somebody will speak about the foreign data wiper and maybe somebody just want to help out with that.
Will be awesome to have a data wiper for Sphinx. Any questions? Thank you very much. And hope to see you the next days. Thanks.