What is the best full text search engine for Python?


Formal Metadata

What is the best full text search engine for Python?
Title of Series
Part Number
Number of Parts
Soldatenko, Andrii
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Andrii Soldatenko - What is the best full text search engine for Python? Compare full text search engines for Python. ----- Nowadays we can see lot’s of benchmarks and performance tests of different web frameworks and Python tools. Regarding to search engines, it’s difficult to find useful information especially benchmarks or comparing between different search engines. It’s difficult to manage what search engine you should select for instance, ElasticSearch, Postgres Full Text Search or may be Sphinx or Whoosh. You face a difficult choice, that’s why I am pleased to share with you my acquired experience and benchmarks and focus on how to compare full text search engines for Python.
E-text Computer animation Information Search engine (computing) Gender Projective plane Selectivity (electronic) Bit Search engine (computing) Data structure Benchmark Information security
Musical ensemble 1 (number) Emulation Variance 2 (number) Web 2.0 Programmer (hardware) Mathematics Bit rate Single-precision floating-point format Computing platform God Social class Task (computing) Process (computing) Graph (mathematics) Inheritance (object-oriented programming) Information Sine Projective plane Basis (linear algebra) Bit Band matrix Particle system Computer animation Data storage device Search engine (computing) Query language Blog Order (biology) Natural language
Multiplication sign Price index Mathematical analysis Computer Metric tensor Field (computer science) Read-only memory Term (mathematics) Memory management Information security Addition Database transaction Computer Length Basis (linear algebra) Content (media) Term (mathematics) Group action System call Subject indexing Arithmetic mean Explosion Computer animation Search engine (computing) Data storage device Quicksort Boundary value problem Hydraulic jump Inverter (logic gate)
Graph (mathematics) Price index Bit Line (geometry) Term (mathematics) Total S.A. Table (information) Database normalization Subject indexing Word Database normalization Computer animation Query language Personal digital assistant Data conversion Vulnerability (computing)
Hidden surface determination Server (computing) Moment (mathematics) Projective plane Streaming media Revision control E-text Computer animation Search engine (computing) Elasticity (physics) Quicksort Subtraction Data type Hydraulic jump Elasticity (physics)
Query language Context awareness Covering space 1 (number) Price index Perspective (visual) Table (information) E-text Population density Vector space Special functions Ranking Ranking Inheritance (object-oriented programming) Complex (psychology) Bit Functional (mathematics) System call Subject indexing Population density Sparse matrix Frequency Computer animation Hash function Graph coloring Query language Ranking Film editing Glass float Matching (graph theory) Resultant
Web page Slide rule Scientific modelling Direction (geometry) Disintegration Source code Home page Auto mechanic Electronic mailing list Graph (mathematics) Field (computer science) Word Goodness of fit Object (grammar) Operator (mathematics) Normal (geometry) Extension (kinesiology) Default (computer science) Information Gender Projective plane Constructor (object-oriented programming) Interior (topology) Diffuser (automotive) Bit Set (mathematics) Limit (category theory) Functional (mathematics) Quantum state Subject indexing Word Computer animation Query language Search engine (computing) Personal digital assistant Uniform resource name Search algorithm Telecommunication Order (biology) Configuration space Right angle Data management Resultant
Implementation Building Analytic set Real-time operating system Mereology Computer icon Scalability Subject indexing Computer animation Database Elasticity (physics) Natural language Right angle Units of measurement Elasticity (physics)
Musical ensemble Game controller Concurrency (computer science) Price index Client (computing) Mereology Weight Code Rule of inference Attribute grammar Number Revision control Word Oval Nebenläufigkeitskontrolle Elasticity (physics) Units of measurement Stability theory Buffer overflow Default (computer science) Theory of relativity Forcing (mathematics) Shared memory Parallel port Stack (abstract data type) Price index Instance (computer science) System call Functional (mathematics) Subject indexing Word Numeral (linguistics) Computer animation Query language Synchronization Right angle Data type Data management Resultant Elasticity (physics)
Point (geometry) Query language Line (geometry) Source code 1 (number) Price index Client (computing) Field (computer science) Table (information) Writing Word Database String (computer science) Videoconferencing Data structure Subtraction Social class Installation art Algorithm NP-hard Scaling (geometry) Information Server (computing) Query language Open source Sound effect Attribute grammar Staff (military) Bit Letterpress printing Sequence Open set Table (information) Word Arithmetic mean Spring (hydrology) Computer animation Database Search engine (computing) Software testing Right angle Quicksort Videoconferencing Field (mathematics)
Pairwise comparison Dialect Information Algorithm Real number Projective plane Diffuser (automotive) Client (computing) Weight Functional (mathematics) Table (information) Inclusion map Computer animation Query language Search engine (computing) Synchronization Ranking Data structure Marginal distribution Elasticity (physics)
Dialect Scientific modelling Source code Gradient Client (computing) Set (mathematics) Functional (mathematics) Computer animation Angle Search engine (computing) Internet service provider Quicksort Subtraction
Musical ensemble Code Multiplication sign View (database) Source code Price index Real-time operating system Water vapor Solid geometry Inverse element Parameter (computer programming) Weight Mereology Front and back ends Summation Word Elasticity (physics) Cuboid Damping Position operator Algorithm Search tree Bit Trigonometric functions Functional (mathematics) Arithmetic mean Multi-agent system Data storage device MiniDisc output Curve fitting Resultant Point (geometry) Read-only memory Open source Distribution (mathematics) Streaming media Frequency Term (mathematics) Database Subject indexing Ranking Configuration space Gamma function Associative property Address space Task (computing) Scale (map) Default (computer science) Information management Scaling (geometry) Information Inheritance (object-oriented programming) Sine Gender Independence (probability theory) Set (mathematics) System call Table (information) Subject indexing Word Fermat's Last Theorem Computer animation Search engine (computing) Query language Personal digital assistant Universe (mathematics) Speech synthesis Object (grammar) Matching (graph theory) Elasticity (physics)
Metropolitan area network Graph (mathematics) Electronic mailing list Combinational logic Control flow Magnetic stripe card Emulation Table (information) Revision control Subject indexing Word Computer animation Database Query language Search engine (computing) Database Weißes Rauschen Software testing Office suite Resultant Task (computing) Physical system Elasticity (physics)
Algorithm Electronic mailing list Coma Berenices Mereology Subject indexing Computer animation Internet forum Subject indexing Ranking Ranking Addressing mode MiniDisc Extension (kinesiology)
Slide rule Computer animation Operator (mathematics) Directed graph
Multiplication sign Mathematical singularity Home page Price index Weight Surgery Table (information) Writing Term (mathematics) Electronic meeting system Subject indexing Ideal (ethics) Addressing mode Subtraction Physical system Scale (map) Real number Concurrency (computer science) Element (mathematics) Open source Basis (linear algebra) Attribute grammar Letterpress printing Quantum state Data model Content (media) Computer animation Database Uniform resource name Field (mathematics) Dependent and independent variables Personal area network Ranking Resultant Elasticity (physics)
Sequel Source code Set (mathematics) Client (computing) Graph (mathematics) Theory Revision control Subject indexing Mathematics Computer animation Personal digital assistant Search engine (computing) Elasticity (physics) Configuration space Subtraction Units of measurement Library (computing) Task (computing)
tools for different them frameworks regarding the search engine it's really difficult to find useful information especially benchmarks comparing security and this and quality of the structure and that's why it's really difficult to select search engineer for your project if you start producing so you continue project the other day and gender is when talking about a little bit about me what is called that search there are different from full-text search engines like was aggressive elastic whose things and so search accuracy and search speed and what's next I'm
boast exporter biasing vitamins calling and I created with my friends Australian start which helped create uh checkout where it quickly I speak here and I have a blog I can't believe what's 18 years ago there was no and also other web search engines to work around begs then and asked we stand for in encounter me and also web search and but what was unbelievable was at 26 years ago there was no web search at all and now work is rapidly changed and will the of information available and bandwidths gives us the opportunity to get this information but unfortunately is a processing rate which human being can consume information doesn't change much and this inevitably means transform searching from something that only ever care about something that every single asked to do with on daily basis let's stop it from a single text text search it's not a new problem and every day and we develop something like I like to do we have survived God basically a project called base and we tried to search all the Quran is of order deep class is a common tool for every Unix platforms graph and you see that you can find it less than 3 seconds on my left and this is if we try to do this task using think ACK it's like improved optimized grep for programmers it's a little bit less than 2 2nd and the my favorite ones what most used by commercial search through called it's like datasets less than 1 2nd and the most my favorite 1 its role and waste platinum storage maybe everyone music in the all young Marx because it's super fast it's like for 14 lives Sec its characteristic way made
but OK it's it's direct searches when you have particle and so achieves its simple problem but what if we talk about full-text search and full-text search from white capability to need to find natural language documents and that despite the uh query and sorted by relevance
just query and if you plan to read many books the means and you can find search and it's by the way search what 1 of my favorite people evolved to us to search and this is purpose of sorting in used to optimize performance and was out index as a search engine with where will scan every document in corpus which would require considerable time and computing by for example while and he makes for 1 themselves and to load documents can be acquiring was milliseconds the sequential scan and really worth intense thousand documents will will take hours uh advantages or disadvantages of index is additional computer and computer storage required to start this index n the died to create tend to create index or refreshing because data can change
let's imagine we have a simple example with 2 sentences and we try to build the inverted index it's content or full-text search for these 2 sentences 1st was split-second and field on each document based on work basis and then we create sort it leaves like you see the 1st call on the story like colleagues at the Broad etc. and then we mark each security in each document and place where it's you for my example exclude the places but only the fact that the term exist in document and if we
tried to search query using quick graph you see is it a table following words in the extensional for us like a Q his brother weak and we see that around existing to the local documents and we use only in 1
uh but you can you you can you can find that I have a little bit a redundancy in my index is its way in which the top line of conversation it means that should lower case we pluralized all this he wasn't world works etc. and maybe used synonyms like
story like jump and leave OK let's talk about what search engines we can now look at the current moment so lots of different types of search engines but today we we'll talk about only for all of these it's PostgreSQL full-text search elastic server python search tools and stream let's start what
risk and that sort everybody used PostgreSQL it's was created by Michael Stonebraker for 8 years in 1986 and the interesting fact that full-text search supported from version 8 dogs sweet it means that you can use it to every project because I'm not sure that you felt less from 8 and less stable version is 9 . 5 and the lots of 3 advantages for is that let's see examples we
have seen people query and simple text and we tried to full-text search through the squaring in context of was addressed we should use to functions you expect search electoral formant when we transform your data and special functions text search queries and the results will be looks like this you see that this results returned to which means that we find the results of our research OK next we all full-text search it about index I mean if you want to understand how it's work you should understand how indexes works and was greatest for white to kind of Phoenix's 1st it's generalized inverted indexes that I showed before in example and 2nd use generalized searched the base and it the last 1 is a little bit lost the because is a index might produce false matches because it has limited hash function for for search for text which should try to search it means that it can represent the same phrase we use the same idea and you can find a false much that's right so we will not recommended but has a different between this index is simple when you have the data on which perspective it means that it's changed not so often that's why you can use for sparse if you have a dynamic data which change every day every minute or 2nd and you try to search you should use generalized search
next important it's ranking search of its how to measure how relevant documents not to particular query so when there are many much of the most relevant ones conditional 1st sometimes it's very useful and the phosphorus for like seed-based functions its rank and run calls densities and the you can cut color density and you can uh you can use it and I have some small example for you how to use it the next of line of last but not least it's highlighting results every the user wants to see what his George parents to what secure what he tried to church and for this was destroyed headline function which where you just uses function and it will be your results with some in dates forward search
also important you stop or stop words it's like English words for example which and useful or informative etc etc it's like we use the end the work was stopped for works also included in settings of all the graphs and when you multiply using text search rector function you can there are text search rector applied to you call which have some information and as a result you see like special for all prosperous where from the useful information exclude stuff works like useful instead of uh because you know that all are necessary and no need to search engines is work in some cases next a little bit about was graceful gets George for the wife diffuse Django it's good news because in general wonder 10 already edited was search functionality we choose relies only PostgreSQL that search engine and it means the super fast if you use it in your project always of general model is of order M extension is freedom a couple years ago and it's working perfectly result worship gender this aligns using the a skills of human there are some examples from to apply to your project and if you already have some model which called page you can just create search in index the special field and overwrite search manager where you should have configurations and search fields and after update search field is cut on each safe update or delete index will mathematically updated by what's going and as a result you can somewhere or a quick release using keyword search is just use search and you can search recommendation and embark on the limitations of positivist provide and as a result jangle also it's where we to query construction mechanism it means you can use only 2 people learn operator so it's and all all and you see is that the in 2nd example provide example about or a document or gender according to terrible
wonder to itself and you can as you can understand that and it's by default using underscore underscore search for each field that's why you can also use it without any installation like I show my previous slide and the all you can take the research reactor and future right she's and results it's also because will converted all direct text search query text search electron or a skill query and PostgreSQL executed and examples so yes this community was made by natural but couple months it's super fresh information there is not any documentation about the only source of funds is coming you can find it meant a lot of data and the truth is that it's
already the OK let's talk about finish with was granted full text search we have brought slight quick implementation is so it's no dependency maybe it is biologist these need many all managing exists because it's not about mathematically dependent was use my sequel or if you use and has a database will not work normally takes data what a mean about this is it is it is it means spike and gets analytics on search from right and research and that's all if I want to get some important natural language text data icon and where it is simple aquarium
building of let's continue with the last of Elastic Search is distributed scalable real-time search and analytics engine it's were important because it enables us to use church finalized and your data you based on our part she search index which not is the most advanced and higher performance in the units who
use ElasticSearch you can't use ElasticSearch query 100 instructive units of 1 of quality of everyday StackOverflow music combining full-text search was geolocation sometimes it's very useful gazetteer possible looks like lots of companies and Wikipedia try to provide contextual attributes highlighted in data and data OK what about past the idea was Elastic Search where simple it's it's not quite equals but it's like in parallel you can understand codes works you have a relation that the basic elastic data raises the indices you roles types columns equals documents stable calls you the most
important it's maybe locks Elastic was optimistic concurrency control it means when you're trying to change documenting inelastic shortage they just uh updated and up to date version of this document and it means when your search for and force for some document you it will use the last version of the document not into institutional rules by compliance default by the client numeration made by Honda corollaries pricing and also a DSL when you can with your queries if you work was elastic you know how it's this goal sometimes work was these bj songs and manipulated using its knowledge of the greatest-selling was pretty also some examples you can get data you can creating the eggs we use the number of shares number of replicas use really you can add adjacent to index is just like how you create data for your lives you can manage words for instance you can readily stop words the attention highlight result of my every feature you can select and because sometimes is also useful to select your predefined did not use default 1 and restaurants really you can explain query and you see what's weight and I've removed lots of details but it's the explanation of why these query returns these results and part which each well you will always and I like it's difficult to do impossible but it's really easy to understand and calculate the wisest results 1st if you want for example right you relevance functions or rent function etc. OK next
word quickly it's thinks by only put 1 slight differences means written in C + for the last class right on and it's used for example my scale as that source and the going with as written not enjoy and strings assumes that you already have my skills database analyzes staff based on my scale but it's not like mom about the you can use Bo's you can use any
provider and about being searched several a little bit currencies db table springs from the next to roles screams document and the columns things fields and how to use it's not seem similar to pause gross and elastic sort maybe seem similar to the last structure and to query language it's not the sequence it's strings query language but it's very similar to default scale and you can find from your best ones name where much you are required to some selected it's a means of I put only difference all others can be very similar to the last and last but not least the pure white on the whole OSHA which created by men should here's a video was like OK my clients have no ability to install John and that's why she created full-text search engine impurity light and it's not super fast but in comparing with another purine points on search engines is super fast and it has a lot of all scoring algorithms you you can and I also think of lots of staff by the way more information you can find them useful ready to repeat it and some small examples and depends little bit on positive is because used was grid for example stop and it's create frozen that they mimic before this was addressed stop cost but you can select any effect of the costs mean it's just the sampled from source called the also
can highlights churches in assume that where we have he is entitled and this is the most interesting it's it used best March 25 over in which it's lies weights ranking function which used to search engines store margin documents according to their relevance to a given by search queries it's the common use of real and we it was created and developed in the 1970 s and now I created some comparision table for you because when they started to work on my 1st project was full text search it was difficult to understand lots of information on how to
structure it is applied created some table well the following and you see that support most search engines use open our a you have lots of clients you you can do this table like reference interesting isn't was addressing elastic walls are seen clients Sphinx and caution and and I had a general just for example dial diffuse General sometimes you need to stand or or etc. is that's why
uh you can find haystack we're useful and that but talking
about it's like provides model source for angle an it's grade 8 1 API layer under a couple of different different search engines and provide fuel general like around functionality for but I can't believe that it's really useful when you project you create a charge for that sort you up like this that and then you decide OK tomorrow I will be use elastic so that they sold 2 and that they have tomorrow will show an etc. it's strange because it's like only where where a simple set of features all of the features different than the
fly boldface like music now and it's useful but not for for specific tasks and migrated small constant for you about case that x is to start out looks like gender or search engines independent support not for endurance if we go deeper research clearer said API where were and it's it's where you can't create were smart queries difficult to manage the courts because you to go to all search engine backend intend to abide by hand by itself it's that doesn't care about the I lose performance because you need like converted to results to search for it says and what was a remained in memory and model-based means more of what full-text search engines try to promote amongst the whole concept when you can like objects or when you have the document multimodal not 1 table that's why it's a little bit difficult I think some the most widely used in his face that it's while hot centered settings in search engine if you open source code of case that you can find hard coded Elastic Search stating it's hard coded settings for solids etc. and tunnel if you want to change something you need to change the the status quo by achieved or something like that let's continue with my table next word difficult and interesting things which index each search engine use and I want like elastic user license you know you can find more information about the it's the default university in as I said before was used in general this inverted index and generalized to search trees the has been has streams has swooned opportunities to do is indexes real time indexes in distributed by the way distributed the index it's just like containing the fall lots of disks and real time indexing it's how you can scale your speed and cost were simple and extolled the as I said before that create whose asserts just you can only apply to and folder result in about the ablaze John I had several that's why the uh use simple approach and last column it's interesting sometimes when you have database you need to search in the memory without creating index and it's possible only I like this feature because you can use it you all databases no need to create if you want you just need to create index but you can search all other search engines you know you'll put together data from data sources put it to you in the next we will be in the x and z on and then you can choose but positives can do it in real
time next interesting it's ranking random and and that such as its call which probably probability algorithms speech engine used for storage elastic wearing a comma term frequency inverse document frequency it means how often do you want to remove all your query uh you mean means a whole document database and according to was addressed we already talked about here and you can uh it's interesting that you can put some weights for she Durant like input parameter but it's you you you you can't influence see during the formal on how to calculate right just the use of school because lot of water variants by default use to affect 1st the major part its approximately that which the document text and we really it's called like longer and longer as common or something like it and where a common note best match 25 and who she from my point of view was the most small are relevance because it's improved best parchment to find the box the interesting that you can replace any Britain once functions to pollution and things has been table they also formalize how if I mean you you also can't word not like football address for elastic short you can't according to configure most of course you can do it in all danger you can highlight searches of of all sanctions it's like common feature that you meet sometimes it's good you want to use in learning and you can find this engine supports an audience only cost but you can do it manually to replace for all created dictionary which associate like 1 worked with set of words like synonyms the ball scaling I would like to see the most scalable which elastic search because exports from scratch you can use it for pauses you should think about partitioning Table Inheritance etc. for about being salary feathered its use distributed searching and you can include lots we existing distribute index it's how you can do it manually whose gonna supporting the scales and the the end I would like to presented for you all
some will test that they made in real collection I have 1 million users can't and I wanted to each such changing magnetic stripe to search because most awful tests that they found for search engines use like white noise is the generates like combination of letters and try to assure doesn't make any sense and the performance is not interesting because of these data but I could it this 1 table for example was graph rest of officers when they're created index was fun 4 milliseconds the last version is the last the latest which I found like 9 the fixed by the selected elastic returning 9 milliseconds it's also pretty awesome sphinxes returning 6 ms but I'm not sure that they I configurator correctly and that's why maybe some results not super useful and was also mean as well as performance and the question only if you have more data which are not putting in was breast my next best task for me I plan to do a small smart queries and I have a database we use 300 million words which I know and that they can put in 1 table with the balls lists and etc. and z yeah binder results will be different in there and I would like to
propose you to read some books which I found very useful for medium volatilized its origins in if you interesting industry shouldn't since we examine the where it book about don't break which called red combined was system
I created some list of references for you because it's really difficult to share with you all the details of each index and you can find in the something very useful links and read about in the course when you Steckle what's in your customer's decide OK relevant should work with common in extension works that was these you can read about she index and find for you in which use your index will be more efficient also about ranking ranking is the really difficult part that's why I also wanted to link so you you can read more about each forum how we spoke a ladies and etc. etc. because performance will will depend on tool be effective 1st its ranking algorithm because you should calculate ranks
and 2nd indexing how you build your index
and singular slides you can find on the and in the sink you
for your attention and we hiring and
question please beef without few questions with that I get the questions about operators so that in the jungle that such that you mentioned the that by
only and and a lot of articles it can we combine
and I mean John and the goal
of all or yesterday influence but if you can induced by the weights feature divulging
leads feature from it's the Chernobyl maybe this like it's feature for was other questions at yes please and what so good way to compare the performance of different sessions and not in terms of speed of response but in terms of the quality of ranking yeah its and and Center for question it's but what I'm doing common everyday basis I work with all obligations not just for the folded structural data we tried tool much users and etc. by his interests and means that has a ranking it's very important for me and I can also test for their ideal where be queries with and or we see 19 results in names etc. and the this I propose expected result manually and on my this and the results of that for you it's like on the main element for which the work it's depends on your real fast other questions the that apart from this side of new recommendation from time 1 of general and plastic surgery for a combined system run
from my experience it's like you know just point and line I mean you can create manage task which will refresh units if you can get data from I mean maybe you plan to store your vacation was my sequel and uh you need like 1 some you plan to to refresh index and search from elastic search I found a great solutions that you can just use a simple bite on client last source just a elastic sort of by which on the ground maintain more and just but only in uh of managed by 5 1 Boer refreshingly created by analyzing the Romans past for refreshing that's that if you plan to use case then the you can the I don't remember the name but the I found interesting library which override some settings from that and you can all like at the of yours you know genes change configurations etc. and they were coming to use it if you plan to use case that but problem of waste that's it's not support last version of the last lecture and you will you you you you will see that go on and on a 1 . 7 . 5 yeah this is a a reason why you haven't talked a lot about so that could you please repeat his theories and we haven't talked a lot about the social and search engine is a so that all of us only uh I have no experience with but they hope was that so or also use the stadium and the the only difference is that a new so it is not easy to scale you can but it's not so easy and if you're already used for uh maybe you should continue but new produce isotropic I found elastic more like useful only some like I


  626 ms - page object


AV-Portal 3.9.1 (0da88e96ae8dbbf323d1005dc12c7aa41dfc5a31)