We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

A pythonic full-text search

00:00

Formal Metadata

Title
A pythonic full-text search
Subtitle
How to implement full-text search using only Django and PostgreSQL.
Title of Series
Number of Parts
130
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
A full-text search on a website is the best way to make its contents easily accessible to users because it returns better results and is in fact used in online search engines or social networks. The implementation of full-text search can be complex and many adopt the strategy of using dedicated search engines in addition to the database, but in most cases this strategy turns out to be a big problem of architecture and performance. In this talk we'll see a pythonic way to implement full-text search on a website using only Django and PostgreSQL, taking advantage of all the innovations introduced in latest years, and we'll analyse the problems of using additional search engines with examples deriving from my experience (e.g. djangoproject.com or readthedocs.org). Through this talk you can learn how to add a full-text search on your website, if it's based on Django and PostgreSQL, or you can learn how to update the search function of your website if you use other search engines.
Revision controlFormal languageSoftwareSoftware developerComplex (psychology)Execution unitBeta functionReading (process)Query languageDevice driverE-textSubject indexingPhysical systemSearch engine (computing)WebsiteFunctional (mathematics)Projective planeProduct (business)Query languageMereologyCoefficient of determinationLibrary (computing)VortexWeb 2.0Interpreter (computing)Task (computing)Level (video gaming)DemosceneSoftware frameworkLatent heatInformationDatabaseSemiconductor memoryExecution unitPhysical systemElasticity (physics)Subject indexingRevision controlBitPlatonic solidMultiplication signResultantOpen sourceLine (geometry)Electronic GovernmentSoftware engineeringInstance (computer science)Device driverSoftware industrySelf-organizationComputing platformSoftwareData typeSynchronizationSoftware developerReading (process)Meeting/InterviewComputer animation
E-textFunction (mathematics)Field (computer science)Regulärer Ausdruck <Textverarbeitung>Price indexData modelEndliche ModelltheorieHuman migrationOperations researchVector spaceBlogDigital filterObject (grammar)Configuration spaceRankingQuery languageWeightCurvatureRevision controlDatabaseFormal languageWorkloadSoftware maintenanceWebsiteMultiplication signSpring (hydrology)Theory of relativityForm (programming)Exclusive orSoftware developerWordExtension (kinesiology)Similarity (geometry)EmailInformationVector fieldCASE <Informatik>Search engine (computing)Formal languageLight fieldCellular automatonEquivalence relationElectronic mailing listMereologyAuthorizationLine (geometry)Vertex (graph theory)ResultantExpressionRankingSocial classFunctional (mathematics)Query languageAttribute grammarBit rateSoftware testingConfiguration spaceBlock (periodic table)Latent heatSource codeVector spaceSoftware frameworkType theoryMorley's categoricity theoremWeb 2.0Field (computer science)Endliche ModelltheorieWorkloadDatabaseSoftware maintenanceBlogRevision controlWeightLocal ringModule (mathematics)Content (media)MultiplicationMobile app
E-textSoftware maintenanceStatisticsSource codeSource codeWebsiteComplete metric spaceFunctional (mathematics)WordAsynchronous Transfer ModeOpen sourceResultantShared memoryQR codeStatisticsSoftware maintenanceLink (knot theory)Presentation of a groupExpressionSoftware developerSoftwareHypermediaOffice suiteBuffer overflowPoint (geometry)Different (Kate Ryan album)Expert systemProjective planeWeb 2.0Context awarenessAttribute grammarInformationRevision controlDigital photographyLevel (video gaming)Stack (abstract data type)Right angleComputer animation
DatabaseSlide ruleQuery languageOverhead (computing)Row (database)Vector spaceMassMeeting/Interview
Vector spaceQuery languageEndliche ModelltheorieBlogObject (grammar)DatabaseEndliche ModelltheorieVector spaceSubject indexingWorkloadVector fieldQuery languageSlide ruleData storage deviceMusical ensembleXML
Vector fieldBlogField (computer science)Endliche ModelltheorieAuthorizationInclusion mapSlide ruleOcean currentVector spaceMeeting/Interview
Vector spaceObject (grammar)Endliche ModelltheorieBlogField (computer science)Vector spaceEndliche ModelltheorieArithmetic meanObject (grammar)Block (periodic table)Vector fieldBlogCoroutineDatabase
Vector fieldWorkloadDatabaseWebsiteWordStructural loadSubject indexingMultiplication signProduct (business)Query languageProjective planePort scannerSinc functionMeeting/Interview
Transcript: English(auto-generated)
Okay, so Paulo is going to talk to us about, you know, how to build a full-text search engine using only Django and Postgres. So I'm going to let you start, all the best. So hello everyone, I'm very happy to be here with you at EuroPython 2020.
I want to thank all the organizers for making this online edition possible and thank you all for attending from all over the world. If you're asking yourself what is a Python full-text search, I'll show you an example. This is the search function in the Django website.
How many of you have searched information on it in the past? I think a lot of you. The search function is based only on Postgres and Django itself and I was the one who built it. So the next question is, who am I?
I'm Paulo Makiora and I'm CTO of the Twentietab. It's a Pythonic software company based in Rome, for which I work remotely. I'm a software engineer and a long time, quite a complicated developer. After using Django for a few years, I became a contributor to the project.
And now I want to try to explain a bit more about the title of this talk. A Pythonic Full Text Search. I think you can read the definition of Pythonic by entering import this in the Python interpreter.
These are only the first principles of the design of Python. The most important for me is the third one, and I think it's also the most difficult to follow. Full Text Search refers to technique for searching a computer store documents in a full text database.
There are a lot of search engines that already provide full text search, as in this definition. The most popular search engine library is Apache Lucene, an open source software written in Java.
Based on Lucene, there are two very popular search engines that I used in the past in some projects. Solar is the first one, and it's part of the Apache Software Foundation. And Elasticsearch is a product of the Elastic company.
The last big project where I use one of them is Docs Italia. Docs Italia is an Italian government website to find public documents. I worked in this project with my colleagues to improve the search function. Under the hood, Docs Italia is a part of the open source project Read The Docs.
So, as the original project, it's a Django-based platform. And it requires a lot of Python packages to access the Elasticsearch instance, asking for results.
Of course, the search function is working very well now, but we can consider this as a simple solution. We can say various things about external engines. On the good side, they are very popular.
There are a lot of features, and you can find a lot of line resources about them. On the bad side, you always need a driver to use them from Django. You have to use their specific query language, and it's common to have a synchronization problem. But let's go higher.
Oh, this is embarrassing. Jokes aside, this is something similar to what happens in e-commerce when you find a product in the search results. But it's not available anymore when you click on it. Usually this happens because search results are free from the search engine, which is not already synchronized with the database.
So why don't we search directly on the database? Maybe a big one, and with elastic memory. Like this one. Postgres is a very popular and long-lived database.
It added full-text search years ago with specific data types and special indexes. And since then, many useful new features have been added every year until the last version.
The main concept of full-text search in Postgres is the document. A document is the unit of searching in a full-text system, for example a magazine article, or the union of all these parts. For example, the title, the abstract, and the body text.
But implementing a web search function directly on the database can be a low-level task. To do this, we can use a web framework. Maybe one of the best. Django is a very popular Python web framework.
It added full-text search a few years ago. And it did it in the Django country Postgres module. It added specific fields, expressions, and functions. Since then, many new useful features have been added every year until the last version.
Which will be released in a few days. Django 3.1. The Django documentation defines document-based search as a full-text search with advanced features.
Waiting, categorization, highlighting, multiple languages. We can implement all of them with Django itself. But to better understand how the full-text search in Django works, we are going to see how to perform some queries, from the basic to the more complex one.
That can perform well also with a big amount of information. To do that, we can use the block models as defined in the Django documentation. Here we have three classes with a few fields on it.
A block with a name, an author with a chart field, and an entry connected with both of them with a lot of text on it. And then the line, more text, and other fields. We can perform basic query on these models using fields lookup.
For example, we can search an author using part of his name. We can have more results performing a case-insensitive query.
To find more with accented letter, which is common in Italian or other languages, we can activate the unaccent extension. After that, we can search an author name also if we don't know exactly all the accent letters.
To have results, also if we don't remember where we had our name, we can activate the trigram extension. Searching for an author, we can have results with similar but not necessarily identical names, as you can see here.
But to use all the above features, we have to add the Postgres module in the installed apps. After that, we will be able also to perform a full-text search on a field.
For example, we can search for a word in the plural form and have results in the singular form. To search a text in more than one field, we can use the search vector function.
We can define our document as the union of the entry body text and the related block name. After that, we can search for a word and have more accurate results.
To search using a more complex text, we can use the search query expression. We can also use common search syntax directly in the query text using the web search type. After that, for example, we can search for two words at the same time, having potentially more results.
To perform a full-text search in a specific language, we can use the search config expression. We can specify the language in both the document and the query.
After that, we can have more precise results than before in the selected language. To list relevant results first, we can use the search rank function. Based on the query text and the document jungle, we can create a rank.
We can order and filter our results using this rank. And we can also show them. To perform a fine-grained full-text search, we can use the search vector weight attribute. For example, we can decide that words in the headline are more relevant than in the body text.
After that, we will see a new rank in our results, also performing the same search. To highlight results, we can use the search headline function. We have to specify the highlight fields.
After that, in the results, we will see some HTML tags. All these things can be customized using some attributes. To speed up all this search and simplify also the query, we can use the search vector field.
We have to manually update our search vector fields before running a query. After that, we will have the same results as before, but way more quickly.
I started using the full-text search in Django 1.10. And the search frequently the Django documentation for information about this new feature. But in the meantime, I started asking myself how was implemented the search function in the Django website itself. I noticed that the search was performed only on English contents.
And in some cases, there was HTML tags in the results. I started then the Django website source code and I found out that documentation was generated with Sphinx. And all the data was stored on Postgres.
But the searches was performed in an external search engine. So I proposed to fix that on the Django developer mailing list. A lot of Django developers share different opinions about the update. The doubts were the amount of work to be done, the equivalence of the search feature and the increase of the workload in the database.
The same things on the other side were less maintenance, a lighter setup and the exclusive use of Django on its own websites.
After that, I organized the Django sprint during the Europe item 2017 in Rimini. And some developers joined me to work on this search update. In a spring day, we created a draft of the Postgres-based full-text search. But also we spent a lot of time trying to set up the Django website locally.
Presently because the external search engines. In the following months, I wrote an official pull request with a complete version of the full-text search.
I received a lot of suggestions from other developers and after a lot of comments, they merged by pull request. That was the first one of other merged pull requests in the same full-text search function. So today, after a few years, the Django website full-text search is multilingual.
It's based only on Postgres. It returns clean results. It's a low maintenance solution and it's way easier to set up than before. Also locally if you want to try to set up on your PC. As I already said, new photo search features are released every year in both Postgres and Django.
And I want to add all of them in Django website search. For example, misspelling support, search suggestions, highlighted results, web search syntax and search statistics.
After that, I want to show you some useful tips to learn more about full-text search and how to become an expert on it.
As I said before, I think the starting point is reading the Django documentation. The Django documentation on the Django website is full of information about the full-text search feature.
You can read all the attributes you can use or the function and expression you can implement in your full-text search. It's well written, there is a lot of example, more than the one I show you now.
If you want more details, you have to read the Postgres documentation on the Postgres website. It helps you to understand how it works in the lower level. And for me, it was very useful to understand how Django developers implemented something in a certain way.
After that, you can read also the source code for both the projects. You can find them in GitHub and you can learn something from the source code and you can find documentation.
And it helps you to understand more deeply how things work. After that, strange tips. I suggest you to search for questions on Stack Overflow without reading the answer.
Try to answer them by yourself and also solving the problem and submitting the answer. It's something that sends you to the next level. Last but not least, you can also study this presentation because it's released with a Creative Commons license.
I'll share the link at the end of this talk and you can reuse it and share with other people. I hope I've been able to show how it's possible to develop a more complete full-text search using less software in the stack.
Doing more with less is the motto of 20Tab and it's our version of Pythonic. You can find more about our open source projects and our Pythonic work using this context with different social media and also in our website.
To find out more about my work with Python and Django, you can use all my contacts and using this QR code, you can download this presentation on my website. Thanks again for me and enjoy the next talk in the conference.
Thanks. All right. Thank you for the talk. I think we have two questions. Will you be able to take them? Yeah, thank you. Okay, so here's the first one.
Does the annotate on a search vector involve a massive database overhead to perform the query? So as I said before, I cannot record slides.
Hey, I did not understand what you said, I'm sorry. So I'm taking the slide when I ask for this question. Okay, got it. As I said before, to speed up the search query and maintain a workload on the database very low, we can use the search vector fields.
So it stores all the documents we constructed in the search vector and we can add index on these. And everything is working very fast as querying in a normal column on your database or feed on your model.
So this is the solution for speed up our query. Okay, so the next question is on the similar slide I think.
So here's the question. When using search vector field, I was unable to populate this field with fields outside the current model. For example, the author's name of a blog if search vector field is inside the blog model. Do you know why and how to include relationship fields?
Yes, I show at the beginning, sorry, this is the one. In this example, I've shown exactly these things.
So the document I built, the search vector, it's the union of the body text of the entry model and the name of the related object, the blog.
So, as you can see, we can construct the search vector using both these models. To populate them, maybe you need something more sophisticated. You can update your search vector field using an update or also other things from routine in the database.
But in the theoretical meaning, you can add here also more than one field and also join many, too many fields using aggregation.
So everything is possible. Your document can be very big if you want it. Okay, so here's the next question. How much more is the load on Postgres database with this full search feature?
Oh, I think this was just asked in other words right now. Yeah, okay. Yes. Actually, there was a lot of people that thought the workload on Postgres database can be affected by the use of the search.
But at the end, I can say the workload of the database is identical than before. Because the search vector field is only a new column and if you add also index on it, at the end when you're searching in this column,
you're performing an index scan on the column and everything works very fast. More than you thought, more than I thought before starting using it. And you can check it using the search in the Django documentation website.
So here's the last question. When should Django Postgres search not be used in production? Sorry, can you repeat?
When should the Django Postgres search not be used in production? I didn't understand the first word. Okay, so when should the Django Postgres search not be used in production?
When should it not be used? If I ask, well, the question I think you can use, which is no problem in production, because I used it in a lot of projects and as I already said, in the last three years, the Django search documentation feature is built using exactly this.
So it's run queries using Django feature and for the search in queries on Postgres. So it's on production since a long time. Okay, that's awesome. Thank you very much for your talk and it was pretty amazing.
Thank you. Thank you very much. Bye.