The denormalized query engine design pattern

Video in TIB AV-Portal: The denormalized query engine design pattern

Formal Metadata

The denormalized query engine design pattern
Title of Series
Part Number
Number of Parts
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Most web applications need to offer search functionality. Open source tools like Solr and Elasticsearch are a powerful option for building custom search engines… but it turns out they can be used for way more than just search. By treating your search engine as a denormalization layer, you can use it to answer queries that would be too expensive to answer using your core relational database. Questions like “What are the top twenty tags used by my users from Spain?” or “What are the most common times of day for events to start?” or “Which articles contain addresses within 500 miles of Toronto?”. With the denormalized query engine design pattern, modifications to relational data are published to a denormalized schema in Elasticsearch or Solr. Data queries can then be answered using either the relational database or the search engine, depending on the nature of the specific query. The search engine returns database IDs, which are inflated from the database before being displayed to a user - ensuring that users never see stale data even if the search engine is not 100% up to date with the latest changes. This opens up all kinds of new capabilities for slicing, dicing and exploring data. In this talk, I’ll be illustrating this pattern by focusing on Elasticsearch - showing how it can be used with Django to bring new capabilities to your application. I’ll discuss the challenge of keeping data synchronized between a relational database and a search engine, and show examples of features that become much easier to build once you have this denormalization layer in place. Use-cases I explore will include: Finding interesting patterns in your data Building a recommendation engine Advanced geographical search and filtering Reacting to recent user activity on your site Analyzing a new large dataset using Elasticsearch and Kibana.
Software developer Projective plane Bit Mereology Event horizon Product (business) Web 2.0 Software design pattern Process (computing) Computer animation Methodenbank Different (Kate Ryan album) Software design pattern Query language Software framework Right angle Quicksort Local ring
Point (geometry) Sequel Electronic program guide Database Price index Power (physics) Strategy game Different (Kate Ryan album) Synchronization Single-precision floating-point format Software design pattern Query language Physical system Addition Relational database Projective plane Bit Database Flow separation Dean number Subject indexing Computer animation Query language Network topology Computer science Pattern language Quicksort
Building Scaling (geometry) Multiplication sign Range (statistics) 1 (number) Numbering scheme Price index Web 2.0 Query language Elasticity (physics) Damping Search engine (computing) Physical system Vulnerability (computing) Relational database Moment (mathematics) Bit Flickr Entire function Web application Digital photography Process (computing) Website Pattern language Figurate number Row (database) Web page Service (economics) Sequel Calculation Virtual machine Mass Surgery Event horizon Field (computer science) Scalability 2 (number) Number Goodness of fit Software design pattern Scaling (geometry) Key (cryptography) Military base Chemical equation Database Multilateration Cartesian coordinate system Subject indexing Voting Computer animation Search engine (computing) Query language Calculation Table (information)
Trail Multiplication Sequel Multiplication sign Electronic mailing list Database Recurrence relation Digital photography Voting Different (Kate Ryan album) Query language Computer hardware Right angle Quicksort Descriptive statistics Resultant
Point (geometry) Web page Building Divisor Multiplication sign Decision theory Database Price index Plastikkarte 2 (number) Mathematics Different (Kate Ryan album) Query language Elasticity (physics) Diagram Information Physical system Relational database Plastikkarte Database Bit Subject indexing Category of being Digital photography Computer animation Software Query language Blog Sinc function
Web page Shift operator Sequel View (database) Relational database Cellular automaton Projective plane Electronic mailing list Similarity (geometry) Database Staff (military) Login Scalability Event horizon Twitter Wave Computer animation Query language Core dump Website Selectivity (electronic) Figurate number Gamma function Alpha (investment)
Web page Sequel Surgery Event horizon Field (computer science) Wave Blog Term (mathematics) Physical law Formal grammar Computer architecture Scale (map) View (database) Electronic mailing list Menu (computing) Twitter Replication (computing) Subject indexing Type theory Word Event horizon Computer animation Query language Search engine (computing) Right angle Cycle (graph theory) Scheduling (computing) Alpha (investment)
Point (geometry) Addition Open source State of matter Multiplication sign Execution unit Insertion loss Real-time operating system Login Focus (optics) Product (business) Power (physics) 2 (number) Query language Goodness of fit Centralizer and normalizer Different (Kate Ryan album) Elasticity (physics) Scale (map) Programming language Focus (optics) Scaling (geometry) Real number Projective plane Open source Mathematical analysis Analytic set Bit Formal language Computer animation Visualization (computer graphics) Search engine (computing) Query language Interface (computing) Website Problemorientierte Programmiersprache Quicksort Library (computing)
Building Email Computer animation Projective plane Cuboid Bit Social class
Filter <Stochastik> Point (geometry) Email Service (economics) Computer file State of matter Image resolution Source code MIDI Maxima and minima Coma Berenices Number Different (Kate Ryan album) Term (mathematics) Elasticity (physics) Computer worm Information security Email Relational database Interface (computing) Gender Projective plane Electronic mailing list Counting Independence (probability theory) Voting Frequency Computer animation Search engine (computing) Calculation Website Convex hull Pattern language Quicksort Resultant
Point (geometry) Web page Greatest element Multiplication sign 1 (number) Logic synthesis Field (computer science) Number Different (Kate Ryan album) Term (mathematics) Representation (politics) Elasticity (physics) Information security Sanitary sewer Email Sine Block (periodic table) Interface (computing) Counting Independence (probability theory) Bit Subject indexing Word Integrated development environment Query language Video game Problemorientierte Programmiersprache Resultant
Building Code Multiplication sign Database Price index Insertion loss Replication (computing) Mereology Timestamp Mathematics Mechanism design Different (Kate Ryan album) Synchronization Elasticity (physics) Vertex (graph theory) Physical system Scripting language Trail Relational database Moment (mathematics) Sound effect Streaming media Staff (military) Bit Replication (computing) Process (computing) Endliche Modelltheorie Pattern language Quicksort Resultant Row (database) Trail Overhead (computing) Open source Sequel Electronic program guide Streaming media Event horizon Field (computer science) 2 (number) Power (physics) Natural number Subject indexing Queue (abstract data type) Selectivity (electronic) Data structure Stapeldatei Multiplication Code Database Cartesian coordinate system Timestamp System call Subject indexing Word Radius Computer animation Logic Synchronization Table (information) Library (computing)
Group action Existential quantification Code State of matter Multiplication sign 1 (number) Database Price index Mathematics Mechanism design Object (grammar) Query language Elasticity (physics) Endliche Modelltheorie Extension (kinesiology) Descriptive statistics Physical system Relational database Reflection (mathematics) Structural load Moment (mathematics) Electronic mailing list Bit Term (mathematics) Message passing Order (biology) Website Right angle Pattern language Queue (abstract data type) Figurate number Quicksort Resultant Row (database) Web page Dataflow Digital filter Sequel Connectivity (graph theory) Online help Mass Event horizon Field (computer science) 2 (number) Number Revision control Term (mathematics) Subject indexing Queue (abstract data type) Codierung <Programmierung> Message passing Scaling (geometry) Projective plane Code Database Cartesian coordinate system Subject indexing Word Event horizon Search engine (computing) Query language Personal digital assistant Function (mathematics) Object (grammar) Electronic visual display Table (information) Elasticity (physics)
Point (geometry) Filter <Stochastik> Polygon Digital filter Tournament (medieval) Multiplication sign Shape (magazine) Mereology Event horizon Frequency Bit rate Computer configuration Radius Elasticity (physics) Extension (kinesiology) Descriptive statistics Physical system Boolean algebra Execution unit Email Turtle graphics Content (media) Mathematical analysis Similarity (geometry) Word Radius Event horizon Visualization (computer graphics) Search engine (computing) Website Video game Right angle Figurate number Geometry
Point (geometry) Group action Code Multiplication sign Virtual machine Real-time operating system Event horizon 2 (number) Response time (technology) Goodness of fit Bit rate Strategy game Meeting/Interview Different (Kate Ryan album) Term (mathematics) Computer configuration Core dump Query language Queue (abstract data type) Elasticity (physics) Stapeldatei Spacetime Real number Database Timestamp Subject indexing Radius Computer animation Query language Blog Quicksort Physical system Resultant Active contour model
Point (geometry) Preprocessor Mapping Sequel Differential (mechanical device) Multiplication sign Query language
Addition Group action Standard deviation Context awareness Mapping Source code Median Distance Food energy Subject indexing Meeting/Interview Query language Calculation Order (biology) Set (mathematics) Energy level Cuboid Metric system Library (computing)
around the and
and I and
I know and so and so I will start with a little
bit of a career introduction which I promise is is very relevant to the talk I'll be talking about a design pattern that is sort of i've i've Stolcke throughout my career and so I started out and many years ago the long at a tiny little local newspaper in Kansas belongs general world working on a web framework that eventually became Django and about a year after I left that am I moved on to work at Yahoo where was I briefly tinkered with the flick team and worked on various by product development in research projects I did data-journalism the Guardian which was the most fun job at the because it's you get to work with data online journalism deadlines which you know that sort of ties back into the original TIGR like Django as well and then after the garden I did a starter I co-founded lanyard with them my wife Natalie is there in front but rather at 3 years and then sold out to event right so now through various parts of a different and different modulations science and engineer director that point over in some francisco and but the thing I want to talk about today is a design patterns and
design patterns and really the power of design patterns is almost entirely in the name it not nobody really invents design patterns you more sort of look at something that other people are doing you slap a name on it and then it becomes something which people can talk about and the pattern I want describe today is 1 which to my surprise no 1 else seems to slap to name 1 yet so I'm slapping the name on it I want to start getting discussion guide so I think it's a pattern that can help out with a lot of different projects and in low different ways and and the name I picked for
this pattern is the dean normalized query engine which I hope is just not enough that work for people and and essentially this is a way of working of of taking a system built on a relational database and enhancing its using a search index such that you can do a huge amount of additional and interesting things with it and so the key idea is you have your relational databases you single point of trees and you know we we'll go complained of infatuated nice sequel a few years ago I feel like the Federation has worn off a little bit it turns out that 40 years of computer science has made relational databases of particularly like reliable place to keep the data you care about and but anyway you have your data you relational-database and you then indeed normalize the relevant data into a separate search index so you take all of the data you think about the bits that that would make most sense to B and B D normalized to be queriable in different ways you get those into a search index and then you invest an enormous amount of effort in synchronization between the 2 making sure that whenever somebody changes something in that database you get into the search indexes were alive in as quickly as possible and that's the hobbit I'll be talking a little bit more about some strategies for doing this and toward the end of this talk and why would why
would you want to do this well we this is way of addressing some of the weaknesses that made most relational databases have and the first one which I imagine many people well into his later let's bases are really very good accounting things if anyone's ever implemented pagination where you have like 2 100 thousand rows in table you want to do page 1 page 2 page 3 you'll find that the bit way you can select out we you count star against that table is the actually starts to hit the 1st because the database has to scan through all 200 thousand rose just generate that count on as general any time you're doing something that that end users a and a going to be accessing the need to avoid queries that read more than say a few thousand most of the time and date relational databases are insanely fast at Pitt primary key lookups and they're insanely faster range queries against the index but if you've got a query that needs secret needs to invest in these look at 10 thousand rows that's going to add up to 1 2 3 seconds it's going to send you can't deployed in a and in application and uses hitting all the time and this 1 this is current specific my cleverly post-process the Texas but my sequel can only use 1 1 of the index is defined in the database for query that's being executed so you might think you can stock indexes on the age column and on the on and only add and on the the job title column this search cost both those ones but actually will pick 1 of those 2 indexes and we'll use that to speed up the queries so actually the moment you start more complicated lookups the database indexing scheme really start some starts undermining meanwhile search engines have a whole bunch of strengths and firstly most search engines and I'm mainly talking of the meson mass elastic search in this talk with the same is true for solar and other search engines as well as a really good at scaling horizontally like you can take a system like plastic surgery and literally just throw more and machines from all nodes that it will be balance across fact baffle cluster and give you more re-performance more right performance and and just jet general and and and and improvements in your capacity as as you scale that up and they're really good accounting and database is not so it's not a great accounting search engines really really fast at this and the greater aggregations as well shall talk about and in more detail in moment and you you human queries across multiple indexed fields if you have a super complicated query where your your and you're looking at like 4 or 5 different fields in finding those together a search engine will make sure work and their unsurprisingly very good get relevance calculations in scoring because that's kind of the nature of of uh the based and they give you text search you get all of these benefits and you can implement full-text search as well I deliberately left that want to last because some of my interest in search engines goes way beyond just using them to to to search the text that uses event today and I think that this entire design pattern revolves around the fact that search engines have strength beyond just being able to implement a full text search
so I know a well back in time to 2005 to and to to talk about the 1st time I saw this as pattern in the wild and that was that's Africa and the photo Federation sharing site so back in 2005 flicker were having enormous scaling problems because it was the birth of Web 2 . 0 with social needs it was by the accession museum the termite and an enormous quantity of users coming into the and and the service not letting voters and then industry we haven't really figured out how to do this web-scale engineering thing yet sites like Flickr flicker work having to figure this stuff stuff out from from scratch and figure out how to scale up to handle these these giant and giant is this enormous numbers of users in huge amounts of data at the CTR Flickr account Henderson and wrote a book about the school building scalable websites which came out of a decade ago now I think it's still very relevant today because it essentially talks through the lessons they learned at flicker figuring out how to how to scale these things out how to how to build a scalable and a web application the and the technique that they used to flicker that the government the whole and was was a database
sharding and so this is a very common technique and to this day essentially what you do is you say OK we have 1 my sequel database and we couldn't keep up the reduced too many rights coming into this database so what we'll do is we'll splitted into multiple databases and look at different users on different shards so maybe we'll put users 1 through 10 thousand on this database 10 thousand 20 thousand list of database and so on and so forth and this is a very naive and a description of shouting but I have it illustrates the concept so do this like becomes a lot easier because you can as your userbase grows you just add more hardware you add more databases and if you want to do things like show me the most recent photograph uploaded by Simon use their cable Siemens and shot free so that South not select and photos from that wanted by whatever and and that'll give me an answer to my question so that sounds were heard it sounds relatively straightforward but as 1 massive problem which is what you do
with data that have that that lives across multiple different shards a great example of like a flicker and but were very early adopters of the idea of a sort of user provided tags and so you have like at today and you can see all of the voters that Bintang track times with the recurrence time and see that see the most recent uploads and all of that kind of stuff and the obvious problem here is if you've got photos across 5 or 6 or a dozen or hundred different shouted databases need to find all of the rank-and-file all isotactic occasions I you gonna do a query that he its 100 databases of once and try and combine the results of a come back that's not really a sort of practical way of solving this problem so what if like team did is they took advantage of fact that they were now within within Yahoo and they leaned
on a piece of Yahoo until technology called Vespa which was pretty much what elastic searches today that 10 years ago and written in C + + some kind of knowledge to work with and all and so what they did is they said OK we're gonna have a showed a database will have different uses photos will divide divide lim different places that that's all fine and and then we'll have a search index which we load all of the fighters from all the shots into and this was on vast which could scale horizontally and gave them all of those benefits and then we can and when somebody shrugged makes a query against liquor we can make a decision we can say it's it's you and you're looking at you and photos that's going to be a database query if it's you looking for other people's photos of its you looking at every public factor tank raccoons will turn into a search query instead and this is a diagram from Aaron stock out 1 of the engine is like who worked on this time and this turns out to a really really well there's a very there's a key concepts and embedded in
here which is the uh this idea of smart query routing that you're building software that human beings use 1 of the worst things you can do is have a blog where somebody makes it had it and then they refresh the page I get of bit BUY refreshes and that it has isn't shown back to the effect that if a user tags a photo raccoons and then goes and looks the 1st time since it's not that this is about and that that justifiably annoyed by iPod so the the solution flicker and the solutions that used was to say if you're looking at your own data that should be a relational database that because we can't guarantee that search index has got those changes yet generally would be systems it can take a few seconds up to a few minutes for the underlying search index to reflect those changes and if you look the other people's data you're just not can you you'll never going to know if your friends just uploaded tank workings act 5 seconds ago and you can't see yeah that's not something you'll be able to observe so that point it's safe for us to use such would use the search index for public and other people's data in the relational database for data
and so the next and well fast forward a few years to 2010 and when we use as we used solar and another so such urgent to solve a similar kind of problem and landed and so we launched landed in my wife launch landed on our honeymoon and it was supposed to be a side project up growing way way beyond that but the the idea was we we will we were in at Casablanca in Morocco and we have food poisoning and were unable to keep when traveling and then cast it was during Ramadan when none of the restaurants where and so we can get anything to eat anywhere else I see Figure cable that's apartment 2 weeks will look after cells will look at ourselves better and will try and shift side project we've been working on an we made a mistake in building a side project with User Account logins which you should never do because users have expectations and and we also built on top of and so the core feature of land when we launched
was you sign in with your Twitter account and we show you conferences that your Twitter friends the people you follow on Twitter are speaking out for attending and which when you think about enzymes of a database query ends up being a sequel query where you say select staff from events where it's in the future and at least 1 of the attendees is in this list of a thousand ideas like back Twitter and this is the kind of thing that relational data bases are incredibly bad and so we we launched um we got a flurry of an initial activity we ended up on TechCrunch UK unexpectedly andesite just died instantly because the most popular page site was the page with the most expensive database queries and so the way we resolve this was and we started using
solidified searching we thought well maybe this is something we can we define as a search problem
so but we turn this feature from a giant Harry sequel query into a so surgery said hey so I want events that are in future where the attendee Iadies field in so that matches any 1 of these list of up to 2 thousand and ideas cycle back from Twitter and do that and ordered by date we built this thinking this will probably the worst like a few months and so we can figure out a solution and their 5 years later that still have this
page works because it turns out search engines are incredibly good exactly that kind of query and all normally research and expects to be dealing with words because users type words in but actually things like use right these or other terms with things exactly as well and the underlying architecture the search engine is really good at some dividing those up certain uh at and merging together all the documents the index that have and fields that match whatever criteria is your passing it
this is the time to switches talking a little bit about Elastic Search and as I mentioned uh landed with so flicker used but which I looked this morning and says I'm gonna actually an open source state which is kind of kind of interesting that and I don't know if it's gained an enormous amount of community adoption yet and we use so the alignment which is a very fine search engine albeit 1 which was clearly and it was clearly a and it's a sort of products of the time in which it was designed it's very but it's very XML heavy base so that adjacent as a as a as a as a later later detailed elastic search is what you would get if you designed so that today she said I k clearly the wealth speaks Jason and well speaks http and we gonna want things to scale horizontally let's combine all those things together and build a really good search engine and so it's an
open source search engine is built on top of the same Lucene as such library that solar and other projects of use in the past entirely J over HTTP which is great because it means you can talk to it from any programming language that speaks those 2 things which are pretty sure is everything these days and is very and it makes it very easy to use from for it you can use it as a sort of fair central point between different programming languages very easily as well and marketing but all claims to be a real-time search engine in practice it's close enough we're talking a few seconds between you submitting a document into elastic search a document becoming available across the cluster 2 queries against and it has an insanely powerful query language which is a little bit of that and as we go along but essentially there's a a domain-specific language written in J some that lets you construct extremely powerful and complicated search queries and it also has a very strong focus on analytics if you go to the ElasticSearch website you'll see that they mainly talk about it as a analytics engine for things like and low like log analysis and so forth that's where a lot of effort has been an but as I said earlier this is the thing that excites me about search engines is sure full-text search is nice but being able to do these more complicated analytical queries is where they get really interesting and then finally the last it really does mean elastic and elastic search scales horizontally I've actually and in the past i've run a cluster of 4 nodes and just killed 1 of them around them to see what would happen and you can watch the documents rebalancing across the remaining items in real time using various and visualization but inserts it doesn't live up to the to the units 9 so I
talked a little bit about how I'm excited about more than just search on the feature that with specifically excited about is aggregations and the best way to illustrate those is with an example and so
this is a project that I built last year
and it's a little side project embarrassingly since I'm speaking Jan on this is actually the only thing I've ever written flask because I decided to try out and see see what clustered by class work very well it's a very nice way of building this and so what this is is some it's called DC inbox explorer and there's a that's an Stevens University is Stevens and which are called the DC box which
collects the e-mails that and senators and congresspeople send out to their own constituents so this and this research subscribes to all of these different men West and gathers all of those e-mails and puts them in a giant giant M a giant Jason file you can then use to when we search patterns e-mail about what and when and because it was John J. some file it was very easy for me to take this and imported into elastic search the source code for this is
all available available and get help that's not very much of it secured the how work habits works in and get an example of of elastic accepts this is pretty good starting point so basically what this does is it shows you all 57 thousand e-mails and been collected by the project and it lets you search sigh conceptual state security and then I get back 15 thousand results it chosen the number of e-mails and 1 of the top and then down side this is this is my sort of pet favor feature of any piece of the search of where it has these things which sometimes called facets sometimes called filters so this is saying is that without a search term and there are 56 thousand e-mails 36 7 thousand resolution by Republicans 19 thousand some by Democrats 280 was sent by independence you can see the broken down by by represented senators by the states that the and that politician represents an if I then search for say security those numbers updates and I can see that the e-mails that mention the term security and 2005 and versus and by Senator if I click on that announcing e-mails sent by a senator mention security might consider of those 810 Democrats and 16 into Republicans I can now see that the state that is most concerned about the state his services can most trusted to practically any that main and and I get this and so I can keep on drilling down Sarah set e-mails mention security from main by I've got gender and I can look at by by male or female sentences I can see the actual sense themselves so Susan Collins is the most prolific e-mail at from the state of mind on the subject of security but the key thing to a list I'm illustrating here is that when you've got an a search engine like Elastic Search these kinds of calculations these and aggregate counts become essentially for the future that they're very that they become very different confession would be the performance is super is a super fast so you can build this kind of highly interactive interface very easily on top of that underlying engine and if you go on sites like Amazon and booking . com insightful they'll make very extensive use of this faceted navigation patterns and if you try to do this with relational database you're likely to run into some pretty nasty performs problems pretty quickly and what I can also show you
is under the hood I'll show you a little bit of what ElasticSearch itself looks like to work with this tool called sense which is kind of an IDE for talking to Elastic Search and as I mentioned last searches Jason and and and they should be so I'm going do get against the slashed e-mails fashions score search and point this returns essentially all of the e-mails is paginated tend tend to a page and you can see at the very top it says that there are uh 58 that 56 thousand of them this is what the male looks like it's got all of this data that I ingested when I index the documents as a then conceptually actually I want to run search so this is illustrating the Elastic Search and domain-specific language and saying and then you get against females life search I want be and it's a query and i want to match the term security in the body field and i'll run that and that gives me back 15 thousand e-mails and those the ones that that match this to query there's a slight auditive Elastic search this is an HTTP GET which includes the body as if it was an HTTP POST I had no idea this was even possible but apparently this and them ElasticSearch use it for everything so that there's something islands and playing around with us and let's go a step further and say OK we can search for all of the most mention security but I want to also get numbers on broken down by Walter height which is centered synthesis and representative and by party and if I run this search here I get back all of these different search results and then at the bottom I get this aggregations block words as well tonight representative has 12 a half thousand senator has 2 and half thousand Republicans 10 thousand Democrat 55 thousand independent 123 these numbers we saw in the interface earlier this illustrates how it's just a little bit of extra J. on you and your query and the query time for this was full MS and which I think pretty good know and I'll show 1 last example just to illustrate something that I think is unique to Elastic Search which is the last of such lets you take these aggregations and makes them so here what I'm doing is I'm saying I want to aggregate counts by the party and then within that party i'd like to the counts by the role so if I run then this I get my results where you can see that the Republicans have sent that isn't as e-mails all those 31 thousand some by representatives 5 thousands and assesses the Democrats 19 thousand e-mails of which 15 thousand representatives in full thousand sensors then the independence of bottom and is apparently won the e-mails sent by an independent representative and RESTful sent by senators I'm intrigued let's have a look
so independent here the ch and they represents and sure enough that were represented glorious Sandelands um has sent a single e-mail apologizing spam which can considering various if everyone e-mails a little bit surprising and
so that we get them so an but
but sort of some of the power that you get once you start having an an engine like Elastic Search into your stack and I said earlier I talk about the difficult problem which is the set the synchronization structure between your relational database and electrode a whole bunch of different ways of doing this the 3 that I've had the most like with these 3 and so also through these in a bit more detail but to and to sort of 2 to and to repeat the problem we're trying to solve is you've got users who are making changes in a relational database that updating things adding things you want us to be reflected in your search index as quickly as possible because any delay could result in like a strange behavior the users don't understand and once you there's a way that is both performance and efficient and doesn't cost too much overhead on the on the various parts of your so the simplest way to do this and is to basically keep it in the database and it's perhaps a last touched or a changed timestamp India and India L on natural rows of your database which gets updated any time somebody changes that right this is a very common pattern and His what it might look like an agenda or am I got the last touched column it's daytime field DB underscore index equals true it's important sticking index on this because you can be pulling this from a foreign job like once a minute so it needs to be able to return results quickly and and the Senator defaulted to now and that's fine and then if once you've got a set up the simplest thing do is just have a chronic runs once a minute select staff from an from a table where the at which a stays within the last minute and then we indexes those items and the nice thing about having this is a time stamp is that an indexer can keep track of the last time last time that it's Paul last thing so so if you're index doesn't 5 minutes Whitman's again it can capture confined minutes with the changes or wants and there is a subtlety to this which is that quite often when you're building a search index from relational database the changes that happen to other tables which still should trigger an update so money at we have a concept called a dying to guide is AM somebody might create my guides to jobs the conferences in Europe and then have conferences thereby events into that guide the problem at and we try includes the name of that died in searches so if you search for jobs script Europe as long as an event has been added to the job in Europe guided should show up what that means there is any time somebody changes the died we need to we index all of the events that led to that guy because there's a a bit of dependent data this is now been updated using loss such mechanism that's pretty easy you can say anything to guide is adjusted to guide the Conference is that all find overcomes attached and update their last touched dates to the current timestamp as well and for the most part that works fantastically well this means that you get a soda cascading cascading changes happening with the new database which research index command and try and catch up on so sigh more-sophisticated we're doing this is with a Q & as a jet you can have a application logic says any time somebody updates an event like that of or updated document like that document ID into a queue and then have something at the other end of the queue which is consuming from it we index and those documents a really nice side effect of this is that you can have duping all you get DTP would be on the previous mechanisms well you can write your indexes so that it says OK there's been a flurry of activity around this particular document but I'm gonna batch those up and a few seconds later on to 1 in Wong indexing call to recreate that in the last search and so I built this a few times above the top of radius which word great is a little Rocha at and I built this song I'm adding them right we use can before this in fact we we have a slightly more sophisticated system which I'll dive into the moment and the thing about users if you have a persistent Q you get at replayability as well so you can we play all of the indexing changes from the past 5 minutes this is the most sophisticated way that I think the salt this is what we do right and which is turned into the database replication log itself so my sequel has very robust replication it's very easy to have a my sequel leader databases and then sort of multiple replicas that that we apply all of the changes made to the leader and it turns out to be replication stream is this sort of slightly weighted binary protocol but if you know what you're doing you can tap tonight yourself you can write your own code that reacts to changes we made to the database there's a fantastic and open source Python library that we use for this event like call them Python my sequel replication so about why we built
a system called by Libyan and dilithium is essentially a way of listening to so I write this as a way of listening to those some database changes and using them to trigger actions around event the Eventbrite system so what is you have your master my sequel database and with all of the rights going to it you have a replica my sequel database that's that's replicating off of that and then dilithium listens to that replicas so it's replicating from a replica to figure out what changes going on it sees things like an event wrote 57 has been updated and as attending rows sounds as have these fields change it takes the and it takes that it that flow of data and turns into a more sort of fun more in into turns into what we call interesting moments and because we can't use the word events and event right because it's already taken by the bombing the main objects and so those moments that come through at that what we we translate things like event 57 has been updated event 23 has been created an order 37 has been placed on those we then might inter Kafka which is a very and robust high-performance message queue that linked and that a few years ago and the are search indexes a 1 of many different components that can listen to that can kick you decide what decide when they need to we next things so it's a pretty complicated flow data once you stick it in the diagram but on essentially what this means is any time any piece of code eventbrite updates 1 of the rows are events Table B and died will pick that will turn that into account message art index encode will listen that say 0 event 57 has been updated it'll then query the database figure out the current details of that event and then like those changes into elastic search so the end result is we have something which scales extremely well and which can be run on many different genes it once and gives us a very robust and very robust path from initial database changed updates inelastic search index the so I
don't you tips and tricks that I wanted to and dive into and just live off and bits and pieces i've picked up that have helped with implementing this and oval pattern and the 1st 1 is that 1 that can really help avoid solving still date that stale and data to users amounts to do everything with your search engine in terms of object these as opposed to the war data itself generally with with your search index you'll be writing a lot of data into it you know it needs to know the titles of things the descriptions of things any feels you might want to search by voice is a temptation to hit the search index get that data back and then use that to construct uh and objects you would prevent actor uses the moment you do that by yourself yourself up some for some really nasty nasty latency risks because as we so as I said earlier there's gonna be a 3 to 5 maybe 10 second delay between changes new database and change your index you really don't want to be showing that's their later to users so the trick here is very simple and when you when searches you would you ask back from search engine that the ideas of the underlying records so you get you wanna search you get back a list of say 20 into dry these you can then hit the database directly to inflate flows into actual finished objects and and it sounds like me that the downside of this is your adding additional load to database again uses the databases are insanely quick a primary key lookups anytime you know you're doing primary key that's all look up against men that that's the Newton really fast so would Django you can use changes in bulk mass and help method and make extensive use of prefetch related as well which again is a very fast way of retrieving data and you can set set up so that users will never see stale data because that's stale data even if the data was telling indexed by the time is pulled from database it's going to be the most recent version of things and a related concept this is what to do if this is if if you're doing this if you're pulling things directly from a database What if something's been deleted would you do if your search engine gives you back ID 57 and then we can fetching the database ID 57 has been deleted in the time it took future that that that comfort and the way we handle this in the past is an essential to have a self-repairing mechanisms because the cruiser database can notice when there's might that doesn't that that that's missing and stick it on a salary Q or deleted or run or some other mechanism so the search index knows too then remove that document from the index entirely on the downside of this is somebody might ask for pages 10 results on and 1 of those results is missing so you you end up getting that 9 results in quietly filing that can't wait to be deleted the my hunch is that no 1 will ever noticed this I don't think people go around counting the number of results they get on the page so it's probably a no 1 last trick on which again tries into this idea but this is something I've I've been done using a project to work at moment and I'm calling the accurate filter trick essentially this is a way of and solve it and this additional way of solving for this latency between your search index between your database and your search index so imagine if you will you build a system where users can have you you users can save events fall sooner than they want to go to they had a save button answers is saved to their account in some way and I would like to be able to pull work to to answer queries about and what the events this user's site by having search index is like an obsession that's like combine it with all of these other benefits I can let users search with its search for text within the events they save I can be filtered by geography there's all sorts of useful things I can do this and that's easier to invite you have a field on your event are document and last search we containing the Iadies all the users who say bad that many can so by investing in I search for events where 1 of the saved by users and values is the user ID I'm dealing with as just 1 obvious problem of this if the user saves events and then goes looks for a list of side events within a few seconds and it's not the enactment isn't shown to them then obviously something is broken here that there there this is the latency problem we've been fighting since since the beginning of the talk so what we can do is you can say OK anytime I'm running query the 1st thing I'm going to do is hit my relational database to figure out what are the events that this user has saved in the last x minutes let's say the last 5 minutes so this is guaranteed to give me an accurate model of the user's recent activity and it'll give me back safe for 5 and ideas of documents that that we know that the user decide once you've got less these are the ones that saved in last 5 minutes you can construct an elastic search query where you say give me back any events where either be user is listed in list of users who said this event all the event itself was is 1 of these 5 but we know that it's safe recently and as a search query this'll 1 crazy fast give you all of those benefits but this is guaranteed to be exactly up to date with the activity of the users and making this you save save my users 1 obvious and application of this there are a whole bunch of other things where if you want to get a precise up to the dates and reflection of the state your system you can use tricks like best to to pull those out of plastic search so a few more use cases that i've
been applied as in Figure of applied Elastic Search to 1 we use event right is the recommendations because it turns out recommending events to a user is essentially just another such problems and you can there are a bunch of ways you can do this you can say find events where 1 of my friends who say that again this is the way we did our calendar affect the lanyard earlier and find because as they find events that is similar to the last tenements life site a very and straightforward way doing that is to look at the last tournament saved by user collect together the text from the title description of all of those events into a giant blob of words and then just search for those words with a Boolean all clause which is enough for elastic such as relevance to kick in and it'll give you back other events that are that are similar in textual contents the events the user has saved as well and search engines are really good at relevance scoring and boosting so you can't you fine-tune the stuff very out to to to a huge huge extent using the tools that are built into to the search engine but nothing us such a great geographic search and it's got built in support for geo you can answer latitude Chanology points to documents and then you can do things like all of that all documents within 5 kilometers of this rate radius point you gain centered polygonal say here's the shape of Canada give me everything that falls into that that whole gun shape and again a system which if you're not using PostgreSQL be quite difficult to do the relational database and but more importantly you can combine these with all of the other such and filters so far the recommendation system built a massive search I can say recommend the events similar to these events that fall within this geographic area and and and then further combine that with other options as well and that
and search engine general great visualizations you so this earlier with the DC inbox explore I've got this little and part of the talk we actually generated from get from just another 1 of these aggregations research against lasted search I can say give me back counts per month for this time period on simple is constant I can turn those into into a bar chart and if you look at this and the way elastic search is useful log analysis people do some really exciting visualizations on top of the wall data that's been collected by these aggregations more like to
think about this is it's kind of like having a real-time MapReduce engine you have and use MapReduce on something like a date a very powerful way of running a query across many different and many from machines getting results but it's generally something you want to run as a batch job my takes 30 seconds to a minute for it to return results Elastic Search under the hood is doing pretty much exactly that it's got you give to search it will spread out across the core of the nodes in your cluster combined results together and use that to return return and return documents to you but is designed to work in real time you're getting like and response times measured within milliseconds which means you can expose these directly to users so in
summary you should do not always related to a query engine it's definitely a good idea lets you build all sorts of things you can build the for and elastic search it and that's a pretty good option for that's and I've left lots of time for questions and thank you very much at the whole segment of the year I think you know you were talking about how you use a q through duplicate repeated indexer actions can you Spiegel more to the short so the thing you want to avoid is 500 feet like can 500 people interact with it's something new database and then you send 500 updates to ElasticSearch index in in a giant flowers and so really this is an is indeed you may yet GP is rate limiting and it's so it's being able to get smart about this and say OK there were 500 updates within a short space of time but actually convert that to a single combined update the index and and when have built this on top of the queue is your with the code that listens to the queue as well so actually what were they billed this and against radius was to say when an indexing request comes in if I have if the index hasn't done a thing in if in in 5 seconds just index that thing straight away if idiolects has won the past 5 seconds that suggest that as a lot of activity going on then hold out for a couple seconds to see other updates coming for the same events ID and if they do bundle those together and some metal want I think actually be and the DG being becomes a lot easier if you use the last modified timestamp because that's just a contour point 1 wants to minutes and if there were 500 updates to events in the last minute you'll still only we index at once then of I was wondering are you the only person who is like put this into terms and there's and you know about as far as I know I am which I find really surprising because I've seen lots of places doing exactly this understanding was the name on for and so if somebody does have also name that and I don't hear about it we as as far as I can tell I'm the only person who's said let's let's give this a name and and start discussing this as a as a general strategy so there's no books no thoughts or books but it's a books deftly not yet and I I should write a blog entry here the yes please write a blog entry that would be great and so
you talked a lot about getting stuff out of learning uh look specifically for your DC mailbox is people do you have to do a lot of preprocessing of the data before it goes you when you're actually structuring iron documents and stuff like that is is just on its own yet so
1 concept I didn't really talk about is and that's the thing ElasticSearch mapping which is basically the same thing as a sequel schema and you don't have to use mappings you can just start lasting Jason doctrines and it'll it'll work but if you actually want to be able to differentiate dates times from geographic points and so forth you need to use that so is the mapping for B and DC inbox uh thing that
distance there's a lot of data here so it's got caucuses in Congress members and seized and these insightful the source code for this is all available
on get helps you can see so I yeah and by you create actually used a library called ElasticSearch DSL for this which is a Python library that tries to be a slightly higher level way of working with elastic search and you can also compose Jason globin and post that ElasticSearch itself the yes this is the 1st step is going to be designing a mapping few data the actual indexing is pretty trivial once you've got map in place because you really are just constructing Jason documents and posting them back up the server and get a mapping design is is quite important additional talk of legally Bob saw a new year's talking about how outperforming this is due to find documents be somewhere criteria so is equally performance from 1 to do something like energy so you like everage receipts deviations on something inside the document absolutely this is that the strength of aggregation is is that that insanely fast the stuff I mean I it's as you get more complicated the aggregations the performance can be it can start to add up but honestly that the most complex queries I come up with it in the order of 100 ms as opposed to 10 ms so generally the performance is is really good yeah there a lot about aggregation actions so far what are called bucket aggregations we divide documents into different named buckets there also an aggregation metrics like that can calculate things like standard deviations and sons and medians and that even like induced spatial turned the aggregates that will conflict a bounding box around all the documents there's a whole bunch of additional context as you get around them summary of these the names and the few you're end it if you were