We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The denormalized query engine design pattern

00:00

Formal Metadata

Title
The denormalized query engine design pattern
Title of Series
Part Number
47
Number of Parts
48
Author
Contributors
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Most web applications need to offer search functionality. Open source tools like Solr and Elasticsearch are a powerful option for building custom search engines… but it turns out they can be used for way more than just search. By treating your search engine as a denormalization layer, you can use it to answer queries that would be too expensive to answer using your core relational database. Questions like “What are the top twenty tags used by my users from Spain?” or “What are the most common times of day for events to start?” or “Which articles contain addresses within 500 miles of Toronto?”. With the denormalized query engine design pattern, modifications to relational data are published to a denormalized schema in Elasticsearch or Solr. Data queries can then be answered using either the relational database or the search engine, depending on the nature of the specific query. The search engine returns database IDs, which are inflated from the database before being displayed to a user - ensuring that users never see stale data even if the search engine is not 100% up to date with the latest changes. This opens up all kinds of new capabilities for slicing, dicing and exploring data. In this talk, I’ll be illustrating this pattern by focusing on Elasticsearch - showing how it can be used with Django to bring new capabilities to your application. I’ll discuss the challenge of keeping data synchronized between a relational database and a search engine, and show examples of features that become much easier to build once you have this denormalization layer in place. Use-cases I explore will include: Finding interesting patterns in your data Building a recommendation engine Advanced geographical search and filtering Reacting to recent user activity on your site Analyzing a new large dataset using Elasticsearch and Kibana.
6
Thumbnail
42:19
Query languageSoftware design patternProjective planeSoftware developerQuicksortProcess (computing)BitDifferent (Kate Ryan album)Web 2.0Local ringSoftware design patternFlickrSoftware frameworkProduct (business)Event horizonRight angleMereologyMethodenbankComputer animation
DatabaseQuery languagePrice indexScaling (geometry)CalculationSearch engine (computing)Subject indexingTable (information)BitRow (database)Vulnerability (computing)Multiplication signRelational databaseCartesian coordinate systemRange (statistics)Software design patternCalculationProcess (computing)Physical systemPattern languageNatural numberScaling (geometry)Term (mathematics)Web pageMoving averageDigital photographyWeb 2.0FlickrService (economics)HypermediaWebsiteMoment (mathematics)Search engine (computing)Channel capacityNumbering schemeBuildingCountingField (computer science)ImplementationTransport Layer SecurityEntire functionComputer scienceFigurate numberDampingScalabilityNumberSingle-precision floating-point formatPoint (geometry)NP-hardSynchronizationQuery languageStrategy gameDatabaseWeb applicationPower (physics)Projective planeQuicksortAdditionDifferent (Kate Ryan album)Elasticity (physics)Event horizonKey (cryptography)2 (number)Sequel1 (number)Electronic program guideVotingDean numberNetwork topologyMassVirtual machineSurgeryGoodness of fitChemical equationFlow separationMilitary baseMultilaterationComputer animation
Computer hardwareDescriptive statisticsDigital photographyDatabaseSequelMultiplicationRight angleDifferent (Kate Ryan album)Electronic mailing listVideo gameSelectivity (electronic)Heegaard splitting
FlickrQuicksortQuery languageDatabaseDigital photographyResultantRecurrence relationTrailMultiplication signVotingComputer animation
InformationDiagramDivisorDifferent (Kate Ryan album)Multiplication signDigital photographyDecision theoryDatabaseElasticity (physics)Query languageSubject indexingFlickrProgram flowchart
Query languageDatabasePrice indexPlastikkarteScalabilityLoginSoftwareWeb pageCategory of beingSubject indexingDigital photographyMathematicsBitSimilarity (geometry)Sinc functionMultiplication signCore dumpPhysical system2 (number)Projective planePoint (geometry)Cellular automatonFigurate numberShift operatorPlastikkarteQuery languageBlogBuildingRelational databaseOpen sourceRoutingSoftware bugFlickrTwitterComputer animation
Gamma functionAlpha (investment)View (database)WaveSequelStaff (military)Selectivity (electronic)DatabaseRelational databaseElectronic mailing listWeb pageEvent horizonTwitterLoginWebsiteQuery languageTerm (mathematics)Intrusion detection systemSource codeXML
Alpha (investment)WaveScheduling (computing)Event horizonView (database)Formal grammarReplication (computing)Scale (map)TwitterMenu (computing)Physical lawBlogEvent horizonField (computer science)Cycle (graph theory)Electronic mailing listQuery languageSurgerySequelIntrusion detection systemTwitterSearch engine (computing)Web pageWordType theoryTerm (mathematics)Subject indexingComputer architectureRight angleSource codeXMLComputer animation
Real numberOpen sourceInterface (computing)Query languageAdditionFocus (optics)Scale (map)Formal languageSearch engine (computing)Query languageQuicksortPoint (geometry)Centralizer and normalizer2 (number)Different (Kate Ryan album)Analytic setBitMathematical analysisInterface (computing)Plug-in (computing)Power (physics)Focus (optics)Multiplication signProduct (business)Scaling (geometry)Programming languageOpen sourceProblemorientierte ProgrammierspracheGoodness of fitProjective planeReal-time operating systemLibrary (computing)LoginVisualization (computer graphics)Execution unitInsertion lossState of matterElasticity (physics)WebsiteComputer animation
FrequencyConvex hullMaxima and minimaComputer wormEmailMIDISineSanitary sewerEmailDifferent (Kate Ryan album)NumberBlock (periodic table)ResultantPattern languagePoint (geometry)Representation (politics)Interface (computing)Computer fileType theoryQuery languageInformation securityTerm (mathematics)Independence (probability theory)BitBookmark (World Wide Web)QuicksortSoftwareIntegrated development environmentProjective planeWeb pageSinc functionBuilding1 (number)Speech synthesisFilter <Stochastik>VotingCodeSet (mathematics)Electronic mailing listView (database)Search engine (computing)GenderProblemorientierte ProgrammierspracheState of matterOpen sourceKey (cryptography)ImplementationWebsiteCountingMultiplication signComa BerenicesRelational databaseSocial classVideo gameField (computer science)Greatest elementElasticity (physics)CalculationSubject indexingImage resolutionCuboidLogic synthesisSource codeService (economics)WordComputer animation
TimestampSubject indexingStreaming mediaReplication (computing)DatabaseSynchronizationPrice indexEndliche ModelltheorieVertex (graph theory)TrailCodeSubject indexingTimestampEvent horizonMathematicsMultiplication signDatabasePattern languageElectronic program guideTable (information)Touch typingBitReplication (computing)MereologyStack (abstract data type)Row (database)Field (computer science)Process (computing)Queue (abstract data type)Open sourceMobile appLibrary (computing)Streaming mediaQuicksortCodePower (physics)Strategy gameSynchronizationDifferent (Kate Ryan album)Overhead (computing)CausalityTrailBuildingRelational databaseOcean currentSound effectLogic2 (number)Cartesian coordinate systemStapeldateiSystem callMultiplicationSequelWordMoment (mathematics)Data structureElasticity (physics)ResultantSelectivity (electronic)Staff (military)Scripting languageInsertion lossNatural numberRadiusPhysical systemMechanism designJSON
Message passingElasticity (physics)Subject indexingCodeQueue (abstract data type)Queue (abstract data type)Message passingCodeSubject indexingBusiness objectEvent horizonScaling (geometry)Row (database)MathematicsGroup actionField (computer science)Moment (mathematics)WordDatabaseOrder (biology)Physical systemConnectivity (graph theory)DataflowMultiplication signDiagramResultantVirtual machineTable (information)Codierung <Programmierung>Figurate numberRight angleElasticity (physics)SequelQuicksortObject (grammar)Program flowchart
DatabaseElectronic visual displayObject (grammar)Price indexFunction (mathematics)Queue (abstract data type)Event horizonQuery languageTerm (mathematics)Digital filterSimilarity (geometry)RadiusPolygonObject (grammar)Term (mathematics)Raw image formatIntrusion detection systemRadiusField (computer science)Descriptive statisticsPolygon2 (number)Reflection (mathematics)Moment (mathematics)Physical systemRow (database)CASE <Informatik>Filter <Stochastik>State of matterWordElasticity (physics)Content (media)Boolean algebraGeometryTunisEvent horizonCartesian coordinate systemBitDatabaseMultiplication signExistential quantificationSearch engine (computing)Web pageQuery languageMechanism designResultantNumberProjective planePoint (geometry)IntegerStructural loadKey (cryptography)Extension (kinesiology)Online helpRelational databaseRevision controlShape (magazine)QuicksortElectronic mailing listEndliche Modelltheorie1 (number)Queue (abstract data type)CodeSubject indexingBuildingPattern languageMathematicsWebsiteBit rateComputer configurationMassRight angleFigurate numberTournament (medieval)Video gameComputer animation
Execution unitTurtle graphicsReal numberPhysical systemQuery languageSearch engine (computing)Graph (mathematics)Visualization (computer graphics)Elasticity (physics)CountingFrequencyRaw image formatEmailMereologyMultiplication signMathematical analysisVirtual machineComputer configurationQuery languageStapeldateiResultantResponse time (technology)2 (number)Goodness of fitQuicksortDifferent (Kate Ryan album)Real-time operating systemCore dumpXMLComputer animation
Queue (abstract data type)Subject indexingCodeRadiusGroup actionSpacetimeBit rateMultiplication sign2 (number)TimestampEvent horizonBlogPoint (geometry)DatabaseActive contour modelStrategy gameTerm (mathematics)QuicksortElasticity (physics)Limit (category theory)Line (geometry)Process (computing)Fiber bundleMeeting/Interview
Query languageSet (mathematics)PreprocessorMultiplication signDifferential (mechanical device)Source codeMappingPoint (geometry)NumberEmailIntrusion detection systemSequelDistanceComputer animation
CuboidLevel (video gaming)Context awarenessMappingAdditionFood energyOrder (biology)Standard deviationMetric systemGroup actionQuery languageSubject indexingLibrary (computing)MedianCalculationSummierbarkeitServer (computing)Power (physics)Absolute valueAverageMeeting/Interview
XML
Transcript: English(auto-generated)
So I will start with a little bit of a career introduction, which I promise is very relevant to the talk.
I'll be talking about a design pattern that is sort of, I've stalked throughout my career. So I started out many years ago at a tiny little local newspaper in Kansas called the Lawrence Journal World, working on a web framework that eventually became Django about a year after I left that. I moved on to work at Yahoo, where I briefly tinkered with the Flickr team and then worked
on various product development and research projects. I did data journalism at the Guardian, which was the most fun job ever, because you get to work with data on journalism deadlines, which that sort of ties back into the original tagline for Django as well. Then after the Guardian, I did a startup. I co-founded Lanyard with my wife Natalie, who's there in the front, ran that for three
years and then sold that to Eventbrite. So now, through a various path of different machinations, I'm an engineering director at Eventbrite over in San Francisco. But the thing I want to talk about today is a design pattern.
Design patterns, really the power of design patterns is almost entirely in the name. Nobody really invents design patterns. You more sort of look at something that other people are doing and you slap a name on it, and then it becomes something which people can talk about. The pattern I want to describe today is one which, to my surprise, no one else seems to have slapped a name on yet, so I'm slapping the name on it and I want to start getting
discussions going, because I think it's a pattern that can help out with a lot of different projects in a lot of different ways. The name I picked for this pattern is the denormalized query engine, which I hope is just snappy enough that it'll work for people. Essentially, this is a way of taking a system built on a relational database and
enhancing it using a search index such that you can do a huge amount of additional interesting things with it. So the key idea is you have your relational database as your single point of truth, and we all got kind of infatuated with NoSQL a few years ago. I feel like that infatuation has worn off a little bit.
It turns out that 40 years of computer science has made relational databases a particularly reliable place to keep the data that you care about. But anyway, you have your data in your relational database. You then denormalize the relevant data into a separate search index. So you take all of that data, you think about the bits that would make most sense
to be denormalized, to be queryable in different ways, and you get those into a search index. And then you invest an enormous amount of effort in synchronization between the two, making sure that whenever somebody changes something in that database, you get that into the search index as quickly and as quickly as possible. And that's the hard bit, and I'll be talking a little bit more about some strategies for doing that towards the end of this talk.
But why would you want to do this? Well, really this is a way of addressing some of the weaknesses that most relational databases have. The first one, which I'd imagine many people have run into, is relational databases aren't really very good at counting things. If anyone's ever implemented pagination where you have like 200,000 rows in
a table and you want to do page one, page two, page three, you'll find that the bit where you count star against that table is the bit that actually starts to hurt you first because the database has to scan through all 200,000 rows just to generate that count. As a general rule, any time you're doing something that end users
are going to be accessing, you need to avoid queries that read more than, say, a few thousand rows at a time. Relational databases are insanely fast at primary key lookups, and they're insanely fast at range queries against an index. But if you've got a query that needs to look at 10,000 rows,
that's gonna add up to one, two, three seconds, and it's gonna be something you can't deploy in an application that end users are hitting all of the time. And there's one that is currently specific to MySQL. I believe Postgres fixed this one. But MySQL can only use one of the indexes defined in the database for a query that's being executed.
So you might think that you can slap indexes on the age column and on the job title column, do searches across both of those at once. But actually, it'll pick one of those two indexes, and it'll use that to speed up your query. So actually, the moment you start doing more complicated lookups, the database indexing scheme really starts undermining you.
Meanwhile, search engines have a whole bunch of strengths. Firstly, modern search engines, and I'll be mainly talking about Elasticsearch in this talk, but the same is true for Solr and other search engines as well, are really good at scaling horizontally. You can take a system like Elasticsearch and
literally just throw more machines, throw more nodes at it, and it will rebalance across that full cluster and give you more read performance, more write performance, and just general improvements in your capacity as you scale that up. And they're really good at counting. Database is not so great at counting. Search engines are really, really fast at this.
And they're great at aggregations as well, which I'll talk about in more detail in a moment. You can run queries across multiple indexed fields. So if you have a super complicated query where you're looking at four or five different fields and finding those together, a search engine will make short work of that.
They're unsurprisingly very good at relevance calculations and scoring, because that's kind of the nature of the beast. And they give you text search. You get all of these benefits and you can implement full text search as well. I deliberately left that one to last because my interest in search engines goes way beyond just using them to search for text that users have entered in.
I think that this entire design pattern revolves around the fact that search engines have strengths beyond just being able to implement a full text search. So, I'm gonna roll back in time to 2005 to talk about the first time I saw this pattern in the wild. And that was at Flickr, the photo sharing site.
So back in 2005, Flickr were having enormous scaling problems, because it was the birth of Web 2.0, it was social, I don't think social media was even a term back then. They had an enormous quantity of users coming into the, adding to the service and uploading photos. And as an industry, we haven't really figured out how to do
this web scale engineering thing yet. Sites like Flickr were having to figure this stuff out from scratch and figure out how to scale up to handle these giant numbers, these enormous numbers of users and huge amounts of data. The CTO at Flickr, Cal Henderson, wrote a book about this called Building Scalable Websites, which came out over a decade ago now.
And I think it's still very relevant today, because it essentially talks through the lessons they learned at Flickr, figuring out how to scale these things up, how to build a scalable web application. And the technique that they used at Flickr that got them out of their hole was database sharding.
So this is a very common technique to this day. Essentially what you do is you say, okay, we had one MySQL database and we couldn't keep up. There were just too many writes coming into this database. So what we'll do is we'll split it into multiple databases and we'll put different users on different shards. So maybe we'll put users one through 10,000 on this database,
10,000, 20,000 on this database, and so on and so forth. And this is a very naive description of sharding, but I hope it illustrates the concept. So if you do this, life becomes a lot easier, because you can, as your user base grows, you just add more hardware, you add more databases. And if you want to do things like show me the most recent
photographs uploaded by Simon, you say, okay, well, Simon's on shard three, so I'll go to shard three and I'll select photos from there ordered by whatever. And that'll give me an answer to my question. So that sounds, well, it sounds relatively straightforward, but there's one massive problem, which is what you do with data that lives across multiple different shards.
A great example at Flickr, Flickr were very early adopters of the idea of sort of user provided tags. And so you can go to Flickr today and you can see all of the photos that have been tagged with the raccoons tag and see the most recent uploads and all of that kind of stuff. And the obvious problem here is if you've got photos across
five or six or a dozen or a hundred different sharded databases, and you need to find all of the photos tagged raccoons, are you gonna do a query that hits a hundred databases at once and then try and combine the results as they come back? That's not really a sort of practical way of solving this problem. So what the Flickr team did is they took advantage of the
fact that they were now within Yahoo, and they leaned on a piece of Yahoo internal technology called Vesper, which was pretty much what Elasticsearch is today, but 10 years ago and written in C++ and kind of gnarly to work with. And so what they did is they said, OK, we're going to have our sharded database. We'll have different users' photos.
We'll live in different places. That's all fine. And then we'll have a search index, which we load all of the photos from all of the shards into. And this was on Vesper, which could scale horizontally and gave them all of those benefits. And then when somebody makes a query against Flickr, we can make a decision. We can say, if it's you and you're looking at your own
photos, that's going to be a database query. If it's you looking at other people's photos or if it's you looking at every public photo tagged raccoons, we'll turn that into a search query instead. This is a diagram from Aaron Strapkop, one of the engineers at Flickr who worked on this at the time. And this turns out to work really, really well.
There's a key concept embedded in here, which is this idea of smart query routing. Because if you're building software that human beings use, one of the worst things you can do is have a bug where somebody makes an edit, and then they refresh the page or the UI refreshes, and their edit isn't shown back to them. If a user tags a photo raccoons and then goes and
looks at their photos tagged raccoons and it's not in there, this is a bug. And they're justifiably annoyed by it. So the solution Flickr used was to say, if you're looking at your own data, that should be a relational database hit, because we can't guarantee that the search index has got those changes yet. Generally with these systems, it can take a few seconds up
to a few minutes for the underlying search index to reflect those changes. If you're looking at other people's data, you're never going to know if your friends just uploaded photo tagged raccoons five seconds ago and you can't see it yet. That's not something you'll be able to observe. So at that point, it's safe for us to use the search
index for public and other people's data and the relational database for our own data. So fast forward a few years to 2010, when we used Solr, another open source search engine, to solve a similar kind of problem at Lanyard.
So my wife and I launched Lanyard on our honeymoon. It was supposed to be a side project, and it ended up growing way, way beyond that. But the idea was we were in Casablanca in Morocco, and we had food poisoning, and were unable to keep on traveling. And Casablanca, it was during Ramadan when none of
the restaurants were open, so we couldn't get anything to eat anywhere else either. So we figured, OK, we'll rent an apartment for two weeks, we'll look after ourselves, we'll cook ourselves better, and we'll try and ship this side project that we'd been working on. We made the mistake of building a side project with user account logins, which you should never do, because users have expectations.
And we also built it on top of Twitter. And so the core feature of Lanyard when we launched was you sign in with your Twitter account, and we show you conferences that your Twitter friends, the people you follow on Twitter, are speaking at or attending. Which, when you think about that in terms of a database query, ends up being a SQL query where you say, select star from events where it's in the future, and at least one
of the attendees is in this list of 1,000 IDs that I've pulled back from Twitter. This is the kind of thing that relational databases are incredibly bad at. And so we launched, we got a flurry of initial activity, we ended up on TechCrunch UK unexpectedly, and the site just died instantly. Because the most popular page on our site was the page
with the most expensive database query. And so the way we resolved this was we'd started using Solr to provide search. And we thought, well, maybe this is something we can redefine as a search problem. So we turned this feature from a giant, hairy SQL query into a Solr search where we said, hey, Solr, I want events that are in the future where the attendee IDs
field in Solr matches any one of this list of up to 2,000 IDs that I've pulled back from Twitter. And do that and order by date. We built this thinking, this will probably do us for a few months until we can figure out a better solution. And now five years later, that's still how this page works, because it turns out search engines are incredibly
good at exactly that kind of query. Normally, a search engine expects to be dealing with words, because users type words in. But actually, things like user IDs or other terms work exactly as well. And the underlying architecture of the search engine is really good at dividing those up, at merging together
all of the documents in the index that have fields that match whatever criteria it is you're passing in. So this is a good time to switch over to talking a little bit about Elasticsearch. As I mentioned, Flicker used Vesper, which I looked this morning, and it turns out Yahoo have actually open-sourced it, which is kind of interesting.
But I don't know if it's gained an enormous amount of community adoption yet. We used Solr at Lanyard, which is a very fine search engine, albeit one which was clearly a product of the time in which it was designed. It's very XML heavy.
They sort of added JSON as a later detail. Elasticsearch is what you would get if you designed Solr today. If you said, OK, clearly the world speaks JSON, and the world speaks HTTP, and we're going to want things to scale horizontally. Let's combine all those things together and build a really good search engine. And so it's an open-source search engine.
It's built on top of the same Lucene search library that Solr and other projects have used in the past. The interface is entirely JSON over HTTP, which is great because it means you can talk to it from any programming language that speaks those two things, which I'm pretty sure is everything these days. And it makes it very easy to use. You can use it as a sort of central point between different
programming languages very easily as well. The marketing bump all claims to be a real-time search engine. In practice, it's close enough. We're talking a few seconds between you submitting a document into Elasticsearch and that document becoming available across the cluster for you to run queries against. It has an insanely powerful query language. I'll show you a little bit of that as we go along.
But essentially, there's a domain-specific language written in JSON that lets you construct extremely powerful and complicated search queries. And it also has a very strong focus on analytics. If you go to the Elasticsearch website, you'll see that they mainly talk about it as a analytics engine for things like log analysis and so forth.
And that's where a lot of their focus has been. And as I said earlier, this is the thing that excites me about search engines is, sure, full-text search is nice. But being able to do these more complicated analytical queries is where they get really interesting. And then finally, the elastic really does mean elastic. Elasticsearch scales horizontally. In the past, I've run a cluster of four nodes and just
killed one of them at random to see what would happen. And you can watch the documents rebalancing across the remaining nodes in real time using various visualization plugins. So it does live up to the e in its name. So I've talked a little bit about how I'm excited about more than just search.
The feature that I'm specifically excited about is aggregations. And the best way to illustrate those is with an example. So this is a project that I built last year. It's a little side project. Embarrassingly, since I'm speaking at DjangoCon, this is actually the only thing I've ever written in Flask, because I decided to try out and see what Flask looked like.
Flask worked very well. It was a very nice way of building this. So what this is, it's called DC Inbox Explorer. And there is a project at Stevens University called DC Inbox, which collects the emails that senators and
congresspeople send out to their constituents. So this researcher subscribes to all of these different mailing lists and gathers all of those emails and puts them in a giant JSON file that you can then use to run research about who's emailing about what and when. And because it was a giant JSON file, it was very easy
for me to take this and import it into Elasticsearch. The source code for this is all available on GitHub, and there's not very much of it. So if you want to see how this works and get an example of Elasticsearch, this is a pretty good starting point. So basically what this does is it shows you all 57,000 emails that have been collected by this project. And it lets you search.
So I can search for, say, security. And then I get back 15,000 results. It shows the number of emails sent by month at the top. And then down the side, this is my sort of pet favorite feature of any piece of search software. It has these things, which are sometimes called facets, sometimes called filters. So what this is saying is that without a search term,
there are 56,000 emails. 37,000 of those were sent by Republicans. 19,000 were sent by Democrats. 280 were sent by Independents. You can see them broken down by representatives versus senators, by the states that that politician represents. And if I then search for, say, security, those numbers
update, and I can see that the emails that mention the term security, 2,500 of those were sent by senator. If I click on that, I'm now seeing emails sent by a senator that mention security. And I can see that of those 810 Democrats, 1,600 Republicans, I can now see that the state that is most concerned about, the state whose senators care the most
about security is a practically ME. Is that Maine? And I get this. And so I can keep on drilling down and say, OK, set emails mentioning security from Maine. I've got gender in there. I can look by male or female senators. I can see the actual senators themselves. So Susan Collins is the most prolific emailer from the
state of Maine on the subject of security. But the key thing I'm illustrating here is that when you've got a search engine like Elasticsearch, these kinds of calculations, these aggregate counts, become essentially free for you to, they become very easy from an implementation point of view.
The performance is super fast. So you can build this kind of highly interactive interface very easily on top of that underlying engine. And if you go on sites like Amazon and booking.com and so forth, they all make very extensive use of this faceted navigation pattern. If you try to do this with relational database, you're likely to run into some pretty nasty performance
problems pretty quickly. What I can also show you is, under the hood, I'll show you a little bit of what Elasticsearch itself looks like to work with. This is a tool called Sense, which is kind of an IDE for talking to Elasticsearch. And as I mentioned, Elasticsearch is all JSON and HTTP, so I'm going to do a GET against the slash email
slash underscore search endpoint. And this returns, essentially, all of the emails. It's paginated. I think it returns 10 to a page. And you can see at the very top, it says that there are 56,000 of them. This is what an email looks like. It's got all of this data that I ingested when I indexed the documents.
And so then I can say, actually, I want to run a search. So this is illustrating the Elasticsearch domain-specific language. I'm saying, I'm doing a GET against email slash search. It's a query, and I want to match the term security in the body field. And I'll run that, and now it gives me back 15,000 emails.
And those are the ones that match this particular query. There's a slight oddity of Elasticsearch. This is an HTTP GET, which includes a body as if it was an HTTP post. I had no idea this was even possible, but apparently it is. Elasticsearch uses it for everything. So there's something I learned from playing around with this. But let's go a step further and say, OK, we're going to
search for all of the emails matching security, but I want to also get numbers broken down by role type, which is senator versus representative, and by party. And if I run this search here, I get back all of these different search results. And then at the bottom, I get this aggregations block where it says, role type representative has 12,500,
senator has 2,500, republicans 10,000, democrat 5,000, independence 123. These are the numbers that you saw in the interface earlier. But this illustrates how it's just a little bit of extra JSON that you add to your query. And the query time for this was four milliseconds, which
I think is pretty good. I'll show one last example just to illustrate something that I think is unique to Elasticsearch, which is that Elasticsearch lets you take these aggregations and nest them. So here what I'm doing is I'm saying, I want to aggregate counts by the party. And then within that party, I'd like to do counts by the role type.
So if I run this, I get back results where you can see that the republicans have sent 37,000 emails. Of those, 31,000 were sent by representatives, 5,000 sent by senators. The democrats, 19,000 emails, of which 15,000 were representatives and 4,000 were senators. And then the independents at the bottom, there's apparently one email sent by an independent representative, and the rest were all sent by senators.
I'm intrigued. Let's have a look. So if I do independent here, and then representatives, sure enough, Representative Gregorio Sablan has sent a single email apologizing for spam, which, considering
he's only sent one email, is a little bit surprising. So there we go. So that's sort of some of the power that you get once you start adding an engine like Elasticsearch into your stack.
I said earlier I'd talk about the difficult problem, which is the synchronization strategy between that and your relational database. And I've tried a whole bunch of different ways of doing this. The three that I've had the most luck with are these three, so I'll talk through these in a little bit more detail. But to repeat, the problem we're trying to solve here is
you've got users who are making changes in your relational database. They're updating things, adding things. You want those to be reflected in your search index as quickly as possible, because any delay could result in a strange behavior that the users don't understand. And you want to do this in a way that is performant and efficient and doesn't cause too much overhead on the
various parts of your stack. So the simplest way to do this is to basically keep it in the database, and it's to have a last touched or a changed timestamp on the actual rows of your database, which gets updated any time somebody changes that row. This is a very common pattern. Here's what it might look like in the Django ORM.
I've got the last touched column. It's a datetime field. DB underscore index equals true. It's important to stick an index on this, because you're going to be pulling this from a cron job once a minute. So it needs to be able to return results quickly. And you set it to default to now. And that's fine. And then once you've got this set up, the simplest thing to do is just have a cron that runs once a minute,
select star from that table where change date is within the last minute, and then re-indexes those items. The nice thing about having this as a timestamp is that an indexer can keep track of the last time that it pulled, the last thing it saw. So if your indexer doesn't run for five minutes, when it runs
again, it can catch up on five minutes' worth of changes all at once. There is a subtlety to this, which is that quite often when you're building a search index from a relational database, there are changes that happen to other tables which still should trigger an update. So on Lanyard, we have a concept called a guide, where a guide is somebody might create my guides to
JavaScript conferences in Europe. And they'll then add events into that guide. The problem with that, and we try and include the name of that guide in searches. So if you search for JavaScript Europe, as long as an event has been added to the JavaScript in Europe guide, it should show up. What that means, though, is anytime somebody changes the guide, we need to re-index all of the events that link
to that guide, because there's a bit of dependent data that has now been updated. And if you're using the last touch mechanism, that's pretty easy. You can say, any time a guide is edited, do guide. conferences.all, find all the conferences attached, and update their last touch to date to the current timestamp as well.
For the most part, that works fantastically well. And this means that you get these cascading changes happening within your database, which your search index can then catch up on. So a slightly more sophisticated way of doing this is with a queue. You can have application logic that says, anytime somebody
updates an event or updates a document, write that document's ID into a queue, and then have something at the other end of the queue which is consuming from it and re-indexing those documents. A really nice side effect of this is that you can have de-duping, although you get de-duping with the previous mechanism as well. But you can write your indexer so that it says,
OK, there's been a flurry of activity around this particular document, but I'm going to batch those up. And a few seconds later, I'll do one re-indexing call to recreate that in Elasticsearch. So I've built this a few times. I've built this on top of Redis, which worked great as a little Heroku app.
I've built this on Eventbrite. We use Kafka for this. And in fact, we have a slightly more sophisticated system, which I'll dive into in a moment. Another nice thing about queues is if you have a persistent queue, you get that replayability as well. So you can replay all of the indexing changes from the past five minutes. This is the most sophisticated way that I've seen this
result, and this is what we do at Eventbrite, which is tap into the database's replication log itself. So MySQL has a very robust replication. It's very easy to have a MySQL leader database and then set up multiple replicas that reapply all of the changes made to that leader. It turns out the replication stream is this slightly weird
binary protocol, but if you know what you're doing, you can tap into that yourself, and you can write your own code that reacts to changes that have been made to the database. There's a fantastic open source Python library that we use for this at Eventbrite called Python MySQL Replication. So at Eventbrite, we built a system called Dilithium. And Dilithium is essentially a way of listening to those
database changes and using them to trigger other actions around the Eventbrite system. The way it works is you have your master MySQL database with all of the writes going to it. You have a replica MySQL database that's
replicating off of that. Then Dilithium listens to that replica. So it's replicating from a replica to figure out what changes are going on. It sees things like event row 57 has been updated, attendee row so and so has had these fields changed. And it takes that flow of data and turns them into
what we call interesting moments, because we can't use the word event at Eventbrite because it's already taken by one of our main domain objects. So those moments that come through, we translate into things like event 57 has been updated, event 23 has been created, order 37 has been placed.
Those we then write into Kafka, which is a very robust, high-performance message queue that LinkedIn put out a few years ago. And our search indexes are one of many different components that can then listen to that Kafka queue and decide when they need to re-index things. So it's a pretty complicated flow of data once you stick
it in a diagram. But essentially what this means is any time any piece of code at Eventbrite updates one of the rows in our events table, the Dilithium will pick that up, will turn that into a Kafka message. Our indexing code will listen to that, will say, oh, event 57 has been updated. It'll then query the database to figure out the
current details of that particular event and then write those changes into Elasticsearch. So the end result is we have something which scales extremely well and which can be run on many different machines at once and gives us a very robust path from initial database change to updates in our Elasticsearch index.
So I've got a few tips and tricks that I wanted to dive into, just little bits and pieces that I picked up that have helped with implementing this overall pattern. And the first one is one that can really help avoid serving stale data to your users.
And that's to do everything with your search engine in terms of object IDs as opposed to the raw data itself. Generally with your search index you'll be writing a lot of data into it. It needs to know the titles of things, the descriptions of things, any fields that you might want to search by. So there's a temptation to hit the search index, get that data back, and then use that to construct objects
that you would present back to your user. The moment you do that, though, you're setting yourself up for some really nasty latency risks. Because as I said earlier, there's going to be a three to five, maybe 10 second delay between changes in your database and changes in your index. You really don't want to be showing that stale data to your users. So the trick here is very simple.
When you run searches, all you ask back from the search engine are the IDs of the underlying records. So you run a search, you get back a list of, say, 20 integer IDs. You can then hit the database directly to inflate those into actual finished objects. And it sounds like, I mean, the downside of this is
you're adding additional load to your database. The good news is that databases are insanely quick at primary key lookups. Any time you're doing primary key lookups or lookups against an index, that's going to return really fast. So with Django, you can use Django's inbulk help method and make extensive use of prefetch related as well, which again is a very fast way of retrieving data.
And you can set it up so that your users will never see stale data, because that's stale data. Even if the data was stale in the index, by the time it's pulled from the database, it's going to be the most recent version of things. A related concept to this is if you're doing this, if you're pulling things directly from the database,
what do you do if something's been deleted? What do you do if your search engine gives you back ID 57, and then when you hit fetch from the database, ID 57 has been deleted in the time it took for that search to come through. And the way we've handled this in the past is essentially to have a self-repairing mechanism. The code that queries the database can notice when there's an ID that's missing, and then stick it on
a salary queue or delete it or some other mechanism so that the search index knows to then remove that document from the index entirely. The downside of this is somebody might ask for a page with 10 results on, and one of those results is missing. So you end up giving them back nine results and then
quietly filing that 10th away to be deleted. My hunch is that no one will ever notice this. I don't think people go around counting the number of results they get on a page. So it's probably OK. Then one last trick, which again ties into this idea. This is something I'm using on a project at work at the moment.
And I'm calling it the accurate filter trick. Essentially, this is an additional way of solving for this latency between your database and your search index. So imagine, if you will, that you're building a system where users can save events. So they'll see an event that they want to go to.
They hit a Save button, and that event is saved to their account in some way. I would like to be able to answer queries about what events this user has saved by hitting the search index. Because if I can hit the search index, I can combine it with all of these other benefits. I can let users search for text within the
events they've saved. I can do filters by geography. There's all sorts of useful things I can do with this. And that's easy enough to implement. You have a field on your event document in Elastic Search containing the IDs of the users who have saved that event. And then you can do a search like this. You can say, search for events where one of the saved by users values is the user ID that I'm dealing with.
There's just one obvious problem with this. If a user saves an event and then goes and looks at their list of saved events within a few seconds, and that event isn't shown to them, then obviously something is broken. This is the latency problem that we've been fighting since the beginning of the talk. So what you can do is you can say, OK, any time I'm
running that query, the first thing I'm going to do is hit my relational database to figure out what are the events that this user has saved in the last x minutes, let's say the last five minutes. So this is guaranteed to give me an accurate model of the user's recent activity. And it'll give me back, say, four or five IDs of
documents that we know that the user has saved. Once you've got that list, these are the ones that were saved in the last five minutes. You can construct an Elastic Search query where you say, give me back any event where either the user is listed in the list of users who have saved this event, or the event itself is one of these five that we know that they've saved recently. And as a search query, this will run crazy fast.
It gives you all of those benefits. But this is guaranteed to be exactly up to date with the activity of your users. Save My Users is one obvious application of this. There are a whole bunch of other things where if you want to get a precise, up-to-date reflection of the state of your system, you can use tricks like this
to pull those out of Elastic Search. So a few more use cases that I've applied, in particular, I've applied Elastic Search to. One that we use at Eventbrite is for recommendations. Because it turns out recommending events to a user is essentially just another search problem.
There are a bunch of ways you can do this. You can say, find events where one of my friends has saved that event. This is the way we did our calendar for Lanyard earlier. And you can also say, find events that are similar to the last 10 events that I've saved. A very straightforward way of doing that is to look at the last 10 events saved by a user, collect together the
text from the title and description of all of those events into a giant blob of words, and then just search for those words with a Boolean OR clause, which is enough for Elastic Search's relevance to kick in. And it'll give you back other events that are similar in textual content to the events that the user has saved as well. And search engines are really good at relevant scoring and boosting.
So you can fine tune this stuff very, to a huge extent, using the tools that are built into the search engine. Another thing Elastic Search is great at is geographic search. It's got built-in support for Geo. You can add latitude and longitude points to your documents. And then you can do things like all documents within five
kilometers of this radius point. You can even send it a polygon and say, here is the shape of Canada. Give me everything that falls into that polygon shape. And again, this is stuff which, if you're not using Postgres, can be quite difficult to do with a relational database. But more importantly, you can combine these with all of the other search and filters. So if I've got a recommendation system built
on Elastic Search, I can say, recommend the events similar to these events that fall within this geographic area, and further combine that with other options as well. Elastic Search engines in general are great for visualizations. You saw this earlier with the DC Inbox Explorer. I've got this little graph at the top, which is actually
generated from just another one of these aggregations. When I search against Elastic Search, I can say, give me back counts per month for this time period. And once I've got those counts back, I can turn those into a bar chart. And if you look at the way Elastic Search is used for log analysis, people do some really exciting visualizations on top of the raw data that's being
collected by these aggregations. And one way to think about this is it's kind of like having a real-time MapReduce engine. If you've ever used MapReduce on something like Hadoop, it's a very powerful way of running a query across many different machines and getting results. But it's generally something you want to run as a batch job, because it might take 30 seconds to a minute for it
to return results. Elastic Search, under the hood, is doing pretty much exactly that. If you give it a search, it will spread it out across the nodes in your cluster, combine the results together, and use that to return documents to you. But it's designed to work in real time. You're getting response times measured within milliseconds,
which means that you can expose these directly to your users. So in summary, you should denormalize your data to a query engine. It's definitely a good idea. It lets you build all sorts of things you couldn't build before. And Elastic Search, it turns out, is a pretty good option for this. And I've left lots of time for questions, so thank you very much.
I will take that question. Thank you. You were talking about how you could use a queue to sort of deduplicate repeated indexer actions. Can you speak a little more to that? So the thing that you want to avoid is 500 people interact with something in your database, and then you
send 500 updates to your Elastic Search index in a giant flurry. So really, this is, is it de-duping? Yeah, it's de-duping. It's rate limiting. It's being able to get smart about this and say, OK, there were 500 updates within a short space of time, but I'm actually going to turn that into a single, combined
update to the index. The way I've built this on top of a queue is the code that listens to the queue. So actually, one way that I built this against Redis was to say, when an indexing request comes in, if the indexer hasn't done anything in five seconds, just index that thing straight away.
If the indexer has run within the past five seconds, that suggests that there's a lot of activity going on. So then, hold out for a couple of seconds to see if other updates come in for the same event ID. And if they do, bundle those together and send that all at once. I think, actually, the de-duping becomes a lot easier if you use the last modified timestamp, because then
it's just your cron job runs once a minute, and if there were 500 updates to an event in the last minute, you'll still only re-index it once. Hi. I was wondering, are you the only person who's put this into terms and is educating people about this?
As far as I know, I am, which I find really surprising, because I've seen lots of places doing exactly this. I just don't think anyone's put a name on it before. So if somebody does have an alternative name, then I'd love to hear about it. But yeah, as far as I can tell, I'm the only person who's said, let's give this a name and start discussing this as a general strategy. So there's no buts?
No buts or buts. Buts. Buts. Oh, buts. Definitely not yet. I should write a blog entry. Yes, please write a blog entry. That would be great. So you talked a lot about getting stuff out. I was wondering, specifically for your DC Mailbox example,
do you have to do a lot of pre-processing of the data before it goes in when you're actually structuring your documents and stuff like that? Is there an extra step on that side? Yeah. So one concept I didn't really talk about is there's a thing in Elasticsearch called a mapping, which is basically the same thing as a SQL schema.
You don't have to use mappings. You can just start blasting JSON documents at it, and it'll work. But if you actually want to be able to differentiate date and times from geographic points and so forth, you need to use that. And so here's the mapping for the DC inbox thing. And actually, this one, there's a lot of data here. So it's got caucuses and congress numbers and C-SPAN IDs and so forth.
The source code for this is all available on GitHub, so you can see. So yeah, I actually used a library called Elasticsearch DSL for this, which is a Python library that tries to be a slightly higher level way of working with Elasticsearch. You can also compose this as a JSON blob and then post that to Elasticsearch itself.
But yeah, so your first step is going to be designing and mapping for your data. The actual indexing is pretty trivial once you've got the mapping in place, because you really are just constructing JSON documents and then posting them back up to the server. But yeah, the mapping design is quite important. Thanks for the talk. That was really great. So I know that you're talking about how performant this
is to find documents based on whatever criteria. So is it equally performant if I wanted to do something like aggregate so to get average scores or standard deviations on something inside the document? Absolutely. The strength of aggregations is that they're insanely fast for stuff. As you get more complicated with the aggregations,
the performance can start to add up. But honestly, the most complex queries I've come up with are in the order of 100 milliseconds as opposed to 10 milliseconds. So generally, the performance is really good. And yeah, there are a lot of the aggregations I've shown so far are what are called bucket aggregations where you divide documents into different named buckets. There are also aggregations for metrics
that can calculate things like standard deviations and sums and medians. And even like in geospatial spatial terms, there are aggregates that will calculate a bounding box around all of the documents. So there's a whole bunch of additional power and flexibility you get around that. All right, thank you so much, Simon. Thank you.