Mastering Elasticsearch With Ruby
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 50 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/37479 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Place | Miami Beach, Florida |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Ruby Conference 201316 / 50
2
3
7
13
15
17
18
19
28
29
35
39
44
48
49
50
00:00
Boom (sailing)Content (media)Multiplication signProtein folding
00:41
Software engineeringCASE <Informatik>Product (business)CuboidCASE <Informatik>Search engine (computing)Content (media)Website
01:23
QuicksortFunctional (mathematics)DatabaseSoftware developerData storage deviceSequelProduct (business)Field (computer science)MetrePhysical systemSearch engine (computing)FreezingParameter (computer programming)Subject indexingQuery language
02:20
Execution unitSummierbarkeitRankingSanitary sewerFacebookFinite element methodManufacturing execution systemSoftware developerFunctional (mathematics)Search engine (computing)Core dumpProduct (business)Projective planeCodeBoilerplate (text)Multiplication signWebsite
03:40
Projective planeDecision theoryTheoryCASE <Informatik>Computer animation
04:18
Content (media)DatabaseMatching (graph theory)Projective planeSequelSimilarity (geometry)Web crawlerResultantInformationThomas BayesDecision theorySource codeBuildingFunctional (mathematics)
05:09
Elasticity (physics)ArchitectureDatabaseFunctional (mathematics)Content (media)Cartesian coordinate systemDecision theoryProjective planeWeb crawlerElasticity (physics)Group actionComputer animation
05:41
Web crawlerStructural loadAuthorizationContent (media)Roundness (object)Web crawlerQuicksortFocus (optics)Multiplication signProcess (computing)Machine codeElectronic mailing list
06:16
Web crawlerTwitterComputer fileTerm (mathematics)CASE <Informatik>Interface (computing)Process (computing)Data structurePoint (geometry)Bit ratePhysical systemLevel (video gaming)
06:47
Point (geometry)Core dumpProjective planeCASE <Informatik>Level (video gaming)Web crawlerCartesian coordinate systemBit rateProduct (business)Computer animation
07:34
Raw image formatGraphical user interfaceSource codePoint cloudComa BerenicesChi-squared distributionSearch engine (computing)Mountain passLogicSearch engine (computing)Open sourceProcess (computing)Electronic mailing listCycle (graph theory)Level (video gaming)Cartesian coordinate systemJava appletMultiplication signCuboidProjective planeAnalytic setDatabaseServer (computing)QuicksortSubject indexingLibrary (computing)Term (mathematics)Formal languageMathematicsDemosceneBounded variationUsabilityReal-time operating systemArithmetic meanReplication (computing)Software developer
09:52
Revision controlElasticity (physics)DatabaseLatent heatDefault (computer science)ResultantTable (information)Field (computer science)Computer configurationLine (geometry)Representational state transferMappingRadical (chemistry)Electronic mailing listProduct (business)Elasticity (physics)Equivalence relationCodeSet (mathematics)Numbering schemeEntire functionLevel (video gaming)Moment (mathematics)
11:21
EmailBackupBootingLoginSoftware testingProcess (computing)SoftwareSocial classIntegrated development environmentOperator (mathematics)MetreBoilerplate (text)Level (video gaming)BlogProduct (business)Multiplication signSemiconductor memorySoftware developerConfiguration spaceDistribution (mathematics)Transport Layer SecurityPhysical systemInheritance (object-oriented programming)Form (programming)WritingDatabaseSet (mathematics)Elasticity (physics)Subject indexingDefault (computer science)OraclePerimeterEntire functionConsistencyLine (geometry)Local ringOnline helpShooting methodRevision controlElectronic mailing listJava appletCuboidVirtual machinePlug-in (computing)Network topologyParameter (computer programming)Term (mathematics)InjektivitätFile systemProcedural programming
15:02
Java appletExpert systemCAN busOperations researchExpert systemGroup actionINTEGRALTerm (mathematics)Array data structureElasticity (physics)Library (computing)Rational numberPoint (geometry)Client (computing)Complex (psychology)Utility softwareFile formatCapability Maturity ModelFigurate numberSocial classProduct (business)Level (video gaming)Mobile appProjective planeResultantMetreCASE <Informatik>Operator (mathematics)Single-precision floating-point formatData loggerBookmark (World Wide Web)Query languageAuthorizationEndliche ModelltheorieTask (computing)Set (mathematics)
18:23
Execution unitPrice indexNormed vector spaceMenu (computing)Twin primeNetwork topologyLine (geometry)Field (computer science)Data structureVideo gameRow (database)Function (mathematics)File formatSocial classCASE <Informatik>MappingSubject indexingDefault (computer science)Multiplication signElasticity (physics)Forcing (mathematics)MereologySoftware developerCategory of beingLogicInjektivitätQuicksortWrapper (data mining)Single-precision floating-point formatParameter (computer programming)Data storage deviceHash functionInformationCode
20:30
Subject indexingRow (database)Subject indexingComputer wormRepresentation (politics)Client (computing)Operator (mathematics)Type theoryQuicksortWaveSoftware developer
21:33
Price indexDatabaseRow (database)Software developerRoundness (object)Set (mathematics)CuboidFiber bundleIterationAsynchronous Transfer ModeControl flowComputer configurationData transmissionElasticity (physics)Line (geometry)Moment (mathematics)Multiplication signLocal ring
22:33
Web pagePermianStatisticsNormed vector spaceWeb pageResultantRow (database)Structural loadLine (geometry)AuthorizationMereologyDivisorData storage deviceTerm (mathematics)QuicksortRankingMultiplication signQuery languageINTEGRALDatabaseTwitterWordEndliche ModelltheorieComputer configurationDefault (computer science)FrequencyMatching (graph theory)MultiplicationOrder (biology)CASE <Informatik>Elasticity (physics)
25:27
LogicTwitterMultiplication signResultantMatching (graph theory)Row (database)outputType theoryString (computer science)CuboidArithmetic mean
26:17
ResultantKey (cryptography)Social classProduct (business)QuicksortTwitterReal numberHydraulic jumpLetterpress printing
27:11
LogicoutputCASE <Informatik>Default (computer science)Field (computer science)ResultantInterface (computing)CuboidCodeWeb pageDemosceneGame controllerMultiplication signElectronic mailing listAsynchronous Transfer ModeComputer animation
29:03
AnalogyWeb pageDean numberCodeCondition numberElasticity (physics)Form (programming)Block (periodic table)Bargaining problemKey (cryptography)Different (Kate Ryan album)Boolean algebraElectronic mailing listField (computer science)Computer animation
29:56
Projective planeResultantIterationAuthorizationInterface (computing)BitSurgeryComputer animation
30:28
Category of beingCASE <Informatik>Search engine (computing)Real-time operating system2 (number)Web pageConnected spaceUniform resource locatorResultantSet (mathematics)Software developerGraph coloringOcean currentLatent heat
31:27
ResultantSet (mathematics)Default (computer science)MathematicsQuery languageDifferent (Kate Ryan album)Revision controlLine (geometry)CuboidCategory of being
32:15
View (database)Link (knot theory)Category of beingRevision controlOcean currentLine (geometry)Term (mathematics)Different (Kate Ryan album)StatisticsNetwork topologyComputer configurationResultantLatent heatMatching (graph theory)Key (cryptography)
33:05
Entire functionCategory of beingTwitterCASE <Informatik>Position operatorGoodness of fitLevel (video gaming)Projective planeFrequencyMatching (graph theory)2 (number)Configuration spaceDistanceType theory
34:27
Duality (mathematics)Open setCuboidBoom (sailing)WebsiteGoogolElasticity (physics)ResultantDistanceNeuroinformatikBuilding
35:16
Local ringTotal S.A.Term (mathematics)Game controllerService (economics)Multiplication signNumberComputer configurationSemiconductor memoryNetwork topologyElectronic mailing listQuery languageTotal S.A.Operator (mathematics)SpeicherbereinigungRevision controlJava appletCASE <Informatik>Software developerProduct (business)Parameter (computer programming)2 (number)Mechanism designClosed setMatching (graph theory)Profil (magazine)Operating systemFamily of setsVirtual machineCodeCore dumpPercolationInjektivitätProcess (computing)Flow separationPhysical systemInheritance (object-oriented programming)MetreStructural loadContext awareness
Transcript: English(auto-generated)
00:16
I'm Luca, Bon Massar, even though my badge say Bon Sammar.
00:21
My real name is Bon Massar. I'm 31, I'm Italian, and I live in San Francisco. I work at Guild, and today I'll talk about search using Elasticsearch. I have a lot of content to show, and I have here a sticker that tells me the time,
00:41
so I'll just jump into the talk. So let's start by defining what we will discuss here. Search is a very broad topic, so we want to clarify what's our use case. What we are discussing here is you are building a product, and you want to integrate some search experience in your product.
01:01
So we are not talking here to build a search engine. We are not trying to compete with Google. So we want to implement that little box that every website has, so that the general use case is you have user-generated content, and you have other user that have to be able to find and discover this content.
01:24
So why we have to discuss this? The reason is that search is not easy. It usually starts when you have to build some search capabilities in your product by saying, hey, our primary data storage has some search capabilities. Why not using that?
01:40
And then you start by adding some sort of SQL queries where you can try to search in your database. But then the user are picky, and they want more. They want not just be able to search by exact name. They want also to, for example, enter a long, long string and be able to find products in your, in your, in your system.
02:02
And then you want to support end queries or not. And your little where name becomes a function that has to parse parameter. You don't want only to search on a specific field, but you soon start searching on multiple fields in your database, so you need to start indexing and indexing.
02:22
And in your product, what happens is that you start with a very simple function and you become building your own search engine in your product. That's probably not what you want to do, because you want to focus your development effort on the core functionality for your product and not rebuilding yet another search engine.
02:41
So in the agenda, what I want to show here is not search in general. I want to pick a pet project and talk about search on that project so that it's easier to discuss the various steps that we have to take to introduce search in the project,
03:00
rather than talking about search for like anything. We will see some, several boilerplate that we have to go through to like download the code, download the elastic search, installing scaffolding, configuring, et cetera, et cetera. And we will see a very simple website that we can build integrating search functionalities
03:22
and then see how we can refine improving and adding more capabilities for our search. And then as homework, other capabilities that Elasticsearch gives that I didn't time to discuss here, but with almost no effort you can integrate in your product.
03:40
So the idea of why we start from a real project and not taking like the search as a theory topic is because it's easier to understand each use case and understand why there are some decisions taken here and there if we talk about something concrete
04:03
rather than any possible search. And we will see, for example, a few features that are not easy to understand why there is this feature there, but in the project it makes a lot of sense because, ah, yeah, you could do this. So our project starts from RubyGems. Everyone, I assume, is familiar with RubyGems.
04:25
And RubyGems has a functionality to search gems in the database they have. And they have implemented the search in the same way I described before. So if you look at the RubyGems source code in GitHub, what they are doing here is a SQL query, name like what you're inputting,
04:45
and they can detect if the result from the result is an exact match or something similar. But it's a pity because they have so many more information in their database. You could look up for dependencies. You could look up for not just the name, but also the info, the summary, the build.
05:02
So what we want to do is extending the search capabilities in a way that we can do all of this. Clearly we don't have access to, like, their database. So our project will start by getting the content through a web spider. We will import the data into a Mongo database.
05:24
I'll clarify why the decision of going for MongoDB for this simple project. We will see Elasticsearch in action, and then we've built a very simple Rails application that exposed the functionality of search that we want to do for this RubyGems project.
05:42
So let's start from the crawler. The code is available there. I'm not spending too much time here because it's not the focus of the talk. But the idea is RubyGems.org slash gems provide the list of all the gems' name. They are paginated by name, so we can go one by one,
06:01
collect all the name using Nokugiri. And then using the gems org API, we can download for each gem the JSON of that gem. So all of this, when it runs, ends in this. So I'm not expecting you to parse all the content,
06:21
but the idea is now we have inside Mongo a JSON file representing all the data available for each gem. In this case, the Twitter gem. This now clarifies also why we want to go for Mongo, because we don't want to map data between what the gems API return. Whatever they return in terms of data structure,
06:41
we just dump the JSON into Mongo, and it's there. It's available for us to manipulate and work on that. So we are now to the stage that we have the crawler running. We have the data imported into the system, and it's also available a dump of the data in case you want to play with that. And let's start building our very simple interface.
07:04
So this is the starting point of our project. So here there is nothing else than just a scuffled Rails application showing all the gems that are available. We support pagination, and you can open any of these. And here is basically reporting all the data
07:21
that we have here. So this is the starting point. This is my generic product, where I want to implement here our little search, and let's see how we do that. So the first step is to introduce Elasticsearch. So we're not gonna implement all the logic
07:41
of a search engine, but we just use something that does that for us. This is Elasticsearch, that is cool, monsai cool. What is Elasticsearch? Here is a long list of buzzwords, but let's say that is an open search engine. They also provide analytics capabilities,
08:00
so you can also use the same engine to get sort of map reviews on your data. It's distributed, meaning that it's easy to scale somehow because the data is not monolithic. You can split it into multiple shards. You can have multiple nodes that you can distribute
08:21
your data on, and each shard can be replicated, so it's also very good for the resiliency of your application. It supports almost near-time search, that in the short terms means that you can index new data, and almost in real time, you can have the data available for search.
08:41
It's multi-tenant, that means that you can have multiple indexes, not just one, and you can do cool stuff, as in you can swap indexes, and the application doesn't see any change, so you can deploy new indexes, and you have almost no downtime for your application.
09:01
That's very cool when you're building your application in iteration, and every time you change something in your index, and unless you want to re-index all the data and keeping the website off for a while, you can just auto-swap data in the database. And the last thing is built on top of Apache Lucene.
09:22
There are also other projects that use the same technology, but Apache Lucene is a Java library for manipulating text that is very powerful. All this nice, but as a developer, what is really interesting is that we have this magic box that is able to do search and expose its capabilities using a REST API, and the language to communicate
09:43
with this magic box is JSON. So we can send JSON documents in, and we can, even with the curl command, we can query the server. So here is a list now of things that we have to do to have the entire Elasticsearch ready for us to play with our product.
10:02
We have clearly to download the Elasticsearch code, set it up, define some settings. We will see some default and something that you have to change. Optionally, we have to define a data mapping. In the Elasticsearch word, data mapping is equivalent of defining a database or table in the database word.
10:25
It's optional because Elasticsearch is schema-less, so you could just ignore that and start injecting data into Elasticsearch. You only have to do that if you don't want the default assumption that Elasticsearch does on each field. So if you want, for example, that specific field
10:44
gets tokenized or parsed in very special ways, you have to define your own data mapping. Then the next two steps are, first, we need to load data into the Elasticsearch cluster. So we have to transfer data from the MongoDB to the Elasticsearch cluster.
11:03
And the last thing that is the thing that we want to do is we want to start doing search. Since it's a REST API and JSON document, we can even do that by using the common line. And we can parse the results because they are JSON, so they are very handy to read from the terminal.
11:22
So let's start from the boilerplate. This is the procedure that works on any environment. So Elasticsearch is a Java beast, so you need the Java in your machine. Hopefully not any Java, but the Oracle Java. You can also run it with the GNU Java,
11:41
but very often you run into weird issues and the Oracle Java, it's definitely better. If you're running on a Mac or Linux, you can clearly do the installation using the package distribution like Bro or Port or APT. And we go for the configuration.
12:02
So the very basic configuration is logging where you have to define the verbosity of your logs, but also where to log and what to log at any stage, like production, development, staging. And then you have the long list of config setting for Elasticsearch.
12:22
By default, if you want to run it on dev box, you don't have to configure anything. You can assume that all the settings are good enough for development environment. There is actually one only parameter that you have absolutely to change, that is the name of the cluster. And the reason is that by default,
12:41
the name of the cluster is Elasticsearch. And as we say, the Elasticsearch is a distributed system. So if you run on a network where your developer friends are on the same network running Elasticsearch, they will start discovering each other and they will start building their own cluster. What it means is that if you are operating
13:02
on your local host, you are actually operating on all your developer team. And it's a nightmare in troubleshooting because I can wipe out the entire database and everyone else that is working on that doesn't realize what's happening. So it's very good that at least you change the name of your cluster. Many of the other parameters are like one time only,
13:23
you set it and forget it. The first are the topology of your cluster, like how the cluster is gonna look like, how many shards, so how do you want to split the data, how many replicas for each shards. You're gonna define where are the things
13:41
on your local file system. Elasticsearch is extensible through plugins in Java, so you can also either write your own classes and inject into the cluster, or you can download any of the plugins that are available, there are many, for monitoring and controlling of the cluster. The setting where you will spend
14:01
the majority of your time in production is the memory. Elasticsearch is a Java beast and you will need a lot of tuning for the JVM. In particular, to not run out of memory every time you run a facets query. And everyone else, again, it's for like,
14:21
you set it once and you forget it. So, we're almost there in term of boilerplate. We can finally start our Elasticsearch cluster and using cURL we can test if it's alive. There are tons of APIs available to check the health of the cluster, each node,
14:41
and also the consistency of each index, so to see if, for example, your index is online or not, if it's corrupted or what it's doing. Sometimes it's like synchronizing data between nodes, so through those API you can do that. And you can also shut down each node or the entire cluster using APIs.
15:02
We're done, so we are finally ready to be an Elasticsearch expert. We can tell the world that we are an Elasticsearch expert and let our friends endorsing us. And I'm sure that as soon as you put it, all your friends will start saying, yeah, it's an expert. So, it's a good thing to put it on your resume.
15:23
So, let's take a step back and see what else, where we are and what is missing for our project. So, we have the Elasticsearch running, we have Mongo running. What else we have to do? We have to start telling in our project something about Elasticsearch,
15:41
and we have to start moving the data between Mongo and Elasticsearch. And last, that's the step where we want to get, is able to do queries so that we can implement our search capabilities in our product. So, the first step is then to tell something to our app of where is Elasticsearch
16:02
and how to communicate with that, so the client side of Elasticsearch. We can use Tire, the gem, that unfortunately has been renamed Retire in September. And the reason is that the Elasticsearch group is now building their own official gem, so the author is now deprecating the gem.
16:22
However, in terms of maturity and complexity, probably they are like at least one year or more behind, so Tire is the way to go for now. And Tire provides not only a way to interact, so you could do everything through HTTP, your favorite HTTP Ruby library like HTTP Party,
16:44
but unless you want to be at the metal level, you want something that wrap all the complexity of interacting with each single timeout, and Tire can do that for you. It also support a nice active model integration, so if you're using Rails,
17:01
you basically can forget about Elasticsearch. You will have a few meters that you can operate on the class, and all the complexity, it's totally hidden. And last, it provides a set of utilities and tasks to perform operation that you would do it by end, and like, for example, importing the data,
17:22
and I'll show a couple of cases. So we need to set the gem how to do that. We put it in our gem file, and we bundle install. There is a typo, if you do that, you probably override the entire gem file. The configuration, it's pretty easy.
17:41
It depends on, like, if it's a Rails or a traditional Ruby, but the idea is you just set where is the entry point of your cluster, and that's the only thing that you have to do. A second configuration, if you want to log, that's very, very interesting for debugging what Tire is doing.
18:00
If you set it up, you can have a log file from the client side, and the format of this log, it's in cURL, so you can cut and paste any of this command in your terminal, and you can basically replace every single step of what Tire is doing, and you can also then inspect each single result coming from Elasticsearch.
18:25
So now we can start talking about code. So the Ruby gem is the wrapper for Mongo, so a Ruby gem class in our little project, it's a single Mongo document. What we do is we extend that with the Tire DSL,
18:42
so at line five and six, we can add Tire. And everything else here is optional. I like to, like, oversell, but everything that is here is optional. So we define our own mapping. That is the format of the record in Elasticsearch. So here we basically define a few fields,
19:02
like ID, name, original name, info licenses, and so on. You don't need to do that, because by default, the first time that Elasticsearch see any of this field will pretend that all this field are there. It's here just as a sort of live documentation, so that if tomorrow you have to put the hands again
19:22
on the code, at least you know what are the fields that are supposed to be in Elasticsearch. And you have to do this in case some of the field you have to override any of the properties, like telling that, for example, for a couple of fields, like ID and original name, Elasticsearch, please store it, but don't do any logic on top of that.
19:41
We want to keep the field as it is. And this define the structure of the record. The second thing that you should do is override a method called toIndexJSON. That's part of the Tire DSL. The idea is that you have to convert your record,
20:00
that in our case is JSON because it's Mongo, into some other JSON for Elasticsearch. You can also ignore the entire method here if you want to just have a one-to-one mapping. So whatever is in Mongo is gonna be in Elasticsearch. However, if you don't want to overkill Elasticsearch with every sort of parameter that is in Mongo,
20:21
you want to define the structure of your own record. So here we just define an hash for this record and we convert it to JSON. So if we recap what we have done here, we can fire a Rails console and take the first Ruby gem record.
20:40
We can call the toIndexJSON on that record, and that is like the representation in JSON for your record in Elasticsearch. If on that record we call updateIndex, what Tire is gonna do is to call your toIndexJSON. So you take the record, it generates JSON,
21:01
and then execute a post on Elasticsearch. So this is like the log that we can enable on the client side. And you can see what is happening. It's posting on the Ruby gems index for the Ruby gem type with a specific ID. And then the payload of the JSON
21:22
that it's loading into Elasticsearch. And Elasticsearch is returning us 200, so pure REST, so the operation succeeded. So now we know how to index at least one record. We have to replicate this for all the data that we have available. The naive way would be let's iterate through every record that we have in the database.
21:42
Let's call the updateIndex and we're done. It works, particularly in development mode. The way it works is to execute one single post for each single record. You won't notice any performance issue if you're running everything in localhost. Clearly if you are running with Elasticsearch,
22:02
it's in one box and you're in another box. The data transfer, it's huge because you have a lot of round trip time. What you can do, you can use a bulk API from Elasticsearch and upload a thousand record at a time. And here we have a bundle exec rake provided by tire
22:21
that does everything for you. So you can just fire that command and it's showing you the upload of all your record into Elasticsearch. And we're finally done with all the infrastructure setting and we can now focus on the search. As usual, I like to put more things that are needed
22:41
just for showing you the capabilities of the gem. But everything besides line 789 is optional. So here I'm not using, for example, any integration with active model. I'm going to tire and asking for a search. I could have done Ruby gem dot search and I wouldn't have to pass the name of the index.
23:04
In the tire search, we basically define how our search is gonna look like. We ask to not, the load false is an option. So by default, what tire does is to match the result coming from Elasticsearch to your original record
23:20
in your primary storage. But that means that if you get 25 results back, for each one, it's gonna go on the database and load the original record to give you the result. If you don't need that data, because for example, you have enough data in Elasticsearch record to present it to the view, you don't want to do that because it's much faster just parsing the results coming back from Elasticsearch.
23:42
So in this case, we just tell, please don't load the data coming from Mongo. And here is the query part. In the query part, we get a search terms. The search terms is a, it's a string. It's whatever the user is gonna type in for our search. And we ask Elasticsearch to search
24:01
into name, info, owners, and authors. We also, and that's everything that you have to do. Then we also ask a few other things to Elasticsearch. We ask not just, don't just give me back results, but also tell me for each results, where did you find the match? Because we don't want to confuse the user because now we are searching also,
24:21
for example, owners and authors. And maybe you're searching Twitter and you get a gem that's called something else, but the authors was called Twitter and you don't get why you get these results back. We ask a specific sorting. By default, Elasticsearch provide a score for each record
24:42
of how that record is significant for the search. It's a sort of not page rank, but search rank on each document. And it's based on multiple different factor. For example, how many time that the currency of that word appear,
25:01
the frequency, the position, and so on. Here we just override by saying, don't bother sorting by that, sort by the original name. And the last thing is to implement pagination. So the two API used here are from and size. So we define what's the page size and where are you in the stream
25:21
so that you can jump to page two, three, four, five. That's all we have to do to implement the search. So what we can do next, it's playing with the search from command line. We can again fire a Rails console. And on the Ruby gem, we can call simple search and this time search for Twitter and Bootstrap, but not Rails.
25:40
And we can print the first 25 results back. So we are done in term of logic. We can go now to the UI and implement our little input box that just generate a get request. Whatever the user types in, we pass to the simple search method.
26:00
And here is also the highlighting running. So we also show that if you're searching Twitter and Bootstrap, not Rails, this is for each record where it's coming from, info, name, authors, and where in the string that matched. I want just to show you how the highlighting works and then I'll jump on the running product.
26:25
So if we re-execute again the simple search for Twitter and Bootstrap, not Rails, and we get the third results. That result is the, first of all, it's not a real Ruby gem class. It's an item wrapping a Ruby gem. And implements other method decorating the Ruby gem class.
26:44
For example, the highlight. Print us back as a key where you define that result. And as a sort of HTML with emphasis where. So you can add easily CSS to highlight and show it in your UI.
27:02
You can also change the way it's tagged. Instead of em, you can use anything that you want. So if we go here, this is where we are now. So we have implemented a very simple search. But there is one problem. So if you, in every simple search,
27:22
if you search for example by Tor, since we search everywhere, at least in that four fields, we can get an expected result. Okay, so I'm searching for Tor and I also match authors. That's clear because that's what we have built and what we were looking for. But while this can work,
27:42
you want your user to be able to go into an advanced mode where it can specify I want to search here and there. So this time we implement this feature going the other way around. We start from the interface and we go back to the code. So this is more or less what a user would see. So we continue to show here the results but on top,
28:03
instead of just giving an input box, we give a list of input box so that the user has more control on the search. And also here we could give more control to the user like on what do you want to sort the results? Do you want to sort by name, by something else?
28:20
We could also ask the user to show, to request how many results per page do you want, for example. When implementing this, one of the thing that you have to wonder is what's the logic for all the fields? So if I'm searching something in name and info,
28:40
what should we search? By default, it could be an or or can be an end. So do you want this search if I put multiple fields to restrict your search or to grow, like expand your search? So in our case, we decide that if you search for name and author, so you put this in name and in authors, we will search by end.
29:02
So we go back here and this is the interface that we have built and let's look at the code. So it's not very different, so everything looks the same and I just cut and paste the code to build advanced search. The only thing here that we change
29:20
is the way we execute the query block. We tell Elasticsearch that this is a boolean search. The search condition now is not anymore a string. It's a Nash of condition that the user, it's whatever comes from the form. So it's a list of different keys and names.
29:41
And we just iterate for the fields that are set by the user. We execute the search by saying, please put this as an end condition. That's everything that you have to do and it just works. So if we go here back, we can now search for Tor here
30:03
and we just filter out everything that was in name and we can iterate by searching in something else. Probably if I search here, it will be empty, empty search because now those are in end and there is no project that's called Tor and have as author Tor, success.
30:23
So let's iterate again and let's make the search interface a little bit more like professional and strong. Let's talk about facets. So facets are a way to organize your results so that when you search for something,
30:40
in this case it's a LinkedIn page and you search for a Ruby developer. What we can do is in our result set, we can have a certain amount of categories. For example, in this case, relationship, location, current company. And in real time, the search engine can tell you for each category, it can propose you specific
31:01
subcategories like first connection, second connection and how many result are you gonna get if you are clicking on that and like narrowing down your search. Facets, it's a very cool way to explore the data because given the 100,000 results that I got back, I can very quickly filter and narrow my search
31:23
back to a few results. So how complex is to implement this with Elasticsearch? So it's kind of easy. It's the same code as before. The only thing that change is the line 34 to 38. So here we define facets. So if the user has clicked on the checkbox facets, we define four categories.
31:43
So we want to group our results by license. We want to group our results by version and we want to group our results by when the gem was built. The difference between the global license and the current license is the facets by default are related to the result of your query.
32:01
So whatever you search, it classify the result set of your query. But you can also specify don't bother about my query. Give me the results of the entire data that you have in Elasticsearch. So if we try this on a common line, so the only difference it's faces through
32:22
when we enable that option. The results array get decorated by a method called facets and if we inspect what's inside that facets, we get back from Elasticsearch a key value where keys are what we have defined in our facets. So global licenses, current licenses,
32:42
current version and the dates. And we get some statistic of how many document add that property, how many they were found, how many they don't have anything like that. But in particular within terms or entries, you get key values of how many for that specific category matched.
33:02
So how you can plug in these into your view? Well, you can on the left implement the same thing that we saw before in LinkedIn. So when you run a query, you get the first one that is global. So you get a categorization by licenses. So that's global for the entire population of gems
33:21
that we have available. Based on your query, you can have a breakdown of the licenses, for example, in this case of Twitter. And if you click on that, you just refine your search by narrowing down on that category. Three other things that you can expand after we have reached this stage of the project.
33:42
The first one, it's implementing a DidYouMean capability that, similar to my badge, that was misspelled. If you misspelled something, it can tell you, hey, you type in bonsammar, you probably meant bonemassar. And it's a simple API. When you execute the search,
34:00
you ask for suggestion in the second case. And it's gonna give you frequency and probability of, yeah, probably meant this other thing. So behind, it's implementing Levenstein distance to find matches, and you can specify several configuration on what for you means similar,
34:21
because clearly you can say that anything is similar, or one or two letter should be misspelled. Something other, bonsai cool, that you can have out of the box from Elasticsearch is to implement the similar to this that you can find, for example, in Google, when you find a result that you really, really like.
34:40
For example, you're building a website where you're searching for apartments, and you finally find an apartment that you really like, but unfortunately it's not available today. You can execute a new search asking, give me more results that are similar to this. There is also an API for that, and basically you tell, okay, I really like this document,
35:01
give me something that is similar, you can specify what similar should look like, and again, Elasticsearch will compute distance from that document to what has in its database and provide you other documents that are very similar to that. The last bonsai cool API that I want to show is the percolate.
35:21
Percolate, this is one of the API that when you read it the first time, you don't understand what you can do with that. It's a reverse search, so usually you search for a term and you get back a list of documents that match that query. Percolate is the other way around. You give a list of queries and then a single document, and you can get back which query would match.
35:43
What you can do with that is, for example, going back to the example of the product where you search apartments, you could have a query as a user of apartment in Miami because this is what you're looking for. What you can do is to save your search,
36:01
and every time there is a new apartment in this product, the product can search for any queries that has been saved by a user and notify you when a new apartment is available because now that match your query and then you can notify the user, hey, come back. There is an apartment that could be interesting for you.
36:22
And closing on this, a couple of comments on deployment option. So everything was more or less around development but consideration about deployment. Option number one is the do-it-yourself. So the pro is that you have total control on installation
36:42
and you can have any topology and you can specify. You can also inject Java code and extend the cluster. The con is that, in my experience, it's a nightmare. In particular, the early version that were very, very hard to run and manage.
37:01
Some of the learning that we have found doing that, first of all, there is something that you have to be aware when you're moving from a cluster of three nodes to something more than three nodes. Till three nodes, everything is fine. Unicorn and rainbow, after three nodes, you have to specify a set of settings
37:21
that if you forget about that, you lose all your data. So be aware of that. And the reason is that there is an arbitrary mechanism that automatically define who is master, who is slave. Till you are below three nodes, everything is fine. After a while, unless you specify those parameter that you can find in documentation,
37:40
weird thing can happen. You could have everyone is a master and then everyone start saying, delete the data, I'm the master, no, I'm the master, and be aware of that. And the other consideration is about the memory profiling. There are some operation in Elasticsearch like the facets
38:00
that were, unless you read carefully the documentation, they load all the data in memory. So if you have enough data, you can go out of memory very quickly. And you have also to tweak several time the garbage collector to say, please keep all the memory, reserve it to me, or the operating system will switch your Elasticsearch out.
38:22
A easy way, in case you want just to spend some money, is to go as a service. There are a few companies doing that as a service. This is really beautiful because you just need the credit card, swipe the credit card, and you have the cluster up and running in a minute.
38:41
Also, you buy support. That's very important when you're playing around with the API and you don't understand why your query is always putting the cluster out of memory. The consequences, it's expensive. The second thing is that you could be in the wrong region. For example, in our case, we run in US West,
39:01
but all these companies and also other that you can find are on US East. You can find something also for Rackspace, but that's also tricky. And the other two consideration that is expensive and is expensive, really expensive. So this, all I got here is the code and there is also a dump of the data
39:22
so that if you want, you can play with that. There is also a machine running with that. Please be nice with that because it's a Linode micro and everything is running there. And that's it. So I have 10 seconds left. So if you have any questions.