We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Data exploration and analytics with elasticsearch

00:00

Formal Metadata

Title
Data exploration and analytics with elasticsearch
Title of Series
Number of Parts
163
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Aleksander will give an introduction to Elasticsearch and the many possibilities Elasticsearch offers in terms of search, data exploration and data aggregation. Aleksander will demonstrate how we can navigate structured and unstructured data for search as well as aggregating and visualizing data for analytical purposes. We will look at case studies beyond traditional full-text-search, and hopefully see that Elasticsearch can help us build so much more than just a search engine.
Computing platformExecution unitMenu (computing)Java appletOpen sourceRepresentational state transferElectronic signatureAnalytic setTerm (mathematics)Task (computing)Visualization (computer graphics)Field (computer science)Open sourceComputing platformRepresentation (politics)TwitterNumberRelational databaseContent (media)SurfaceSearch engine (computing)Cellular automatonMultiplication signGoogolGoodness of fitMultiplicationCodecSubject indexingMedical imagingDifferent (Kate Ryan album)Mathematical analysisCategory of beingData structureWordElasticity (physics)Speech synthesisFocus (optics)Connectivity (graph theory)NeuroinformatikNetwork topologyPower (physics)MereologyExecution unitDistribution (mathematics)Information overloadNatural numberLevel (video gaming)Table (information)ResultantKey (cryptography)CASE <Informatik>Direction (geometry)Fundamental theorem of algebraVideo gameQuicksortSurgeryTime seriesData analysisNavigationFile formatSoftwareCorrespondence (mathematics)Representational state transferInformationComputer animation
Inverse problemPrice indexStandard Boolean modelQuery languageContent (media)Term (mathematics)Logical constantFunction (mathematics)Inheritance (object-oriented programming)MultiplicationMaxima and minimaString (computer science)Range (statistics)Digital filterRead-only memoryPolygonHash functionCellular automatonLimit (category theory)Scripting languageFilter <Informatik>Metric systemCountingLocal GroupExecution unitMenu (computing)Tape driveRevision controlContent (media)Sinc functionBoolean algebraQuery languageEndliche ModelltheorieNumberTerm (mathematics)Operator (mathematics)Matching (graph theory)Functional (mathematics)Combinational logicFilter <Stochastik>Rule of inferenceThumbnailMereologyGoodness of fitDifferent (Kate Ryan album)Limit (category theory)ResultantBitData structureElectronic mailing listElement (mathematics)Field (computer science)File formatExclusive or2 (number)Position operatorOcean currentCASE <Informatik>Table (information)Multiplication signFundamental theorem of algebraMathematicsMetric systemWordCountingSubject indexingType theorySearch engine (computing)SequelTheory of relativityFamilySquare numberSineFocus (optics)Category of beingCodecElasticity (physics)Parameter (computer programming)Semiconductor memoryObject (grammar)Group actionAdditionSurgeryPoint (geometry)Service (economics)TheorySummierbarkeitMathematical analysisXML
Tape driveTerm (mathematics)AverageRange (statistics)StatisticsHistogramBound stateRankingMathematical analysisSeries (mathematics)Point cloudBlogSystem identificationLemma (mathematics)Color managementHypermediaFile formatTime seriesTerm (mathematics)Point cloudInstance (computer science)Field (computer science)HistogramNumberNormal (geometry)BlogCore dumpSystem identificationMultiplication signBitType theorySet (mathematics)FrequencyTable (information)Query languageRange (statistics)AverageResultantMappingPower (physics)Operator (mathematics)TwitterFile formatRevision controlLine (geometry)Mobile WebRule of inferencePoint (geometry)ThumbnailReal numberWordString (computer science)Process (computing)Arithmetic meanTotal S.A.CountingIndependence (probability theory)Subject indexingTimestampCASE <Informatik>Different (Kate Ryan album)Latent heatMetadataContext awarenessElectronic mailing listFunction (mathematics)Data analysisField extensionSoftware developerToken ringComputer clusterState of matterHydraulic jumpMathematical analysisImage resolutionCodecMereologyTouchscreenMetric systemKey (cryptography)Theory of relativitySymbol tableSystem callClient (computing)Content (media)Source codeFormal languageInformationForcing (mathematics)AuthorizationBuilding1 (number)Physical lawFilter <Stochastik>Cellular automatonPhysical systemCodeComputer animation
Total S.A.Execution unitState of matterScripting languageError messageFile formatTerm (mathematics)Field (computer science)Point (geometry)ResultantSlide rulePower (physics)Revision controlPhysical lawState of matterAreaCountingAuthorizationHistogramBit rateMultiplication signBlogMetric systemRule of inferenceTerm (mathematics)ThumbnailStreaming media1 (number)Goodness of fitElectronic mailing listExtreme programmingInstance (computer science)Different (Kate Ryan album)Set (mathematics)Type theoryCASE <Informatik>CausalityCombinational logicHuman migrationJSONXML
File formatTerm (mathematics)Content (media)Formal grammarSource codeMathematical analysisCASE <Informatik>TwitterTime seriesLine (geometry)Classical physicsResultantContent (media)Type theoryTerm (mathematics)Field (computer science)CountingHistogramComputer animation
Heat transfer coefficientHill differential equationContent (media)Point cloudWordTerm (mathematics)Query languageWordPoint (geometry)Term (mathematics)Field (computer science)ResultantGoogolPoint cloudQuery languageContent (media)Analytic setNumberTwitterBus (computing)BitSocial classMathematical analysisInstance (computer science)System identificationSet (mathematics)CASE <Informatik>Covering spaceMatching (graph theory)Computer animation
MultilaterationPoint cloudResultantLine (geometry)Video gameLibrary (computing)Instance (computer science)Analytic setCartesian coordinate systemGroup actionTerm (mathematics)DivisorReal-time operating systemNumberMultiplication signWordSoftware developerSummierbarkeitPlastikkarteFile formatRange (statistics)GeometryBoom (sailing)QuicksortCore dumpSearch engine (computing)Revision controlMereologyCodecMathematical analysisUniqueness quantificationSystem identificationStatisticsBuildingGoodness of fitBus (computing)Functional (mathematics)Visualization (computer graphics)
Compilation albumDifferent (Kate Ryan album)Presentation of a groupMultiplication signField (computer science)Normal (geometry)Service (economics)Real-time operating systemInternet service providerConfiguration spaceState of matterFunction (mathematics)Functional (mathematics)Goodness of fitConnectivity (graph theory)NumberView (database)Source codeDatabaseFilter <Stochastik>Water vaporRepresentation (politics)Point (geometry)InformationSequelTotal S.A.CuboidResultantVisualization (computer graphics)Revision controlContext awarenessBlogInstance (computer science)Term (mathematics)Single-precision floating-point formatType theoryEntire functionSet (mathematics)Category of beingTouchscreenAuthorizationStapeldateiElement (mathematics)Demo (music)GoogolTwitterQuicksortHistogramAsynchronous Transfer ModeCombinational logicTraffic reportingSelectivity (electronic)Query languageMetadataGroup actionPie chartXMLComputer animation
Mathematical analysisPredictionProcess modelingPoint (geometry)Floating pointFunction (mathematics)Multiplication signCodeDirection (geometry)State of matterTwitterService (economics)Query languageNear-ringPoint (geometry)Real-time operating systemDot productGeometryStapeldateiBranch (computer science)DistanceElasticity (physics)Level (video gaming)Subject indexingPosition operatorCartesian coordinate systemBitCanadian Mathematical SocietyPercolationProcess (computing)CASE <Informatik>Stack (abstract data type)SatelliteProjective plane2 (number)EmailMathematical analysisTraffic reportingData structureModule (mathematics)String (computer science)Relational databaseQuicksortElectronic mailing listStatisticsSearch engine (computing)LogicTerm (mathematics)View (database)WebsiteProduct (business)Software testingAnalytic set1 (number)PredictabilityPolygonOrder (biology)Function (mathematics)Video gameInstance (computer science)AverageGraph coloringThermal radiationNetwork topologyFunctional (mathematics)Uniform resource locatorHash functionComputer animation
Content (media)WordPoint cloudTerm (mathematics)Heat transfer coefficientInstance (computer science)ResultantStatisticsAverageFunction (mathematics)JSONXML
Formal grammarTerm (mathematics)Content (media)Source codeWordPoint cloudExecution unitPresentation of a groupProcess (computing)Point (geometry)Floating pointFunction (mathematics)Process (computing)Time seriesFile formatVapor barrierProgram slicingDemosceneCASE <Informatik>Scaling (geometry)Term (mathematics)Goodness of fitNormal (geometry)Subject indexingTask (computing)Elasticity (physics)NeuroinformatikSequelMultiplication signLimit (category theory)Search engine (computing)StatisticsDatabaseCodecMathematical analysisPercolationNumeral (linguistics)Arithmetic meanFunctional (mathematics)Standard deviationData structureResultantMoistureDifferent (Kate Ryan album)InformationData storage deviceNumerical analysisMetric systemHypermediaRow (database)Latent heatInstance (computer science)TextsystemInheritance (object-oriented programming)Power (physics)Computer clusterRule of inferenceElectronic mailing listMoore's lawAlgorithmWordField (computer science)Table (information)CountingQuery languageSet (mathematics)Lattice (order)Pivot elementMarkup languageAbsolute valueFormal language1 (number)Matching (graph theory)MultiplicationComputer animation
Shared memoryField (computer science)Computer wormSource codeArithmetic meanResultantMultiplication signOverhead (computing)Entire functionComputer animation
Software developer
Transcript: English(auto-generated)
Welcome. Thanks for joining this talk on data exploration and analytics with Elasticsearch. It's not the best part of the day after the party yesterday, so I'm glad to see so many people here. My name is Alexander Stensby. I work in a company called Monockel, or Monockel in Norwegian.
I've been working with Search since 2004, mainly focused on data analytics and textual analysis. So I've got some experience with Search.
When I was asked to do this talk, and I decided on the title for my talk, I had to Google the word exploration to see if I could find some images, and I found this one, which is an old Polish explorer who I can't even pronounce the name of.
He was exploring Africa on foot and by bike. And the second image I found was this one, which is the Curiosity rover doing a selfie on Mars. So I was thinking this is two different worlds, and it's the evolution, and it's the final frontier.
And I find it very fascinating when I saw this tweet on Twitter back in February. Matt Lenda, he's an engineer at NASA, a software developer, and he tweeted this. Elasticsearch is now powering our Mars Curiosity analytics platform.
Welcome to Mars, freaky fast. So traditionally, when you think about search engines, it's all about helping the users find information. And in the day and age we're living in, with vast amount of data, this is becoming more and more important.
So everyone knows Google, everyone probably uses Google. And that's where it all starts. My talk is not going to focus on traditional use cases for search. I'm going to try to show some examples of how we can actually use the fundamentals of search and search engines
to explore data and do analysis on data, rather than just helping the user find data, which is obviously also very important. So traditionally, we have this Google search field, and it's really smart. It helps us find information in awesome ways.
And then you also have other companies that are really good at helping our users navigate the data. So here in Norway, we've got one example, which most of you probably recognize, which is from Finn, where we can do a search, and then we can drill down into the data, because we've got information overload.
So this is a really important tool for us as users. And similarly, you have the same thing on Amazon, where you're searching for books, and then you can actually filter and drill down into different departments, different categories.
And this is quite often called faceting. For those of you familiar with technologies like Solr, you're probably also familiar with faceting. How many here have worked with a search engine or a search technology?
Okay, good, good. And how many have heard about Elasticsearch? Excellent. How many used Elasticsearch? Beyond just the tutorial or... Okay, yeah, a few. Not too many. Okay, good. So I find this concept of helping our users navigate data fascinating,
but this is the fundamental part of taking it one step further and allowing our users to navigate data visually, rather than just typing in and actually forcing the user to navigate through the data manually. We're actually giving them a visual representation of, in essence, the numbers that you see in your facets or categories.
You can present that on a map, on different charts, and you can actually allow your user to navigate interactively through these different visual components. I find that very fascinating. So we're going to look at how we can do some of these things with Elasticsearch,
and then we're also going to focus in on a special use case, which a lot of you have probably encountered. A lot of you have probably solved with traditional relational databases, and that is time series analysis. So we're going to look at how we can actually use a search engine to do these things freaky fast and elegantly.
So my agenda, good thing a lot of the people in the room have heard about Elasticsearch with us, because I'm not going to give you a thorough introduction to Elasticsearch. I'm going to focus in on the concept called aggregations, which is facets on steroids.
But I'm going to give you a lightning fast introduction, and then we're going to look at time series analysis, how we can use Elasticsearch to do some of these tasks, and if time permits, I'm going to show you how we can bring that data to life through some visualizations.
Elasticsearch is an open source search engine. It's written in Java, and it's built on top of Lucene. How many people have heard about Lucene? Good. Lucene is what's doing all the magic for us. We can thank Lucene for most of the awesome features we're going to look at today. Elasticsearch adds the elasticity on top of that,
the distributed nature and the scalable nature that we get from Elasticsearch, plus a really awesome RESTful API. I'm going to call it Elasticsearch, but they've changed their name now, the company behind Elasticsearch, the commercial company. So they're now called Elastic, but I'm just going to say Elasticsearch.
So, in a nutshell, Elasticsearch consists of a cluster, which, again, consists of different nodes. So you can run a single cluster on your computer, consisting of one node, and each node consists of shards.
Now, the thing to remember here is that each shard in Elasticsearch is actually a fully working Lucene index, which, again, consists of segments, etc., etc., and lots of cool stuff that we're not going to go into today. Suffice it to say, this is what powers the search
and all the capabilities that we're going to look at. The smallest individual unit of data in Lucene, and thus in Elasticsearch, is a field. So documents consists of a number of fields, and in essence, they're a key value map.
So each field can have corresponding values, and in Lucene, as we will see in this talk, we're going to look at the JSON format that we're using to create queries, and the results that we're getting back from Elasticsearch are formatted as JSON documents.
Lucene doesn't have any concept of JSON at all, so it's actually mapped down to a key value in the segments internally in Lucene. Okay, so I'm not going to spend a lot of time on the inner workings, but the two key things that we need to know and understand in Lucene,
and also, of course, in Elasticsearch, is the data structure called the inverted index. This is what helps us, especially when we're doing the aggregations, which we're going to go through today. This is what makes it lightning fast, and this is essentially what makes it different from B-trees
and index lookups in traditional relational databases. So in essence, this structure, there is one inverted index for every field. So we've got documents in an index, and the documents consist of multiple fields
with content. It can be numeric, it can be text, but each of those fields are then broken down into tokens or terms. Now here, represented as words, but it doesn't have to be words, and the reason why it's so fast is that this is a direct lookup table.
So if you're searching for the term blues, you've got one lookup, you immediately see the number of times this term or word occurs, and then what documents. And this is a very simplified version of an inverted index. In addition, you will have information such as the position of those terms
within each of those documents. So you can actually say that I want to have this word, and it should occur before this word or after that word. The other fundamental and critically important part of the search engine is the Boolean model.
The Boolean model is what we use to combine terms in our queries. So we do a single term lookup, we look it up in our inverted index, we do another term lookup in our inverted index, and then the Boolean model determines the rules for what document to retrieve.
So let's say we're searching for the term born, and we use the Boolean operator and, and another term that dictates what documents we retrieve. And this is set theory, so it's very simple to calculate for Elasticsearch because you're just breaking down your query into individual pieces,
and then you combine them with the Boolean model. So that's all the details I'm going to do on the inner workings, but keep this in mind because we're going to see how this is actually done in practice when we're going through the actual queries, etc. So let's get our hands dirty.
So this is the simplest query in Elasticsearch. So that's the format of the most basic query in Elasticsearch. It's an empty JSON object. So this translates internally into this, which is called a match-all query.
And it does just what it says. It matches all documents. So there are no criteria at all. It's just returning all the documents in your index. So this is the basic structure of how we write queries in Elasticsearch. You have a query tag,
and then inside of that you can add any number of different queries. So another example of a query where it actually is a bit more meaningful, I can actually say I'm searching for the term monocle in the field content. Is everyone familiar with JSON in the room?
Yeah, good. So Elasticsearch provides us with a bunch of different queries. So this is from the query DSL. There are some changes going on here, and there are some new things popping up as well, but this is broadly the current version of the query DSL. And this is again, of course, provided by Lucene.
So Lucene provides the fundamentals for a lot of these things. Elasticsearch adds some magic on top of it, provides us with some new queries, but they all translate down to either the match query, which we were looking at here, or ultimately something called the bool query, which we'll have a look at in a second.
The other concept in the query DSL is something called filters. Now the key difference between a query and a filter is that queries care about the relevancy. So they calculate relevancy for the documents that your query matches
to provide the user with the most relevant results. So the rule of thumb is, if you need relevancy in your search, so have some way of ranking your results in accordance to what your user is asking for, you need to use queries. But if that doesn't matter,
which is the case when we're doing aggregations, which we'll look at, we don't need to worry about relevancy. So we use something called filters, and filters have the benefit of being cached, or they can be cached in memory, which makes it significantly faster than queries, because we can avoid the whole step of calculating the relevancy.
So keep that in mind. We also have a number of different filters available in the query DSL as well. And a lot of these have a counterparty query. So if you can solve it with a query, you can quite often solve it with a filter as well.
And I recommend always trying to use filters instead of queries. So today, in the current version of Elasticsearch, we use quite frequently the filtered query to combine different queries with different filters. So the structure here is you have a query, and the type of the query is filtered,
and it consists of two elements, the query element and the filter element. And inside here, I've just one example of a match query and a term filter. So if you want to combine more than one query or more than one filter, we've got the amazing bool query, or boolean query,
and the equivalent bool filter. This allows us to combine a number of must clauses, should clauses, or exclusions through the must-not clause. And they all take arrays. Now, if you only specify one, you specify them as an object.
If you have more than one, you specify it as an array. Fairly straightforward once you get the hang of the syntax here. Now, moving forward, just worth pointing out, is that in 2.0, which is the current master, and they're working towards 2.0,
I don't know when that's coming out, but in 2.0, they're introducing an element to the bool query called filter, and the filtered query is being deprecated, so we're not going to use that anymore. So just keep that in mind. Now, that's all I'm going to say about queries and filters.
It's obviously a very important part of Elasticsearch, and also in terms of analysis, because we're going to apply filters when we want to drill down into the data, we're doing that by applying a filter. But given the limited time, I'm going to focus on aggregations, which is the really, really cool part.
So aggregations, for those of you familiar with SQL, can be roughly equivalent to the concepts of grouping functions in SQL. So that's, in Elasticsearch, referred to as buckets. A very nice name for that.
So it's buckets, and then we've got different metric functions, as we do in SQL with some average count. We have metrics in Elasticsearch as well. So the definition of an aggregation in Elasticsearch is, in essence, the combination of buckets and, optionally, metrics.
So this brings us back to our facets, or categories, when we're looking at data. This is all dealt with through aggregations. So we have a category, or book category, bucket,
or number of buckets, and each bucket would be science fiction, science fiction and fantasy, et cetera, et cetera, et cetera. Now, how do we create an aggregation? The syntax is, I don't think it's that intimidating. So you use the keyword aggregations, or, since that's a very difficult word to spell,
we have a short version of it, which is just eggs, so that's perfectly valid to use in your query. And then you have to give it a name. With queries, we don't name the queries, but with aggregations, you have to actually specify the name for your group. So I'm just giving it the name here, speakers.
It could be anything. And then you specify the type of aggregation. In this example, we're using a bucket aggregation called terms aggregation. And then I'm providing that aggregation with the parameters, which is the field that you want to run your aggregation over.
So essentially, I'm grouping by speaker. So the speaker in this example is a speaker name. So what does this return? Well, it returns the list of all unique speaker names. And the number of documents for each of those.
But that's fairly straightforward. And we could do that earlier with faceting as well, in Solr and in Elasticsearch earlier as well. But the beauty and the real power of aggregations is that we can actually nest or add sub-aggregations, or operate with sibling aggregations.
And this is where it gets really interesting. So what am I doing here? I'm actually building on my previous bucket aggregation, although I changed it to beer instead. So I've got a data set of beer types. I'm running a bucket aggregation on beer types.
And then for each of those buckets, this syntax is saying, I'm going to add a sub-aggregation. So essentially run these buckets. Give me all of the different types of beer. For each of the types of beer, run a new aggregation on that result set. So I'm here using another type of aggregation called a metric aggregation.
In this case, it's an average aggregation. So in essence, what I'm doing here is that I'm running an average on the IBU field in the data set for each beer type. So I'm going to get a table of beer types, and then the calculated average for each of them.
So again, like with the query DSL and the filters, we've got a bunch of aggregations as well. And this is our tool set for data analysis. So we've talked a little bit about average, for instance. So on the left-hand side here, I've just listed various types of metric aggregations.
And then we've got other aggregations that are bucket aggregations. So we're not going to go through all of these. There's a lot of them in here. But the core ones, as we will focus on, is the terms aggregation, which is probably the most important bucket aggregation we have,
in conjunction with the range aggregation and the histogram aggregation. And then for convenience, we've got some specific histogram aggregations, like the date histogram aggregation, which is our most important tool when we do time series analysis.
So, time series analysis. I'm just going to take you through a number of examples and use cases on how we can solve different problems with Elasticsearch. We're going to look at a data set which consists of online news and blog data.
We're not going to worry too much about the data, but it's a data set consisting of news articles and blog articles from the mobile industry. So we're going to see some examples around mobile phones, etc. We're going to try to do some time series analysis on that data. We're going to look at a concept called peak identification.
And then we're going to look at term clouds and see a pretty neat feature in Elasticsearch that can be used for this purpose. Okay, so the format of the data is on the screen here. So we've got the content field, which is the actual news article or blog post. We've got the author.
We've got the time stamp and the date, obviously. And then we've got a bunch of different meta information about this source that we're collecting the data from. So we've got, for a lot of the data, we've got the language, we've got the country, etc.
So how do we do time series analysis in Elasticsearch? Or, not necessarily time series analysis, but how do we actually visualize and represent our data as a time series? So the date histogram helps us with that. I've given this, as you recall, the format of an aggregation is that you specify it with the x keyword
and then you actually have to give it a name. So I've given it the name over time. And then I've said that this should be a date histogram. It should be run against a field called date. And then I'm specifying the interval or the granulation of my histogram.
So this could be monthly, it could be yearly. In this case, I'm doing it daily. So I'm getting one bucket per day. That's what I'm doing here. And then for convenience, you can also specify a format for those front-end developers that don't understand time stamps. They quite often want to have this neatly formatted string.
So we can provide that as well. And this is the result set. So we actually get one bucket per day. We've got our neat little key as string. And we've got the count, number of documents falling into each of the buckets.
Now, what can we do with that? Well, that's our starting point for a trend line. A bar chart or a line chart showing the results that we've calculated it on. And of course, the aggregation here can be combined with a query or a filter.
So I could say that I want to look at this chart, but I want to drill in on a specific country. So I apply a filter that says country should be Norway. And then I get a new aggregation result based on that query. So the aggregation is executed in the context of your query and or filter.
So countries, we can do the other really important aggregation that I was talking about, the terms aggregation. So this one, can anyone guess what output I get from something like this?
No. As Greg Young said yesterday, we're in Norway, so I don't expect anyone to ask any questions or answer. So this gives me a list of all the different countries in our data set and a count for each of the news articles independent of the time. It's the total count per country.
So the result set looks like this. We've got the key, which is the country name. We've got the document count. And they're ordered in the biggest buckets comes first. We can specify that as well. Yes, you're jumping.
Yes, I'm coming to that now. Well spotted. So I've said the field should be country.notanalyzed. The reason for that is, if I'd done this on the field content, it will do a lookup in the inverted index.
So in essence, as some of you may have noticed, this result here looks very much like our inverted index. It's got the different terms and then it's got the term frequency or number of times this term occurs. Or in this case, the number of documents it occurs in. So what I would get with aggregations
is that it will actually calculate all terms for each field. And if you do analysis on the field, which we haven't even talked about, but strings in Elasticsearch or in Lucene gets tokenized, meaning that you have a sentence, for instance, it gets splits into smaller pieces.
So you're quite often into the different words of a sentence. So in your inverted index, you would actually have the different terms. So if I did this query here on the country field, I would actually get back a result that gave me one count for united and another count, well, in this case, the same count for states.
But it could be that united got a whole much bigger number because you've got the United Kingdom and the United States. They would be both counted for united. So that's a very important part of aggregations. So when we do aggregations, the rule of thumb, or when we do aggregations on textual fields,
the rule of thumb is to use the not analyzed version of a field, which is something you have to specify in your mapping. So we're not going to go into the details of that, but what we're essentially saying is that we've created a subfield of country which does not go through the tokenization and normalization process in Lucene,
or in Elasticsearch. So I could run that as an example. So I could run the country example here.
So this is the one where we were just... Oh, sorry, my bad, PowerPoint. Here we go. Okay, so this is the aggregation that we were just looking at on the slide.
So I've got country, not analyzed, and then I get the results. We've got the United States. If I remove the not analyzed, I'm going to do it on actually the tokenized version of the field, and my results would not make as much sense. That's clearly not what we want. We have the same issue if we've got, for instance, an author field,
as we do in our dataset, where we've actually stored the first name and last name. So if we did the aggregation on the name field, we'd actually get the count for the first name and then the count for the last name, and if there are more than one person with the same first name or last name, it would be counted multiple times. So that's a good rule of thumb.
Use the not analyzed field, which makes much more sense. Okay, going back here. Now, so we've looked at the terms aggregation and we've looked at the data histogram aggregation.
Now, the real power comes into the ability of combining different types of aggregation. So what we're doing here is actually looking at our date histogram, and then for each date, we're going to actually run a new aggregation. So for each bucket, which is a day, we're actually going to run this country aggregation.
And the result of that is we've got our uppermost bucket, which is the date, and then we've got the document count for that date, and then we've got a new aggregation within that, which is called country. And what can we use this for?
Well, that's our fundamental stacked chart. So we can do a stacked bar chart or an area chart or a stream chart, which is really cool. And then you can actually filter down on these combinations. So if I click on one of these peaks in the chart, I could just apply a filter saying that it's this day and it's this country.
So we can add even further to that, to the extreme, we can actually say for each country for each day, I want to get the list of the top authors for each of those countries. So who are the most frequent journalists?
Or if you're looking at the comments to the blog post, we can see who are the most active users that are commenting about us. So I can just add another sub-aggregation on the authors field, and then, just to put it even further,
so I'm adding another aggregation below my author, which is a metric. So metrics can be applied as the leaf node of an aggregation. We can't add sub-aggregations to a metric, but we can add the metrics as leaf nodes.
So in this case, let's pretend we have a field called rating, which is the star rating of all the readers for these news articles. We can actually calculate the average rating for each of the articles per author, per country, per day.
So you can see who's the most popular authors, not just the ones that produce the most news articles. Quite often when you're looking at trend data, you want to see what's behind this peak. Why are there so many news articles this particular day about this particular topic?
So actually starting to do trend analysis. And we can also do that with Elasticsearch, because we've got the ability to, and in this case I've actually just taken the data histogram and I've added a new type of aggregation, which is called a filters aggregation. So rather than just getting a count for each author,
I'm actually looking in the text field, the content field, and I'm actually doing a query, or in this case a filter, on a term. So I'm saying in this case I'm trying to produce a time series where there are two lines, one for HTC and one for Samsung. So I actually get a trend chart for these two brands.
Now to actually look at the content and the peaks, I can add another really cool feature in Elasticsearch called a top hits aggregator. Now the top hits actually goes into each of your buckets, where you've specified it, and then it pulls out the highest scoring results for that bucket.
So why is this useful and what can we use this for? Well actually we get our buckets, so I've got one HTC bucket here, there's 812 results, and then I'm just producing the top three news articles
based on the content and how they match my query for each of those peaks. And some of you may have seen, for instance, Google Trends, where you can do a search and then see the trend for that search, how many people have searched for this term. And then quite often you've got these annotated spots on the chart
where you can mouse over, and you actually see here's an indicating story for why there is a peak here. So like, for instance, the iPhone launched or something like that. So that's a crude approach to peak identification, but at least it's a very quick and efficient way of getting an idea of what's behind our data.
At some point you probably need to drill into the data and do manual analysis, etc., but this is a really good starting point. For text analytics and online search, we quite often heard the term clouds or word clouds, or bus clouds. So you could think that the terms aggregation is a good starting point
for making a term cloud, because you're actually just counting the number of occurrences for each term. But that doesn't really work when you're doing it on, for instance, the content field. It's completely uninteresting to see all of these common words
in your corpus of data. So in Elasticsearch, we actually have a very neat feature, which is still a bit experimental in Blackmagic, and it's called significant terms. And this is actually on the verge of a bit of a Blackmagic.
So what it actually does is that it's trying to identify the uncommonly common terms in your dataset. So rather than uncommonly common terms, yeah. So it's actually looking at the statistical abnormalities in your dataset relative to your query. So in this case, when we're looking at, for instance,
searching for the HTC One, which is an HTC phone, and I'm running the significant terms aggregation on the content field, rather than get back in and a of etc., I'm actually getting some surprisingly awesome results. For those of you who are familiar with the mobile industry
and this particular device, these are the, well, top four, five, six, seven, eight, nine features or terms from the articles about the HTC One. And why are these selected? Well, these are the terms that are more strongly connected to the documents matching my query, HTC One,
versus the entire corpus, which means that these aren't necessarily the terms that occur most frequently in the documents about HTC One, but they occur more frequently in the documents about HTC One than in the rest of the corpus.
So they are, in essence, unique or contributing factors to these documents. And surprisingly, well, HTC One are the most frequently used terms in my query as well. Then we've got the S4, which is the number one competitor of the HTC One.
So in all the reviews of the HTC One, it's compared to the Samsung Galaxy S4. Then we've got UltraPixel, which is a unique feature that the HTC One came with, which is a branding thing for their camera functionality. Similarly, BoomSound. And then there was a mini version of HTC One as well. So you can actually see here, quite interestingly, unique terms for this result.
And we can also use this when we want to do peak identification. So rather than showing the top three news articles, we can actually show a word cloud or a bus cloud when we mouse over our peaks on the chart.
Now this feature has been used in some pretty interesting scenarios as well. For instance, for fraud detection, for credit card fraud detection, they've used this feature. It can be used to provide our users with a better search experience.
So rather than saying you're searching for the swine flu, you can actually propose to your user that if you're searching for the swine flu, you may also want to search for the very cryptic name of the vaccine, H5C3 or something like that. And it will actually determine that by using a significant terms aggregation on the results.
So it's not just about statistical analysis, but it's also about providing your users with a better search experience. Okay, that's the core part of the aggregations we're going to have time to go through now.
As I said, there are more aggregations there. So you should check it out. And to sum it up, it's all about this ability to combine and add sub-aggregations and nest further. So what I want to show you now is how we can actually bring the data to life through some visualizations.
Now Elasticsearch provides us with a really cool analytical tool or dashboard called Kibana. How many people have used Kibana? Good. It's an excellent tool. Lots of things you can do there.
A lot of the aggregations that we've looked at now today, you can use in Kibana and build and add sub-aggregations, etc. I'm not going to talk much about Kibana today. There's another talk later this afternoon, which I encourage you to go to, which will focus more on Kibana.
So what I wanted to do is just show you, beyond Kibana, you may want to use your aggregations and your data and present it to your user in your own applications. So there are obviously a lot of tools available for doing that, building dashboards, etc.
And it's very straightforward since we are dealing with JSON data. So we've got a very simple format to deal with when you're a front-end developer and you want to make use of libraries like D3.js, etc. We haven't even talked about the geo-capabilities of Elasticsearch, but there are also a range of geo-aggregations.
That is really fascinating, so I encourage you to look into that. We're not going to have the time to go through that today, I think. So, beyond Kibana, you may want to combine your own aggregations and provide them to your user in a real-time dashboard, or even as a search engine, for instance, and then visualize your data.
So what I'm going to talk you through now is just show you how we can use something like D3.js to provide our users with an interactive way of navigating data. We've just put together a very simple dashboard application using AngularJS and D3.
I'm just going to pull it up here, see if I can do full screen mode. I'm not a designer, so this doesn't look very good, but it's for the demo purposes here.
I've got my Google search bar, and I can type in anything in my news data, so I could search for HTC, and I get the results. The dashboard is built around these visual representations of data, so I'm actually showing the data in real-time
as the user clicks, filters, etc. Rather than having to search in the text box, we can actually navigate through the data through a number of different components. The most basic one is obviously the trend chart, which we built in the first aggregation we did.
That's a very simple date histogram on the date field, and it just gives me the total number of results in context of my query. So you saw that it changed when I searched for HTC. Now, I could then also, if I want to, drill in on one of the dates, and then you see my results also change
according to that date that I've selected. So, I've also done a number of very simple terms aggregations on different fields. So we've got the author field, we've got the country field, and we've got another field called data provider or source,
and then we've got the region. As you can see, there are some missing elements here because not all of the entries in our dataset contains that meta information. So we've got news articles that are actually tagged with the country of the source, etc., while online blogs don't have that information since they're global rather than country-specific.
So, we've got that sort of facet view that we saw on Amazon and Thin. Now, we can do that exactly the same output we can actually visually represent in a chart. It could be a pie chart. Right now, I've just chosen a bar chart.
Maybe not that easy to read, but it gives you a pretty good impression of what the big or biggest values are. Now again, I can obviously drill down and slice and dice my data through these charts. It's very simple. These are just plain old D3 charts, and I've just added an action on click,
which just takes the value and applies a filter. So if I say I want to have United States, I actually get the results for the country United States, which I've added as a filter. So that's cool. You provide your user the ability to drill into the data.
Then you may want to see, okay, so let's do that over time chart we did with the top authors. So who are the most contributing authors per day, for instance? So we've just added this little functionality here, which takes the single bucket aggregation on author,
combines it with a data histogram, and actually provides us with an over time chart. So I can actually look at that over time. Now let's get rid of this unattributed author, and then see who are the top authors over time. So that's pretty neat.
And then again, we can obviously drill into the different categories and filter further, or we can remove and then say let's just look at the countries over time and see who's actually producing the most content. Let's get rid of not defined. Then you actually get a very visual representation of your data with just a few clicks,
rather than running all of these big batch jobs in Hadoop or your SQL database or reporting services or something like that. So it's very easy to deal with. And the beauty of this dashboard here, and well, the beauty of having to deal with JSON data,
is essentially that the configuration is minimal as well. So I could just flick over here and have a look at the configuration. So I've got, that's my entire configuration here for the dashboard.
So I'm just saying what fields do I want to provide as, well, I've called it facets because the first version of this was built with facets. So what fields do I want to provide to the user visually in the dashboard? And if I do something like this, I got two.
So I got that dashboard. I want to add them back again. Let's say that was my starting point. I simply add those two back, and then I've got them back again.
And I can add more if I had more fields in my data. So that's pretty cool, huh? So just going back to the presentation.
Normally I talk way over time, so today I've actually managed to finish a bit early. So we've just had a very brief look at how we can visualize data as well. I'm not going into the code for the D3 directives, et cetera, but that's straightforward.
We've looked at news data. So that's just one example of what we can use aggregations and the capabilities of Elasticsearch for in terms of analytics. So hopefully you have a view now of the capabilities. You need to obviously go home and test it out yourselves.
But it's interesting, and I'm an advocate for making people think of search engine as much more than just a search engine, something you use with your CMS to provide content search. It is so powerful and so fast because of the data structures that we have underneath the hood.
And I just want to encourage everyone to, next time you're being asked to produce a reporting module or some sort of statistical output, to consider stuffing your data into Elasticsearch and using that rather than the good old relational database. So there are a lot of use cases. If you Google around a bit and you can look at the Elastic website,
they have some use cases there as well. There are some interesting things that I've seen Elasticsearch being applied to, like, for instance, recommendations, prediction modeling, fraud detection, as I mentioned. And obviously log analysis is something that probably a lot of you have even done briefly yourselves with the ELK stack.
And there's so much more as well. So, you know, we haven't talked about the geo capabilities, which is really, really cool, really awesome, really fast. And again, it's allowing you to do geolocation data, search, filtering, and aggregation
with the benefits of our search engine. So we're doing string lookups where they're doing a lot of cool stuff underneath the hood using geo hashes, etc., which makes it really, really fast. So, you know, we can provide our users with an application like Yelp or Gulis,
either in Norway, where you actually have a map and you search for a coffee shop and then you get all of these dots on the map, coffee shops near you, which is one very, very simple search in Elasticsearch where you sort by the distance from your pinpoint and you just provide the pinpoint. And you get the distance, and they're ranked in order of distance. And you can combine that in Elasticsearch with a function, which actually says that
I want the distance to be weighted this much, but I also want the average price for their products to be rated higher. So let's say I want to have hotels near me, but they should be within 50 kilometers, but I actually want the ones that are cheaper to come up higher on the list.
A simple query in Elasticsearch. There's another cool one, which I'm just going to mention briefly, something called percolators. So, yeah, the coffee-making thing. Percolators is awesome. So, you know, this is a bit of a brain twister
because percolators is the concept of rather than storing a document, you're storing a query in a specific index or a special index that is called a percolation index, and then you actually take a document and then rather than send a query into your index and ask for what documents did I match, you're sending a document into your index and asking what queries did I match.
Now, that is a bit of a brain twister. Why would you use that? I've got one very concrete example. How many of you have written a cron job that runs every night and, you know, does things in batch and, you know, yeah, exactly. Well, one, hopefully a few more.
So this is one perfectly valid example of how you can use percolators. So there's a big company in Norway that had an alerting service for its users where they could save a search and get an alert. So, you know, which is quite a common feature.
You have the same thing if you go on Twitter and you open up the search site on Twitter, you quite often would notice that there, since you did your search, there are 15 new entries. So this works the same way. The alerting feature of this unnamed company was done as a batch job, so for each user, for each search that they've specified,
they ran a batch job during the night that actually looked for all the documents matching each of those queries and then sent out that email. Now, the problem was that they had so many users and so many searches that the batch job took something between 8 and 10 hours.
So real time was probably not very real time in terms of getting an alert. So they actually changed their whole logic into a percolator. So they stored all of their users' queries and then in real time, every time a new document was inserted into the index, it was also sent to the percolator which said, this document matches these 5 searches or these 50 searches.
So they just aggregated that and sent out the email immediately. So all of a sudden they could provide real time alerts in 10 seconds or even 5 seconds, depending. So I think that's a pretty cool feature to apply it to.
And then you can also build on that and use it for data enrichment, saying that we've just done this project where we've dealt with the satellite data from vessels, boats on the sea where you're collecting in real time their position and then you want to be able to say what region does this vessel live in or operate in.
So you can actually have a geo search that has a polygon that defines a region and then every time you get a new point, you immediately know that it's in the Pacific or in the Mediterranean, et cetera, et cetera, et cetera,
without knowing that that point is the Mediterranean. Lightning fast and a really cool way of enriching data in real time. And then lastly, the pipeline aggregation feature is something completely new long awaited, really powerful that we will have hopefully in 2.0.
It's in the master branch of Elastic now, so you can check it out. And the concept of a pipeline aggregator is in essence that if we think back to my aggregation example of my buckets here, let's do this one.
Today in Elasticsearch, we don't have any ability to do anything with our aggregation results in our first request. So we have to issue another request if we want to do something. Let's say you want to calculate the average for each of your top 10 buckets, for instance.
Or you want to do something like moving averages, which is quite common to do when you do statistical analysis of data. So the pipeline aggregator actually allows you to specify an aggregation
that uses value output from your aggregations. So I can actually aggregate on my aggregations after the result is being calculated. So that is something for the future, but a really powerful feature. And I think that's where I will open up for questions.
And since we're in Norway, obviously no one's going to ask questions. Yeah, one, good. Instead of submitting terms, is it possible to use some kind of removed language noise? Yeah, absolutely. So you could run the terms aggregation that we looked at there, and you can actually analyze your field like the content field that we were doing it on.
We can actually run it through, for instance, a stop word process, which means that you can specify a list of terms to remove common terms in the language and remove them. So then you would get rid of them. But still, this would still only be a crude counting of the top terms.
So unless you remove a lot of information, then you would get some noisy results. And they wouldn't be unique to your actual query, which in the case of significant terms is the case. And then 10, 15 years ago, using stop words was very common.
It was a means to an end, because we had slightly limited computing powers, and text analysis was a pretty heavy-duty task. Today, that's not really an issue. We don't have the computing issue or a barrier when it comes to computing power. So today, the recommendation I would give is don't remove stop words,
but you could for a very specific use case like that, because stop words can quite often contain meaningful information. As an example, not is a stop word. So not is a pretty, pretty significant term saying the difference between not bad and bad.
You lose that if you remove the stop words. Yeah, absolutely. So I actually did have some time, so I could have done that, but I wanted to actually talk more about numeric analysis, because that's really interesting,
because traditionally, you think of text or search engines as searching unstructured text. But equally, you can use this, or Elasticsearch, as a data store for numeric data as well, which we've done quite a lot, actually. And it's really powerful, because you get all the same benefits of the direct lookup in the inverted index.
You can use your sum, average, standard deviation functions, etc. There's a lot of statistical metrics available in Elasticsearch that we can use. So yeah, it's definitely possible to use all the same aggregations that we did here with numeric data.
Well, it depends on what you're going to use it for. I mean, there are dedicated NoSQL solutions that are focused purely on time series analysis. So you have got these time series databases. And of course, they're optimized for numeric data and time series data.
So I guess the answer is, if you've got a combination, or if you want to actually be able to filter and slice and dice, a lot of people use Excel. And Excel is a great tool. I mean, you can do so much with Excel, but there's a scaling issue, and there's the ability to do more than what Excel provides to you.
So I've seen a lot of issues that we've started with a pivot table in Excel. All of that can be sold with Elasticsearch, and that's purely numeric data. Two sets of large data, trying to find the best matches between both sides.
Will pair collectors help us in some way here? Could we store queries from one side having millions of records? Millions of queries, yeah, absolutely.
Yeah, absolutely. Well, to put it this way, the unnamed company example that I gave, they have millions of queries that are running through a percolator. So it does definitely scale in that sense. And you can think of it, it is a regular index.
It's just that it's got a reserved name, then it's called a percolator index. But it is, in essence, a fully functional Lucene index, or an Elasticsearch index, which consists of many Lucene indexes. So it scales just like Elasticsearch scales. Yeah, it does. But different data sets and joining is probably one of the biggest challenges
when it comes to Elasticsearch or document-oriented databases in this sense, because you don't have a join feature. You do have features in Elasticsearch that allows you to do similar things like joining. So you've got something called a parent-child structure, and you've got something called a nested structure,
which allows you to actually have documents within documents. There are limits to what you can do with that, unfortunately, today. But yeah, the solution is quite often to restructure your data or actually duplicate your data. So you store it actually multiple times in different formats or different structures.
Okay, any other questions? Yeah? Analyzing comments from the site. And usually it comes also with like HTML markup. Is it possible to somehow totally strip it out? Yeah, so Elasticsearch, this is back to the analysis and normalization process
that I haven't talked about today, really. There is something called an HTML strip filter that you can actually apply to your data when you're storing it. So it actually removes, you know, with a pretty good quality. But quite often you would have, you know, you know, un-strict formatting and that messes it up quite often.
So, you know, from my experience when in my previous company we did the social media analysis and we had to actually build our own stripper to, that's the wrong word. Yeah, sorry about that. We had to build our own algorithm to remove and cleanse the data before indexing it, yeah.
But there is an out-of-the-box solution that works pretty good, yeah. Okay, enough about strippers. Any other questions?
No, no, there's a very simple solution to that which I can show you quickly, so if I understood your question correct.
So right now when I did my search here, I actually removed all the results and I just did the aggregations. But the results of the data we are looking at here, you see I get the entire document back, which is in Elasticsearch stored in something called the source field, which is a reserved field.
So that's the actual document, and all of the fields in the document. So actually you can specify and say that I only care about the subject, I don't care about all the other fields. So you can easily specify that through something called source, and I can do like that.
I actually just get that specific field back, all of them.
So you actually just want the values. I think you need to restructure your data, but the solution that I can think of is actually just specifying the fields that you care about and disregard everything else.
This is much faster than actually returning the entire document, obviously, because of payloads, but also because of how it needs to retrieve them from the different shards in the cluster. Because it collects that before it sends it back to you. So there is an overhead there as well, which you can avoid by being very specific on what you return. Okay, my time is up. Thank you everyone. If you have any questions, just come over here.