We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Build Real-time Analytic Applications: The Easy Way

00:00

Formal Metadata

Title
Build Real-time Analytic Applications: The Easy Way
Title of Series
Number of Parts
56
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Apache Druid is the open source analytics database that enables development of modern data-intensive applications of any size. It provides sub-second response times on streaming and historical data and can scale to deliver real-time analytics with data ingestion at any data flow rate – with lightning fast queries at any concurrency. Sounds great, right? But any large distributed system can be difficult and time-consuming to deploy and monitor. Deployment requirements change significantly from use case to use case, from dev/test clusters on the laptop to hundreds of nodes in the cloud. Kubernetes has become the de-facto standard for making these complicated systems be much easier to deploy and operate. In this talk you will learn about Druid's microservice architecture and the benefits of deploying it on Kubernetes. We will walk you through the open source project's Helm Chart design and how it can be used to deploy and manage clusters of any size with ease.
EmulationMusical ensembleAnalytic setCartesian coordinate systemReal-time operating systemGoodness of fitSoftware developerProduct (business)Projective planeOnline helpOpen sourceXMLUMLLecture/Conference
DatabaseScaling (geometry)Online helpProjective planeScaling (geometry)Software developerComputer architectureBitOpen sourceComputer animation
Query languageStatisticsDatabaseStapeldateiConcurrency (computer science)Cartesian coordinate systemReal-time operating systemKey (cryptography)Differential (mechanical device)VideoconferencingQuicksortResponse time (technology)Query languageProduct (business)DatabaseAnalytic setDimensional analysisCharacteristic polynomialConcurrency (computer science)Data conversionPivot elementStatisticsVisualization (computer graphics)Interactive televisionDrill commandsEvent horizonOperator (mathematics)StapeldateiDependent and independent variablesComputer animation
Router (computing)Query languageService (economics)Data storage deviceService-oriented architectureLocal ringLogicMaxima and minimaProcess (computing)Scaling (geometry)Data storage deviceQuery languageData managementOpen sourceSemiconductor memoryComputer fileSquare numberRight angleRepresentational state transferBitLevel (video gaming)User interfaceReal-time operating systemDifferent (Kate Ryan album)Set (mathematics)Data structureDimensional analysisParallel portCoordinate systemStreaming mediaEvent horizonTask (computing)File systemScalabilityChannel capacityService-oriented architectureWeb applicationOperator (mathematics)High availabilityService (economics)Router (computing)RoutingObject (grammar)Heegaard splittingTable (information)Concurrency (computer science)Multiplication signSubsetCASE <Informatik>StapeldateiPartition (number theory)Real numberInternet service providerComputer architectureComputer animation
Plane (geometry)Control flowSystem programmingVertex (graph theory)Data storage deviceBroadcast programmingRight angleCartesian coordinate systemOperator (mathematics)Service (economics)State of matterData managementSet (mathematics)Object (grammar)Integrated development environmentCharacteristic polynomialAerodynamicsVertex (graph theory)Physical systemConfiguration spaceProcess (computing)BefehlsprozessorSemiconductor memorySoftware testingParallel portHigh availabilityDifferent (Kate Ryan album)Functional (mathematics)Perspective (visual)Software developerMultiplicationLecture/ConferenceComputer animation
Scale (map)Service-oriented architectureQuery languageService (economics)Router (computing)Installation artData storage deviceMetadataLocal ringElectronic mailing listDivisorElement (mathematics)Parameter (computer programming)Default (computer science)Set (mathematics)Software testingInformation securityObject (grammar)Repository (publishing)EncryptionTemplate (C++)Coordinate systemPhysical systemDatabaseDifferent (Kate Ryan album)Cache (computing)State of matterSoftwareData storage deviceCASE <Informatik>Computer fileComputer configurationNormal (geometry)ImplementationLocal ringProcess (computing)Multiplication signTable (information)Service (economics)Integrated development environmentNatural numberScaling (geometry)Characteristic polynomialHigh availabilityGame controllerRepresentational state transferThresholding (image processing)Real-time operating systemVolume (thermodynamics)Data recoveryData managementRight angleLaptopMetadataData structureComputing platformService-oriented architectureRouter (computing)Vertex (graph theory)Type theory1 (number)NumberRevision controlLogicQuery languageTime zoneOpen sourceFunction (mathematics)WorkloadCountingScalabilityStructural loadConfiguration spaceComputer animation
Price indexData storage deviceLocal ringMetadataLimit (category theory)Read-only memoryBefehlsprozessorRouter (computing)Service (economics)Chemical affinityConfiguration spaceQuery languageStreaming mediaMetric systemService-oriented architectureMaxima and minimaMoving averageSemiconductor memoryBefehlsprozessorData managementVertex (graph theory)Computer fileMetric systemCASE <Informatik>Service-oriented architecturePartition (number theory)Multiplication signSet (mathematics)VirtualizationFrame problemPhysicalismConfiguration spaceDatabaseQuery languageMereologyConcurrency (computer science)Game controllerInternet service providerParameter (computer programming)Task (computing)1 (number)Default (computer science)Process (computing)ResultantSubsetReal-time operating systemScaling (geometry)Gene clusterSoftware developerData recoveryMaxima and minimaInstallation artCharacteristic polynomialPoint (geometry)Power (physics)Term (mathematics)Sound effectCountingBitRow (database)High availabilitySoftwareNumberWorkloadPoint cloudRevision controlService (economics)SummierbarkeitParallel portProjective planeChannel capacityData storage deviceRight angleReal numberNormal (geometry)Function (mathematics)Link (knot theory)FrequencySubject indexingStreaming mediaRouter (computing)Coordinate systemResponse time (technology)Thresholding (image processing)Insertion lossThread (computing)Goodness of fitDifferent (Kate Ryan album)Differenz <Mathematik>Computer animation
Metric systemProjective planeConfiguration spaceBefehlsprozessorSemiconductor memoryQuery languageUtility softwareData storage deviceUniform resource locatorWhiteboardSet (mathematics)Software repositoryMereologyRule of inferenceSoftware bugAndroid (robot)Computer hardwareMultitier architectureRight angleSoftware developerSoftwareData managementPrandtl numberMathematicsMathematical optimizationLecture/ConferenceSource code
Lattice (order)GoogolGroup actionWave packetEndliche ModelltheorieBlogRight angleInternet forumComputer programmingIntegrated development environmentConfiguration space
Multiplication signPlug-in (computing)Query languageSequelLecture/Conference
Extension (kinesiology)Online helpProjective planeMereologyHookingMetric systemCore dumpCoordinate systemLecture/Conference
Set (mathematics)Query languageWorkloadData storage deviceAnalytic setMultitier architectureSemiconductor memorySpacetimeRight angleImplementationStructural loadReal-time operating systemConfiguration spaceRule of inferenceComputer hardwareMereologyDistribution (mathematics)Meeting/InterviewLecture/Conference
Service-oriented architectureSingle-precision floating-point formatQuery languageScaling (geometry)Point (geometry)Service (economics)MereologyOrder (biology)Semiconductor memoryType theoryLecture/Conference
Cartesian coordinate systemData conversionCASE <Informatik>SpacetimeResponse time (technology)Real-time operating systemMeta elementEvent horizonSoftwareService (economics)Goodness of fitPairwise comparisonRight angleLecture/Conference
1 (number)Multiplication signPresentation of a groupLecture/Conference
Musical ensembleMeeting/InterviewJSONXMLUML
Transcript: English(auto-generated)
Hello, so we're going to talk about building real-time analytics applications the easy way. So this is about Apache Druid. Just first show of hands, anybody heard of Apache Druid?
Okay, we have a few, this is good. So well, first, who am I, I'm Sergio Ferragut, I work for Imply as a senior developer advocate for Apache Druid. This means I promote the Apache Druid project and I help people implement the Apache Druid
project in open source. Imply as a company contributes to the Apache Druid project and we have our own products around Apache Druid. So what we're going to talk about today is an overview of what Apache Druid is. We'll talk a little bit about what Kubernetes is and what Helm charts are because Apache
Druid is hard, we'll see the architecture and it's hard to implement unless you use something to help you orchestrate it and Kubernetes and Helm charts help. So we'll go into the Apache Druid open source project Helm chart, it has a Helm chart of its own. We'll go into an overview of the Helm chart, how you can use it to scale the cluster
up and down, how to use auto scaling for ingestion and the things that are missing from the Helm chart where we want help from other developers to get involved in the project and improve it. So what is Apache Druid? It's a database that is fully scalable both for ingestion and for query.
It deals with both batch and real time data. It provides ad hoc statistical analysis across any dimensions that you put in the data and it provides low latency delivery in real time events and really fast response times to queries.
So what is a real time analytics application? This is an example of a real time analytics application but it has a few characteristics. First of all, it's very fast queries. That's sort of the key differentiator for Apache Druid.
It does this on fresh data, so on streaming real time data, as well as historic data and that's sort of its unique characteristic that it has both real time analytics and historic analytics in the same query. It provides high concurrency because it has these fast queries, it helps drive high concurrency
and it's meant to drive a highly interactive conversation with the data. So one thing I'd like to point out about these videos is this is an example of an application. The database behind this application is Apache Druid.
This is actually one of the products that we sell. It's called implied pivot, which is a visualization tool that allows you to navigate the data a lot like Tableau, if you're familiar with Tableau. But these videos are not accelerated. This is videos in real time. So you can see the interaction that it provides, where you can do filtering, sort by a different
dimension, split the data by a different dimension, do drill in, drill out, that kind of operations with that response time, with that speed. So how does it do this? So let's talk about the architecture of Druid.
So at the high level, Apache Druid is deployed in three different sets of or kinds of nodes, the query services that provide access to the cluster for users, where you submit queries, where you submit ingestion tasks, the data services, which is the main scalability of
the process, where it provides scale for ingestion and for query, and we'll talk more in detail about that, and the master services, which control the operation of the whole cluster. Additionally, it has a dependency on external storage and what we call deep storage, and
we'll talk a little bit about how that works. This could be any distributed file system like HDFS or object stores like S3. So let's talk about data ingestion, which is one of the main processes that Apache Druid provides. So in ingestion, there's a subset of these microservices that are involved, right?
There's the router on the query service side. There's the middle managers, which manage the ingestion process and the overlord, which controls the overall ingestion process and tells the middle managers what to do. So how does this work? First it starts at the router, where you use rest APIs or the user interface that
it provides as a web application to submit an ingestion request that the router's job as its name indicates, is to route the request to the appropriate microservice in the whole cluster. So in this case, if it's an ingestion, it routes it to the overlord.
The overlord plans the ingestion. This means splitting up the work among all the available middle managers that actually do the ingestion and submits those subtasks to the middle managers. The middle managers then do what they're told, right? They either connect to a subset of partitions in Kafka or Kinesis to do real-time ingestion
or connect to some source of batch files to ingest those batch files. And they produce something called these little squares that we see moving from the middle managers to deep storage are a particular data structure that Druid uses that's a columnar
data structure, so the middle managers ingest the data. If it's real-time, if it's streaming data, it keeps it in memory. And we'll go a little bit into why that's important when we get to the query side. And it builds these segment files. These are immutable files that once they're built, it's a columnar structure that's
pre-indexed for all of the dimensions. And it pushes them into deep storage for persistence, for long-term persistence. In the batch case, it also does the same thing, except that it's not keeping the data in memory. These also scale, right? So if you have more throughput in your event, in your Kafka streams, you can just add more
middle managers. The overlord recognizes that there are more middle managers, and it uses the capacity to do faster ingestion or more throughput. So then the other main process that occurs is the data management process.
What happens here is the coordinator, another one of the master processes, recognizes all of the ingestion that has occurred. So it knows about all of the segment files, these immutable files that exist in deep storage for any given table, and it distributes them across historicals.
The historicals are used for querying the historical data. So it distributes them, creates copies of them across historicals, both for parallelism, so to execute queries in parallel, and for high availability. So you choose how many replicas you want. What's happening here is the queries don't access the data directly from deep storage.
The coordinator tells the historicals to preload the data into its local storage, so the historicals have local storage, so that it can query them faster. It also caches them in memory within the historicals to provide the performance that the query is known for.
So on the query side, so we've loaded into the historicals the historical segments. The real-time streamings are in memory within the middle managers. So when a query comes in, so again, through the router, you submit queries either through the REST API or through a user interface or through JDBC, and the query is submitted
to one of the brokers. So you can have many brokers. This also provides higher concurrency if you have lots of queries running, so it distributes the queries among the brokers. The broker plans the execution of the query, and it knows also which portion of the data is in real time, so which portion is in memory inside the middle managers, and which
portions of the data are in the historicals. So it submits individual subtasks to each of these middle managers or to the historicals to resolve the queries. And so this is how it delivers both on real time, the portion of time that you're consuming
in real time, and the portions of the data that have already been consumed and put into the historicals. So Kubernetes, why Kubernetes? So Kubernetes, I'm assuming everybody knows Kubernetes, hands, Kubernetes, well, most,
okay. So Kubernetes orchestrates the operation of a cluster, right, or of applications in general of services. And deploying on Kubernetes makes managing Druid much easier, and it provides other benefits, which we'll go through in a minute. So it provides the acquisition and management of nodes, so it acquires or it has a set
of resources, different node types with different CPU and memory characteristics, so it knows what resources it has available, it accepts object requests, this is, you know, the deployment of an application, the deployment of a service, the deployment of stateful sets or non-stateful
sets or deployments in general, and it instantiates, based on what you request, it instantiates containers to deliver on those services or to deliver the application functionality in general. It also monitors these containers, so it provides high availability from that perspective.
So if there's a failure on any of the containers, it restarts it and manages the restartability and high availability of the application. These two examples are two different kinds of deployments, right? In the first one, we have a development system where we put all of the containers
in a single node, so it's not a highly available configuration, but you can decide whether to configure, you know, a test dev environment that's a small environment or a larger environment that has, you know, the replication of different processes, providing high availability and across multiple nodes, providing parallelism and more performance.
So why Apache Druid on Kubernetes? So I've mentioned a few of these things, it provides high availability and recovery, it's monitoring each of the containers and restarts them when necessary, it provides anti-affinity, this is the idea that you don't want two containers to run on the same
node if they're of the same type, you want them running on separate nodes and even across different availability zones, as an example, so that if you have a failure, only one thing is failing at a time, right? Or you could have even more, you can have replication factors that are larger and therefore lose more than one element at a time.
But anti-affinity gives you that capability of distributing a particular set of containers across different nodes. It also provides persistent storage for the historicals in particular, because like I mentioned, historicals load data to their local cache, and in normal implementations
outside of Kubernetes, what this means is if you lose one of these nodes, you have to restart it and if you don't have the storage, the segment data already in the local storage, you have to reload them from S3 or from HDFS, that takes time, so Kubernetes helps there
by allowing, giving it persistent storage, so it restarts a container, brings it along with its persistent storage, so the recovery of a historical is way faster. It also provides scalability, you can easily manage increasing or decreasing the scale of a deployment of any of the ingestion portion or of the historical portion.
It provides auto-scaling for the ingestion specifically, and we'll talk about that a little more. It provides security, it provides encryption, it provides access control, ingress control and network isolation, and it automates the process of upgrading.
It allows you to do rolling upgrades such that during an upgrade process, the system is available. Helm charts, so everything we've talked about in Kubernetes is sometimes hard, right? Because you have to define a lot of all of the objects yourself and you have to define
the characteristics of the service, the characteristics of the deployments and so forth, and those are sometimes complex structures that you need to build. What Helm charts do is predefine those, predefine dependencies, in this case the Apache Druid Helm chart has dependencies on ZooKeeper and on a metadata database, which normally
is Postgres or MySQL, it can be others, so you can configure it to be other databases, but those are the ones that are included in the Helm chart. It defines templates for each of the services, for each of the ones that we've talked about, the historical, the broker, the middle manager, et cetera, these templates are the definition
of the Kubernetes objects that you need to deploy each of these microservices. And finally it provides a values.yaml, this is the list of parameters that control the deployment, and it comes with defaults, with default values. So when you're deploying a particular cluster, all you need to do is override those parameters
that you want to change in your own values file, and it will use all of the defaults plus the overrides that you've provided, right? So you can deploy a 10 historical node and six middle managers by simply replacing
the replica counts of those two elements, so it makes it really, Helm charts make it really easy to deploy things. So what I'm talking about next is what those objects are, what those templates look like
inside the Helm chart. So for the router, broker, overlord, and coordinator, it uses a simple deployment, this is a stateless configuration of these services, which means it doesn't matter what the names of those containers are, or they don't have state, so the storage doesn't
need to be persisted, so it just manages a set of these containers and monitors them and restarts them. It provides an ingress, which is really only necessary in druid for the router, so mostly all of the services have REST APIs that you can access directly, but in a normal
deployment you're only accessing through the router, so the router is the one you usually enable for ingress, and it provides the service definition, which provides a logical network name, a logical hostname that allows you to access the service across the cluster.
The middle manager is a little more interesting, right? This is where the ingestion occurs and where the real-time queries occur. This is deployed as a stateful set. It's a stateful set because it uses local storage to do intermediate files, so as it's ingesting in real-time, it writes locally to do checkpointing as it's consuming the
data. It also has a pod disruption budget. This is what controls the rolling upgrade, and it tells Kubernetes how many nodes to upgrade at a time, so how many to bring down at a time and replace with a new version.
Usually that's set to one, but you can set it higher if you have higher replica accounts for the data. It provides the service as well, the logical service as the middle manager, the ingestion capability as a logical service, a logical name, and in this case, the middle manager
comes with an optional horizontal pod autoscaler. This allows you, and we'll look at the details in a second, but it allows you to configure how small or how big the deployment of this service can get and when it should grow, what thresholds will trigger it to grow, what thresholds will trigger it to shrink.
The historicals are very similar. They also are stateful. In this case, it's particularly important, like I mentioned before, because we have the segment files that are stored locally on the historicals that would need to be reloaded if they're
not there when it restarts. Kubernetes here provides this very fast recovery by giving it the persistent volume to recover quickly. It also uses the same pod disruption budget as the middle managers to do the rolling upgrade,
provides a service, and in this case, it's not a good idea to do automatic autoscaling because in the historical case, it has this local cache of segment files that will be distributed automatically when you grow the platform. If you add more historicals, the coordinator recognizes that there are more historicals
and it redistributes the segment across whatever number of historicals you have. If you were automatically scaling up and down, that causes an additional workload to be moving the segments around that's not really necessary. Usually the historical scaling is done manually.
This is an example of what it takes to deploy using the Helm Chart. It's actually fairly simple for a simple deployment, for a test dev deployment. This starts from cloning the repository, the open source repository, and then it's really
just two commands. It's the Helm dependency update, which brings in the Helm charts for ZooKeeper and for Postgres, and then it's just an install command that will deploy the simple cluster. In the simplest case, you can see here, it uses a single historical, a single middle
manager, the Postgres, a single coordinator, so it's a test dev deployment in this case. If you want to do more complex deployments, then you need to configure the values, the YAML, and I'll walk through a few of these. So the first thing you need to configure is the deep start, what you're going to use
as the permanent repository for the segment files. Typically, this is S3, in some cases it's HDFS. It also has the option of doing it locally, which means on local disk, but that only works if you're in a test dev environment on your laptop or something of that nature where everything is using the same storage.
You also have to define the metadata database that you're going to use for where it stores the definition of the tables, the definition of which segments have been loaded to each of these tables and where they're distributed across the cluster. So you can use the in the Helm chart, like I said, you have Postgres or MySQL as the
options for this, but you can actually point this at whatever you want. You can point it at a Redis service or at some other cloud-provided database or other deployments within your Kubernetes of other databases if you want.
The other important thing to do across all of these services is define the resources. The resources that you want each of these services to have in terms of CPU and memory. This is a bit of a science to figure out, depending on the workload that you want to do.
This is where you decide how much CPU you're allotting for each of the middle managers or each of the historicals, and a lot depends on the workload that you're trying to achieve. The amount of ingestion, the throughput that you have in Kafka will require more memory and more CPU in the middle managers, larger sets of data in the historicals will require
more storage and potentially more processing to do faster or more threads in a particular historical. So anyway, you need to think about this and a great resource for this is this link I provided there, Basic Cluster Tuning, which describes each of the services and how you
want to configure them in terms of CPU and memory allotment for each of the services. In the router's case, you'll want to, at least in a high availability configuration, you'll want to at least have a replica count of two for high availability.
This is where you will want to enable the ingress, because this is really where all access from the users occurs. In the overlord and coordinator, you at least need a replica count of two for high availability, so those are important in real cluster configuration.
In the middle managers, again, you'll want a replica count of two at a minimum. This is where you probably want more than two if you want a real deployment. The way I think about this is if you're going to lose one of your resources for ingestion in this case, which is what the middle managers do, if you only have two, you're going to
lose 50% of your capacity if you lose one of them. So you'll probably want more even if you assign less CPU and less memory to each one of them, just so that when you lose one, you're not losing a lot of ingestion capacity. You'll also want to configure some other parameters. I didn't go into all of the details here, but again in the documentation, you can see
which parameters control how much work each of these services is doing. So there's a set of parameters like those, the druid indexer running parameters, druid indexer fork parameters, which control the spawning of other JVMs within the middle
manager that actually do the work of ingesting from a particular set of partitions or for a particular set of files. So you want to size those as well. And all of those are parameters within the Helm chart. Another thing you'll want to do in the high availability case is provide task replicas.
So what happens here in the real-time ingestion is we're bringing in data directly from Kafka and consuming from Kafka. And if you just have one task dedicated to each set of partitions and that one fails, you're going to have to replay.
So this has two effects because there's data in memory here. And when you're querying that real-time data, you're querying a set of rows that you've consumed from Kafka. And if there's a failure, it's going to restart. So if there's queries coming in on that real-time portion and you're doing counts, for example, or sums or whatever, you'll see that the sum is growing over time from
the in-memory portion. And if it fails and it restarts, suddenly the counts go down, the sums go down. So you want to avoid that. And Druid provides a way of avoiding that by defining your ingestion process with replica tasks. This means that you're doing double the work, right?
So you're ingesting from the same set of partitions on two different middle managers. And so the red ones are the same set of data. The green ones there are also the same set of data. And what happens is as soon as one of them completes the timeframe that it's consuming from, the other one will be discarded.
But if there's a failure in between, the other one keeps going and the other one keeps responding to queries. So this is how you really provide high availability of the real-time ingestion. Also, if you want to scale it, you can scale it manually. So by increasing the number of middle managers, that automatically scales
the capacity of ingestion, and therefore can deal with larger throughput in the streaming. This can also be done automatically by using the auto-scaling definition inside the Helm chart. When you enable it, you define the minimum size, the maximum size, so the minimum number of middle managers, the maximum number of middle managers,
and which metrics, which memory or CPU consumption metrics and the thresholds that will trigger the growth of more replicas or less replicas. So Kubernetes will manage it automatically. It'll increase the number of containers as the workload grows
or decrease it as the workload subsides. In the historical case, again, we want a minimum of two. Same kind of problem around parallelism, right? So if you usually want more than two, because if you lose one, you'll have 50% of the capacity. Same thing as in the middle managers, right? So you'll want to think about this,
how much capacity you want to lose in the case of the failure. And even though Kubernetes will bring it back fast, there will be a period of time where you have a loss of capacity, so you want to minimize that. This is where you also want to define anti-affinity because you don't want two historicals running on the same node,
and you'll want to define node selectors. This is the characteristics of the physical nodes, CPU, memory, configurations of the physical nodes, or virtual nodes, I guess, that you want the historicals to run on. So it'll select a particular kind of node
among the nodes that are available to Kubernetes. And you definitely want to do that both in the historicals and middle managers. The broker, you probably want, I don't know why I said three, you need at least two for high availability. The anti-affinity is also important here for high availability and the node selector,
because the brokers are actually involved in the queries. So you want them to have significant resources if you have more concurrency. This gives you, you know, this is where you control the concurrency capability and parts of the queries, because parts of the queries are actually resolved in the broker as it receives the result,
the intermediate, the subsets of results from middle managers and historicals. It has to put them together and do final aggregations. So they require some significant power as well. So summarizing, I think I went really fast. What's my time so far? I'm almost done. All right, good.
So summarizing, Apache Druid is a real-time OLAP database. So it provides, you know, slicing and dicing with really fast response times. Kubernetes makes this better because it provides higher availability for a deployment than you would otherwise, which gives us better recovery times,
better recovery points in the case of failures. It also provides the auto-scaling of the ingestion capability and the real-time query capability. So it'll increase and decrease, like we said, based on the workload that you're throwing at it. The Helm install makes it really easy to deploy new clusters.
What we see is people defining different sets of values that are the overrides to the normal parameters or default parameters. So you have things like, you know, a minimum cluster for development, a QA cluster, and you have those predefined so you can deploy as many of these clusters
as you want fairly easily by just doing a Helm install command. And finally, it allows you to do, by using Helm diff and Helm upgrade commands, it allows you to change the configuration live so you can manually increase the characteristics, even the resources allotted to each of the containers or the number of containers that you're using.
And you can do that live by using the Helm upgrade capabilities, as well as upgrading the software itself by changing a tag that defines the version of the software. It will automatically go through the rolling upgrade process. So what doesn't it do yet? So Druid has other capabilities
that are not integrated into the Helm chart yet. And this is where we want more people involved in the community. This is a great conference to recruit developers. So I'm hoping that we get some more committers to the project. And one of the things that's missing is the metrics configuration. This is the ability that Druid has to emit metrics
about the resource utilization across the board, how much CPU and memory it's using for each query, what are the resource utilization overall through the cluster so that you can understand what's going on in the cluster and optimize it, increase the resources where you need to and so forth.
So these metrics are interesting to have, but they're not part of the Helm chart yet. So adding that to the configuration would be useful. Also the multi-tiering configuration. So historicals can be configured in sets of historicals. They don't have to be like one cluster. You can have like multiple sub-clusters
as you could call them of historicals that deal with different portions of the data. So we have the real-time data is going in the middle managers, and then it goes into historicals, but you could have things like one week of data in really fast hardware, fast CPUs, lots of memory, SSDs for the local storage,
which will make the query fast in the one week of the latest data, and then have another set of historicals deal with two years of history or whatever goes beyond that. Druid will automatically, you can define the rules that tell you one week of data in the fast tier and the rest of it in the slow tier,
and Druid will automatically move segments from the fast tier to the slower tier, and this is missing from the helm chart, so this is something that we want to add. And finally, like I said, you're invited. You can fork the repo. Just go to GitHub to the Apache Druid URL
that I have there. Make changes and submit your PRs, and we're not just looking for developers, right, for people to actually commit changes to the software. We're also looking for users and users who report bugs as part of the project or have requests for new enhancements, or improve the documentation. We were always looking to make the documentation better,
so it's all about that, right? It's not just the actual development of the software. So I'll just leave you with a bunch of ways of contacting us or interacting with us. We have a Slack channel. We have the Druid User Forum, druidforum.org,
where it's a Q&A environment. It has blog posts. But one that I'd like to point out particularly is our training program, learn.imply.io. If you're interested in learning more about Druid and getting your hands on it, these are hands-on courses that are free right now. So if you go to that website, you'll have a basics course, the Druid Basics,
which helps you configure the basics of configuration, walks you through it and how to use it, and then a more advanced course about modeling and ingestion that's also available there. So anyway, with that, I'll open it up for questions. Thank you. Yes, thank you.
So we are right on time. We have seven minutes for the questions, so that's super. Who starts? Hi, thank you for the talk. I just have two quick questions. What does the query language look like? Is it just standard SQL or?
It is standard SQL. I won't say it covers all of SQL, but yes, it's SQL. All right, and is there a plugin for Grafana? I'm sorry. Is there a plugin for Grafana to support Apache Druid? So one of the really interesting things about this project, it has core extensions that are part of the Druid project
and also community extensions. And yes, there are extensions for Grafana that have been built. So the metrics emitters are, it's not in the help chart yet, but there are extensions where you can emit to things like InfluxDB and hook up a Grafana to it for sure. All right, thanks. Yes. Someone else?
Hi, thank you for the talk. I have a couple of questions too. So one, around the coordinator, you said that there is, it takes care of rebalancing and the segments across the historical set you have. Do you also support some kind of heat-based rebalancing or it's primarily based on the segment sizes?
I'm sorry, some more. Do you also support heat-based rebalancing based upon the heat of segments and the kind of query workload or it is primarily based on the segment sizes, I suppose? You called it heat, right? Yeah, heat. So that data temperature? Yeah, right. Okay, so that's where multi-tier configurations come in because what we see is typically
the temperature of analytics is related to the recency of the data. So you can define load and load rules as part of the historical definitions and have multiple tiers that allow you to have faster CPUs, more memory in one tier or in SSDs maybe in that tier
that allow you to do the hotter data in the more expensive hardware and then the colder data in less expensive hardware. Which, but it's really typical implementations are you look for SLAs, right? You're looking for how fast you want X queries to respond whether that's the hot queries in the more recent data
versus how fast you want it to respond in the more historic data. So it's usually related to that. You define your SLAs and you define how much money you wanna put into each of the tiers, right? To make it cost efficient. Okay, and within the tier it's mostly a size-based equal distribution, I guess.
It's completely configurable, right? You decide how much storage you allot to each of the tiers and what the Druid does is it distributes the segments that are available across those tiers. So you have to have enough space for the amount of data that you're dealing with
in real time or how much data you're dealing with in a week and how much data you have in two years or however much history you have in the slower tier. Okay, thank you. The second question which I had is around the brokers, right, because a scaling broker could be typically a challenge because they are doing the heavy lifting of all the aggregations and then of all the queries result that they are getting on.
So do you see, do you have any recommendation on how to scale up and then can it be a single point of failure for the entire service? Like what is the recommendation to do this? So brokers are an important part of the query path, right? So yes, there's scaling up a broker, giving each broker more CPU, more memory
to deal with the types of queries that you're driving through it. It's not really a single point of failure because there are other brokers that will take it but in a given query, it will fail, right? If the query that was assigned to one broker
is halfway through that broker fails, the query will need to be resubmitted in order to grab another broker, all right. All right, thank you. Do we have another question?
Hey, thanks for your talk, it was really good. Thank you. Can you talk about some exciting application, like ways people are using Druid that we might get inspiration from? That might be a longer conversation but yeah, I'll throw a few examples. So Druid was born in the ad tech space, right?
Druid was actually developed within a different company originally. It was the company's meta markets. So ad tech is one of the use cases, right? And it's one of the important use cases because it requires you to understand what's happening right now with your campaign, with your ads and compare that to the history, right? So any use case where looking at what's happening
in real time and doing comparisons to history is a good use case. And we see use cases in fraud detection and in the finance space. We see Netflix is a user. They monitor their streaming service. So they have events like buffering events.
They measure everything about how their users are, the user experience, right? So, and they optimize and they constantly react to the lack, to buffering events or to other events that are providing a bad experience to their users. So it gives them the ability to react fast.
It's also used in network monitoring services. There's a bunch of use cases. We can have a longer conversation about that. In general, it's any use case that needs fast response times or can use fast response times and deals with both real time and historical data.
Yeah, that would be a really interesting question for our CTO. I don't have any particular ones that surprised me so far, but I'm still learning. We have the time for maybe one quick question.
I guess that's it. So really thank you for such a good presentation. Thank you.