Demystifying the Solr Operator
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 69 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67321 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Operator (mathematics)Operations support systemParallel portLogicRepetitionIndependence (probability theory)Game controllerPoint cloudSingle-precision floating-point formatBackupInformationAbstractionDirect numerical simulationOrder (biology)Self-organizationType theoryComputing platformModal logicCategory of beingCartesian coordinate systemScalabilitySet (mathematics)Data centerQuicksortDefault (computer science)Musical ensembleState of matterGreatest elementPresentation of a groupIP addressProjective planeImage resolutionLink (knot theory)Vertex (graph theory)Scaling (geometry)Term (mathematics)Patch (Unix)Query languageDirection (geometry)Goodness of fitPoint (geometry)Standard deviationHeat transferRight angleService (economics)Multiplication signSoftware developerWeb pageBlogMedical imagingConfiguration spaceVirtual machineSoftware testingIntegrated development environmentSoftwareMathematicsComputer programmingSoftware repositoryField (computer science)BitNetzwerkverwaltungPhysical systemMultiplicationGene clusterServer (computing)XMLUMLComputer animation
09:32
AuthenticationInformation securityTransport Layer SecurityMultiplication signDifferent (Kate Ryan album)Point cloudSoftware repositoryCodeQuery languageOperator (mathematics)Flow separationStructural loadIndependence (probability theory)Point (geometry)MappingOperations support systemConfiguration spaceRight angleService (economics)Arithmetic meanSelf-organizationCartesian coordinate systemComputer fileSet (mathematics)CASE <Informatik>Stability theoryServer (computing)TelecommunicationScheduling (computing)Run time (program lifecycle phase)Computer configurationOnline helpIP addressCloud computingPower (physics)WordState of matterEvent horizonScaling (geometry)MereologyBitNumberReal numberVirtual machineSoftwareBefehlsprozessorComputing platformSemiconductor memoryProduct (business)Software testingGene clusterPlanningLimit (category theory)System callSound effectNetzwerkverwaltungTemplate (C++)Uniqueness quantificationComputer animation
18:55
AuthenticationPoint cloudInformation securityMedical imagingAuthorizationPublic key certificateSign (mathematics)Real numberDefault (computer science)Metric systemConfiguration spaceDirection (geometry)FlagRight angleProjective planeSoftware testingPoint (geometry)Operations support systemSoftware maintenanceLogicGoodness of fitSoftware repositoryMultiplication signBitDifferent (Kate Ryan album)Flow separationSelf-organizationAreaSource codeObject (grammar)Operator (mathematics)Run time (program lifecycle phase)Server (computing)CodeRevision controlCASE <Informatik>NetzwerkverwaltungAsynchronous Transfer ModeCausalityOpcodeFront and back endsSocial classComputer configurationLatent heatIntegrated development environmentCloud computingPresentation of a groupTransport Layer SecurityCubeTranslation (relic)Musical ensembleBinary fileScaling (geometry)Game controllerPrandtl numberTelecommunicationGene clusterComputer animation
28:19
XMLUML
Transcript: English(auto-generated)
00:08
Yeah, thank you for joining our discussion today. We hope it'll be a good one. And sort of to keep just the introductions brief, Houston and I worked together at Apple,
00:21
sort of where we advise various teams on deploying and scaling solar on Kubernetes. We're also both PMC members and try to actively contribute to the project. And Houston is actually the original creator of the solar operator. So with that, Houston, could you give us a brief introduction
00:43
of what the operator is and the problems it is trying to solve? Yeah, of course. So just to give a brief overview, you can look at the link at the bottom to see one of my previous conference presentations about kind of the introduction of the solar operator.
01:00
But just for a little introduction, for those who might not be as familiar, solar operator is basically a, I mean, if you've heard of Kubernetes operators before, it is a management tool for managing solar resources on Kubernetes. So this includes multiple types of researches, like the solar cloud, the solar Prometheus exporter,
01:23
and for now solar backups. And basically, without any details, what it does is abstracts the way they need for the user to have a lot of Kubernetes knowledge. And so a lot of the knowledge that is kind of necessary for running solar and Kubernetes is built into the operator
01:41
and we're able to provide a very easy solar kind of DSL in front of it so that you can say, I want a solar with these types of properties, give it to Kubernetes and the solar operator will go and manage that for you. Okay, great. So let's dive into the discussion.
02:03
And before we get too far into the weeds on the solar operator, I think our listeners here would like to just understand, why might somebody want to run solar on Kubernetes? I mean, that's a really good question. So obviously, I'm sure there's no one here that hasn't heard about their organization
02:22
or an organization that they know about that is making a massive switch to Kubernetes, be it on like GKE, EKS, Azure, any of these types of platforms. And so I guess I'm not going to answer why someone would move their entire organization there, but for solar at least, it's a pretty good selling point.
02:44
You want to be able to deploy solar like you deploy everything else at your company so that your SREs or the people who are actually managing these deployments don't need really special, really kind of deep knowledge about how solar works and how solar deployments work.
03:01
And so if we're able to kind of standardize things, and Kubernetes is just a general way of standardizing deployments. If we're able to kind of allow you to deploy solar in the standardized way, it makes it a lot easier for you to kind of transfer the knowledge to these people
03:21
on how to run these things. And so you'll be able to have like one team running all of your things, which is kind of nice. And also it allows you to separate out your concerns. And so one of the reasons why people, and we'll get into this more later, but one of the reasons why people run massive solar clouds
03:41
is that it's kind of messy to run all these little tiny solar clouds independent of each other. And so on Kubernetes, we're able to kind of scale up these small solar clouds, not in terms of vertical scaling, but horizontal scaling. And so it's a lot easier to manage many solar clouds.
04:03
And so you're able to kind of separate your concerns and have smaller, more independent solar clouds that are safer because they're not impacted by, let's say, independent queries to two different collections on the same cloud. But I know you have a lot of experience in this, Tim.
04:20
And so why have you seen organizations moving in this direction? Yeah, yeah, for sure. And you neglected to mention being buzzword compliant. I think that's a big one. And then it's also good for the resume. But no, in all seriousness, the one that really resonates with me first off
04:40
is kind of this concept of eliminating snowflakes in the data center. I think we probably all throughout our careers have had at least one experience with a jar patch not getting applied correctly, or one of the machines not getting the same configuration. And I think just with solar being
05:01
a complex distributed system, Kubernetes actually helps with that because you know every pod is running the same exact Docker image with the same config. And that's kind of an important thing for small clusters. But as you scale these things up,
05:20
that becomes absolutely critical. And the other piece I like is that your developers and your testers and your UAT environments, they all run the same software, which is great. And then the other piece for me about running on Kubernetes, solar on Kubernetes, is it gives sort of a platform for us as solar developers,
05:43
and I think we'll see this more in the talk, to actually build and implement distributed system kind of solutions on a platform built around best practices. So that's kind of the other piece is that like we building solar and supporting and running solar, we don't have to reinvent the wheel
06:01
on every kind of thing, right? Like Kubernetes has come up with a lot of these concepts and that's like ingresses and services and service discovery and all those sorts of things. So that's kind of the things that attracted me to solar on Kubernetes. Okay, so switching gears a little bit,
06:22
let's assume people are least interested in trying solar on Kubernetes. Other than the solar operator, you could just use a Helm chart. I actually wrote a Helm chart for solar a long time ago and wrote like a 20 page blog about it.
06:40
But tell us a little bit, Houston, about why use a solar operator over something like a Helm chart or some other type of config management thing? Yeah, I mean, that's a really good question. So if you look on like Artifact Hub or like any of those kind of public Helm chart repos, you'll see a couple that are like popular and used for solar. And in general, they're not bad solutions.
07:02
And so they're really good for if you want to be able to customize your deployments more. So for the solar operator, it's a kind of a set config. Like you have a CRD that it has a certain fields are accepted. And if you want to make changes to that, it's pretty hard.
07:20
You have to go modify the Go program, make a new Docker image, modify your CRDs. At that point, you're not up to date with like the official Apache release anymore, which makes it kind of hard to go back. And so if you want to just be able to like add little fields that aren't supported by the solar operator, it's a lot easier to do that in a custom Helm chart. And so you'll just take the Helm chart
07:41
that's available somewhere else, add the fields that you want, and then go and deploy. It's much more straightforward. However, what the solar operator does buy you is safety. And so with that, we're talking about safe and efficient rolling restarts. So when you're using the Helm chart, I would hope running a solar through stateful sets.
08:02
And so the default stateful set, like rolling restart logic, whenever you change a pod template, it works well for a lot of applications. It does not work that well for solar. In particular, there's a lot of things that you can get into, like the health check, the liveness check,
08:21
not taking into account that things are still, like replicas are still recovering on a node. So you restart another one, which takes up a leader. And also, if you have, let's say, a large solar cluster, you wanna be able to restart independent nodes, like nodes that don't have replicas of the same charts on them in parallel. And that's something that the stateful sets
08:40
definitely cannot do. And so when you're running the solar operator, solar operator has the ability to take over that rolling restart logic and really use the business logic that we know about solar cloud in order to make the most safe and efficient rolling restarts. It's also able to do other cool things because it has much more information
09:02
about the live Kubernetes cluster. It's living inside the Kubernetes cluster and it has watches with the API server. And so we're able to do really cool things, like if you're running solar with ingresses on Kubernetes, you don't really want inter-node traffic
09:21
to go all the way out to your external DNS, to your ingress controller and back to the solar node for every single request between solar nodes, you wanna be able to say, I know what the IP address of the service that fronts the solar node is, I'm gonna go inject into each of my solar pods in the etcd files, I mean, etc host files,
09:42
that this ingress host maps to this service IP address. And so we can do complete node to node traffic, even though the individual nodes have ingresses fronting them. That's something that you very much cannot do in Helm, same with the solar cloud.
10:00
And so there's kind of a give and take there, you get really cool features using solar cloud, but it's less flexible for your unique use case. There's another thing there where you might be thinking, well, the solar operator is another thing to manage for myself. Helm is just like a static thing, I run it and then it's there. So what happens if like the solar operator goes down
10:22
while I'm running my solar clouds? That is actually not as big of a deal as you might think it is. So basically solar operator does do a lot of the same thing that Helm does, it templates out the resources that it tells Kubernetes to run. These are the stable sets, the config maps, ingresses services, all those.
10:40
And so if the solar operator goes down and stops running, those things still exist in the Kubernetes cluster. And so your solar cloud won't go down, it's just because the solar operators in charge of doing rolling restarts, if you want to change something about your solar cloud, it won't actually update and restart until the solar operator starts again.
11:00
And so it is an important thing to make sure it's running, but it's not like it is a part of your critical application path. And so, yeah, I mean, you've written an Helm chart before, you've worked on the solar operator. Clearly you have some insight
11:20
as to what organizations would want. Yeah, and just to kind of elaborate a little bit on your last point also, like the operator gets deployed to Kubernetes just as a deployment. Typically one pod is fine. What I like about it is it comes up in like less than a second. It's a lightweight go service.
11:44
And so it's very efficient. It's not like this cumbersome agent or something that you have to worry about. And the other thing for me with the operator is a lot of the configuration to get right with solar ends up being kind of tedious, right? And it lends itself to being programmatically built
12:03
versus trying to kind of templatize that into a Helm chart or whatever. If you haven't worked with Go templating, it has its nuances, but a lot of just the tedious config that solar needs to get right at scale, is really much better done in code.
12:22
At least I found that kind of like when I implemented the support for wiring up TLS certs and things like that. And the other piece, and you touched on this, Houston, is the operator does receive events from Kubernetes, but it also can look at the state of things.
12:40
Like it looks at the state of the, and I'm using the word looks at loosely here, but say your TLS cert updates, right? And that's stored in a kube-secret. The operator has a way to basically say, oh, our certs expired or about to expire. They renewed.
13:00
So we're going to go ahead and do this rolling restore that Houston talked about. So I think all of those things just really make it a lot more flexible and powerful solution for using the operator. So, you know, now I have my solar cloud config, the operators managing it.
13:23
Next question that comes to my mind, really Houston is kind of like, well, does this operator help solar kind of run faster or any of those kinds of things? I mean, this is a look at the different options of running solar on Kubernetes. And so I just want to get this out there,
13:41
like squashing misconceptions, no matter how you run solar on Kubernetes, that's not really going to impact how solar runs nearly as much as how you're running Kubernetes itself. So this includes what like servers you're running your Kubernetes nodes on, how full your cluster is,
14:02
like is the scheduling really hard because there's nowhere to put pods? What kind of resources are you giving to each individual pod? And so these are the things that are actually going to kind of affect the runtime. It's much like what resources to the server that you're running bare metal solar on.
14:21
And so the solar operator does do some like kind of improvements that we talked about earlier, which is like faster internode communication when you're using ingresses, it allows you to do like the more efficient and faster rolling restarts, which are probably gonna lead to less cluster instability
14:42
because the leaders are restarted last. But in general, most of the like runtime effect is gonna be how you're actually running on your Kubernetes cluster, not what you are using to manage your solar pods. Right. And unfortunately, the solar operator
15:01
hasn't been able to solve this age-old problem of cluster sizing and correctly assigning resource requests and limits. So I mean, I just want to emphasize with people on the call that Kubernetes and the operator actually help you manage these clusters as well,
15:20
but the hard work of really sizing the cluster correctly and getting that, that still falls on A, planning, but also testing and running load tests with a realistic load that kind of simulates your production traffic. I will say all this infrastructure
15:40
actually helps set all that up faster. And then if you do need to say, increase memory available to each pod or CPU or whatever, of course, this makes it a lot easier, but the actual work of sizing is still there. That's good to know. Okay, so as we start on our solar operator journey,
16:04
what are some of the best practices that you've seen Houston for running the solar operator and just, I guess best practices that the solar operator helps support as well? Yeah, and I think this is a great question to start with about the software. So basically, I mean, Kubernetes is all
16:21
about best practices, right? We're trying to like establish best practices for running cloud applications in a platform. So that makes sense with the software as well. And so two things that the software really, I mean, there's two things that we've talked about already, which is basically it really helps you manage like smaller solar clouds with more isolated collections
16:43
because I know it's a real big pain. I've run solar on bare metal and thousands of machines and it is not fun managing that. It is a lot easier to manage like a smaller number of solar clouds that are giant because there's less things to look at.
17:03
But with the solar operator, you have this like solar CRD, so a cloud CRD that lets you spin up a solar cloud and then you can monitor it via Kubernetes just to make sure that it's up and running. You always know it's up and running. I mean, it might not be healthy, but you know it's up and running
17:20
and you can like point your Prometheus, you can point your log monitoring there and it's pretty easy to manage a lot of independent solar clouds at that point. And that allows you to have smaller, like smaller solar clouds are one, gonna perform better than one large solar cloud with tons of collections in them.
17:41
And it also lets you separate concerns between collections. So like one collection's load, if like they're getting really bad queries and going like getting lots of OMs or like bad GC like collection, you're able to like separate that concern. So the collection that is not performing badly
18:02
doesn't have their, like isn't impacted by that. It also allows your like ops team to run solar as infrastructure as code. So you're able to like, let's say you have 10 solar clouds, let's take our 10 solar CRDs, commit them to a GitHub repo and we have a way to just redeploy our cluster
18:24
at any time without any differences at all. And so just kind of like it allows you to really focus in on that infrastructure as code and run solar as a part of that as well, which is a completely new thing for solar. And then also- Okay, no, sorry, go ahead. I was gonna say that like,
18:41
I mean, I know it's a pain to use like solar security, like TLS, solar authentication, all that kind of stuff. But recently Tim has written into like the solar operator, very easy setup of both TLS and basic auth so that it's basically,
19:02
I mean, it's very much recommended, but it's like almost just plug and play and you're able to very easily start your solar in the recommended and secure way. Yeah, definitely. I mean, I think it's safe to say at this point, like there's really no reason for bringing up solar cloud and Kubernetes
19:23
with the operator without security enabled. It really is just, as Houston indicated, plug and play that enables basic authentication. We wire up a couple of different users. We ensure that like the liveness and readiness check,
19:41
the probes still work with an authenticated solar. And then we wire in kind of a sensible getting started authorization controls, right? And all this is just really, it's a flag in this CRD that you kind of enable. I've thought about just making that like the default on,
20:02
but we're not there yet, but it is- Maybe V1 or V3. Yeah, yeah. And then the TLS, frankly, who even wants to think about that stuff anymore, right? So with that, what we've done is integrated with cert manager, which happens to be another operator
20:22
that's very popular in the Kubernetes ecosystem, that it has CRDs for managing certificates and certificate signing requests and certificate authorities and things like that. And it will go off and renew your certs and all these things. And then the solar operator can actually use and work with the cert manager
20:41
to wire in the TLS config for solar. And as I mentioned earlier, it can even respond to the cert being updated by cert manager and trigger this restart. Yeah, sorry for interrupting earlier. The thing I wanted to emphasize about the CRDs is what I like about it is first off,
21:04
it's very solar specific. So anybody who kind of knows solar cloud can look at it and really kind of see almost self-documenting of how their cloud works in Kubernetes. And I think that's really nice because you don't get bogged down in all the kube resources as Houston mentioned,
21:22
at the beginning of the talk is, you don't have to worry about all the kind of nuts and bolts of Kubernetes. It's really focused on how I want my solar cloud to work. And that's what's in the CRD, right? And that translates on the backend to all these crazy config objects that get created by the operator that Kubernetes understands.
21:41
But for you and for your ops people, they just work with these solar cloud CRDs. So this one kind of seems like maybe the answer is yes, but does the solar operator work with solar collections in Kubernetes?
22:01
Yeah, we're running low on time. So I'll just briefly answer this one, but I'm happy to answer it more after the presentation. Basically the answer used to be yes. And like the Bloomberg version of the solar operator, when we moved to Apache, where you removed the ability to manage solar collections and solar collection aliases through Kubernetes, through the solar operator.
22:20
And it was because we had two different sources of truth. We had the Kubernetes API server and then a zookeeper for solar cloud, both managing the same thing and they didn't always agree with each other. So the solution for that was just to get rid of one of those sources of truth, which was the Kubernetes one. And so basically now you have to manage it all through solar,
22:41
but it's nice because we just have one API for everything at that point. Everything in solar is one API and everything about managing a solar runtime is through Kubernetes API. That's right. And then we didn't want as maintainers of the operator code to have to duplicate all of the logic in solar's collections,
23:02
APIs and things like that in this other area. But the whole two sources of truth aside, just you don't want to have to duplicate all the logic. So we think it's a very nice separation of concerns. This one comes up a lot because there's things
23:21
organizations need to put in various Docker images. Maybe there are different OS requirements, what have you. So can you use the solar operator with custom solar builds? Yeah, so I would say you can. The Prometheus exporter especially, but for the solar itself,
23:40
you need to make sure it has been solar. It basically looks a lot like the official Docker image, which for solar nine will be nice because we moved all that logic into the solar repo. And so when you have all of your custom things done to the repo, you can just use a cradle command to build yourself a completely compatible Docker image.
24:04
And there's tests for that as well. But as for pre solar nine, because that's not even released yet, I would say yes with the star of make sure that you have the entry point be a foreground in solar and you support everything that Ben Solar does,
24:22
including all the inverse, all that stuff. Right, right. It's definitely doable. I've managed to work through it. So if I can get through it, I think most tops folks can. And so as we are running a little short on time, talk a little bit about briefly just how solar and the solar operator can kind of grow together.
24:43
Yeah, and that's kind of the awesome benefit of having the solar operator be a sub project of Apache solar now is that we're able to kind of move together in the same direction. And so that means growing the solar operator when solar has new features and growing solar to better fit these cloud needs.
25:03
And so, I mean, I'll just give a few examples, but there's a ticket for giving individual solar node Prometheus endpoints instead of just having the Prometheus exporter one endpoint for all solar nodes. And this is something that would really help enabling better like horizontal thought
25:20
auto-scaling in Kubernetes. It's possible now, but it'll be a lot easier when each solar node is able to have Prometheus metrics like exported from it individually. There's also other kind of weird configurations that are necessary for running solar cloud in a like a real cloud environment.
25:41
And so we're piece by piece trying to like make all of these things more configurable by default not having to know weird esoteric solar config options and just kind of making solar really a first like cloud run times a first class option for solar cloud.
26:01
And this does include making like better liveness checks and other things for running insecure modes. Cause if you try to run solar with TLS on Kubernetes now, Tim had to do this really, let's say, it was a good way of doing it. It's awkward because of the way that solar makes him do it.
26:20
But in the future, we're trying to make solar kind of understand the way that clouds work like not solar cloud, but like cloud providers work and understand what it needs to do to make it easier. Gotcha. And so just to wrap up for those that are interested in this
26:40
is how would they get involved in this project? Yeah, and that's great. I think we have like very few little time yet. The answer is communication. Basically we want you trying it out, download it, run it with like however your company or you run your Kubernetes clusters. We want to see as many different use cases to make this work for everyone that wants to use it.
27:04
And so like, obviously we love PRs, they love documentation PRs, they love new features, but what we really, really want is just people trying it out and telling us how it works for you, what you think about it, what needs to change.
27:20
Right. That's a great point. And then the other piece I'll add is like, it's actually a really fun code base and project to work on, especially if you're like wanting to learn Golang and get more on Kubernetes. I think it's a fantastic way to kind of like just get involved in that broader ecosystem as well. Compared to Solr, it is a tiny repo, so don't be afraid to go and look at it.
27:41
That's right. Well, cool. I think we're out of time with respect to Q&A, but the good thing is, Houston and I are going to now shuttle off over to the Apple chat room that's available in the buzzwords dashboard there,
28:01
and we will be in there to answer any questions that you have that have come up. So apologize, we ran a little tight on time for questions here, but we'll definitely answer them over there. So with that, thank you so much, Houston. I think this was really useful and thank you everyone for attending. Yeah, thank you.