Running Apache Spark on K8s: From AWS EMR to K8s
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 56 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67164 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202240 / 56
22
26
38
46
56
00:00
TrailAlgorithmProcess (computing)Portable communications deviceOpen sourceXMLUMLLecture/Conference
00:40
Computing platformStaff (military)Computing platformState observerLecture/ConferenceMeeting/Interview
01:11
Traffic reportingGoodness of fitEvent horizonWebsiteInformation engineeringAnalytic setOrder (biology)AverageInformation privacyData storage deviceLecture/Conference
01:50
Context awarenessBitEvoluteHypermediaInheritance (object-oriented programming)Process (computing)AlgorithmReal numberType theoryContext awarenessLecture/ConferenceComputer animation
02:34
Process (computing)AlgorithmHill differential equationEvent horizonHypermediaInterface (computing)Group actionUniform resource locatorInformationAlgorithmBlock (periodic table)Process (computing)Demo (music)BitComputer animation
04:00
AlgorithmProcess (computing)AlgorithmMultiplication signoutputProcess (computing)ResultantServer (computing)Inheritance (object-oriented programming)NeuroinformatikSubject indexingDifferent (Kate Ryan album)WordLibrary (computing)Lecture/ConferenceComputer animation
06:19
Dependent and independent variablesoutputSummierbarkeitRegulärer Ausdruck <Textverarbeitung>Cloud computingArmBasis <Mathematik>Process (computing)Gateway (telecommunications)Functional (mathematics)Instance (computer science)ExpressionParameter (computer programming)outputResultantLogicLatent heatScheduling (computing)Lecture/ConferenceComputer animation
07:53
Regulärer Ausdruck <Textverarbeitung>Lambda calculusDependent and independent variablesParameter (computer programming)outputArmFunction (mathematics)NumberComplex (psychology)Multiplication signFunctional (mathematics)Process (computing)Instance (computer science)Configuration spaceLogicoutputComputer animation
08:31
outputDependent and independent variablesRegulärer Ausdruck <Textverarbeitung>Thermal expansionFunction (mathematics)Parameter (computer programming)ArmMechatronicsProcess (computing)Greatest elementNumberFunctional (mathematics)InformationExpressionData storage deviceDependent and independent variablesData structureLambda calculusOrder (biology)Computer animationLecture/Conference
09:21
Gamma functionScaling (geometry)Process (computing)Functional (mathematics)LogicInformationParameter (computer programming)Diagram
09:55
Scaling (geometry)Integrated development environmentComputer fileData managementTask (computing)outputRevision controlNumberGame controllerProcess (computing)Instance (computer science)Computer animation
10:35
Kolmogorov complexityCloud computingVirtual machineType theoryLogicInformationProcess (computing)Complex (psychology)State observerPoint cloudLecture/ConferenceComputer animationDiagram
11:35
Point (geometry)LogicData managementComplex (psychology)Functional (mathematics)Process (computing)Virtual machineDiagram
12:12
Execution unitGamma functionIntegrated development environmentPortable communications deviceProcess (computing)AlgorithmEngineering drawingLecture/ConferenceProgram flowchartDiagram
12:56
Cartesian coordinate systemStructural loadMultiplicationSoftware developerLecture/ConferenceMeeting/Interview
13:44
MultiplicationCartesian coordinate systemGraph (mathematics)Metric systemInstance (computer science)Projective planeOperator (mathematics)Open sourceSoftware developerProcess (computing)User interfaceMedical imagingLecture/Conference
16:59
Gateway (telecommunications)Pulse (signal processing)Sampling (statistics)Metric systemCartesian coordinate systemMedical imagingCodeRepository (publishing)Device driverSynchronizationProcess (computing)Graph (mathematics)MathematicsCASE <Informatik>Computer hardwareMultiplication signOperator (mathematics)Instance (computer science)MultiplicationLecture/ConferenceComputer animationDiagram
19:22
Metric systemCartesian coordinate systemStack (abstract data type)Multiplication signDemo (music)TrailComputer animationLecture/Conference
20:04
Characteristic polynomialRoyal NavyDemo (music)Interface (computing)Element (mathematics)Query languageGoodness of fitBitProcess (computing)Interactive televisionXML
21:59
GEDCOMUniform resource locatorMedical imagingSystem callProcess (computing)Multiplication signCartesian coordinate systemConfiguration spaceLimit (category theory)Lecture/Conference
23:20
Gamma functionProcess (computing)ExpressionInterface (computing)Lecture/ConferenceXML
24:02
Process (computing)NumberLecture/ConferenceComputer animation
25:03
Process (computing)Task (computing)Product (business)Video gameNamespaceIntegrated development environmentPoint (geometry)Interface (computing)Medical imagingReal numberTemplate (C++)Query languageLecture/ConferenceComputer animationSource codeXML
26:29
Server (computing)Process (computing)Demo (music)Parameter (computer programming)Computer animationSource codeXML
27:07
Gamma functionDevice driverLimit (category theory)Semiconductor memoryPresentation of a groupCore dump
27:39
Error messageGamma functionReal-time operating systemCASE <Informatik>NumberLoginPlanningBlogDevice driverProcess (computing)
28:42
FluidHill differential equationInclusion mapServer (computing)RootNormed vector spaceMetropolitan area networkMagneto-optical driveMIDIGamma functionReal-time operating systemState observerMetric systemServer (computing)Process (computing)Task (computing)Source codeLecture/ConferenceComputer animationXML
29:27
NumberProcess (computing)Subject indexingQuery languageInformationPoint (geometry)FrictionSoftware developerComputing platformLecture/Conference
30:02
Software developerComputing platformFeedbackBookmark (World Wide Web)Cartesian coordinate system1 (number)Lecture/ConferenceDiagram
30:55
Hardware-in-the-loop simulationDemo (music)Cartesian coordinate systemDevice driverLecture/Conference
31:29
Lattice (order)Conditional-access moduleTerm (mathematics)InformationTask (computing)Single-precision floating-point formatXML
32:02
Process (computing)Reduction of orderNumberQuery languageEvolutionarily stable strategyEstimationCloud computingBit rateComputing platformDistribution (mathematics)Computer-generated imageryCodeOperator (mathematics)Server (computing)Scaling (geometry)Gateway (telecommunications)Multiplication signConfiguration spaceInstance (computer science)Point (geometry)Software developerMultiplicationCartesian coordinate systemProcess (computing)Gene clusterMedical imagingState observerOperator (mathematics)FeedbackLecture/ConferenceComputer animation
34:26
Point cloudInternet service providerPort scannerSoftware developerMetric systemGoodness of fitPoint (geometry)Software developerIntegrated development environmentCodeView (database)Portable communications deviceInheritance (object-oriented programming)Lecture/Conference
35:13
BlogMultiplication signRight angleLecture/ConferenceXMLUML
35:47
Presentation of a groupLecture/Conference
36:25
Gene clusterBitSoftware developerComputing platformLecture/Conference
37:05
Complex (psychology)Lecture/Conference
37:38
PlanningChannel capacityMaxima and minimaSocial engineering (security)Physical systemQuicksort2 (number)Lecture/Conference
38:17
Cartesian coordinate systemInsertion lossOperator (mathematics)Lecture/ConferenceMeeting/Interview
39:22
Device driverOperator (mathematics)Process (computing)Instance (computer science)Scaling (geometry)Lecture/ConferenceComputer animation
40:08
Operator (mathematics)Lecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
00:07
Okay guys, welcome to this track. Thank you for coming today to see this journey that we had during the last year, let's say. And we are going to talk about how we immigrate the solution we have to run and exploit our Spark jobs or algorithms.
00:25
In the solution that we implemented in the first step we used Amazon Web Services tools and now we have a new solution using open source Kubernetes portable solutions.
00:40
But let's start introducing ourselves. Hi, I'm Ramiro Alvarez. I start Platform Engineer at TEMPAZY.CO. I'm mainly interested in infrastructure, observability and performance. Yes, this is me after KubeCon Europe three weeks ago. Therefore, please stay safe guys. Don't be like me. Stay safe in the Berlin passports.
01:05
And yeah, we are hiding too. Therefore, you can go to our website and find the details there. Good afternoon, I'm Daniel Hernandez. I work as a Data Engineer in the Data Team. And what we do in the Data Team is collecting all the events provided from the websites
01:24
or from our customers in order to provide analytics to the merchandisers and improve the final user experience. Here we wanted to remark and if you already attend this, our colleagues talk about privacy and relevancy. We don't store any personal data provided from the users but we only aggregate them in order to improve the average user experience.
01:50
Okay, and this is me. It's a bit difficult nowadays to explain what I do but let's say that I try to understand the customers and market needs to translate it or help my colleagues to translate it to real work, evolution in the systems, new features.
02:06
And I'm super boring in social media so let's move on. Okay, we are going to talk a lot in this talk about Spark jobs, how we escalate solutions, what are these algorithms or these jobs.
02:22
But let's start with a bit of context. Most of us here, we do search. We have some talks today so you can imagine the type of data that we collect in empathy. As Danny said, we collect all the information without personal information, location information, nothing related with the user.
02:44
The talk this morning gives more details but happy to hear your opinions in our discussions about the topic in social media. So yeah, we collect all these events that we have in the super experience in our search journey and oversimplifying a bit the solution.
03:04
We push all these data in a data lake. So in empathy we realize some years ago that we need to exploit this data, obviously. And we have several blocks of how we use the data. We can use it for reports, for visualizations, and we have another big group that
03:24
we are going to see a live demo today about one of these algorithms or jobs. That is the Search Intelligence Group. In this group we mainly create features based on the wisdom of the crowd to help the user in the interface to feel more guidance.
03:42
And as you can see in the slide, we use Spark to create these algorithms. But why is a good question. So I want to do here like this typical raise your hand to see more or less. I don't know how many of you work with Spark in your companies. Okay, nice. I know some of these hands.
04:05
Okay. Okay. For you, probably this is not interesting and you are going to criticize me because of this oversimplification, probably. But let's say that we have this scenario. As I mentioned, we have a data lake with some data. We have some algorithms that we have implemented in Spark.
04:24
These algorithms can be like something super complicated or something super simple. For that, Spark provides us different libraries like SQL libraries, ML libraries, all these super fancy words that are super difficult for me to pronounce as a Spaniard.
04:40
But let's say that we can create something super simple like extract from our data. What is the most clickable product, let's say. Obviously, this is a job or an algorithm that runs and finishes and saves some results in some place, database, index.
05:01
So we don't need a server running 24-7 to execute these jobs. But why does Spark help us here? This obviously, even if it's a simple operation, takes some computation time. And Spark helps us here since we can split this computational time in different executor nodes and join the results at the end.
05:25
So let's say that our entry data is 10 gigas, every one of these executors. Each one of these executors will take a portion of this data so the computation time is reduced. Fancy words, we can work with large datasets with low computational times.
05:44
These also have, I would think that we can escalate the solution so if we have more entry data, more input data, we can add more, let's say, executors to the solution. And we can execute the solution with the same computational time, more or less. OK, so we have some algorithms, we have a cool technology that allows us to run these algorithms and generate these results super fast.
06:10
But we are going to take a look at the first solution that we implement to run these jobs using Amazon Web Services tools. This is about the previous solution that we were using running on Amazon EMR that we were using before moving to Kubernetes.
06:35
We were using native technologies on Amazon Web Services like CloudWatch to schedule the different jobs that we run on a daily basis.
06:44
It's the function that contains all the logics to orchestrate the complete workflow from running the cluster, running the jobs. The Lambda function that we use is to check the cluster status during the process and also running the Spark jobs.
07:03
We use the IP gateway to access all these infrastructure from the outside and if we run some job on demand, for instance. And finally, we use S3 to store the final results of the Spark jobs that we run in. We have in-house monitoring and logging solutions that were some rudimentary and we wanted to improve this with a new solution.
07:29
In this slide, you will see the complete process that we use to run the job. We started uploading our artifact that contains the jobs to S3. And once we have that, we use CloudWatch with the scheduler expression to run the jobs on a daily basis or every 13 hours.
07:51
We configured it using the transpression, the specification of the job and the input parameters that the job needs. After that, all the logic that contains all the complexity is located in the step functions.
08:06
The step functions start with running a cluster on EMR. Here is when we waste most of the time that we needed because the cluster lasts about 10 or 15 minutes to be up and running and provision instances.
08:24
After it is up and running, we can run a certain number of jobs that are present on the input configuration. It's a variable number. For each job, we check its status. We handle their job retries if they fail. And if one job succeeds, we concatenate the next one until the last one.
08:45
When some job fails or others have succeeded, we terminate the cluster. At the bottom of the slide, you can see that there is a JSON expression with all the information that the step function needs to handle all this orchestration.
09:01
And it has to store responses from EMR, from the Lambdas. So at the end, we end up with a very complex structure that we needed to simplify in order to make our lives easier. Now talking about the problems that we can find on this solution.
09:24
This is how we can specify the different jobs that we wanted to run. We have the run expression, the name of the job, and the main information that contains the logic from the Spark jobs and the different arguments.
09:42
These jobs are what the step function takes and run it subsequently. Talking about the problems of this solution, the first one is that using EMR is managed by AWS. So we don't have control on the version that is running on the cluster and the dependencies that the cluster is encoding.
10:03
So since we have a file that is not a container, we don't have an isolated environment and the dependencies can come into conflict. Another problem is the auto-scaling. EMR does not base the auto-scaling on resource usage but on number of input tasks, pending task managers from Spark.
10:26
So that's not an auto-scaling solution because it can take, as I said before, some minutes to have new instances up and running. The other is the cost that you have to pay to run Spark jobs.
10:41
So Amazon is not only the cost associated to the type of machines that you are using inside the cluster, but also you have to pay an extra cost for running EMR jobs. About the monitoring and logging, we don't have good solutions here to make the cluster observable. For example, the Spark history server, which allows you to monitor the different jobs that can run,
11:04
dies when the cluster is terminated. So you don't have access to that information after the cluster is finished. And also, this was not a cloud agnostic solution. So if we wanted to port the solution, for example, to another cloud like GCP, this was not possible.
11:24
And last but not least, the stephanion that handles this logic and orchestrates the job has a lot of complexity. As I mentioned before, here you can see all the logic that the stephanion contains. You don't need to understand it right now because there is a lot of complexity here.
11:42
But the main points are that the stephanion tries to run on a spot cluster. If it fails, then it tries a cluster with on-demand machines. And after waiting for the cluster is up and running, we launch the different jobs that we wanted to concatenate. All the jobs have red rice management.
12:03
And after all the work success of one of them fails, we have to terminate the cluster. And then this kind of logic, the logic of this function was very hard to manage. And we wanted to move to the Kubernetes solution.
12:20
You can see here that this is kind of a mess. But let me highlight something important in the things that didn't work for us in the solution. It is important in the portability for us because you have your environment. In AWS you have your teams working in your algorithms, but you find in
12:42
the market some opportunities to exploit this algorithm of these jobs in other environments. So it is something that we consider as really important to create the new solution that Ramiro is going to tell us. So it looks like we have an elephant in the room. So as we were moving to Kubernetes, our microservice architecture, we thought it would be nice to have our Spark application there too.
13:09
Because we can reduce the cognitive load of our developers and don't use a way to deploy the Spark application and another way to deploy our other microservices.
13:25
Therefore, after a request for comment, some months later of work from multiple teams, we can give you what technologies we are using on our journey to Spark on Kubernetes. First of all, you know, Kubernetes, you need it to orchestrate all your microservices.
13:46
Besides, we didn't want to distribute the Spark application using JARs, as Danny commented before. We would like to handle it using distributed images and increasing the isolation using Docker.
14:07
Besides, we love the GitOps approach. The developer experience is really nice. The user interface loves it. We were using ROCD for all our microservices.
14:23
Therefore, we embrace it too for the Spark applications. Although you can use vanilla Spark to deploy your Spark application on Kubernetes and create all your Kubernetes resources by yourself, we prefer to do things easy.
14:43
Therefore, we deploy a Spark operator that is a project started by Google and now open source that allows you to create a custom resource definition to create and handle a Spark application easily. As we would like to distribute the Kubernetes manifest through multiple clusters, we are using Helm to increase the agility to deploy them.
15:13
Besides, when you are running a Spark application, sometimes you would like to concatenate jobs, deploy one job to another, create a complex graph.
15:28
You know, you can use the Kubernetes vanilla job definition, but when you are running a concatenate job, it's not enough. Therefore, we review Argo workflows.
15:42
It accomplishes all our requirements. Therefore, it fits so nicely with Argo CD. For those who don't know, Argo is an incubating project from the Cloud Native Foundation. It has great support from the community. We are using a multiple ephemeral Spark application.
16:06
Therefore, we don't want to have instances running every day. We don't want to have orphaned resources or isolated resources.
16:22
We are wasting money, wasting resources, and increasing our carbon footprint. Therefore, we are using Cluster Autoscaler to assume as we deploy the Spark application, the instances can be provisioned.
16:41
Therefore, we can have the right resources at the right time and save money. You know what happens when this happens. You need metrics. You need dashboards. Probably some of you know these icons. We are using Prometheus to save all our metrics.
17:02
Pulse gateway to send all the metrics from the Spark application to Prometheus. And Grafana to just explore and create some nice dashboards on it. But how everything works together. Let me explain with you a little sample and we can go ahead.
17:21
First of all, we have Git to where all the code is located. And from there, we can go to push our Docker image and Helm chart to an image repository. In our case, hardware bar, but you can use whatever repository image you want, like VCR, GCR, whatever.
17:44
It doesn't matter. The next one. You need a Kubernetes cluster. We have the Spark operator set in place. Therefore, you have the Kubernetes resource there and you can create the Spark application easily.
18:00
And you define the Spark application in code. And Argo CD is going to sync the changes on your code and deploy it to the cluster. It's going to reconcile the status again and again and again. Argo workflows. As I commented before, you would like to concatenate jobs, create complex graphs.
18:30
For us, this is a simple use case scenario. We are going to execute the job one and after finish, we would like to execute the job two.
18:42
In the first job, you are going, as you can see, there are a driver and multiple executors. We have the cluster autoscaler. Therefore, as soon as the executors are deploying, the cluster autoscaler is going to deploy new instances. Therefore, we are going to save money, save resources, because we are going to have the right resources at the right time.
19:09
After finish, we can go to the job two. You can see that it's saying logic than before. All these, what has in common?
19:22
The observability stack. We have the Prometheus metrics there. All our Spark applications are sending the metrics to Prometheus using the push getaway. We can have the dashboard somewhere final to explore, review the metrics, the performance, whatever.
19:42
Now, I think that it's time for a demo, but first, let me take a selfie, guys. Come on, make some noise. Let's make them be the guys that are in the other tracks. Say cheese! Thank you, guys.
20:00
Thank you, guys. Okay, let's go for the demo. Let's take a look first to an example that we have here. This is a demo client, demo showcase. We are going to play a bit with this.
20:22
Let's say that I open. You can see here a lot of things that we commented on in other talks during today. But it's important to mention that some of these elements that you see in the interface are generated by these jobs, running in Spark, running the solution. And we are going to execute now one.
20:41
But let's play a bit with this to understand what we are going to do. You see that I have some suggestions here, and I can search for a blue top, for example. Here, I don't have more interaction elements in this query. Yes, I can filter.
21:01
For example, if I can filter for a specific color like blue, I can do it. Or a specific collection, whatever I want. But I don't have nothing to refine more my query. And this is something that we are going to do now. We are going to run a Spark job that will generate what we call relative tags. And in the last talk, it's playing it super good.
21:21
But it's a feature that will generate tags attached to the query that I'm doing. So if I search for top, in this case, blue can be a related tag up top. This feature will help us a lot to guide the user in the interface and to refine the query.
21:43
So this is something super nice for the users to fill this guidance in the interface. Oh, yeah, I was sorry.
22:00
Let me show them the code, Panizo. Yeah, you can see the, let me show you the code, because, sorry. You can see in this snippet what we are doing, this reality. As you can see, we can set your repository, your tag, and the image location.
22:20
Besides, you can set easily your Spark version, the path of your main application, the time to leave seconds. Besides, you know, you don't want to hard call all your credentials and calls. Therefore, you can set all the Hadoop configuration with the needs for your OEIDC or your provider.
22:49
Besides, we are working with a scheduled Spark application. Therefore, we would like to have a historical limit that allows us to check if a Spark application failed during the time.
23:05
But you are going to ask me, how can you create all these orchestrate jobs with Argo workflows or something like that? So this is the way that we used to configure jobs using Argo workflows.
23:30
It's very similar to the way that we configured using AWS CloudWatch. This is an example of a job that has two jobs that run subsequently.
23:42
Sorry, there are two workflows. One of them has one job, and the other has three jobs that run subsequently. You have the current expression for every workflow, and that goes inside the Helm chart that we have deployed. So that's easy to configure.
24:01
If we go to the interface of workflow workflows, we can see an example of one of these jobs that ran, for example, some hours ago. We have here five jobs that have been concatenated. We used to put this subsequently because some of them depend on the other, and we don't have the need to run it immediately.
24:26
So we prefer to have a node pool with a small number of nodes and run them subsequently. Argo workflows allows you to create any DAG, a thick-licked DAG, to run jobs in parallel if you want.
24:42
We can run now one more job to see how it works. This usually takes about half an hour or 45 minutes, so in the meantime you can check your telegram, message, or so on. You're already happy? No, just joking.
25:04
Yeah, it's true that this takes 30 minutes to execute, so we're going to execute a smaller item without the rough. Believe us, when this first job finishes, the second one is going to start and it will finish and it will work. We are awesome. But let's execute a specific job, the job that I said to you related with these related tasks.
25:29
We are going to run one of these jobs. Here in Argo we can see that we have already set some templates to easily run these jobs. We came here to these query signal jobs.
25:41
That is the job that generates for us, is the name that we put for this job to generate these related tasks. And here if we click on submit, we can see that this is a super easy interface. I need to select the entry point that is the template. Here you can see that I have my Docker image. I can put here a nice name for you.
26:01
And we are going to see this working in real life with a demo in production. Yeah, we are crazy. Okay, let me open a console. So, if I move here to the production environment in one of our namespaces.
26:24
You can see it, right? Yeah, more or less. Okay. First we are going to check the pods related with this job that we want to run. We can see that we don't have nothing here. So, nothing is running for Spark.
26:41
We don't have servers running, nothing there. But I'm going to click here. I'm going to execute this job only for one of our demo customers. This is a parameter. Okay. Okay, I missed an S.
27:04
Okay, I think it's going to work. Okay, now if we come back here, we see things, Kubernetes doing things. We have a driver. I'm going to reset this because when things started, it drives a bit crazy. But you can see here a clear picture.
27:22
You can see that we have a driver and we have some executors. We can check also in the presentation in the meantime how this can be expressed in our charts. For example, here we have a driver, as you can see with the memory and core limits.
27:41
And the number of executors that we used in this case is four. You know that we can escalate this solution if we have more data coming in. So, one cool thing that we can do with this setup is that we can check the logs in real time where it's being executed.
28:01
So, if we do here... Sorry, I have the mic. Logs minus F. And I copy here the path of the driver. I can see logs in real time. So, this is what's happening in our job. It will take only probably like a minute and a half.
28:21
We are taking into account, I don't know, 10 days of mock data. So, it will do some things to discover these things. We have a nice article in our generating blog about how we generate these things. So, you can take a look. And when this job finishes, it's going to finish the resources as Romitos plane.
28:43
So, everything is going to be super nice. So, okay, this is nice. We can check the logs in real time. But we want more granularity in the metrics that we explained with Prometheus and all these things. Yes, one of the problems that we had with the YAML solution was the lack of observability.
29:00
And here we can see the first improvement that is the history server is up and running 24-7. In AWS, we have it only available during the cluster lifecycle. So, you can inspect any of the jobs that have run and I'll be writing to them and try to find any issue that you can have with balancing tasks and so on.
29:25
And the other improvement in observability was pushing custom metrics on our jobs. So, we have some Grafana dashboards with information about the jobs that have run. For example, in the query signal jobs that we are running, you have the number of index documents that the jobs output,
29:45
the duration of the jobs or the resource usage during the jobs running. But how can we configure this in the health chart? This is pretty simple. You know, there are some common friction points between the platform engineers thing
30:01
and the developers to how to configure the alert manager, the Prometheus, all that stuff. One of our goals from the platform engineers thing is to provide a better developer experience and a faster feedback loop.
30:21
Therefore, well, this is all the snippet configuration, but in the same Spark application definition in the health chart, all our developers can set their favorite Slack alert channel. Maybe some of them would like to have some labels and other ones others.
30:45
Besides, they can deploy their favorite Grafana dashboards in their favorite namespace, as simple as it is. But I think that we have a demo still running, doesn't it?
31:04
Yeah, probably finished already, hopefully. Let's see. Yeah, okay, the application is finished and you can see here in the resources that all the drivers and all these things were finishing. So if we check for the pods right now, probably they are completed
31:22
and we don't have executors anymore. So yeah, this runs. But we are going to check the demo again, see if it works. So now if I search for a single term, I'm going to remove this filter, like for example, top. I'm going to see here, as you can see, the related tasks that we just generated.
31:43
So I can filter this information with a more guidance feature. Instead of using the filters, I can filter easily by terms that are trendy or the people is using most. Okay, but I think we have some takeaways.
32:02
Yeah, last minute, don't worry. Yeah, we improved our time execution a lot.
32:21
The improvement was mainly because the cluster auto-scaling provides the instances faster than the usual AWC and R solution. Besides, another important point is that our developers start to embrace
32:40
this solution because they are using, each day, more jobs, more Spark applications. Another important point is the cost. You know, this is business. AWC and R has a license fee. Well, AWC has a license fee too.
33:00
Anyway, we reduced a lot our cost of the Spark clusters. Besides, let me show you the summary for this talk. We increased our portability, deploying thanks to Kubernetes.
33:24
We provide isolation images using Docker. We can increase the agility of deploying the Spark application of multiple clusters using HEM chart in an easy and idiomatic way using the Spark operator.
33:41
Using a vTops approach, easy to understand for everyone, thanks to Argo CD. Using multiple, how to orchestrate multiple jobs easily with Argo workflows. Besides, the cluster auto-scaling provides us a faster feedback loop
34:01
because it can provide instances faster. Everything with awesome stack, observability stack with so simple configuration improving a lot the developer experience from our developers. I think that this is a lot.
34:21
Now, Panito, I think that could improve and summarize this talk. Yeah, improve. I doubt it. Yeah, so we have a good solution. Ramiro already summarized all the good points, bad points that we have in the old solution versus new solution. But let me say to you my point of view.
34:42
I developed some code in the data science team that we have in this solution. I need to say that it's amazing for developers. It's super friendly. We have also, as I mentioned in the middle of the talk, that it's portable. So we don't care to have another use case, another environment.
35:03
We can just shoot our Kubernetes definitions there and everything is going to run nicely. So, yeah, this is everything that we have for you. Remember to check out our media, our engineering blog. We have some interesting things related with this
35:20
and with more things there and hope you enjoy. Thank you very much, guys. So we still have some little time for questions.
35:42
Is there any question in the audience? Yeah, we have a couple of them back there. I'll be right there. First of all, thank you very much for your presentation.
36:02
I have a question regarding have you faced new challenges or new difficulties while moving away from EMR to Kubernetes? I mean, for example, one of my main concerns is that, first of all, you need to have people that have working experience using a cluster using the Kubernetes. That means administration, security, networking, et cetera.
36:23
Yeah, thankfully, this is not from zero to 100%. We have two years of experience with Kubernetes, but, you know, it's easier to deploy some microservices.
36:41
The AWS EMR solution was dependent on status. It was an elephant in the room, as I commented before. It was a little bit more, you know. Therefore, in Platform Engineer, we have a couple of years of experience administrating clusters and all our developers have user experience on Kubernetes.
37:05
Therefore, it was pretty easy to handle it. Also, if you take a look to the old solution, you see all the pieces that we have there. So, yeah, the complexity of running Kubernetes or managing all these new things is big when you start with it.
37:21
But when you take a look to the solution that we have, I mean, we have the truth of this solution right now and we need to check why we use APK, why we use Lambda lately because it was also super complex to understand all the pieces working together and now I think it's much more easy to understand the overall thought.
37:45
Yeah, so again, congratulations on the talk. Two questions. One is, was Kubernetes adoption as a consequence of you having already some Kubernetes systems or there was a pretext to adopt Kubernetes just to tackle this issue?
38:02
And then the second question is, do you do any sort of capacity planning for Spark jokes, for example, to have some minimum executor values or maximum executor values or just hope for the Kubernetes to scale? Great question. Thank you. This, the Spark was like,
38:22
the team has a very huge cognitive loss working with AWS C&R. Therefore, we were like, okay, we have to think about a solution to that. We start thinking and analyzing and reviewing Spark operator or some other approaches like do it yourself.
38:46
Well, we have the Kubernetes microservices, a lot of microservices there, but a simple one with deployments or some stateful set, but you know, not all the orchestration from the Spark application. Therefore, the Kubernetes reach first the company than Spark on Kubernetes.
39:09
And the second question, a very good question, we have some dedicated node pools to hash all our Spark application.
39:20
As you could see in the slide, in one of the slides, yeah, I think that is enough. You can see a driver and some executors. As we would like to have a faster provision, the driver will decide to set an on-demand node pool,
39:44
and the executors are on spot instances. Therefore, you can, the node pool is going to auto scale from zero to whatever and downscale after the job finish.
40:00
It's pretty simple, the cluster auto scaler works pretty fine and the Spark operator covers all our requirements, all our visibility because you know, you have a Spark operator, but how are you going to be sure that the Spark operator is working?
40:20
Because if it's not working, you are... That's it. Brilliant. Thank you. Thank you. Great. Then let's thank the speakers again for the awesome talk. Thank you.