Logging Apache Spark - How we made it easy
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 56 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67169 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Musical ensembleAdaptive behaviorXMLUMLLecture/Conference
00:19
Streaming mediaConnectivity (graph theory)Computer architectureCore dumpLoginSoftware developerMusical ensembleSlide ruleWordInformation engineering
01:35
Software developerSlide ruleSoftware engineeringNetiquetteView (database)SoftwareLecture/Conference
02:31
Computer data loggingElasticity (physics)Bootstrap aggregatingServer (computing)Beat (acoustics)LoginCodecGroup action1 (number)CASE <Informatik>Multiplication signCartesian coordinate systemProduct (business)SoftwareComputer architectureConnectivity (graph theory)Core dumpMoment (mathematics)Computer data loggingBookmark (World Wide Web)Point (geometry)Data compressionLine (geometry)Stack (abstract data type)Computer fileDevice driverVisualization (computer graphics)Subject indexingInstallation artBitDifferent (Kate Ryan album)Type theoryComputing platformOpen sourceSoftware developerData typeProcess (computing)CuboidWeb browserEvent horizonData loggerData storage deviceSlide ruleInformationMetric systemWeb 2.0Elasticity (physics)Computer animation
08:14
Line (geometry)Pattern languageComputer configurationMessage passingMultiplicationStack (abstract data type)Java appletAnalytic continuationInformationSynchronizationLevel (video gaming)Time zonePoint cloudBeat (acoustics)Configuration spaceUniform resource locatorComputer fileScripting languageVariable (mathematics)Server (computing)Instance (computer science)MetadataMultiplication signData managementGroup action2 (number)InformationCartesian coordinate systemComputer architectureSoftware developerMoment (mathematics)Bootstrap aggregatingDataflowProcess (computing)LoginError messageInformation engineeringSoftware engineeringRegulärer Ausdruck <Textverarbeitung>Latent heatMessage passingTemplate (C++)Order (biology)SoftwareGoodness of fitRegular graphMereologyNumberConnectivity (graph theory)MathematicsMetric systemService (economics)Windows RegistryComputer data loggingPattern languagePoint (geometry)Line (geometry)BitLoop (music)Lecture/ConferenceSource codeComputer animation
15:00
Multiplication signLecture/Conference
15:13
Sign (mathematics)Lecture/Conference
15:35
Musical ensembleJSONXMLUML
Transcript: English(auto-generated)
00:07
So logging Apache Spark, because nothing is definitely as easy as it seems, especially dealing with adapters. So what are we going to talk about, though, in the next 15 minutes or so? We'll start off with looking at Nielsen's data
00:22
architecture, and we'll really see how Spark played a core component in it. We'll then talk about how we used to search logs, or more how we just couldn't do it. A deep dive into the solution that actually provided us with that possibility of searching logs.
00:41
Some main obstacles, because every solution has to have an obstacle. Some pretty Kibana charts. And then to top it all up, just some future add-ons. But first, who am I that am speaking to you? So I'm Simona. I'm a senior big data engineer at ADOC based in Israel.
01:01
I've been dealing with data for the past decade now, which feels like a lot, especially with the word decade. I love music. The weirder, the better. I love traveling. And that's it, yeah. This is me in a slide. And so because this is actually very exciting that we're all here on prem, instead
01:22
of making some assumptions, I'm going to make these assumptions interactive. So out of curiosity, how many of you are big data engineers, or would you call yourself a big data developer? OK, that's nice. And are the rest of you software engineers?
01:42
Cool. How many of you would call yourself Spark warriors, experienced Spark developers? Nice. And just novices, beginners? Never heard of Spark? OK. OK. Out of view, how many are actually AWS-centric?
02:04
Cool. So we are going to talk about an AWS-based solution. But I just hope that this talk will give you a lot of insights regardless. If you're wondering what this picture is doing here, I was looking for a good picture for the slide. And somehow, just remember that there are Japanese etiquette
02:24
commercials in Japan. So this is one here for you, and just another one that I really like. So at least you gained something from this talk. Yeah, but now let's start. So I think the first thing that really pops out
02:40
when you look at this architecture slide is Spark. This architecture is processing more than 60 terabytes of data. Billions of events flow into this architecture daily. And all of these events get consumed, deserialized, enriched, aggregated, and then eventually stored in all of these different data stores all done by Spark.
03:01
So it would be pretty safe to say that just being a big data developer in Nielsen Israel just means that you have to develop and maintain these big, robust Spark jobs and streaming applications. And at the time, we were running Spark on what was or probably still is,
03:22
unless you're using Kubernetes, the most popular platform to run Spark on, which is Amazon EMR. Now, if you're not familiar with Amazon EMR, it just stands for Elastic MapReduce. And essentially, it's just a bunch of servers running together with some predefined software that you can run your Spark jobs on.
03:43
Nice. So before we talk about how we actually use to access logs, I think we really need to talk about when we access logs. And we access logs in one of two cases. So the first one would be when you just launch a new Spark application and you want to follow it up, so you want to access your logs.
04:01
And then the second one would be when you have a production issue. So you might know that the problem is with Spark or you don't, but you definitely need to access your logs. And in both of these cases, time is an essence. So with this in mind, let's talk about how we actually can access our logs when running Spark on EMR.
04:23
And there are basically three methods to do it. So the first one, and the most straightforward one, is just SSHing to the servers. So again, EMR, just a bunch of servers running together. So this sounds pretty straightforward, right? But as we all know, or most of us know,
04:41
Spark has several components. So we have the driver. We have the executor. And each one of these is actually outputting a log file. So if you want to just SSH the server, you actually first have to pinpoint that component whose logs you want to see, then pinpoint that server that that component is actually running on, SSH that server,
05:03
find that right log file, decompress it, search it, and just hope that you actually found what you were looking for. If you're tired of that, you can always actually access your logs in a centralized way through the YARN UI. The problem with that being is that if we think again
05:21
about when we access logs, then we probably access logs after quite some time, right? So if you just pick up your favorite web browser, go into the YARN UI, try to access your logs, your web browser will probably just crash and die. And so at this point, we have that last final way
05:42
of accessing Spark logs when running on EMR. That would be just downloading the log files from S3. So that's a possibility that AWS provide us with. But the problem with this solution is that there's quite a lag between the moment that the log line is actually written to the server and until it reaches S3.
06:02
And we just wanted a solution that would make log search quick and easy and possible. And that's why we've designed a solution that's based around the ELK stack, but with beats really being the core component in it. Now, if you're not familiar with beats,
06:21
beats are lightweight, open source shippers. Each beat is dedicated to a different data type. So what are we going to do here? We're going to just go into our EC2 servers and install two types of beats. So first thing, we're going to install Filebeat, of course. That's for our log data.
06:42
And then since we're already there, why not also install Metricbeat and just get all of that interesting information from the EMR so we can just search our logs and metrics together? We're going to produce all of that information to Redis, consume it using Logstash, index it in Elasticsearch,
07:02
and then, of course, Kibana for visualization and search. So what we're going to do now is just do a deep dive into that bootstrap action that actually does that installation, because everything else pretty much works out of the box together, especially for log data.
07:21
So let's just talk about that bootstrap action. And the first thing that we need to do is just understand what a bootstrap action is. So what you see before you is just taken out of AWS documentation. And it's the main points to keep in mind when thinking about using bootstrap actions. So you can use bootstrap actions
07:42
to just install all of that additional software that you want on your EMR. But it's very, very important to keep in mind that the bootstrap action is actually running before all of these applications and predefined software that you wanted installed on your EMR. So I have here for you today, I hope that's pretty visible,
08:04
the bootstrap actions that I defined at Nielsen. So we have two main ones. So that's the one for metric beat, really bad with a laser. And the one down there is the one for file beat. So you can see we just basically provide the location of the Bash script
08:23
and then some configuration variables that we're going to discuss in just a minute. Cool, so let's just deep dive into that Bash script. The first thing that we need to do is just determine that we are not running on the master node. The reason for that is very simple.
08:40
Spark is not running on the master node. So just installing file beat and metric beat on the master node would be kind of useless. Luckily for us, AWS do provide us with a lot of information, a lot of metadata about the EC2 servers in the EMR, one of which is, is it a master node or not? So we can just grab it out of the instance JSON file,
09:03
just as we did right here. Cool, so now that we know that we're not running on the master node, the next thing that we're going to do is actually receive all of those configuration variables. And at this point, I'm going to split them into two groups. So I'm going to call the first group for the future configuration variables
09:22
and then the second group I'm going to call for the legacy. So for the future is everything that pops through your mind that can actually make your log search easy in the future. So that includes everything like flow name and team name, but also interesting things like EMR cluster ID,
09:42
which is actually the most recent information that you can gain about your job when it's running on EMR. To understand what the legacy variables mean and why they're here, I just have to take you for a moment to Nielsen. And so what happened at Nielsen Israel is what happens with a lot of companies. And it's the fact that we had at first one data team
10:03
and that one data team grew out to be three data teams. And each and one of these teams, of course, was running their own Spark applications. And of course, they were using their own log for J files. So no standard whatsoever.
10:21
And then forcing our big data developers into actually standardizing their jobs and changing their log for J files were harder than just defining these for the legacy configuration variables. That's fine. You don't have to change your log for J file. You can just pass in the multiline regular expression
10:40
for us to be able to parse your logs nonetheless. So that's exactly what we're doing here. So we're just not only defining, installing Filebeat and Metricbeat on all of these servers, we're also making this installation tailor-made to that specific job and that specific team.
11:01
The next thing we're doing looks kind of stupid. So it's a while loop. What we're doing is actually waiting for HDFS to be installed on the server. So if you remember when we talked about bootstrap actions, we said that they run before all of that predefined software. But we actually need some of it for Filebeat to run.
11:22
So we're going to wait for HDFS to be installed on the server. It's a pretty simple way to do it. If you have something that's more sophisticated and you think is right, that's totally fine. But the idea is that we have to wait. Cool. So now all we have to do is just download
11:41
and install Filebeat and then actually configure it. So all Elastic products are just configurable using YAML files. And what we're actually doing is just using a sed command on this template YAML file in order to configure it with those configuration
12:00
variables we received. So a template file would look exactly like this. And then if I can draw your attention to that EMR cluster ID, that's exactly one of these for the future variables. And if I can draw your attention to the very end, the multi-line pattern, that's exactly that for the legacy.
12:21
You can parse it in any way you want. We're going to install Metricbeat in basically the same way. So with this, this architecture is pretty much done. But I did say that we're going to have an obstacle. What would you think that the obstacle is?
12:40
Does anyone have any idea? That's fine. I also didn't think that we were going to have an obstacle, but we definitely did. And that was just data engineers. Because I think the data engineers suffer from a lot of the things that just regular software engineers suffer from, which is that misconception that old is good.
13:01
Old is good. I like SSHing to my servers. And change is work. I don't really want to configure bootstrap actions. But eventually, we managed to convince them or force them. And they were searching their logs in Kibana, so that was fine. So some pretty Kibana charts here for you today.
13:21
You can just see the number of warning messages versus the number of error messages. This is really nice. I really like visualizing data. But this is the real star for me. So the thing circled in red, that's our log line. This is what we wanted to search. That's our Spark log line. Everything else you see is a bonus.
13:41
We have EMR cluster ID. We can see the Spark component. We have a lot of that information that was just not accessible to us at all before. And now, we can search our logs and also gain a lot of more insights out of them. So some future in add-ons.
14:01
The first thing that I want to mention is that Redis was part of that architecture. And it was becoming quite quickly. Redis became quite a big bottleneck as we were adding more and more applications, collecting more and more data. So we always thought about just replacing Redis
14:20
with Kafka that would give us a more robust solution. And then the last thing, as I already said, EMR is just a bunch of EC2 servers running together. And at Nielsen Israel, we had quite a lot of EC2-based services. We were running our own managed Druid Kafka schema registry on all EC2. So we can just use this architecture
14:41
and use these installation scripts on all of these servers as well, all of these services, just gain more insights on the things that we're running and managing. And with this, I'm done. I'm not really sure if I have time for questions.
15:03
Thank you very much, Simona. Wonderful in time. So you have the chance to ask her. And yeah, give me a sign and I'll come by with my mic.
15:27
Then thank you. Okay. Okay, thank you.