Demystifying Spark: A Deep Dive into Its Workings
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 18 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung 4.0 International: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/69797 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
5
12
00:00
SummierbarkeitFunktion <Mathematik>Interaktives FernsehenStapeldateiMaschinelles LernenMAPInnerer PunktSpeicherabzugTaskInformationQuick-SortAutomatische IndexierungOrdnungsreduktionLochkarteVirtuelle MaschineNeuroinformatikWort <Informatik>Web-SeiteImplementierungResultanteZahlenbereichMinimalgradHeegaard-ZerlegungÄquivalenzklasseFramework <Informatik>MultiplikationsoperatorSkalierbarkeitOpen SourcePhysikalisches SystemSchnittmengeObjekt <Kategorie>AbstraktionsebeneGeradeDatenstrukturMereologieInverser LimesBitBeanspruchungOrdnung <Mathematik>EvoluteStapeldateiFunktionalCodeRuhmasseInformatikSchedulingJSONComputeranimationVorlesung/Konferenz
08:33
FontSchedulingOrdnungsreduktionTaskPhysikalisches SystemDatenreplikationCloud ComputingÄhnlichkeitsgeometrieIterationInteraktives FernsehenKlasse <Mathematik>VererbungshierarchieInformationsspeicherungPartitionsfunktionFunktion <Mathematik>IndexberechnungFehlertoleranzMereologieQuick-SortOrdnungsreduktionBeanspruchungMultiplikationsoperatorAbfrageLoopMini-DiscPartitionsfunktionMAPAutomatische IndexierungVererbungshierarchiePhysikalisches SystemWeg <Topologie>AlgorithmusRechenbuchStrategisches SpielNeuroinformatikStapeldateiFunktionalInformationsspeicherungImplementierungKette <Mathematik>Explorative DatenanalyseGraphInteraktives FernsehenSpeicherabzugVirtuelle MaschineTermIterationZahlenbereichGüte der AnpassungService providerEinsAlgorithmische LerntheorieGleitendes MittelBitDatenreplikationDatenanalyseMixed RealityInternetworkingOffene MengeTypentheorieComputeranimationVorlesung/KonferenzXML
17:05
FontSchedulingOrdnungsreduktionTaskSinusfunktionZählenGruppenoperationRandwertDigitalfilterMAPQuick-SortDatenfeldPartitionsfunktionTotal <Mathematik>Lesen <Datenverarbeitung>DatensatzUmwandlungsenthalpieTaskURLPartitionsfunktionQuick-SortRankingMAPSchedulingCodeNeuroinformatikExogene VariableStrategisches SpielBenutzeroberflächeSchnittmengeMereologieObjekt <Kategorie>EinsDifferenteMultiplikationsoperatorHeegaard-ZerlegungRandwertMathematische LogikAutomatische HandlungsplanungZahlenbereichProzess <Informatik>Transformation <Mathematik>BitLesen <Datenverarbeitung>DatenverwaltungPhysikalisches SystemFehlertoleranzLoginOrdnungsreduktionGruppenoperationZählenKontrollstrukturGebäude <Mathematik>GeradeDemoszene <Programmierung>SchlüsselverwaltungAbstraktionsebeneÄhnlichkeitsgeometrieMinimumRechter WinkelResultanteComputeranimationVorlesung/Konferenz
25:38
FontSchedulingTaskOrdnungsreduktionSchnittmengeKonfiguration <Informatik>KontrollstrukturMultiplikationsoperatorQuick-SortNeuroinformatikDatenverwaltungSchedulingObjekt <Kategorie>Coxeter-GruppeResultanteAutomatische HandlungsplanungComputervirusAlgebraisch abgeschlossener KörperÄußere Algebra eines ModulsComputeranimationVorlesung/KonferenzJSON
Transkript: Englisch(automatisch erzeugt)
00:05
So Neil is a backend engineer with a passion for big data, and he already holds a Master of Science in data science and is doing another degree. He didn't write down which one, so it's actually a secret.
00:22
And yeah, we are looking forward to hear about Apache Spark's inner working. So who of you has worked with Apache Spark in the past? Like, oh, that's the, out of one-third of type, that's right, it's one-third, yes.
00:40
So yeah, you will help us understand how Apache Spark really works. So thanks a lot, and a warm welcome to you. Thank you. My name's Neil Gibbons, and I'm currently doing a Master's in computer science, but I've worked as a backend engineer for the last couple of years.
01:02
And the topic of my talk today is an introduction to the internals of Spark. So I think most people are probably familiar with what Spark is, but if you're not familiar, it's sort of like an open source framework for processing large amounts of data.
01:23
And it's sort of, I found maybe to understand it at a deeper level, it's a bit difficult because it offers so much functionality. So you have, can sort of handle streaming workloads, interactive, the sort of batch
01:43
workloads, and also like very iterative machine learning workloads. So it's quite difficult to know where to start with all of this, and on top of that, there's sort of a massive code base. So it's difficult to sort of pick it all apart, and there aren't actually that many
02:02
resources as well on the internals of Spark. So that was kind of my motivation for making this talk today. So what I'm going to talk about then, the structure is, first of all, I'm going to
02:21
speak about Map Shuffle and Reduce. So this is one of the core ideas behind Spark, and I think maybe a lot of Map Reduce is sort of abstracted away. So just sort of thinking about how Spark is doing, Map Reduce helps you to build some
02:42
intuition for it. And then the other next part of the talk, I wanted to talk about RDDs, which are Resilient Distributed Datasets, and they're sort of like the core abstraction behind Spark. And I think once you understand how these work, then the whole of the Spark system
03:09
becomes a bit clearer. And then just at the end, if we have time, I wanted to go through the Spark scheduler, and so how it takes sort of this collection of RDD objects and uses that to actually
03:28
do our computation and to find a result. So to start with, I'm going to talk about how Map Shuffle and Reduce is behind Spark.
03:42
So the first thing to say, I think, is that Spark is just another in like a long line of computation frameworks that breaks down tasks into Map and Reduce. So what is Map Reduce?
04:02
It's basically a way of breaking down sort of any computation task in a way that it can be parallelized, so it can be done by multiple actors at the same time. So that's really key for what they call horizontal scaling.
04:25
And something that I quite liked is that this idea is more general than just running on computers and that they've actually used, they haven't called it Map Reduce, but they've used sort of the idea of Map Reduce for hundreds of years.
04:43
So in a census, you don't have one person going around to every single house and counting the number of people in a country. You sort of split up the task. So you have a person, say, for each village, and then they sort of bring their results together.
05:03
So the person going to each village would be the equivalent of the Map stage. And then bringing your results into one place is like the Shuffle. And then the Reduce task is actually combining the results from all the different villages into one.
05:25
And just sort of like another manual example of Map Reduce is creating a book index. So the way that you, you know, they actually did this as well before they had computers, which is that, you know, you would give maybe if you had a team of 10 people
05:45
and you want to create a book index, then maybe you give 10% of the pages to each person, and then they sort of pick out the important words from that page.
06:00
And then there's a Shuffle step where they bring all their data together. And then there's like a Reduce where you combine everything together. And so to talk about some like actual implementations with machines of Map Reduce,
06:26
a very early one was in the 1890s for the US Census. It's the Hollerith machine. And they basically, you know, the Map stage there would be encoding each person's information
06:44
on a sort of punch card, and then you could feed that into a machine. And it would then sort of do the Reduce step for you of combining all of these people's information
07:00
to create the final census. And then sort of, this is sort of such a powerful idea. I think it was the core idea behind Google's, like they did a implementation of Map Reduce,
07:21
and then they open sourced it in 2004, and that became Hadoop Map Reduce. And I think maybe it's worth mentioning that, you know, Google sort of developed this in order to produce their search index in a way that could be parallelized.
07:46
And then sort of Spark was an evolution of the Hadoop Map Reduce. And it was basically, there was some limitations with the original Map Reduce.
08:04
So that's why they implemented Spark. So that's kind of just an explanation of the Map Reduce idea. And so now I just wanted to talk a bit more about the other like core idea behind Spark,
08:25
which is these resilient distributed data sets and like the problem, like where they come from and the problem they're trying to solve and how they're actually implemented. So I mentioned like the, what came before Spark was Map Reduce and what they found
08:48
using this in practice is that it was very good for a batch workload. So something like producing Google's search index, like scraping the open internet,
09:03
and, you know, could do that quite well. But for more like interactive and iterative workloads, it wasn't as good. So an interactive workload would be something like doing exploratory data analysis
09:24
and like SQL queries one after another. And similarly, like an iterative workload, this was like Interact is especially for like machine learning workloads, maybe you're sort of waiting for convergence or you sort of have a loop that's running.
09:49
Yeah, Map Reduce wasn't so good at handling these workloads. And so this was a problem as well, because I think around the time, sort of 2008, 2009,
10:05
like machine learning was becoming a lot more popular. And so this is what motivated the AMP Labs team at Cal Berkeley to sort of have a look at how they could maybe improve Map Reduce.
10:23
And so one of the first things that they looked at was like diagnosing the problem and asking what was actually causing the slowness and like looking at the system, what does it spend most of its time doing and what it was spending most of its time doing
10:45
was writing like the intermediate data in a calculation to disk. So to stable storage. And the reason that it was doing this was for fault tolerance. So like in any distributed system, fault tolerance is really, really important
11:05
because you don't want to like lose any data or that whole thing. But so the strategy that they were using to achieve fault tolerance was replication.
11:23
And this, you know, writing to disk, you know, every loop in your, every time you go through a loop or every time you do like SQL query or something like that,
11:40
like every stage in the workload, that was very costly. So I think what they wanted to do then was to try and find a way to avoid that, this AMP Labs team and just something to introduce as well is like, there's this sort of tension between fault tolerance and performance then
12:05
and what they're trying to have is a system that achieves fault tolerance, but also achieves performance at the same time to have the best of both worlds. And so, I guess, you know, the next stage then for this team
12:26
was to try and think of some other strategies of achieving fault tolerance that could be, could have a better performance. And one of these strategies was re-computation.
12:42
And I think it's almost like surprising that this was the one that ended up working because your, you know, if you think back to creating a book index or creating a census,
13:01
then it's very like, what the time consuming part of that is having to physically go through the index and it sort of what's time consuming in that is not actually sort of handing your data to a particular person.
13:21
And so, if you think about it in terms of the manual example, you could say that re-computation seems like a really bad idea, but actually on a computer, you know, counting the number or like find it,
13:43
you know, the actual computation part isn't as expensive. It's sort of like the moving of data that's expensive. And so, I think this graph that I've got here is from one of the papers that the AMP Labs team produced.
14:02
And when you have an iterative algorithm and you use this re-computation strategy, then it actually ends up being a lot faster. And so, maybe like, so this is like for an iterative workload for the first iteration,
14:29
you know, you're still reading from stable storage, so that one's a bit slower. But then for following iterations, you're not having to read the disk and write from disk.
14:40
So those ones are a lot quicker. And then even when there is a failure, if you're just having to recompute one partition of the data, then it's actually not that much slower. And I think the other really good thing about this strategy is that rather than having to do the expensive thing for fault tolerance on every single iteration,
15:10
you only have to do the expensive thing, which is re-computation when there actually is a fault. So that would be on iteration number six.
15:24
So I guess now that we have this idea of how we're going to provide fault tolerance and we're going to have good performance at the same time, and we're going to achieve that through re-computation when there is a fault,
15:45
the next question is sort of like, how do we actually implement this? And so RDDs or Resilient Distributed Datasets, the core idea behind Spark, a sort of an implementation of this idea. And sort of the good news for understanding Spark is that they are actually quite simple.
16:06
So what do you need for re-computation? Like you need to follow the chain of like computation steps. So you need to link back to your parent.
16:21
And then you also need sort of a function which can then compute the partition based on its parents. And we also need to keep track of the partitions of the data. So how we've actually split the data apart.
16:44
And that's basically it. Like, and so that's how we represent our data in Spark. And that's how we achieve fault tolerance with a good performance.
17:02
And we can do this re-computation. So then just briefly to finish out the talk, I want to speak about the Spark Scheduler. And so I think the question here is sort of, how does it actually take the collection of RDD objects
17:25
and how does it then turn that into a result? And so I guess, you know, to give, here's an example of some Spark code. And, you know, when you actually write this code
17:43
and you specify these transformations like split or map or reduce by key, then, you know, Spark doesn't actually like materialize the data after you say those things.
18:02
So it sort of builds up this like logical plan basically. And then when you have the collect or the count method called on the RDD, that's called an action. And that's actually what finally triggers the computation.
18:24
And so we have our, like the first part of the system then, which takes all of the RDD objects is the DAG Scheduler. And what it's doing is breaking up the computation task into stages.
18:46
So you have maybe a stage. What it will do is it looks for shuffle boundaries. So basically tries to combine as many tasks together into one
19:11
computation like stage basically. And it's also worth mentioning that it's the DAG Scheduler like that is sort of responsible for this re-computation strategy
19:26
for fault tolerance. And so when after it's split the data up into stages, or I think maybe before I come on to that. So this code here, I realize I'm going through it quite quickly,
19:43
but it's basically taking a data set like what you see in the top right-hand corner, that's a CSV. And it's trying to find like which locations like rank them by popularity.
20:02
So we have this Spark code. And then the DAG Scheduler produces what we have at the bottom, which is that it combines the filter and the map together into one stage. And then it creates two more stages, one for the reduced by key transformation
20:21
and one by the sort by key transformation. And then once it's got this set of tasks, then it passes that on to the task scheduler. And so this task scheduler basically talks to like a cluster manager.
20:43
So something like Apache, Mesos or YARN. And launches specific tasks and it sort of does that. There's like a preferred location that it will do that.
21:01
So it takes into account like where the data is actually stored. And it also takes into account like what workers are actually free to take on the particular task. And so to sort of build on the example from earlier,
21:23
so we've got our filter and map stage. And then it sort of breaks that down into tasks. So there's a task for each partition of the original data set. And you've sort of got... So yeah, read partition one of logs RDD and then filters the inactive users.
21:46
And then it creates this location activity like tuple. And then just creates a number of stages that it then gives to the cluster manager to run.
22:05
And it'll do a similar thing for the other stages. So stage two and stage three. So the actual task within stage two is reducing all records for locations assigned to their particular partition.
22:24
And then there's also the sort by key. So it'll sort records within each partition. And just like this kind of brings everything together. This is what we've kind of started with our code.
22:42
And now we've built up this. This is what the actual execution looks like. So that's what the scheduler has done. So in this example, let's say we have three workers. And we've also got like five partitions of the data.
23:02
So you can see like at the bottom, I've put the stages. And so that was the DAG scheduler that did that for us. And then once you've sort of figured out what the key stages are of your computation,
23:20
then the task scheduler actually creates a task for each partition. And maybe it'll start, it's come up with task one. And it decides that to put it on worker one just because it sees that it's available.
23:42
Same for like task two puts on worker two, task three puts on worker three. And then all of our workers are taken up. So we have to wait a bit before we can put task four on worker number two once that's finished. And then task five on worker number five.
24:05
And there's sort of a similar process that happens. The task scheduler also will put the tasks onto the different workers for stage two. But also notice how like we can't put task one of stage two onto worker number one.
24:27
Worker one has to wait because there's sort of this dependency that the DAG scheduler has worked out. So we need to completely finish everything within stage one before we can start on stage two.
24:46
So I think I'm basically out of time. But just to give a recap of what I spoke about. We like what Spark is at a very high level is just another in a long line of
25:04
methods of breaking down a computation task into these map shuffle and reduce stages. And it's maybe not that clear that it's doing this behind the scenes. Because it's got such a nice user interface and it's got these really nice abstractions.
25:27
But that's kind of what it's doing. And I think maybe when you understand it in that way, it seems a bit less intimidating. Intimidating. And then the other sort of major idea that I was speaking about was
25:41
like these resilient distributed data sets. Like where the idea came from and how they were actually implemented. And then just went over the Spark scheduler example as well. And talks about how it actually takes these RDD objects.
26:01
And then from that sort of plan, it actually converts that into computation and actually calculates a result for you. So yeah, I think that's everything.
26:29
So if there is like one question, we would have time for one. So there's one over there. Oh, hello. Thank you so much for your presentation.
26:43
As you mentioned, Spark relies on the RNN methods to create a distributed closure solution. And some companies do not like that much. And we have other options in the market in merging solutions such as Python, Ray and Desk. So I would like to hear your opinion about it.
27:02
Yeah, I haven't actually worked with like alternative cluster managers that much. So maybe can't really comment. But I think that within Spark, they even have their own cluster manager or something as well. But yeah, I can't comment on that question.
27:25
All right, thank you very much. Okay, thank you. And yeah, enjoy your lunch break. Thank you.