We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

A guided tour of the BigData technologies zoo

00:00

Formale Metadaten

Titel
A guided tour of the BigData technologies zoo
Serientitel
Anzahl der Teile
163
Autor
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Elephants (named "Hadoop" after a toy), bees and hives, pigs, ELKs, rabbits and one ZooKeeper. You can find them in a zoo, or you are just a software developer trying to make sense out of data using technologies with weird names.This session will be a guided tour of the BigData kingdom. We will explore various challenges one has to face when handling large volumes of data, and learn about various tools with funny names which were built to help in the process.Equipped with real-world examples and use-cases, by the end of this session BigData should stop being a buzzword to you.
Gewicht <Ausgleichsrechnung>EDV-BeratungElastische DeformationAdressraumBildschirmfensterMultiplikationsoperatorSkalarproduktWort <Informatik>TransaktionAggregatzustandURLNotebook-ComputerProzess <Informatik>StrömungsrichtungPerspektiveMAPFehlertoleranzKontextbezogenes SystemDatenstromMetropolitan area networkEchtzeitsystemPhysikalisches SystemSuchmaschineGüte der AnpassungFreewareReelle ZahlComputeranimation
Wort <Informatik>Kontextbezogenes SystemTermComputeranimation
DialektSichtenkonzeptParallelrechnerPhysikalisches SystemExogene VariableProjektive EbeneInformationsspeicherungRechter WinkeleCosMereologieComputeranimation
PunktTabelleProjektive EbeneMaßerweiterungTermVirtuelle MaschineARM <Computerarchitektur>Rechter WinkelNummernsystemOrdnung <Mathematik>p-BlockArithmetisches MittelZentrische StreckungInformation
VideokonferenzElektronische PublikationTypentheorieInformationCASE <Informatik>Gesetz <Physik>StellenringComputeranimation
Minkowski-MetrikInformationsspeicherungStapeldateiProzess <Informatik>DialogverarbeitungInternet der DingeElektronische PublikationStreaming <Kommunikationstechnik>SchnittmengeServerWort <Informatik>Prozess <Informatik>InformationsspeicherungNeuroinformatikRechter WinkelVideokonferenzComputeranimation
SystemprogrammierungDatenverarbeitungssystemHypermediaProgrammiergerätZentrische StreckungInformatikFlächeninhaltInternetworkingCodeComputeranimation
GoogolHardwareOpen SourceDateisystemOrdnungsreduktionMereologieSuchmaschineWort <Informatik>DateiverwaltungAutomatische IndexierungProjektive EbeneNeuroinformatikParallele SchnittstelleSupercomputerSoftwareHardwareParallelrechnerPASS <Programm>InformationsspeicherungProdukt <Mathematik>Open SourceeCosRechter WinkelMAPGoogolSinusfunktionARM <Computerarchitektur>Physikalisches SystemOrdnungsreduktionComputeranimation
Jensen-MaßPunktwolkeTiefensuchePhysikalisches SystemElektronische PublikationNatürliche ZahlMereologieProdukt <Mathematik>VererbungshierarchieInformationsspeicherungProjektive EbeneBitService providerLoopImpulsPASS <Programm>Computeranimation
NeuroinformatikElektronische PublikationDateiverwaltungBitOrdnungsreduktionLeistung <Physik>ParallelrechnerZweiParallele Schnittstelle
Elektronische PublikationEinfache GenauigkeitElektronische PublikationDemoszene <Programmierung>MathematikPhysikalisches SystemServerInhalt <Mathematik>EinsAuswahlverfahrenMultiplikationsoperatorInformationsspeicherungART-NetzRoutingTypentheorieTabellePunktMereologieDienst <Informatik>SystemzusammenbruchDateiverwaltungFitnessfunktionComputeranimation
SystemprogrammierungDateiformatElektronische PublikationEntscheidungstheorieTypentheorieNeuroinformatikElektronische PublikationDateiformatDatensatzDatenkompressionInformationsspeicherungDateiverwaltungQuick-SortImplementierungDifferenteRechter WinkelPhysikalisches SystemMultiplikationsoperatorRechenschieberComputeranimation
OrdnungsreduktionMAPZählenWort <Informatik>Funktion <Mathematik>AusnahmebehandlungZeichenketteSchnittmengeOvalHydrostatikKlasse <Mathematik>ClientElektronische PublikationTaskPunktwolkeOnline-KatalogDistributionenraumSystemplattformTaskOrdnung <Mathematik>Elektronische PublikationProzess <Informatik>MaschinenschreibenAppletPunktRechenschieberCodeKlasse <Mathematik>ZweiNeuroinformatikMultiplikationsoperatorEindeutigkeitZahlenbereichWort <Informatik>PunktwolkeCluster <Rechnernetz>ResultanteZählenMereologiePhasenumwandlungSchnittmengeProgrammierumgebungEinsDistributionenraumGrenzschichtablösungDifferenteAdditionImplementierungQuick-SortMAPEntscheidungstheorieIterationBildgebendes VerfahrenTypentheorieZusammenhängender GraphParallelrechnerOrdnungsreduktionParallele SchnittstelleFunktion <Mathematik>GeradeEin-AusgabeTupelSystemaufrufDatenverwaltungCASE <Informatik>Diskrete MathematikRechenwerkSoftwareEinfache GenauigkeitSchlüsselverwaltungBildverstehenNatürliche ZahlSystemplattformAbgeschlossene MengeLeistung <Physik>Nachlauf <Strömungsmechanik>Endliche ModelltheorieFunktionalInformationsspeicherungAggregatzustandARM <Computerarchitektur>UnrundheitSkalarproduktBildschirmmaskeWhiteboardBitTermGraphfärbungWinkelProdukt <Mathematik>Computeranimation
TabelleGruppenkeimASCIIOrdnung <Mathematik>Mini-DiscLastEin-AusgabeKontrollstrukturBitrateFunktion <Mathematik>BootenHardwareGanze FunktionNeuroinformatikTaskProzess <Informatik>OrdnungsreduktionNormalvektorMAPZählenSchreiben <Datenverarbeitung>InformationsspeicherungMathematische LogikRechter WinkelBeobachtungsstudieARM <Computerarchitektur>TabelleResultanteIterationRechenschieberAppletFormale SpracheTypentheorieAssemblerQuick-SortProjektive EbeneOrdnung <Mathematik>DifferenteSkriptspracheSchnitt <Mathematik>Wort <Informatik>CodeGruppenoperationToken-RingZweiBitByte-CodeMultiplikationsoperatorComputeranimation
Prozess <Informatik>Projektive EbeneChirurgie <Mathematik>PunktRechter WinkelMultiplikationsoperatorBildschirmmaskeDatenverwaltungStellenringTaskStabSingularität <Mathematik>ResultanteEinfache GenauigkeitCluster <Rechnernetz>OrdnungsreduktionProgramm/QuellcodeBesprechung/Interview
Operations ResearchPhysikalisches SystemDateisystemBetriebsmittelverwaltungStreaming <Kommunikationstechnik>SkriptspracheAppletOrdnungsreduktionReelle ZahlInteraktives FernsehenStapeldateiGeradeFehlermeldungMessage-PassingMAPZählenGruppenoperationCachingTreiber <Programm>ProgrammKlumpenstichprobeGraphSpeicherabzugJensen-MaßMachsches PrinzipAlgorithmusTaskBefehl <Informatik>NeuroinformatikTransformation <Mathematik>SelbstrepräsentationGrenzschichtablösungProjektive EbeneSchreiben <Datenverarbeitung>CodeMultiplikationsoperatorDifferenteCachingComputerarchitekturProgrammierspracheTreiber <Programm>Ordnung <Mathematik>NetzbetriebssystemGruppenoperationHalbleiterspeicherBitPunktwolkeProzess <Informatik>Ganze FunktionAggregatzustandDatenloggerIntegralGraphDatensatzImplementierungDatenverwaltungPunktProgrammierungResultanteVerzweigendes ProgrammGeradeARM <Computerarchitektur>SchnittmengeZentrische StreckungAuswahlverfahrenMaßerweiterungInformationsspeicherungFortsetzung <Mathematik>InterpretiererVirtuelle MaschineHypermediaComputervirusKlasse <Mathematik>AppletKundendatenbankTabelleComputeranimation
StapeldateiEin-AusgabeStreaming <Kommunikationstechnik>Reelle ZahlProzess <Informatik>Architektur <Informatik>Lambda-KalkülMaschinelles LernenStreaming <Kommunikationstechnik>BitDifferenteStapeldateiProzess <Informatik>Arithmetisches MittelLoopWellenlehreResultanteAlgorithmusMessage-PassingMultiplikationsoperatorMaßerweiterungProjektive EbeneElektronische PublikationPunktZweiSichtenkonzeptProgrammbibliothekCodeTupelReelle ZahlProgrammierumgebungEchtzeitsystemTwitter <Softwareplattform>NeuroinformatikAlgorithmische LerntheorieLambda-KalkülSchnittmengeVirtuelle MaschineComputerarchitekturWort <Informatik>IntegralTopologieTaskARM <Computerarchitektur>RandverteilungBrennen <Datenverarbeitung>DefaultEinfach zusammenhängender RaumMathematische LogikZahlenbereichAuswahlverfahrenEINKAUF <Programm>MereologieOrdnungsreduktionATMPROMComputeranimation
SystemprogrammierungWeb logDivergente ReiheStreaming <Kommunikationstechnik>SoftwareDatenbankWeb logDifferentePartitionsfunktionPhysikalisches SystemMixed RealityRechter WinkelInformationsspeicherungMultiplikationsoperatorSuite <Programmpaket>EinfügungsdämpfungComputeranimation
SystemprogrammierungAbstraktionsebeneRichtungGoogolVerschlingungBitratePhysikalisches SystemAbstimmung <Frequenz>DistributionenraumPunktOffice-PaketOrdnung <Mathematik>AlgorithmusQuaderMereologieZusammenhängender GraphProjektive EbeneOrdnungsreduktionTaskARM <Computerarchitektur>TropfenResultanteGanze FunktionCASE <Informatik>QuellcodeEinsAutomatische HandlungsplanungNeuroinformatikMultiplikationsoperatorCOMHyperbelverfahrenService providerGruppenoperationSchnittmengeSichtenkonzeptProzess <Informatik>Quick-SortRuhmasseWeg <Topologie>DatenverarbeitungssystemComputervirusRechter WinkelSoftwareindustrieChirurgie <Mathematik>EchtzeitsystemRegulator <Mathematik>ImplementierungDatenverwaltungMAPProdukt <Mathematik>DifferenteWort <Informatik>Virtuelle MaschineTermProgrammbibliothekElektronischer ProgrammführerDateiverwaltungInformationsspeicherungStreaming <Kommunikationstechnik>Komponente <Software>Cluster <Rechnernetz>ProgrammierumgebungGrenzschichtablösungParallele SchnittstelleWarteschlangeVerschlingungComputeranimation
Transkript: Englisch(automatisch erzeugt)
OK. So hello, everybody. This talk is a very high level overview of big data technologies, commonly referred today as Hadoop. If you guys won't expect anything deeper than a high level talk, that's not a talk. That's a very high level, what is Hadoop, what is
big data technologies, how you can use it, what you can do with it, this kind of thing. I'm also, I will try to give you a good perspective of what the current state of affairs in the Hadoop ecosystem, but it's a huge ecosystem, so hopefully we'll be able to do
some of it. My name is Itamar. I work at Sinharsh, going everywhere important, GitHub, Twitter, everywhere. I've been involved mostly in search engines for many years now, but I'm working a lot on big data kind of stuff lately.
Today I work at a company that in real time detects fraudsters. So given a transaction, a merchant sends us a transaction. We look at a transaction, and in a couple of milliseconds, we tell it a couple of hundred milliseconds, we give him an answer whether it's a fraudster or not.
So we look at lots of data. This is a good example of real transactions that somebody made. And it does look very suspicious, right? Somebody buys free MacBook Airs from an AP in Rwanda and ships it to the States. And he did have previous transactions, but they all were
for really little sums. And now the transaction is for $15,000, and it's an expedited shipping, et cetera. So lots of dots that don't really connect until you realize the story behind the person, until you realize he's serving in a remote location, and he buys laptops
so he has what to do to watch movies or something in his post and stuff like that. And when you connect the dots, suddenly it makes sense. So we do lots of big data, real time big data stream processing. We'll get to that later today as well. We do lots of really cool things with that.
One thing about big data, I'm not really good with buzz words, and whenever somebody says big data, that's what I think. And I'll use the words big data to give you some context over the next two or three minutes, and then I'll stop using those words. But you have to have big data in the title of a talk
to give people an idea what you're going to talk about. But I'll explain in just a second why we are not going to use big data. And the reason for that is because it's a problem of definitions. What is big data? So that's one definition. We may have other definitions.
People can refer to gigabytes or terabytes of data as big data, but that's not really true because today when we talk about big data, we usually refer to petabytes and more of data. So because of those definition problems, I'm not going to use this term anymore unless I fall. But we do have a problem.
We do have a problem where we have a lot of data that we want to use, and a lot of data that we need to store, and a lot of data that we want to process. And that data needs some way of working with it. Better yet, it's not only when you have a lot of data, but this big data realm, all of those technologies that
we use to make sense of data and to process it, we basically came up with quite a lot of ways of handling that. So because we have those technologies, we can now enable for higher data retention. We are now aware of the need to be highly available.
We now know that we need to do parallel computations on data to provide better responses, faster responses. So out of this realm of processing lots and lots of data, you can also come with conclusions with technologies and apply them to your projects, even if you don't have so-called big data.
So that's what we're going to do today. We're going to have a look and overview, a bird's eye view of this ecosystem. So it's called Hadoop, and we'll talk about it in a second. But it not only applies there. It applies pretty much everywhere. But again, you just need to understand the concepts and to
know how to use them. So where are we today? What do we have today? Today, pretty much every project on the planet still uses SQL to some extent and has these old ways of storing data, of working with data, tables, rows,
schemas, et cetera, et cetera, et cetera. So if we're talking about brownfield applications, projects, for sure, even new projects, sometimes it's just easier to bootstrap something using SQL and stuff. But it's very limiting in the way of how we think of data, how we approach a project, how we do things.
So you need to think in terms of schemas. You need to see how your data can fit into those tables, because we're working with tables, because we're working with SQL databases, we are now limited to one machine, because the way SQL works, you basically cannot scale it out.
So you just need to have bigger and bigger and bigger blocks unless you apply very sophisticated methodologies. And then you're pretty much limited. And then scaling out or growing, basically, means you have to spend lots of money on buying a bigger and bigger and bigger machine. This relational mindset really limits us.
It doesn't let us move forward, and that's really basically what needs to change. Then we get to a point where, in order to be able to grow, we need to delete all the information or to back up or to store it sideways or something like that. And that's where, again, it doesn't make sense anymore.
Because now, when we want to be able to move and deal with lots and lots of data, we are pretty much stuck. So use cases for when you want to look at those big data technologies or moving to other types of technologies is where you want to deal with many, many files or
really huge files. So Netflix videos is one good example. How do you store so many large files and ship them to customers? How do you deal with large data sets? How do you compute on them? How do you derive information out of many, many logs, for
example, if you have lots of traffic and you want to understand what's going on there, and you want to try and do an anomaly detection on those logs, for example, how do you do that if the files are very, very big? That's where you start looking at those technologies.
So our agenda for today is basically look at data when it's at rest. So when I have big files, like the Netflix videos example, and I just need to store them somewhere. So I cannot really store them on one PC, on one server. I can't even do this on 10 servers. I need many, many servers.
How do I manage storing so many files, so many big files on a cluster, and still have this highly available and replicated and all that comes with it? So that's data at rest. Once I know how to store that, I need to start thinking about how can I compute on that, so if it's not just videos, if it's really huge data sets.
So another big buzzword today is Internet of Things. So I have lots of data coming in. How do I store all of the data and still be able to compute on that, even though it's huge, huge data sets? So that's another thing we'll look at. Then comes the questions of streams. I not only want to compute on stuff that I have stored, I
want to be able to compute stuff as it comes in. Not only compute on that, maybe I want to react on that data. So we will look at stream processing. We'll look at how I can deal with data as it comes in by treating and handling the streams themselves.
And then we'll briefly talk about how I can build this huge ecosystem, because as we'll see in just a second, those are many, many, many technologies that need to piece together somehow. So I'll give you a couple of words about how we can do this generally. So what's really amazing about all of that is, and it's
this quote, that Grace Hopper is the mother of code. She was one of the best programmers, if you could call it that. She was really amazing and did amazing stuff. She invented or created COBOL. And she made really, really amazing stuff in the area of
computer science. And this quote is quite amazing, because she died in 1992. And back then, there wasn't even an internet as we know it today. And still, she was able to imagine and say this quote. And this quote basically says that hundreds of years ago,
when people wanted to do heavy lifting and do stuff with lots of animals, with an animal, and the ox couldn't pull it together anymore, because it was too heavy for them, they didn't try to grow a bigger ox. They just got more oxes. And basically, that's exactly what she says about scale out, not up. And that's something we say today quite often.
And she was able to imagine that and say that a lot before 1992. I don't know when she said this quote, but she died in 1992. And it's amazing. It's an amazing fact. And that's exactly what lies in the middle of our talk today, because that's exactly what we want to do when we want to
support lots of data and do lots of stuff on lots of data. But then we get to this mess. So we have lots and lots and lots of technologies, each doing its own thing. We need to understand what's going on there. We need to understand what can support what, and how we can plug them together.
This is what this talk is about today. So a few words about Hadoop. Hadoop started as a project. I'll give two words about what the project was. But today, it's not a project anymore. It still is. But what it really is today, it's an ecosystem. And we have many, many products, all of those from
the other slide. All of those products are basically part of the Hadoop ecosystem, one way or another. We will create a Hadoop cluster, which I'll discuss in just a second exactly what it is. But this cluster is going to be assembled of many of those other technologies. So Hadoop today is an ecosystem, not
just a software product. So Hadoop was actually a part of a bigger project. It was called Notch. And Notch was created to be a search engine, pretty much like Google. It was actually based on two papers Google released. One is GFS. The Google file system. How do you store a lot of data?
And the second one was MapReduce. How do you compute on lots of data? So Google released those two papers. And Doug Cutting and Michael Farrell went and started implementing that search engine based on those two papers, and they called it Notch. This project, basically today we know it as two projects.
One is Lucene. I gave a talk earlier today about that. So the search engine part. How do you search on text, do this stuff? That's Lucene. And the second treated the actual storage and actual computation. How do you do this index computation in parallel?
And that was the MapReduce part. So that Notch project is basically today Hadoop and Lucene. And Hadoop is taking care of those two pieces of storage and parallel compute. Hadoop, right from the get go, was open source. Today it is under the Apache umbrella, completely open
source, and its entire cause, its entire reason is to allow you to do, again, to support lots of storage and parallel compute, but doing that using commodity hardware without using any super computers. So ever since Notch was created and the Hadoop
project started rolling, it just got impetus and just went forward, and really amazing things happened. Within about 15 years, maybe less, there were big companies established on top of Hadoop. And today we have big Hadoop providers and big products
and big other companies, which, again, are not working on Hadoop itself, but are working on products that are part of the Hadoop ecosystem. It became a really huge, really fertile ecosystem. So the two things that Hadoop treats is storage and compute.
Storage, for once, is how do you store a lot of data that could be lots of files and that could be really huge files. We'll talk a bit in just a second about what does that mean. It's called HDFS, the Hadoop Distributed File System. And I'll show you in just a bit how it works and what it does exactly.
The second thing Hadoop does is parallel compute. So I have this data somewhere. Now I want to compute on that on parallel. So again, it's based on the Google MapReduce paper. We'll talk about MapReduce, and we'll see where we are today with MapReduce.
So HDFS, it's really about taking a very big file and giving you the ability of storing that on more than one server. That's for once. So you can actually store a huge file that cannot fit on one single server because you're now using multiple servers to store it.
And the second thing it allows you to do is to actually store lots of files on many, many, many, many servers as well. It also allows you to replicate that, to replicate these files. Even if one server goes down, you still have access to that server.
So the way it does that, it takes this huge file, it splits it into chunks, and pushes those chunks into the cluster. So in this example here, I have a very big file. I split it into six chunks, and I have four servers that I want to be part of my HDFS cluster. What I do now, I take each of those chunks
and I store them into the cluster, making sure that I have at least two copies of each chunk. And that's exactly what you see here. Each chunk of the file is going to be stored on at least two servers. And that will allow me to failover to the other chunk if something happens.
And still, I have this entire file available to me from this entire cluster. I will also want those chunks to be spread across the cluster. That's for the MapReduce part, which we'll get to in a few minutes. So those nodes are called data nodes, right? Those nodes contain those chunks. They contain the actual data, but they contain the data as chunks.
And that's why they are data nodes. If I want to access the file by a file name, because still it's a file system, even though it's a distributed file system, then I will need to go to the cluster through another type of server, which we call the name node. The name node contains basically the file system table.
And every time I want to access a file, I go to the name node and I ask for that file. And the name node tells me, OK, this file has those chunks. Please go to those data nodes and ask each data node the chunks that it has. And it gives me this entire table. So in HDFS, files are pretty much write only.
Sorry, write once and read only after that. So you write the file once, and then you can read it as many times as you want, but you cannot really change a file. Because of that. And the name node, we'll talk about this in a few, but the name node is pretty much my single point of failure. Because the data itself is replicated,
but once my name node crashes or something happens to my name node, I lose all my data. Because even though I still have the chunks somewhere, I may not know which chunks create which file. So you're already using distributed file system today.
If you're using S3, that's a distributed file system. That's basically, again, some sort of implementation of GFS. Very similar to HDFS. There are many, many other implementations of this file system concept that I just described. And again, every one of them is making its own trade-offs.
Everyone is making its own design decisions to support failovers in various scenarios. But once we are talking about distributed file systems, and because we are going to attack the problem of compute, it really makes sense of how you store those files. Because if you store them in a certain way,
it will allow computations of certain type to be done more efficiently. For example, if I'm doing computations of a certain type, it makes sense to store data in format of rows. Because if I need the entire row at a time, if I only need a column at a time, then it will make sense for me to store my files, my data
files, in a columnar format. So this is why we have many, many types of file formats. And each file format can come within its own compression scheme, et cetera, et cetera. So let's see how compute works on those HDFS. And for that, we need to use the concept of MapReduce.
So the more appropriate slide may be this one. Because many, many times people hear about MapReduce and suddenly get frightened. There's nothing to be afraid of. And we'll have a very nice example of how MapReduce work in the next slide. However, a few words on MapReduce.
So MapReduce, the concept behind MapReduce is to be able to compute on data, on one piece of data, but on parallel. So we need to realize what we can parallelize, and then how we can combine those parallelized results. To be able to do this efficiently on Hadoop, we need MapReduce to be tightly integrated on HDFS.
And we'll see that indeed the case. And MapReduce originally is Java code that has been compiled to basically to JVM byte code, and then submitted to the cluster. And we'll see that in a second. So MapReduce as a concept is, again,
taking a large piece of data. So in this case, it's three documents. But it could be as many documents as you want. And then in this word count example, I want to count the number of words and the number of unique words. If word appears twice, I want in the end to get the word and the count of two.
So the way I'm going to do this, I'm going to look at the entire data set. But instead of writing one method to compute on the entire data set, I'm going to split the data set into chunks like we did in HDFS. And then I'm going to iterate each computing unit is going to work on its own piece of data,
do some work, and that work needs in the end to be able to be combined into one result. So here's three documents. Each document resides on its own parallel computing unit. And each unit is now going to do a map phase. That map is basically going to do an initial computation
that will give me something that I can work with going forward. So this process in a word count example is just taking the sentence, taking the text, tokenizing it, and getting to a point where I have a word and a count for that specific document. So in this case, I get three pieces of counts.
So each one contains the words that it has, the actual word. And the second one contains a count. If I only have one occurrence of that word in that document, I'm going to get a count of one. And then I'm going to basically collect
all of those results from all of those computations. And I'm going to combine them into something that I can now move to the next phase, which is reduce. And that's where we get the name MapReduce. I first started mapping something from the data that I have to something that I can work with.
And now I'm transferring those immediate results into the reduce phase, which will do the actual computation. And the reduce phase is going to look at the words as the key, for example. It's like a dictionary and keys. And it will just count how many times they appeared
on those discrete units that did the computation in parallel. So it's Java code. And the Java code is pretty much simple to understand. We have two classes. One is map and one is reduce. The map class will define what types of input it gets
and what type of output it emits. And basically those are the lines that interest us. I'm just looking at the text, I'm tokenizing it, and I'm just outputting the word and a number of one. The reduce class is going to just accept all of those tuples.
And again, iterate through all of those tuples and just count the ones, basically. So I have two classes, one doing the map, one doing the reduce, and together I get something that can do parallel computing. Because it's Java and because I need to do something with it and send it somehow to the Hadoop cluster,
I will have this main code, main method, which will define all of the wiring. What is the map reduce, what is the reduce, what is the map function, sorry, class, or what is the reduce class. And then I'll call run job. So MapReduce is tightly integrated on HDFS.
And that's in order to be able, again, to support this computation on top of those large chunks. That's why we have a Hadoop for. So this time we're going to look at the same slide that we've seen before on the same architecture, on the same cluster. This time, instead of treating it as a cluster for HDFS,
I'm going to talk about it as a cluster for computation, for MapReduce. So again, we have the same file and it's split into chunks and those chunks are stored on those nodes. And those are indeed data nodes, but this time I'm going to treat them as task trackers
because I just created a MapReduce task, a computation that I submit to the cluster. And that task is going to run in parallel on each of those nodes. Each node is going to select the chunks it's going to work on and then it's going to produce their results
and submit it to the reducer. Now, this reducer is not necessarily going to be on those data nodes. And that's, again, some sort of a single point of failure and we'll touch on that in just a bit. So similarly to what we've seen on HDFS, I have this cluster which holds the data.
In HDFS, I call it data nodes. From this angle, I'm calling those task trackers. I still need something to coordinate all the work. So that's the job tracker. The job tracker is going to accept the job from the Java code that generates that and submit the job. It calls runJob. It will get this MapReduce job
and it will know how to coordinate the execution on those clusters within the Hadoop cluster. If computation fails for some reason, then the job tracker's job is to reinitiate the job for that specific chunk.
So that's Hadoop in general, right? So we have the storage part. We have the MapReduce or the computation part. They are pretty much, again, tightly coupled. And that's Hadoop in general. Hadoop comes in various flavors. So it's a concept. It's been implemented as one code under the Apache umbrella.
But today, it's been maintained by several companies using several different design decisions around various stuff accompanied with additional components or software products that lets you do stuff easier. For example, the image that you see
is from the Cloudera distribution. This is the Cloudera manager. It allows you to install the cluster on any cloud environment. It manages that for you. It gives you nice management tools. So every distribution gives you added value. On some respects, the big ones are Cloudera and Hortonworks,
the HDP, Hortonworks Data Platform. There's more implementations. For example, MapR is some sort of a Hadoop implementation but in C++, but they all pretty much implement the same concept that we just saw. And the entirety behind this entire Hadoop thing is, again, being able to do all of that,
the storage and compute on a commodity hardware. And then we get to some additional benefits. So we have storage, we have compute, and the way compute works is by creating Java MapReduce tasks, which basically are written in Java
and they are submitted to the cluster through the job tracker and being computed on the actual data. And this is where it gets really nasty because as long as you have, once you have really complicated MapReduce tasks, this Java code gets really ugly because you need to write lots of Java code.
You have to maintain lots of logic. Suddenly, you have dependencies between tasks. Things become really, really, really nasty. So this is why they started thinking about those MapReduce tasks, those Java jobs, as some sort of an assembly language that you can, or assembly language for the computation on Hadoop,
that maybe we can write some higher level language that will translate into those Java tasks or those MapReduce tasks, which are compiled to JVM bytecode. And this is where, for example, Hive comes from. Hive gives you SQL type of syntax to create MapReduce jobs,
which are basically translated into Java jobs, Java MapReduce jobs, but then you can submit that as if it was a normal MapReduce job. However, you don't write any Java, and it's easier to write and easier to maintain. So Hive is an SQL type of language that lets you write MapReduce jobs.
Pig is another example for that. It's exactly the same, only this time you don't write SQL kind of syntax. You write scripting kind of syntax. So both this and the previous slide showed you a word count. So while in Hive you would create a table
and you'll do some table manipulation in order to get individual words and then do a group by and count them, here you write something that looks more like a script. You'll tokenize and you will iterate, and then you'll be able to get the word count, and then you will dump the result,
probably into HDFS. So MapReduce is basically a Java job that's being compiled and sent to the cluster for execution, but as things getting more and more involved, people want to try and write better syntax and something more maintainable, and this is why we have, for example,
Pig and Hive for that. The scenario today is a bit different, and we'll talk about it in a second. And those MapReduce jobs will still need a way to be coordinated and scheduled and tested to see if they actually ran and if something happened, and that's where we get projects like Uzi,
Azkaban, Luigi, those projects basically give you a way to schedule jobs and to rerun them and to see dependencies, et cetera, et cetera, because again, those MapReduce jobs can become really, really complex and really, really heavy. And then we get to the not so nice things about Hadoop,
because as I said, we have many, many single point of failures. For example, on HDFS, we have this name node, which if something happens to it, we basically lose all our data. So obviously there is workarounds, there are things we can do, but it doesn't look very nice to begin with. And then we have this thing
of the single point of failure of MapReduce, that's not only the single point of failure, which is the job tracker, but it's also the time that it takes MapReduce jobs to run. So Hadoop is very well known for the long MapReduce tasks. People talk about MapReduce task as something you take and you send to the cluster,
you go home to sleep, you come back the next morning, you see if you had results or not. If you had a bug, you'll have to fix it and come back the next morning to see the results. So that's how people know Hadoop. And that's a problem. And it's not really a problem of Hadoop, it's more of a problem of resource management
of Hadoop clusters. And Hadoop are pretty much aware of that, and that's where Hadoop is going, and that's where we are today. And what I'll do now, I'll show you quite a few solutions for that. So the big problems with Hadoop today are mostly the resource management kind of stuff.
So the time it takes MapReduce jobs to run is basically due to resource management. And there's also the issue of data locality. Hadoop really runs well when you have the data you need to compute on, when it's local to the actual executing node. Once you get to a point
where your cluster isn't balanced well enough, we get to a point where the executors have to move data between nodes, enable to compute on them. And again, the reducer scenario. And that's where you get to a point where jobs take a long time to run. And that's what we want to solve.
How do we solve that? So basically the MapReduce concept is good, it's solid, it really solves the problem. The problem is both, it's mostly a resource management kind of issue. So we have two ways here. It's either rewriting or reimagining
the MapReduce concept. So we have the MapReduce concept, it works well, but we just need to find somehow better to execute it as a concept. Or we can go and do resource management better. So Hadoop as a project goes into what we call Yarn, yet another resource negotiator.
And Yarn is basically the Hadoop 2.0 way of saying, I will manage resources for you and do this much better. So that's Yarn, and Yarn can be thought of as an operating system for clusters, for the cloud. I will create a cloud, I will create a cluster for you,
but I will do it in a smart way and be able to manage the resources better for you. And that way I'll be able to basically map computations better and have MapReduce tasks run better. So that's one way of solving this problem. The second one is reimagining MapReduce.
Again, the concept is valid, question is how do we treat it? And Apache Spark is a project which, in my opinion, tries to solve this problem in a bit of a different way, and that way is called RDDs. RDDs, Resilient Distributed Datasets. We are going to look at a dataset,
and instead of finding a chunk in it, I'm going to load it into the cluster as if I was in programming language and my memory was spread across the cluster. So if you look at this code, you will see a couple of things. So RDDs, or Apache Spark, is constructed or assembled over quite a few concepts.
The first concept is the RDDs, or the way of loading data. It can be loading a file, it can be dataset that I feed into it, but the base RDD, the thing that I loaded is my data. Once I loaded my data, I can transform it, I can do several steps on it,
and then once I transform it, I can act on that data, I can execute an action. So transformation would be to filter my dataset and to remain with a smaller dataset. Or another transformation would be to map it in a certain way, like in old school MapReduce. I can do distinct, I can sort my data,
so that's a transformation. All transformations are lazy. Nothing happens until I do an action. And an action is to actually give me the rows, for example, the first 10 rows of the result of the computation, or to count how many rows or pieces of data answer that,
or to dump it again into HDFS or something like that. So that's the concepts behind Apache Spark. And RDD is really just a way, the easiest way to think of it is just a variable that you can run and you can load data into it and then manipulate, but that variable,
it doesn't reside on one single machine, it resides in the entire cluster. And what you see here in the code here is basically it. I'm loading data, I'm putting it into a variable, which again, is on my entire cluster. And that variable now contains all the rows,
all the lines, from a log file. Then I can define several transformation on that, and once I've defined transformations, I can act on those transformations. One of the biggest selling points of Spark is the caching ability. It's because I'm working on variables, I can serve intermediary states
and then do a couple of filtering, branch my filtering in several ways, and just cache the results after those filtering, and then try and then start an acting on those various filtering that I've made, different transformations, and then get results. So I will not have to compute all the way
every time I do an action, because I cached in the middle after every transformation. So Spark performs much, much better than the original MapReduce, and as far as I can tell from my experience with Spark and Hadoop, Spark also outperforms
MapReduce on top of Yarn. So Spark is a project separate to Hadoop. It can run as a standalone, but it also can run on top of Yarn for easier integration. So Spark in general works pretty much the same in the same architecture as the entire Hadoop cluster runs.
So you also have a driver program, something that you submit a job to the cluster, and you submit it to the cluster manager of Hadoop, or sorry, of Spark. This cluster manager will then know how to coordinate those worker nodes, and it will submit those tasks to those worker nodes,
and those worker nodes will run basically on a chunk of the data, and again, they will have their cache in order to perform better and serve intermediate states. So you can write Spark in Java, and you can write Spark in Scala,
you can write Spark in R, and basically you can write also with SQL, and those are just implementations of this RDD concept. So you have this base RDD, the Resilient Dataset, and that RDD is then going to represent your data. Once you have done that, you can treat it as if it was, you can treat it as a dataset
with using different programming languages, again, R, Scala, Python, for example, or you can just treat it as if it was several in different ways of interpretation, representation. For example, as if it was a SQL table, and then you can execute SQL statements on that.
You can also treat it as if it was a graph, and then execute graph algorithms on top of that RDD. That leads us to stream processing, because now we have covered the way to store the data, and we have covered how to, to some extent,
how to compute on that data, again, MapReduce, or in a way, the old way of doing that, on top of YARN, which gives us better performance, or using Spark, which is a bit of a different way of thinking of it, but now we need to, I want to show you how we can work on streams of data.
And the easiest way to perhaps start thinking about stream processing is basically doing what we did by now, until now, just batch processing, meaning I have this data, and I will just run my task on this entire dataset, but do this on microbatches, right? So I have a stream of data coming in,
I will just chunk my stream into small batches, and execute my algorithm on those small batches, and accumulate the results. So that's one way of thinking about it, and that's how, for example, Spark streaming works. So Spark streaming is some extension to the Spark project, that lets you run basically the same Spark code
that you wrote for the batch processing part, but this time on streams. The problem with that is that it has some latency. By default, it will look at batches of one second, so all the data that you have coming in within one second will be executed on, it will be treated as a batch,
and this Spark thing, the RDD thing, will work on that batch of that data. But that one second is a latency that sometimes you don't want to pay. There is also other trade-offs, but the biggest selling point of Spark streaming is the ability to reuse the code that you wrote for batch processing on top of Spark using RDDs on HDFS,
or any other distributed file system, to work also on streams. There are other projects that let you do stream processing in a much, much faster way using other concepts. This time, you will not use batching algorithms, you will use other things.
There is quite a few projects the major three are Apache Storm, Apache Flink, which is the kind of new kid in town, and Apache Samza by LinkedIn. Twitter just announced a couple of days ago a new project that's called Heron, and it's basically Storm reimagined, but the concepts are going to pretty much be the same,
to all three, basically, and this is why we're going to look at Storm in just a bit. But the idea is that instead of waiting for batches and computing on batches, we will compute on every single message. So I will listen to stream, and every message that comes from that stream,
I will act upon it. So Apache Storm is basically, has about four or five important concepts. I have a stream, that stream of data that comes in is a stream of tuples. Those tuples can contain as many items as I want, and basically, I will be computing
on each tuple individually. This tuple, the way that I'm listening to a stream is via a spout. That spout is going to continuously emit tuples, and that will be my stream. The stream is a stream of tuples. To that spout, I will connect bolts,
and those bolts basically implement my logic. So each and every bolt can do whatever it wants, and those bolts connect to other bolts, and this way, I can basically create a whole topology, and that topology is how do I connect, basically defines how I connect my bolts to one another,
and then to what spouts they connect to. There are many concepts like Nimbus, how do I manage those, the message processing, et cetera, et cetera, but those are the concepts that are very simple. So instead of, again, looking at batch processing, taking big files or chunks of streams
and processing them using one piece of code, I will now react to every message that comes into my topology. Those two connect through something that we call a Lambda architecture. A Lambda architecture talks about a way to compute both on an offline mode,
meaning batch processing, computing stuff on large data sets in offline, and responding to events in real time. And Lambda architecture basically talks about giving the user a view of their data by combining results of both an offline computation,
which I can do using Spark, or using MapReduce on YARN, or whatever, and that's like the real data and something that's computed really well because I had time to do that, and also integrating results from the real time processing,
which most of the time will also be less accurate. So I have less time to compute things on real time, so I'll do my best effort, or I'll derive some conclusions out of that, and I will just inject it into the view, which is basically built by my offline process. So that's how those two things usually fit together
in real time environments. A few words on machine learning. Those batch processing algorithms, or ways of working, are very, very useful for machine learning kind of algorithms. And this is why we have, by machine learning I mean classification, clustering algorithms, regressions, anomaly detection,
all of those words, and basically we have two really notable projects that implement all of those algorithms for you, so you don't really have to do anything. If you're running on Spark, you have Spark MLlib, and Spark Machine Learning Library, and that library basically implements
all of the algorithms that you would want, K-means and pretty much everything. There is another project that's called Mahout, and Mahout lets you again implement all of that, and it originated by implementing those algorithms for MapReduce, for the original loop clusters,
but today also has support for Spark. So I would say pretty much Spark, the Hadoop wave or the Hadoop ecosystem goes into the way of using Spark more and more, but still you have those projects that give you machine learning abilities on both architectures.
There is more to the ecosystem, just one more prime example I would say. So if you wanted to facilitate full text search on your data, you will use Solar or Elasticsearch, and you have really tight integration for both into any Hadoop cluster. There is also the ELK stack,
which is the Elasticsearch combined with Logstash and Kibana, which lets you do real-time aggregations on data. So you can, for example, plug the ELK stack to Stormbolts and do computations and create real-time dashboards on top of that data. However, when we talk about distributed systems,
we get into lots of really dark corners, because now we have to basically connect all of those really different technologies to talk with one another. Many times they will be aware of each other and know how to do that already, but sometimes they will not. So for example, if I wanted to connect Elasticsearch
to Apache Storm, there is a way to do that, and it's actually a supported way, and Elasticsearch themselves are building it. But if I wanted to connect in different ways that nobody else thought about doing this before me, or they did, but it wasn't really well tested yet, then I'll be in a dark corner.
Furthermore, because we're dealing with distributed systems, we are very, very, very, we are exposed to various issues. So I don't know how many of you read after his blog posts about the Jepson suit. Basically, he was testing databases and distributed systems to see whether or not
they support, how they react to network partitions, for example, how much data are they losing, how do they uphold the things that they promise you as systems that deal with lots of data? Do they lose data?
Do they acknowledge rights without giving you the ability to read them after, et cetera, et cetera? I highly recommend you read them. But in the end, you will end up using many, many systems, RabbitMQ and Cassandra and Redis and Kafka and Apache Flume, for example, to stream data into your system.
So all of those buzzwords, just make sure they connect really well and understand where the data loss risk comes into play. To coordinate all of those technologies, there are a couple of computer systems or software components that let you synchronize
those components and make sure that they are playing well with one another. One of the prime ones is ZooKeeper. I would not recommend jumping into using ZooKeeper right ahead, but many distributed systems will already require that for you. So for example, Kafka is some sort of, if you use it as some sort of a queue system,
will require you to have ZooKeeper in your system. What ZooKeeper does, it basically, it keeps track of various nodes of various technologies and make sure they know about each other. And whenever someone wants to join your cluster, it needs to register with ZooKeeper. The thing is that ZooKeeper is really well written
and really well tested and basically mathematically proven, and it's pretty much failsafe in that respect. What you will also see is a lot of places that start giving you ways of, some sort of IDEs on top of Hadoop. So for example, if we talked about
all of those moving pieces and writing a MapReduce jobs and connecting that to various data sources, et cetera, et cetera, there's many products like Xplenty, for example, which gives you the ability to have some sort of an IDE on top of that. So just drag and drop various components, you connect this algorithm and this task,
et cetera, et cetera, into this, and tell it to compute on this data source and store the result into that data source. And basically you can build really sophisticated clusters just using drag and drop. So we'll see more of those coming up in the near future.
So to sum everything we've seen up, distributed file systems allow us to store a lot of data. And again, we'll need to store it because we want to serve it to people, or because we want to compute on that at a later time. In order to compute on that, we will need to employ the MapReduce concepts.
So we'll have to map all the data that we have in parallel for many nodes and combine them into a reducer node or several reducer nodes in order to get a final result. However, we can do MapReduce in various ways,
and the bottlenecks are going to be in various places depending on the implementation. So again, MapReduce, the old MapReduce has its own bottlenecks in terms of resource management. MapReduce on top of YARN, basically the new Hadoop ecosystem is going to solve many of those.
And Spark is another implementation. Question is, where does it all go to? And as far as I can tell, the market really leans towards Spark. So if you are new to Hadoop, check out Spark because Spark is where things are headed.
You can really see that. Many projects come into play in the Spark world and invest a lot in the Spark world and less and less in the Hadoop 2.0 or YARN world. We also talked about stream processing and some of the trade-offs that we have. There's quite a few stream processing libraries out there.
I've played with Storm. I can't really vouch for the others, but people really know what they're doing in this Hadoop ecosystem. So if you want to start playing with Hadoop or the big data technologies world, you would probably want to get a VM
that has everything installed from either Cloudera or Hadoop. Sorry, Hortonworks. Those two companies, those two distributions allow you to download one VM. It's quite small, actually. You download this virtual machine and you just run it locally from your virtual machine provider,
VirtualBox or VMware or whatever, and you can start playing with it. You'll have everything you need installed locally inside a VM and you can get a sense of what happens. So I know Hortonworks also provide you guides that let you download data sets and they tell you how to interact with it using Pig, using Hive, et cetera, et cetera.
It also exists on Azure. If you want to spin a new Hadoop cluster, and there's also a way if you have an AWS account, you can also spin an entire cluster using Mesos via this link.
That's what I have for today. Thank you very much. I hope you got some ideas of how to approach handling lots of data. And we've got a couple of minutes for questions if you guys have any. Yep. With Elasticsearch,
so the question was about Elasticsearch mostly and how do you, when do you want to use Elasticsearch versus when you want to use Hadoop? So Elasticsearch doesn't really do MapReduce. It does some, it uses some different concept,
more of an aggregations and multilevel aggregations. It's not really MapReduce, but Elasticsearch really solves the problem of real-time aggregations. Once you want to employ more sophisticated techniques that you can do in MapReduce, but you cannot do with what Elasticsearch provides you with, then you will go with Hadoop. That's one. The second one is that it's more of an offering
for people who are already using Hadoop to start and try to use Elasticsearch. So it's their way into the Hadoop world. So it's not necessarily that you want to use Elasticsearch instead of Hadoop. It's rather more of a question of when do you get something from Elasticsearch in the ELK stack
that you don't have in the Hadoop ecosystem? And that's for once very strong full-text search capabilities, which I would definitely recommend using Elasticsearch and not Solr, although both Autonautics and Cloudera ship with Solr out of the box, and the ability to have real-time dashboards, for example,
which you can do really nice things using Storm, for example, using Kibana dashboards, et cetera. Okay, so Elasticsearch is not really MapReduce. It's very private cases of MapReduce. That's what it solves. All right, thank you guys.