Large scale data analysis made easy - Apache Hadoop
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 97 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/45719 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 201056 / 97
1
2
4
5
6
8
14
15
16
17
23
29
38
41
42
44
46
47
48
50
53
54
62
63
64
65
66
71
74
75
77
78
79
80
82
84
85
94
00:00
Data analysisScale (map)Process (computing)Vertex (graph theory)Projective planeSatelliteSlide ruleSelf-organizationAlgorithmTerm (mathematics)Goodness of fitContext awarenessWordBitRight angleOrder (biology)Multiplication signVirtual machineCluster analysisLibrary (computing)Software frameworkScaling (geometry)Electronic data processingEmailSoftware developerElectronic mailing listSet (mathematics)Information securityProduct (business)DemosceneXMLComputer animation
04:08
DatabaseComputer configurationData storage deviceProduct (business)InformationRelational databaseRegular graphData warehouseEndliche ModelltheorieSource codeComputer animation
04:47
Serial portVirtual machinePoint (geometry)Search engine (computing)Physical systemDatabase transactionComputer fileWeb 2.0Regular graphVirtual machineCluster analysisDistributed computingFitness functionScaling (geometry)Data loggerLoginProcess (computing)MultiplicationHard disk driveQuery languageFile systemComputer animation
05:49
Virtual machineNeuroinformatikGroup actionBackupPresentation of a groupHard disk driveData centerMultiplicationReplication (computing)Software frameworkComputer animationLecture/Conference
06:26
BackupSoftware developerComputer programmingSoftwareSoftware frameworkBackupSoftware developerComputer programmingProjective planeVirtual machineMultiplication signSoftwareFehlererkennungAsynchronous Transfer ModeProduct (business)Universe (mathematics)Message passingParallel computingCrash (computing)Computer animationLecture/Conference
07:26
Parallel portMaizePatch (Unix)Software frameworkPhysical systemParallel computingOpen sourceSoftware developerOperator (mathematics)Cartesian coordinate systemSystem administratorComputer animationMeeting/Interview
08:22
Parallel portPhysical systemVertex (graph theory)Scale (map)Computer programmingLevel (video gaming)GoogolLibrary (computing)Reduction of orderCluster analysisTask (computing)Computer programmingCartesian coordinate systemScaling (geometry)Search engine (computing)FacebookSupercomputerImplementationCluster analysisCategory of beingAuthorizationLevel (video gaming)File systemDemosceneInternetworkingMountain passModule (mathematics)Physical systemProjective planeSoftware frameworkMappingSingle-precision floating-point formatGraph coloringGoogolDataflowSoftware developerComputer animation
10:19
Vertex (graph theory)BenchmarkCore dumpQuadrilateralMiniDiscProcess (computing)Computer hardwareReplication (computing)Distribution (mathematics)Physical systemSystem programmingPort scannerCluster analysisSoftware frameworkWritingCartesian coordinate systemBenchmarkMultiplication signNeuroinformatikFile systemMiniDisc2 (number)Computer hardwareComputer fileRandom accessBuildingFocus (optics)Default (computer science)Streaming mediaReplication (computing)CASE <Informatik>WebsitePolar coordinate systemComputer animationLecture/Conference
12:29
Vertex (graph theory)Content (media)Read-only memoryMiniDiscBlock (periodic table)Cluster analysisFile systemWebsiteBlock (periodic table)Cartesian coordinate systemReplication (computing)Band matrixStrategy gameOrder (biology)Core dumpComputer fileClient (computing)Source codeQuicksortData storage deviceSemiconductor memoryComputer programmingPhysical systemMetadataMappingSystem callSocial classFlow separationTable (information)Perpetual motionSlide ruleConnectivity (graph theory)Hard disk drive2 (number)Computer animationLecture/Conference
15:31
Open setBlock (periodic table)Client (computing)Band matrixDifferent (Kate Ryan album)Row (database)Computer filePlotterInformationBlock (periodic table)Computer animation
16:16
Execution unitNormal (geometry)AreaEmpennagePattern languageReduction of orderGroup actionOvalString (computer science)Function (mathematics)Context awarenessCellular automatonMaizeProcess (computing)Vertex (graph theory)Task (computing)2 (number)Arithmetic progressionSoftware frameworkProcess (computing)Level (video gaming)Function (mathematics)File systemRegulärer Ausdruck <Textverarbeitung>Task (computing)Loop (music)Time zoneDrop (liquid)Functional (mathematics)Key (cryptography)Instance (computer science)StatisticsRow (database)Computer fileCASE <Informatik>Fraction (mathematics)QuicksortObject (grammar)Cluster analysisBlock (periodic table)NumberCountingCrash (computing)Client (computing)Context awarenessArithmetic meanLocal ringScheduling (computing)Virtual machineGoogolCartesian coordinate systemSource codeOcean currentType theorySystem callReduction of orderOverhead (computing)MappingIterationContent (media)Personal digital assistantPattern languageUniqueness quantificationElectronic mailing listCellular automatonMultiplicationStandard deviationFile archiverJava appletResultantComputer programmingInformationOrder (biology)Uniform resource locatorFlow separationPhase transitionRegular graphSummierbarkeitComputer animationLecture/Conference
23:40
WebsiteTerm (mathematics)Distribution (mathematics)Moment (mathematics)Physical systemType theoryHacker (term)SoftwareDisk read-and-write headComputer hardwareComputer animation
24:22
Color managementElasticity (physics)Point cloudSimultaneous localization and mappingLink (knot theory)Data centerRight angleVirtual machineQuicksortService (economics)Front and back endsDisk read-and-write headCloud computingElasticity (physics)Cluster analysisMultiplication signMessage passingComputer animationLecture/Conference
25:28
Cluster analysisService (economics)Auditory maskingCluster analysisComputer hardwareCartesian closed categoryQuicksortLaptopVirtual machineSoftware developerSingle-precision floating-point formatAsynchronous Transfer ModeComputer programmingIntegrated development environmentDebugger
26:34
Task (computing)Vertex (graph theory)Information securityLink (knot theory)Computer fileSingle-precision floating-point formatOpen setProjective planeInformation securityData centerCluster analysisQuicksortTask (computing)Real numberRight angleRevision controlLibrary (computing)NP-hardLink (knot theory)Serial portDisk read-and-write headMobile appLoop (music)Scheduling (computing)Computer animation
27:48
Musical ensembleSearch engine (computing)BitMathematical analysisFacebookMedical imagingFlow separationMountain passMultiplication signSpacetimeComputer animation
28:32
Web pagePresentation of a groupDigital filterLocal GroupUniform resource locatorCountingData storage devicePresentation of a groupLevel (video gaming)Data storage deviceWritingFormal languageWeb pageTask (computing)Computer fileGroup actionStructural loadProcess (computing)Library (computing)Random accessProjective planeScaling (geometry)Flow separationCore dumpJava appletReduction of orderQuicksortData analysisOverhead (computing)CodeRow (database)CountingTable (information)Computer animation
31:22
GoogolCodeMultiplication signTerm (mathematics)Binary fileCommunications protocolQuicksortBuffer solutionLibrary (computing)SpacetimeComputer fileComputer animation
32:02
EmailCategory of beingFlow separationProjective planeBuildingVirtual machinePoint (geometry)AlgorithmSubject indexingData miningSearch engine (computing)Similarity (geometry)Social classLengthWhiteboardComputer animation
33:11
Maxima and minimaNumberCodeProjective planeProduct (business)Slide ruleDisk read-and-write headEmailElectronic mailing listInternet service provideroutputLecture/Conference
33:51
Table (information)Traffic reportingSmith chartScalabilityScale (map)Business IntelligenceSimultaneous localization and mappingMultiplication signGraph (mathematics)MereologyEmailGreatest elementFreewareObject-oriented programmingElectronic mailing listSubject indexingCartesian coordinate systemEvent horizonToken ringDisk read-and-write headWordNeuroinformatikCloud computingNoise (electronics)PasswordLecture/ConferenceComputer animation
35:46
CodePatch (Unix)Standard deviationLatent heatSoftware frameworkMeasurementWritingArithmetic meanHydraulic jumpHard disk driveRegular graphDifferent (Kate Ryan album)Instance (computer science)TextsystemMathematicsMultiplication signQuicksortoutputData storage deviceMedical imagingComputer fileCoprocessorLevel (video gaming)Software testingElectronic signatureFile systemInsertion lossFunctional (mathematics)Mereology1 (number)Web 2.0Cluster analysisCASE <Informatik>Replication (computing)BackupEmailSpacetimeScheduling (computing)ImplementationProcess (computing)Electronic mailing listReduction of orderProjective planeObject-oriented programmingPhysical systemTerm (mathematics)Elasticity (physics)View (database)PlanningDrop (liquid)MiniDiscComputer animationLecture/Conference
41:53
CodePatch (Unix)QuicksortTerm (mathematics)Mathematical optimizationReduction of orderInstance (computer science)File systemCoprocessorTask (computing)MultiplicationSoftware frameworkSemiconductor memoryFormal languageVirtual machineCluster analysisKeyboard shortcutRepository (publishing)Scaling (geometry)Multiplication signPoint (geometry)Physical systemProcess (computing)Parallel portCartesian coordinate systemNP-hardOnline chatPairwise comparisonBenchmarkCASE <Informatik>Disk read-and-write headServer (computing)Type theoryForcing (mathematics)Line (geometry)BackupData storage device1 (number)Java appletComputer programmingWritingWebsiteData analysisRouter (computing)Axiom of choiceRegular graphNumberComputer fileCore dumpFunction (mathematics)Block (periodic table)Point cloudComputer animationLecture/Conference
47:59
CodePatch (Unix)Gamma functionComputer animationLecture/ConferenceXML
Transcript: English(auto-generated)
00:14
Thank you for the warm welcome and Welcome to my talk on Apache Hadoop. It will be a talk on large-scale data processing
00:22
Which is basically very hot topic currently But first of all, I would like to explain Who I am My name is Isabel just as announced I am the organizer of the Berlin Hadoop get together and I am co-founder of Apache Mahout
00:46
What does Apache Mahout yet another name? Mahout is a library that intends to implement machine learning algorithms that scale scale in terms of community have a vibrant community have
01:02
very lively mailing lists scale in terms of has a commercially friendly license and of course scale in terms of scale to large Data sets to huge amounts of data to train on in order to reach the last goal most of our algorithms currently are based on
01:23
Apache Hadoop and I'm implemented on top of this framework And at daytime, I'm a software developer in Berlin Now I would like to know a little more about you Whoever has seen a talk by me and knows that I'm doing Hadoop get-togethers in Berlin knows that usually I take this
01:43
Microphone and give it to the audience and ask you stupid questions But as there are so many people here, I'll just do it a little differently this time Please would you raise your hand if you know the term Hadoop? That's awesome How many of you are actually Hadoop users?
02:04
Okay, good Next question how many notes does your cluster have ten or more? 100 or more 1,000 or more. Okay good
02:21
So some more buzzwords who knows about zookeeper Okay quite some people anyone aware of hive Some more How about hbase? You should know about that one. There was an interesting talk at the NOS PL def rooms this morning
02:42
Anyone aware of pig I mean not so little animal but the project actually Little scene I want to see all your hands Okay, so now Any solar users great?
03:01
Anyone know about my heart before I told you about it Anyone using it? No one yes, there's someone I want to talk to you after my talk Okay, what am I gonna go and talk about first of all we have a chapter on collecting and storing data
03:22
Next chapter will be on analyzing data. I will give you a short tour the Hadoop Tell you what's coming up. Tell you a little bit about the history and Last but not least there are a few slides on the Hadoop ecosystem Because whoever raises hands
03:41
During the questions. I can just ask you knows that there is not just Hadoop secure project, but there are many satellite projects that make working with the framework easier Okay collecting and storing data If we go see traditional way and have a look at where data is
04:06
Collected we may come up with an example like that we have a shop we have products in the shop and we want to Collect information on how many products do we have which price does each product have?
04:21
How many products that we sell information on our customers where do they live and so on and so forth? For solutions that comes to mind is regular relational databases like MySQL Postgres or Oracle source the data in a relational model Analyze the data, maybe put it into a data warehouse
04:42
Maybe run our AP queries over it But what if what's the data that I have is not really relational data What it's maybe a bunch of log files Say transaction logs from your regular web shop Or say something like query logs if you're running a very successful search engine
05:03
You may want to track which queries users are actually searching for to improve your system And what if those log files scale to the point where they don't fit to your regular hard disk anymore So you end up with data that
05:21
Cannot be stored on a single machine and you end up with data that cannot efficiently be processed in a serial way the logical Consequence would be to use multiple multiple multiple machines process your data Maybe to build a little cluster
05:41
distribute computations just to use a distributed file system and Just go with that There are a few challenges when doing it this way First challenge of you are all computer users. You know that single machines tend to fail See MacBook, I'm using here to give this presentation
06:02
Had a hard disk failure 12 hours before I gave my talk at Hadoop user group UK last year So I was pretty happy to have a back up of my presentation if you have not only a single machine But like a data center with multiple machines each machine you add increases the probability of any of those machines failing
06:24
So what you want is frameworks that gives you a built-in backup built in replication and built-in failover Now you need someone to write programs to analyze the data if you have a look at typical software developers
06:44
Usually if they come out of university and are not as brilliant as foster visitors They never have dealt with large amounts of data So they don't know how to handle petabytes of data and don't know all the intricacies that come into play when? writing writing parallel
07:03
Programs and Usually for a probe for a project you don't have time to actually make software production ready And was production already here. I mean something like it has defined failure modes It has failover of a machine crashes. It has to find error codes or something goes wrong and so on and so forth
07:24
So you want something? So it's easy to use basically something like parallel programming on rails And if you're thinking about using an open-source framework you want something where bucks are Regularly fixed their new features are added and where your patches are integrated and into a system
07:48
so you want something with a Vibrant lively development community and last but not least the guy between you the developer and
08:00
your customer is an operations guy and I think I may promise you that he's gonna yell at you if the system isn't easy to Administrate and he'll probably also start yelling if for every single little applications that you write you're using a different framework so you need something that is easy to administrate and
08:22
You need something Just kind of a single systems of maps to a quite a lot a lot of tasks that you want to solve That is when you may want to have a look at a Hadoop. It's easy distributed programming So for me as a developer, it makes it easy to write distributed applications without a very very deep
08:48
Background in parallel programming distributed cluster program programming or HPC It's well known in industry and research It's used by companies like Yahoo. It's used by Facebook
09:03
It's used by the New York Times by last FM and many more and it's girls what scales well beyond 1000 notes So where does it funny little project come from? What is the history behind it? Well, you may have heard of the
09:23
properties implementation of Google So it was done in 2003 At about the same year a paper was published by Google on the distributed file system GFS and Another year later the map of just paper came out didn't take very long long until duck cutting see
09:44
Original author of low scene reported that much which is a internet scale search engine makes use of map reduce it didn't take long again to Grow for the module to grow so much so that we're ended an extra project
10:02
beside notch in 2007 Yahoo reported running a first hadoop cluster with tap was one thousand notes and Like two years ago. It was finally its own top-level project at Apache So last summer just to show you that the framework really works
10:24
Yahoo has won the petabyte sorting benchmark with a hadoop cluster So what are the assumptions that are underneath the frameworks that you should be aware of if you're writing hadoop applications?
10:41
First assumption as I already mentioned is that The data does not fit on a single note it What what comes out of that is that we want to use commodity hardware, so we don't want to use See PCs it's underneath the desk of your secretary
11:00
It's still kind of beefy strong hardware, but it's not dedicated It's not a dedicated hardware What comes out of that is that failure happens? The idea is to distribute the file system to build a replication into the file system Build in replication means that every files that are stored in the file system is replicated by default
11:26
Two times, so it's available three times And you have automatic failover in case of failure Second assumption is that you have so much data that it's pretty expensive to move the data from
11:41
You know from where it's stored to where it should be processed So the idea is to turn the whole model around and move the computation to where the data is and keep computation local to data third assumption is well this seek is
12:01
very expensive compared to Continuously scanning files so the API is that you have available in hadoop Focus on making scanning data very easy, but they don't Make it easy to write Applications that need random access to your data, so you need to reformulate your algorithm such that you can stream over the data
12:26
If you go to the website Hadoop apache.org and download the package Basically what you end up with is two kind of components one is Htfs the distributed file system and the second is the MapReduce engine
12:44
We'll have a look at each of these in the coming slides First of all the distributed file system If you install that on your little cluster What you end up with is one class one node. They called the name node holding
13:03
File metadata that is Each file basically is split into Separate blocks and the name node keeps information of on which node each of these blocks are stored Besides that you have a several worker notes called data notes that actually store the data
13:24
So basically you could compare the name node to holding sort of the inode table of your cluster What this means is okay our name node stores file metadata
13:42
It stores that metadata in memory, and it stores a mapping from block From file blocks to actual nodes that means if you store that in memory this means that the size of your cluster Depends on how much main memory you give to your name node
14:02
And it depends on how large you make each file block if you make the blocks large You can store a lot of data because you don't you don't have so many blocks per file If you are writing a program against HDFS, how does writing a file look like?
14:21
Let's assume you write an HDFS client that runs on the client node Logically first thing it does it goes to the name node tells the name node I want to create a file and the name node tells it okay I say you can go this file should be stored on this data node After that the client goes to the data node stores its data
14:43
And as I mentioned earlier it the system has replication built in so the data node goes and Pipeline see data to its to sort of slaves After replication is complete and all is written to the system your method call will return
15:05
Replicates your application strategies that is used is basically a trade-off between Bandwills that you have between nodes and between distributing your farm Evenly across the cluster in order to minimize failure spreading
15:22
You don't want all three replicas on one hard disk obviously, but maybe you don't want them in one rack as well So if you have a look at it at an example We may have our client on the left hand side To optimize bandwidth this client may write to see to its own
15:44
Data node this one replicates to a different rack and on this rack. It's again replicated to a different data node Fire read looks similar you have your HDFS client It talks to the name node and tells the name node
16:02
I want file X name node tells my client. Okay, this file is distributed across these data nodes It goes to the data nodes reads the plots and gets the information So now you know how to store the data, you know how to read it back You know how to interact with C file system on a sort of coding level
16:24
But what do you really want is write programs that analyze your data in order for you to deduce information from it So that's where the MapReduce engine comes into play To explain what MapReduce is all about how many of you have written MapReduce programs? I should see
16:46
Yeah quite a fair amount. Okay Okay, one example Takes this little XML file. It doesn't look very pretty. It's just in a snippet from the RSS
17:00
URLs that are in my RSS feed reader Goal would be of this task would be to read the file extract host names of each block and Extract the top 10 host names of blocks that I read so I were to do this on a standard Linux machine with Regular tools I will do something like that and come up with a list of
17:25
Okay, they're like 10 hours as feeds from archives. They're six hours as fees from Google and so on and so forth If you have a look closer look at how that's done would probably look something like that
17:40
You define a pattern that kind of looks like a host name. No guarantees. This is the right Regular expression for a host name. It's just for me for the example, you would crap over the file You would send sort by host name and finally Count how many unique host names you have in this list
18:01
If you mapsite over to MapReduce What you end up with is a map step For crapping over the files. You have a reduced step for counting and what the framework does is the shuffle phase Basically your map function would look something like read a block of data
18:24
remembers that in this case, this is not a Kind of feed URL file. So that's just a few megabytes in size But maybe one petabyte one terabyte so it may be distributed across the cluster So our map function is run exactly on those nodes holding the correct fractions of our data
18:45
Map functions and extracts key value pairs where keys are the host names While you may be for instance, how often did I see this host name in the current block of data? If I write the reduce function I have to guarantee that I only see that I see all
19:06
key value pairs of one key type For one call of the reduce function. So summing up is very easy I just iterate over all the values and put out key and sum mapping
19:21
If you have a closer look at what this may look like it's something like that I read the data from the HDFS. I Have multiple map tasks run all over the cluster Each of these map tasks outputs key value pairs in this case
19:41
hostname and counts this intermediate output is shuffled and grouped by key and In the end I have to I have reduced tasks that compute the final results If you have a look at the Java API So it may look like something like this
20:01
If you are used to sort of the old API of Hadoop ordered 18 This one is a new one. It's a lot cleaner and more compact In the map function again, you get a key value mapping in this case. It may be something like file name and content you iterate over the content and
20:24
extract host names and The context object gives you a way of emitting key value pairs What the context object gives you as well as a way of emitting sort of counter values counters can be used for instance to
20:42
Provide the framework with the number of Bad records that you have encountered should say it just skips the record and keep a statistic of how many Bad records were in the file But the context also gives you a way of telling the frameworks that you're actually progressing
21:01
Because what Hadoop does is if your drop is running for too long Then it cannot really decide are you in a endless loop or are you really doing work? So you can set up a timeout? After which the drop was killed. So like after zone if I don't see Progress updates for that many minutes seconds, whatever kill the job
21:25
Which of course is bad if you have a long-running mathematical job, so you want to tell the framework. Oh, yeah. Hello I'm still alive. I'm not that and I'm doing sensible work. So please don't kill me To reduce jobs and simply sums up the values and outputs
21:43
Through result if you have a look at our little picture For my produce we now have a special note called the job tracker That is the note that we connect to to submit map reduce jobs
22:00
And we have on each slave no task trackers that really runs a map tasks and reduce tasks What does a map reduce job really look like if I write my client application on my client note? This application contacts a job tracker and tells us. Hey, I've got a job for you to run
22:23
Job trackers and has a look at where the data really is located contacts the machine On that machine is a task tracker This task tracker is responsible for scheduling local for local scheduling What the task tracker is does it starts a JVM on this note?
22:47
Runs the map job or the reduced job in this JVM and returns its output Why does it run in a separate JVM? Well, as I told you is the famous framework should be robust and being robust means that client jobs that are run
23:02
Shouldn't crash the whole framework and it even shouldn't crash a task tracker If I have a client job that crashes my client JVM, I want it to be in the separate JVM So this also means that if I run my produced jobs that are very very short So there is a lot quite a quite a bit of overhead
23:21
But the assumption is that really I have said much data on my note. So it's the overhead of Starting and as a JVM doesn't count so much And of course again, you have not only one task trigger but multiple of them So if you were to start your own
23:42
Hacking Hadoop hacking. What do you need in terms of hardware? In terms of software, it's easy you go easier to your Apache Hadoop website and download the Hadoop distribution or you go to Cloudera You seek Hadoop distribution or you go to Debian people have some package up Hadoop for Debian
24:07
There are people working on it at the moment So they have Hadoop on the pipeline so it may not be long before you can just type up get install Hadoop and get your system up and running Well, what do you need?
24:23
Probably this is the dream of everyone, right? Running it on a day in the data center on thousands of notes just happily hacking along If you don't have a data center, you may use other people's data centers anyone not aware of EC2
24:41
Okay, so you just rent machine time from Amazon you can Get your own Hadoop cluster up and running on EC2s There are a lot of how-to's and there are also AMI's that makes that very easy If you just want to play around There's MapReduce as a service at
25:03
Amazon it doesn't use HDFS but its own back ends, but it's pretty nice to get started and get playing around with MapReduce in the first place However, be careful because elastic MapReduce uses see a sort of old MapReduce API and not yet the new one
25:22
If you don't want to go into cloud computing you can set up your own little Hadoop cluster thanks to Tilo for Installing the Hadoop cluster and thanks to packet and mask from the CCC Berlin for providing some of the hardware
25:40
But you really don't need large service Actually, you can start playing around with just a tiny little laptop like the one over here You can run Hadoop and sort of single single note mode and sort of pseudo distributed note More pseudo distributed mode and get your get your programs run locally
26:04
This also is very handy if you want to debug programs if you want to develop new stuff Then you probably want to try out try it out on your development machine you don't want to go to your cluster and fire up the debugger and
26:20
Debug in a distributed way you want to do this locally on your machine probably even within the IDE So what is up next? Next up in the o dot 21 release There will finally be a pant in HDFS
26:42
So far you can write files and can close it and that's it. The goal is to facilitate Opening files again and appending to the end of it It's a pretty long story of getting append into HDFS and there will be more advanced tasks get yours
27:00
In o dot 22 there will be more security Currently there are use sort of user rights But there is no real hard security on on data on the cluster So the basic assumption is it runs in your data center and sounds safe There will be afro based RPC
27:23
So that RPC is compatible across Hadoop versions Anyone ever heard of her afro Hey great Afro, basically is a RPC and serialization library. So that is a sub project of Hadoop and written by a duck
27:41
There will be symbolic links and there will be federated name notes. No more single name note So who's using Hadoop? We may learn a little bit about it in the next talk by Facebook Hadoop is used by Yahoo for lots of analysis
28:02
It's used by last FM for coming up with recommendations Did you it's used by the New York Times for? scaling images and converting images it's used by search engines like deep dive for text analysis and
28:21
It's used by Several search engines that are based on Notch or on cutter. So if you have a look at I do at a broader sort of scale There's not only the core project but several people working on
28:45
Other projects that make it easier to handle Hadoop that make it easier to write map reduce jobs that make it Better for data storage Let's first have a look at higher level languages
29:04
Some of you may know some of the logos just to motivate why you need higher level languages. I'll give you an example I'll take the example from pig. I am taking it from pig because I Saw the presentation on pig one year ago
29:22
And they had a very very great example motivating why you need something like a higher level language So suppose you have some user data in a file you have website data in another file and you find the find out The top five most visited pages by users aged 15 to 25
29:43
Sounds like a pretty reasonable task to do, isn't it? So you need something to load your users load pages you need something to filter users by age You want to join both on name you want a group on the URL visited You want to click you want to count the clicks on each URL?
30:02
You want to order by click amount and you want to take the top five? If you were to do this in Java code, it would look something like that You're not supposed to read it up there in the last rows. You're even not supposed to read it over here in the front If you're doing it was pick it looks like something like that
30:22
And I hope that even the guys over there in the back can read it So what you can do visit us right pretty easily one of jobs for data analysis Of course, you pay some overhead when running these jobs, but on the but then again You don't have to write ours
30:41
Loads and loads of Java code and it's easier to understand. Of course There are some projects making it making it easier to Distribute storage as I explained its HDFS is not optimized for random access. It's optimized for continuous files
31:04
So what if you want sort of kind of random access and what if you have semi structured data? Then you may want to go for hbase or you want to go for Cassandra or you want to go for a hyper table? Which are all three? based on HDFS
31:22
There are a few libraries built on top of Hadoop and There are a few libraries that make handling your data easier You may store Your files in the plaintext format, but that obviously isn't very efficient. These are neither space efficient nor
31:42
Time efficient in terms of passing time What you want is sort of a binary format But you want to buy the binary format that you can upgrade easy easily That's where something like afro or Swift or Google protocol buffers come into play
32:02
So a lot of innovation happens around Hadoop As I mentioned already is there's a project building Machine learning algorithms on top of Hadoop. We have clustering algorithms for grouping items by similar Similarity we have classification algorithms for
32:24
Classifying new incoming items into predefined categories Well-known example is spam email classification So we have two categories Mails that I want to have mess that I want to throw away and I want to learn a classifiers at separate
32:41
Data points into these classes. That's what you can do with my heart and of course you can do recommendation mining So this like if you go to Amazon and you buy a book usually Amazon tells you people who bought this book Also bought like sees and such books. That's something you can implement with my heart
33:01
And of course, there are also search engines That are making use of Hadoop to distribute indexing So I've done a lot of advertisement for Hadoop just three final slides of advertisement Why should you go with the project?
33:20
First of all, it's proven code. It works in practice. It works in production and You don't need to reinvent the wheel Next it's an Apache project and they are Mailing lists that are very lively very lively discussions on the mailing lists people who are willing to help you
33:42
So come to the mailing list if you are Hadoop user and provide input By the way, if you're my heart user the same applies to you Last but not least. It's very well possible to become part of the community. If you have a look at the Graph here in the bottoms. That's just the
34:02
Growth of emails and on them all had oop mainly mailing lists. So the community is still growing steadily And there are people From inside as well as from outside Yahoo people working full-time people working in their free time on Hadoop
34:22
So final advertisement if you're using had to come to the mailing lists talk to us One advertisement in my own interest. I'm doing see how to get together in Berlin Next one is on March 10th. If you need an excuse to visit Berlin. There will be three talks
34:43
After 5 p.m There will be beer After the event there will be lots of interesting discussions and lots of interesting people there And there's another head up event in Berlin looks like this year is he had oop year for Berlin There will be a Berlin buzzwords event on June 6 and 7
35:06
talks on the topics storing data Searching indexing and scaling to large amounts of data are well welcome We welcome talks on Apache had oop on noise no SQL databases on distributed computing on business intelligence
35:25
applications on search applications on scaling search indexes and on cloud computing in general So you already notice that when I'm telling you all those Names and words and tokens. It should be easy to play buzzword bingo at that conference
35:46
So that's my final invitation to come to the mailing lists Thank you
36:09
Hi, I was wondering what is being done to make HDFS mountable as a regular file system How
36:20
How can you mount HDFS as a regular file system it does that work yet Okay, and is that does that work nicely or is it like
36:41
Hello, maybe I missed this before but You were mentioning the Amazon elastic they had an old map will use API So what's the difference and what's the advantage of the new and yeah, the new one is more compact
37:05
It isn't so verbose and see signatures of your map and reduce functions have changed a little So it's kind of easy to port your drops It's not kind of sort of a very fundamental change, but you have to add up to the new API
37:22
I'm trying to follow up to that first question. I I'm afraid I didn't hear your answer to it all that clearly, but What I'm wondering is
37:43
Suppose I wanted to implement an STD. I owe this light layer over it and be able to Just save to Hadoop from my word processor my image processor whatever file I'm using How far are we from being able to support that?
38:11
Can you just hand the microphone over From from a POSIX file system. You're pretty much far away because you can't
38:21
Reopen files and then append so you just can't you you can you can write once and then close it and maybe read it again But another pen test that's coming Well, it doesn't work now, but so if you if you're interested to work on that
38:43
Maybe you should think about being part of the project and not building a standard IOL a on top of that Because I mean it'll be way more efficient wouldn't it HDFS with a view to looking at that and then I thought why the last people who know before investing time in it
39:11
Hello, hi simple question you said the system designed with failure Assumed but there seems to be a key dependency on the name node. What happens if the name node fails?
39:24
You should have planned for failover in the name node with your standard Well, it was standard measures So currently I'm not aware of anything in the framework itself said Said there is a failover in the name node, but it's it's sort of easy to
39:43
Implement it and design it into your cluster You mentioned Amazon's hadoop on demand I'm guessing they have their own API for that and I don't mean the actual hadoop API. I mean their
40:03
customer facing API Will see hadoop API and submit them so there's there's no API Amazon didn't implement its own API Take for instance my heart and run it on elastic map reduce that has been done about one year ago
40:24
The only things that you have to take care of this is that the Amazon API is based on hadoop 0.18 and that API differs from what is currently available There's got to be a web facing API in front of that right? Yes
40:40
Is that that is Amazon specific is are there plans for an open implementation of that API the web facing API? And hadoop on demand web API Is that of any interest?
41:09
Excuse me. Yeah, I have a question is the does the hadoop frameworks apart backing up my data store in a Consistent way and restoring it in any way. It does have backup
41:23
built in in terms of replication so each file you store to hadoop is It's replicated to three Disks and in case one of those hard disks fails in case one of those nodes fails Replication is started again to get up to the target level
41:41
To just take a backup for maybe today 12 o'clock and to restore it completely No, if you have zero Force for our applications is their requirements to legal requirements to back to be able to restore at a certain point in time
42:05
And we looked at hadoop and didn't find a way to do a point-in-time backup and restore I mean you can always read data from the cluster. No one's stopping you from doing that That it's not that I can't freeze my cluster at a certain point in time
42:22
I can freeze a machine but nothing I didn't find a way to freeze the entire cluster. I Think that's this is up to you to implement that. Okay, so this has to be done in the application This is this is a job of the application and not of the cluster from your design
42:40
Thanks a lot So my question is about the fact that hard upscales on multiple nodes But how does it handle multiple processors on each node? of course
43:01
But currently I'm not aware of any sort of Optimizations in terms of this data, I'm outputting in this reducer should be Post-processed on the same node because the node has multiple processes and they can do this in parallel as first I know this
43:22
optimization isn't in it, but of course the Framework makes use of multiple cores if you have multiple cores available on each node It's perfectly reasonable Is it in which language a dupe is written is it Java or C
43:47
C++ programs as well and they're streaming Pipe API so it's that you can use any scripting language of your choice So if you really want to you can write your data analysis jobs in Python PHP, whatever you want
44:03
so if I want to Run it on an embedded system. It won't be possible And small router that has a dupe inside with let's say 32 megabyte of RAM won't be possible
44:39
If you just use Hadoop as a file system, do you have any comparative benchmarks with lustre
44:46
Do you have any comparative benchmarks with lustre the lustre file system If there are any more questions, feel free to contact me after the talk
45:02
I I'm happy to answer any questions and the guy over there who? Raised his hand on my heart. Please come to me and talk to me. I want to know more in your use case Any more questions?
45:29
Could you tell me what are the memory requirements for the HDFS nodes? Depends on how much data you want to store in your in your cluster and depends on the size of your blocks
45:42
I Don't have exact numbers Is it a memory intensive task or Well, okay you said that for the name Server it's required to have a large amount of friend. But what about the notes you mean for the name note?
46:06
No, not the name Name note is a single entity that needs a lot of RAM. Yeah, and what about the Nodes which are contained with the data They don't need to be
46:28
So we have some time left
46:44
Take your chance Just an announcement because I saw somebody is searching for Debian packages at cloud era for her dupe The dupe is in the pipe to enter Debian unstable the next weeks
47:07
So if you want to package into Debian mainline, that's the guy to talk to could you please get up again So people really need to help this guy because I would like to have a Debian package To be able to type up get install how to put on my regular Debian machine without Cloudera
47:26
repositories without external repositories just in Debian main Could you just give him some mic? I need somebody to help me out with the C stuff and all the bindings I've done the Java stuff. This is packaged