Hanythingondemand - Hadoop clusters on Hpc clusters
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Part Number | 109 | |
Number of Parts | 110 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/30934 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
FOSDEM 2016109 / 110
4
6
10
11
13
15
17
19
20
23
25
27
30
32
36
38
39
41
42
43
44
45
46
47
48
50
52
54
58
61
62
69
71
72
75
76
78
79
80
82
87
88
91
93
94
95
96
97
101
103
104
106
107
110
00:00
Business clusterSystem programmingSupercomputerDiscrete element methodComputer musicSoftware developerHard disk driveScripting languageCountingWordoutputVariable (mathematics)LaptopProxy serverPhysical systemPoint cloudComputer programmingInstance (computer science)Distribution (mathematics)InformationUniverse (mathematics)Social classArithmetic meanLevel (video gaming)Power (physics)Physical systemVector spaceStapeldateiSpeciesSource codeMereologyChemical equationResultantProcess (computing)Video gameRight angleDirection (geometry)MassProjective planeInteractive television1 (number)QuicksortTheoryEvent horizonDifferent (Kate Ryan album)Position operatorSemiconductor memoryHypermediaEndliche ModelltheorieDependent and independent variablesMedical imagingCASE <Informatik>CountingScripting languageModulare ProgrammierungBusiness clusterWordVariable (mathematics)Queue (abstract data type)Multiplication signStudent's t-testElectronic mailing listPoint (geometry)Machine codePoint cloudCoordinate systemLaptopParameter (computer programming)Front and back endsParallel computingComputer fileBand matrixCartesian coordinate systemData storage deviceIntegrated development environmentSoftware developerBitCodeMathematical analysisSelf-organizationMultitier architectureVirtual machineModule (mathematics)Normal (geometry)SupercomputerComputer clusterXMLLecture/Conference
08:43
User interfaceExplosionHard disk driveSupercomputerSystem programmingBusiness clusterService (economics)LaptopProjective planePhysical systemWeb serviceMiniDiscSoftwareScheduling (computing)Process (computing)Hard disk driveDifferent (Kate Ryan album)Local ringScripting languageSemiconductor memoryQuicksortConfiguration spaceLink (knot theory)Computer fileBlogDefault (computer science)Shift operatorCartesian coordinate systemMultitier architectureParallel computingWeb 2.0Goodness of fitIntegrated development environmentBitConnected spaceTraffic reportingStructural loadMultiplication signFront and back endsVirtual machineCASE <Informatik>Profil (magazine)Utility softwareDirectory servicePairwise comparisonBusiness clusterModule (mathematics)Distribution (mathematics)Single-precision floating-point formatLimit (category theory)IterationLogin1 (number)AreaRight angleVarianceForm (programming)Parameter (computer programming)Element (mathematics)WordData managementMereologyEndliche ModelltheorieFocus (optics)SummierbarkeitTerm (mathematics)Type theoryINTEGRALLevel (video gaming)WorkloadNumberLecture/Conference
17:22
Discrete element methodSystem programmingSupercomputerBusiness clusterAverageCodeSpherical capServer (computing)Hard disk drivePC CardMoment (mathematics)State of matterMereologyLimit (category theory)Endliche ModelltheorieOpen sourceMultiplication signMassComputer virusBitWeb servicePopulation densityProfil (magazine)Process (computing)InformationSpacetimeRight angleResultantMedical imagingTable (information)Row (database)Machine visionUniverse (mathematics)Software testingVariable (mathematics)Video gameThermal radiationWordDataflowGame theoryParameter (computer programming)Dependent and independent variablesRevision controlDivisorElectric generatorCategory of beingGame controllerProduct (business)Level (video gaming)CodeTheoryComputerBusiness clusterWebsiteInternetworkingClient (computing)Semiconductor memoryPoint (geometry)Information securitySocial classSupercomputerDirectory serviceData analysisCluster analysisFile systemPhysical systemFunction (mathematics)Server (computing)Projective planeDefault (computer science)Auditory maskingCodeMusical ensembleExistenceDifferent (Kate Ryan album)Single-precision floating-point formatVirtual machinePoint cloudOrder (biology)Distribution (mathematics)Proxy serverSweep line algorithmConnected spaceSimulationComplex (psychology)Goodness of fitKeyboard shortcutIntegrated development environmentLecture/Conference
26:00
Core dumpGoogolLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:05
All right, let's get it going. So our next speaker is Euan Case, who will be talking about running HPC infrastructure, which is interesting because I actually spent some time on the project doing the other way around,
00:22
running HPC applications on existing infrastructure. Euan will be talking about something else completely. So here's Euan. I'm Euan. Hi. I'm the big data coordinator for HPC at Guinier University,
00:41
and I work on a project called Anything on Demand. We call it HOD for short, sometimes hod. So today I'm going to talk about why we've made it. We're going to go into a little shallow end. I'm not going to do any deep dives into the children. I'll show you a few things like how to maybe use it
01:04
if you'd like to do that. And I'll show you some use cases. We have some use cases for actual research. And as it's possible then, at the developer meeting, we're going to talk about a few community things, see if everyone gets in the room.
01:21
Is anyone here using HPC systems? Quite a lot. And is anyone administrating them? And using and not administrating? All right. All right. So for the HPC administrators, this
01:42
could be a really useful tool for you guys if you've got a lot of people in your organization that are using HPC stuff, PaaS, Romax, Quanta, Espresso, blah, blah, blah. But then you get a few users like us who say, I really want to use Hadoop. Why don't you just build a Hadoop cluster?
02:01
Because you get a new cluster maybe every year or so. Let's go in that direction because everyone's doing that. And so we've made anything on the event. The code is on GitHub. You can take a look. We have quite a bit of documentation, which I think is really good.
02:21
Kenneth helped me with writing that and getting documentation so even undergrads could follow our graduate students at least. So the basic overview is to create a cluster. You're on your HPC system. You can set a Q server, C server, MS server,
02:40
whichever you use. You use HOD create after you've loaded that HOD module using the module system. You can then, once it's been created or it goes to the queue and then it spawns, you can connect to it like any Hadoop cluster and then from there you can use yarn, jar, job submission system.
03:06
If you have a very long queue and you don't want to wait around for the job submission, you can do interactive analysis. You can do HOD batch, which you can use a script and it just sets it and it runs any normal queue sub
03:22
that you would do. Except this time you're using Hadoop. Of course, if you have a bunch of Hadoop clusters running at the same time, you would use HOD list. There's quite a few more commands, but these are the ones you probably would use every day if you're using it.
03:42
The command line is pretty simple. You want a cluster of four nodes, HOD create. Give it a label so that you can find it in the future. You can tell it which distribution of Hadoop you'd like to use. For instance, here we've got Cloudera. Once it's running, you just connect. Give it a label name that you're going to connect to.
04:00
If you use Vagrant, like Nicholas was talking about Vagrant, you use that. It's extremely similar to just connect. And then once you're running, you can just run your, for instance, here's a basic word count. If you wanted to do the batch,
04:22
so for instance, if you wanted to do it in 16 nodes, you just run it with N16 here and give it a script, and it's almost the same thing, and it's running on many nodes. And you can align the arguments with environment variables using each of your batch lists. You can always use the same thing.
04:44
We also support IPython notebook. We had a successful classroom activity with some graduate students who were using IPython notebook to do their analysis. Using systems that had 64 gigabyte memory per node,
05:02
so to do their machine learning analysis, they were able to do this on our system instead of having to use a laptop or use Amazon or some other cloud system that they'd have to be able to really want to have the funding of their graduate students to work that way.
05:23
So to get to the point of why we built this, why not just build a big data system? And by big data system here, I mean a lot of people can play Padoop cluster or Padoop stack with big data. Why not just build one of these,
05:41
or why not just go to the cloud? And the difference here, the main difference here is of course that the traditional HPC system is going to be a whole bunch of worker nodes talking to a few nodes of a parallel file system, so they'd have, for instance, we use GPFS,
06:01
but you could be using Buster, Panasys, I'm sure there's plenty of others. Whereas with HDFS, one of the main insights was that once you scale to a certain point and you've got so many nodes, you're going to start soaking your bandwidth to the storage nodes, then after you're stuck,
06:24
you should have local storage, which is where HDFS comes in, which for very big systems, that's a really useful idea. However, where we fit in is within European HPC, we are covering tier two and tier one.
06:42
So the way European HPC works is you've got four tiers of HPC. So tier three you can think of as your desktop, your laptop, whether you're running your scientific programs or codes. Then your university might have a cluster.
07:00
This is what we would call tier two. Within Flanders, we have a tier one, which is the national center for HPC. We have one in Ghent, but I think towards the end of the year, it's being deprecated in favor of one that is being built in Leuven now. And then at tier zero, you have a few machines
07:23
at supercomputer centers around Europe, so this is probably two in France, some in France, Italy, Spain, Germany. I don't think Belgium will get one. So here's an idea of the size of the clusters we run,
07:43
whether or not we're going to ever get thousands of nodes, as you see in the papers or the MapReduce papers with Google's running with thousands of nodes. Before I joined, someone had the bright idea of naming the clusters like the Pokémon. So quite a few of them have been deprecated.
08:00
We used to have Gengar, Gastly, Homter, but these are the ones that we have running now, the tier two ones. And probably, to me, the most important one for working with big data is FanPy, which is chosen to be an elephant because it's suitable for crossing the gap.
08:23
Notably, we use 512 gig per machine, and it's only 16 nodes. Well, with this amount of memory, actually, a lot of problems with the IO interaction will go away, especially if you're using iPython with a Spark backend.
08:42
We are very happy with the performance, basically, of what we've been using before. Another one that we got in last year was Golet, which is 64 gig per node, which is also really useful. In this year, we're going to be installing Swalot,
09:03
which is 128 gig per node. Not specifically for a high-memory machine, but a lot of memory and a lot more nodes than FanPy, which is only 16. We also have the tier one, which is much more important, the tier two, sorry, the tier one in Leuven,
09:22
which I don't think has a... Well, English is a Pokémon name, because we're not... They're more serious than us. This is going to be a really good machine for HPC. I'm not so sure it will be suitable for running 580 nodes, which we do, but that's now what we're trying to do.
09:42
Just for comparison of the nodes, in case you forgot, Swalot only has 128, and the tier ones have almost 600. So, I think most people... Well, actually, we have a lot of HPC people, so this might be a very useful slide. You might not be familiar with the new stack,
10:00
but basically, you have your HDFS as your file system, and on top of that, you have Yarn running, and Yarn is a job scheduler that can talk to all sorts of different application containers, I guess is the name. We've taken off HDFS, replaceable GPFS,
10:23
and we keep Yarn as a scheduler, but it's also running on PDS, which is another job shift. I spent a bit of time talking about the disk locality because if you take out HDFS, both that run into problems. Well, if you're running more iterative jobs
10:41
over and over again, like Spark, performance improvements coming in from Spark is because once you've loaded the data in, then you can keep it in memory. Having all the local memory is... Having all the local disk is less and less important, and I'm sure it's the case that if you have 8 or 16 GPFS connection
11:03
to your parallel file system, you could soak that really easy with a lot of nodes if you have to get in a lot of data. However, in practice, in 2012, for instance, Carrera came out and said, here's a report on job sizes that we see, and sure, there's a lot of jobs that are extremely large
11:22
and would not be a big use case for our system, but we probably get 80% of our jobs, well, maybe 90% of our jobs if we include PhamPy. It's not really fair because PhamPy is a modern machine, and this is going to do that as well, and the technology moves that quickly, but these days, PhamPy could fit most jobs in a single node anyway.
11:41
So we really feel that this is something... Well, when we were working in HD, we really felt that this is an approach that can be successful. As long as we understand that these limitations are going to get really, really big. So that's why we built it, and it lets us run.
12:02
I do clusters on HPC. How does that work? You're going to be... As a user, you'd be on your laptop, you'd SSH into a login node, and from there, you'd submit a job to the cluster. Did you teach up earlier? Oh, I'm sorry. Is it when I turn around, or just...
12:21
I'm sorry. I actually feel like I'm yelling, so... Sure. So the way HDD works is it's essentially a cluster within a cluster. So it will start the yarn cluster,
12:40
the purple box, within the actual PBS cluster, PBS being the job scheduler. Our job controller, whichever you call it. And it just will take the nodes, turn it into a cluster. Once you've done that, you can access web services by making an SSH tunnel,
13:00
and then you can set up your proxy, and then you can have access to all the web things. This is how we do the iPod to notebook, for instance, and this is how, if you do any of the web front-end systems to monitor your Hadoop cluster, this is how you could do it. For instance, DiBari is another thing that
13:21
I haven't implemented yet on HDD, but I think that would be a very useful way for people to monitor their HDD clusters. So you might remember that when I ran HDD create, I gave it a dist. So you might say, well, what's a dist? A dist is a directory
13:41
which contains a hod.conf, which is like a manifest, and within that, within there, there's a few other scripts, which are configurations, and so here is one for iPhone notebook, and we have the hod.conf, which is manifest,
14:00
and then we have these other jobs, which will be started, and then we have just a utility script for the iPhone notebook itself. The hd.conf lets you define some modules. So in this case, we're using, is the purple or the pink on black going to see?
14:20
None of this is okay? So for instance, here we have ipython 3.2.3. This uses native IO stuff with Intel, and it connects with Python 2.7.10. Users can clone this and just add in whatever they like,
14:42
and if I'm not fast enough in getting ipython 4.1 out, which is available, but I haven't provided a distribution yet, the users could just take this, replace the modules. If it's installed on your system, they could also install it on our system using EasyBuild, for instance,
15:01
as a project for installing software for users. You could do that, and you're basically good to go. The node-manager.com, for instance, so this is one of the services that comes up. It was inspired by systemd-type configurations, but it's not the same.
15:24
I wasn't dogmatic about it. So for instance, here we do the, it uses EasyBuild, so ebroot.hduke, and then it goes to the script and just starts the node-manager. Pretty simple. You can also configure the environment as it runs, and you can tell it if it's gonna run on every node
15:41
or just the master node or the various other nodes, and it gives it a name. Pretty straightforward, I hope. I think users can understand this as well, so if you're administrating a system, you can say, well, you can update it yourself if you really need to. Well, what you might have noticed,
16:01
especially when you think about Nicholas's talk, I didn't see where he went, but there he is. I haven't put in any of the Hadoop configuration things. So, but obviously the defaults are terrible. Well, I have auto-generated configs.
16:20
This links really well into Nicholas's talk if you guys were here earlier. I went through all the blog posts. I did some profiling. I really need to look at a load job and take a look at this. I automatically generated configs now. I was getting trepidatious to put this feature in because it's an unbounded problem,
16:43
and I will never get everything correct for every job. That's not possible. So if you want, in the Hadoop config, all you have to do, instead of using XML, because I refused to put XML in. Well, at first I put in XML, but then Colin convinced me to refuse.
17:02
Well, you just do yarn, blah, blah, blah, blah, blah, and you can give it some config overload, and these will overload. There's nothing complicated. Just do a straight replacement. So if you have, like, one memory overloaded, you really need to start getting out your spreadsheet
17:21
and configure all the other memory systems as well. We're going to speed up. But it understands the yarn site XML. It understands the Spark defaults. So if you give it Spark defaults, instead of using XML, it will use the space delimited, which is the detail. And it also understands the log from Jennifer, which you need to control the log in.
17:42
So user stories. We've got a guy in Ghent named Brice De Cap. He uses anything on demand to profile. He profiles also on an Intel cluster and also on Amazon Web Services. He's also funded by Intel. In his paper, he put in the results for the Intel Big Data cluster that's written,
18:03
where it ran pretty reasonably well. It's comparable to Amazon. He also ran it on our system, and he did a parameter sweep in the fastest one. Of course, he made a cherry pick. Ran in half the time. That's 5,000 seconds, which is about half the time.
18:21
This is also not fair, because this is 64 gigabyte of memory per node, and this is FANPAI, which has 512 gigabyte per node. So it's completely not fair. But it's not IOBOUND, and it's not using, because it's not using each of us, which I think is the important point. So that's the big thing. We also had, in the soup,
18:42
is Professor Dirk van de Kloel, and he had a graduate school class using this, 64 users. Huddled around five per desk, because we couldn't spare so many nodes for the reservation, and they were able to do their graduate school project
19:02
for his big data analysis course, and in the end, it was pretty successful. I've only got a few minutes left, so I'll quickly go through the code, some limitations, because I try to be honest, and community. So it's Python 2.7, and it'll be June.
19:21
I have about 80% code coverage, which I think is pretty good, since it's a really new project trying to stick HPC stuff and Hadoop stuff, and we have drinking pills that make sure that it keeps getting tested on the HPR. Limitations, it's only PBS. So anyone here using Slurm or Sun Grid Engine,
19:42
or UniGrid Grid Engine, unfortunately, it's not supported at the moment. That's something I'm interested in having, but I can't reuse PBS exclusively, so we can't really, I can't really implement it for you, because it's not really useful for our time,
20:00
unless we can come to an arrangement. But I'm happy to work with people to make sure that happens, because that would really get penetration into more HPC people, sites, and then we could have it used more and more and more. The other limitation is that some of the job control is using Python 2, and it's not using Christoph.
20:22
So dealing with asynchronous jobs coming up and down is based on sleeps, and it's something that could be better if it's... Am I closer to Mel? Oh, sorry, good. Time's up. So in the end, would you like it at your site?
20:41
Is there anything you need? If you want to start with Grid Engine, and stuff we can talk about. And so just to let you remember what I've told you, tell me what you're going to tell them, tell them, and tell them again. HPE lets you run HPC, Hadoop on HPC, auto-generated complex being used for actual research,
21:05
and HPC can make pretty good big data clusters for certain definitions of big data. And yeah, please check it out. And there's questions.
21:28
Is this GPMS? Yes. Okay. So you have no problem with Internet anymore? Well, this is an interesting topic. So as Nicholas mentioned, Ohio State has a version of Hadoop that uses RDMA for the fetcher.
21:44
Seagate released an open-source version of Hadoop that takes the fetcher and just uses SIM links, well, not SIM links, harmonies, in the file system anyway, so the fetcher didn't have to do anything, which is nice, but they're a bit touchy. And IBM also has something for GPFS,
22:03
which is an adapter, and I've tested this, but it's also very flaky, so I warned, and it's available as a distribution in HD, but it looked like I've not answered the correct question. My question is a little bit different. Normally you have, need a single interface,
22:21
IP interface, if you're running an HDFS client, you have only one contact port. And if you are running a continue band, you normally have two interfaces, an HDFS and a continue band. But you have only one point, and you have to maintain that to bind to, so how do you solve this problem?
22:42
Yes, so in the configurations, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah. Let's see if I can find that again. So, for, does this one have it? Well, okay, so there's little environment variables
23:00
you can use, like $localWorkDir, So this is a local working directory for the node. There's also host name and host data name. And the host data name is the IB interface, and the host name is the, well, this is the other,
23:24
this is another limitation, I guess, I should have been. Our security model in our HPC system is based on, you can only SSH to the client and node if you're running a job on that node. And we hope I've done it correctly,
23:42
but we configured it so it only looks at the local host. So there's a mask for the server to say, only look at the local host for any port. You can take the user name. Sorry? You can take the user name. And then you are able. You can only, if you are supplying the tab
24:02
to the next QA secure. And then you have a problem with the principles if you're running two different outputs and so forth. OK, well, OK, I was under the impression that you have the IP mask set up so that it will only accept connections from the local host. So you need to be SSH-funded based
24:21
on the machine using that proxy in order to access the web services. Like you don't want to identify them. If you're using local, then that's not protected. This is something I want to get working correctly. It's been on the table for us.
24:41
And we've been thinking about how to do it correctly. So I'm happy to download your brain and tell you more about it. OK, any more questions? Would you be interested in extending this to run on the cloud?
25:01
Yes and no. I think that there's a lot of systems that provision the cloud. So I'm not sure that that would be sort of useful to people because that exists. Running HPC exists. There's my Hadoop. We run Lockwood.
25:21
We're using San Diego. We're using NERSC now. So you basically end up like that because you don't want to support any classroom activities using it. But yeah, I think it's a good idea. And if there's PRs coming in, that would be useful.
25:40
But I think that if it exists, probably then rewriting it would be a fun project. But I don't think there's too many users. OK, and with that, we're out of time. Thank you again.