We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Facilitating HPC job debugging through job scripts archival

00:00

Formal Metadata

Title
Facilitating HPC job debugging through job scripts archival
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
SArchive is a lightweight tool to facilitate debugging HPC job issues by providing support teams with the exact version of the job script that is run in the HPC job in an archive either on the filesystem, in Elasticsearch, or by producing it to a Kafka topic. HPC schedulers usually keep a version of the user’s job script in their spool directory for the lifetime of the job, i.e., from job submission until the job has run to completion — either succesfully or failed. However, once the job has completed, the job script and associated files are removed to avoid stacking up a large number of files. HPC systems typically run several millions of jobs, if not many more, over their lifetime -- it is not feasible to keep them all in the spool directory. In case the job failed, user support teams are often asked to help figure out the cause of the failure. For these occasions, it often is helpful if the exact job script is available. Since a typical scheduler setup will make changes to every submitted script through, e.g., a submission filter, simply obtaining what the user submitted requires an extra hoop to run the given script through the filter(s). Furthermore, users may have tweaked, changed, or removed the job script, which may add to the difficulty of debugging the issue at hand. SArchive aims to address this problem by providing user support teams with an exact copy of the script that was run, along with the exact additional files that are used by the scheduler, e.g., to set up the environment in which the jobs runs. It can be argued that making a backup copy is actually the job of the scheduler itself, but we decided to use a tool outside the scheduler. This has the advantages that (i) one need not have access to the scheduler’s source code (not all schedulers are open source) and (ii) sites running multiple schedulers need not make any changes to each of them, but only to SArchive — which should be a fairly limited effort, if any at all. SArchive is currenly tailored towards the Slurm scheduler (hence the name), but it also supports the Torque resource manager. Adding support for other schedulers should be fairly straightforward — pull requests are welcome :) Currently, SArchive provides three archival options: storing archived files inside a file hierarchy, ship them to Elasticsearch, or produce them to a Kafka topic. File archival is pretty feature complete, the code for shipping to Elasticsearch and Kafka is still under development and only has what is needed in our (HPCUGent) specific setup — which may evolve.
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
SupercomputerScripting languageComputer animation
Scheduling (computing)SupercomputerSystem administratorCluster samplingIntegrated development environmentTask (computing)Process (computing)Key (cryptography)Thermodynamisches SystemSystem administratorIntegrated development environmentScheduling (computing)Direction (geometry)Gene clusterBitFlow separationResultantWord
Scripting languageControl flowRevision controlScheduling (computing)Read-only memoryDigital filterRevision controlProcess (computing)ResultantCrash (computing)Directory serviceScripting languageComputer configurationFilter <Stochastik>Scheduling (computing)StapeldateiDirection (geometry)
Scheduling (computing)InformationDirectory serviceCrash (computing)Scripting languageProcess (computing)Multiplication signScripting languageMoment (mathematics)Scheduling (computing)Directory serviceCrash (computing)Computer animation
Patch (Unix)Scheduling (computing)Scripting languageComplete metric spaceTask (computing)Patch (Unix)Software maintenanceCodeProcess (computing)WorkloadScheduling (computing)Open sourceDirectory serviceCASE <Informatik>Scripting languageRevision control
WebsiteScheduling (computing)Scripting languageScheduling (computing)Process (computing)Open sourceFreewareWebsite
Scripting languageScheduling (computing)BackupMoment <Mathematik>Directory serviceBackupProcess (computing)Scheduling (computing)Scripting languageDirectory serviceLatent heatComputer fileCuboid
CodeInformationProcess (computing)Thermodynamisches SystemDirectory serviceHierarchyScripting languageData structureFile archiverProcess (computing)Scripting languageDebuggerDirectory serviceComputer fileMathematicsService-oriented architectureHash functionFront and back endsQueue (abstract data type)
Scripting languageCluster samplingProcess (computing)Scripting languageInformationSoftware maintenanceGene clusterRight angle
Scripting languageProcess (computing)QuicksortSemiconductor memoryVirtual machineMultiplication signInformationCharacteristic polynomialContent (media)Computer animation
Point cloudFacebookOpen source
Transcript: English(auto-generated)
All right, so next we have Andy talking about debugging HPC scripts. Thank you. I'm looking forward to that. Hi, welcome everybody. So first a few words about myself. My name is Andy George, and I work as an HPC sysadmin at Ghent University.
In that role, I don't do user support very often. I only do what gets sent to me, and I try to limit that. So the topic of my talk is about a tool for user support, but it's not something that I'm using directly.
However, I am responsible at the team for all things that get locked and get sent to some central logging system. I am responsible for the scheduler. So at Ghent University, we're using Slurm. So my task is to deploy it, update it,
keep an eye on it, see what goes wrong, if the nodes are still online, what happens to them, etc. So in that role, I was looking at jobs, and at our site, we run a lot of jobs, not a gazillion jobs, but over the course of the lifetime of our clusters, several millions of jobs can get run. Some are short, some are larger.
And to give you an idea, these jobs can sit in our queue for like two minutes, if there are resources free, or up to a few weeks, if the clusters are very busy. Once the job starts, they occasionally die in an unexpected manner, unexpected to the user, that is, and
after a while, the user checks his jobs, wants to see what happened to them, and see, oh, this job should still be running, it crashed, I don't know why. So then they contact user support, which is sitting over there, and they try to avoid that happens again, because obviously their jobs are important for their research,
they need the results to get the papers out on time, preferably, and obviously, anything that crashes is not their fault. Now I say this with a bit of irony, because sometimes it really isn't, but sometimes it is. So the key problem I'm trying to address here is to figure out what was running in the job
under the given environment that the job was submitted under. Now, if a user contacts us, then we often ask, well, can you provide us with a copy of the job script? But sometimes they no longer have it, for whatever reason, or they can't find it.
He also may have changed it. I mean, the job was submitted three weeks ago, they continue working, and they change the job script to submit a new job, do something else, they don't have a copy of it, it's not under version control. Even if it was under version control, they may not recall exactly what version was submitted. I mean, it was three weeks ago,
do you remember which version you submitted three weeks ago to some batch system? Probably not. The other option is that they have a very clear idea what was submitted, so they provide you with a job script, and afterwards it turns out that that was not exactly what was submitted, so they may provide you with the wrong script, and
might have gone wrong, they don't find anything, but still the thing crashed. And it's important to note that in all of these cases, the user is acting in good faith, because they need the results for their job,
so they try to be as helpful as possible to us to find out what went wrong. However, the user is not the only actor. When the job is submitted, there is, on several systems, at our site, at other sites, a submission script that takes the job script and then manipulates it, or checks if all the directives are okay, checks if the
requirements that the user needs, like resources, are okay, if anything should be adjusted, there may be other filters or other things happening to it before it ends up in the scheduler, and then when the scheduler gets it, it usually keeps a copy.
So somewhere, when the job is submitted, certainly under Slurm, the job script ends up in a spool directory where it gets read again at the time the job is actually started. Then the scheduler just goes to that job script, reads it in, passes it to the nodes that need it, and the job is started.
Now the job script in that spool directory sits there from the moment it is submitted, until it is finished or the job crashes. So either way, once the job is finished, the job script will be cleaned up. I mean, if you have a gazillion job scripts that are pushed through,
you can't keep them all because then the directory wouldn't be unusable. So the first obvious solution is that we patch the scheduler. So there are several cases in which this might be feasible. If the scheduler is free and open source software, you can write a patch.
You can try to figure out where the job script is saved to the spool directory and save it also to another location. And the second location, you never think about it again because otherwise you would also delete it. But then either you'd have to maintain that patch forever. So when new versions come out of the scheduler, you would have to check that everything still works.
Or you can try to get it upstream, but it's not always easy. Upstream maintainers may not agree with the way in which you are doing things or with the fact that this is uber-hub needed. So they might argue that a saving a duplicate copy is actually not a scheduler's task. I mean, the original copy is there,
so why should they save another? And it makes for more work to be done on each job submission. If you do this, then the next release you get, you will need to test it again. Maybe adjust the patch because they changed the original code that you were patching. So this adds to your own workload.
Moreover, not all job schedulers are free and open source, so we're using Slurm, which is, but other sites might be using LSF, which is not. Or Grid Engine, which is also not open source. So this largely depends on the vendor, and it also depends on the vendor if you have a single scheduler or if you have multiple schedulers. If you buy a new machine, then the vendor may say, look, we offer you this machine, but it comes with Scheduler X.
If it's not free and open source, you might need to pay them to add this feature. Once they get the money, other sites may have to do the same because it's free money. But even if your scheduler is free and open source, then
if you get a new machine, then you might end up with a different scheduler. So you can't actually reuse what you already did. So the takeaway here is that, in my view, the scheduler may not be the best place to obtain a backup copy of such a job script. So for that, I wrote SArchive. The S stands for Slurm, pretty obviously, but SArchive
also supports other schedulers. So right now, we support Torque out of the box. And adding support for other schedulers is fairly easy. So we separate the front end, which reads the job script and put it in some intermediate structure, and the back end,
which does the actual archiving. So if you have something else besides Slurm, you run LSF, or Grid Engine, or PBS Pro, or something else that I don't know about, then you can also use this tool as long as your scheduler puts the job script and other files in a
specific spool directory that you can then read. So what it does, we monitor all the directories in the spool. So Slurm typically has 10 hash directories under which it puts job scripts to avoid making one single directory too big. When we receive a change in the directory, we check if it's a change that corresponds to a new job being there.
So the front end knows how to pick up that data, reads it in. We push it into a queue so we can process things as fast as possibly, and then the back end will take it out of the queue and do whatever it needs to do to archive your job script.
So we support several back ends. So the first one is one which we originally wrote. It saves your job script and the other files it needs. For example, for Slurm, it's also an environment file. So two files will be saved into a file hierarchy, which you can optionally subdivide with a year, a month, and a day, depending on how much
job scripts you expect. This might be handy. We can also send stuff to Elasticsearch or to Kafka. And it's important to note that the last two things were only implemented with the stuff that we needed. For example, for Kafka, you can specify the broker and
the topic to which you want to produce, but at this point, there's no support for SSL or anything else. So this is still fairly limited, but it does work. We're in maintenance right now, so I can't show you any live things, but this is a overview of the job scripts that were submitted between January 10th,
1649-something, up to January 11th, same time, from all of our clusters. This information was pushed via S-Archive through Kafka, and then it was read out by Logstash pushing it to Elasticsearch. So now we also have a nice overview
in Elasticsearch of the job script, where you can have full text search, etc., to find your things. So these are the links, so everything is on GitHub. There's also a crate. Note that this might be behind the master here, because sometimes I have dependencies that haven't been published.
You're free to use it, to fork it, to add to it, open issues, etc. And that's it. Thank you very much. Thank you, Andy. We have time for one or two questions.
Yeah, if I recall, there was a talk, I think in HPCKP18, about using machine learning for predicting characteristics on the job, like how it fails. Most of the time it's like all time and other stuff.
You know, resource memory. I'm not sure if you had a chance to take a look, but I think it would be interesting. You mean to also add this information into the same sort of collection of data, and then compare the job script contents with whatever happened? Yeah, exactly.
So it's like more like a dashboard alerting, like, you know, what jobs are failing or may fail because of all time. If I recall, I think the tool is called PredictIt, from a company called UC. I haven't heard of it, but I could take a look. It was interesting because we were also trying to do this exact same problem.
You know, wall time is one, but certainly memory, CPU, and other stuff, I was curious to see if you had any kind of thing. No, not at this time, but it sounds interesting. Thanks. All right, we are out of time, unfortunately. Next talk is coming up in a couple of minutes.