Simplifying the creation of Slurm client environments

CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/62042 (DOI)

Publisher

FOSDEM VZW

Release Date

2023

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Slurm is the most widely used batch scheduler for HPC systems. The Open Source Software community is very active in the development surrounding the Slurm ecosystem, contributing CLI tools for accounting, monitoring, and notebooks among others. A lot of these client environments are nowadays created on containers, which have become a ubiquitous part of running applications. However, this way of working provides new challenges in HPC environments, especially when using Slurm. Slurm requires careful management of shared cluster secrets and cluster-wide configuration files that need to be in sync in order to work efficiently and securely. This talk proposes a novel and simple tool called straw, which allows the creation of secret-less and config-less Slurm client environments. Therefore simplifying the creation of (containerised) environments by removing the burdens of maintaining config files, sensitive munge secrets, and additional daemons. This talk will first provide an introduction to Slurm, followed by a description (mostly drawing from personal experience) of common patterns and pitfalls when creating containers that interact with Slurm clusters for different purposes (monitoring, notebooks, etc). Next, I will introduce Straw, explaining why it was needed and why despite its simplicity (it mostly just fetches a bunch of config files), it is able to perform a task that regular Slurm tools can't, therefore simplifying Slurm client environments. Finally, I will conclude by showing a simple example of how the tool can be used, and how it compares to the usual scenarios in which config files, extra daemons, and secrets need to be carefully managed. If time allows it, I might detail some of the weaknesses of this approach: the fact that the Slurm protocol isn't really documented, and therefore this tool relies on "reverse-engineering" (as much as one can say reverse engineering when no documentation exists, but the code is available) to keep up with new Slurm releases.