We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Playing with Nix in adverse HPC environments

00:00

Formal Metadata

Title
Playing with Nix in adverse HPC environments
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
What happens when you have access to large clusters, but have little control over the software installed on the machines? Unfortunately, this is the current scenario that researchers often find in HPC clusters, which include very old software stack, a brittle environment and non-cooperative sysadmins. We have been experimenting with Nix to provide an up-to-date system running on top of the already existing software, without root permissions with the help of user namespaces. In this talk we give a tour on the problems we found and how we solved them: Starting from the installation and configuration of nix to be used by multiple users when we lack a shared /nix store. - Avoiding library contamination from /usr/lib with an isolated root mount - Interactive development while compiling the code inside the isolated environment with a patched nix-portable - Adding custom compilers to the stdenv Building packages tuned to an specific CPU with vectorization optimizations in mind. - Running the benchmarks with SLURM inside the isolated environment with multiple compute nodes. Improving MPI fast zero-copy transfer inside user namespaces.
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
SupercomputerIntegrated development environmentCompilerHochleistungsrechnenRun time (program lifecycle phase)Task (computing)BenchmarkBefehlsprozessorRun time (program lifecycle phase)Online chatBitFlow separationBefehlsprozessorSystem callCompilation albumComputer virusWorkloadComputer animation
Database normalizationNormal (geometry)Block (periodic table)State transition systemStack (abstract data type)LoginVertex (graph theory)RootkitVirtual machineNP-hardRevision controlControl flowComputer configurationConfiguration spaceLibrary (computing)IterationFocus (optics)Open sourceVisualization (computer graphics)PlotterDerivation (linguistics)Service (economics)Thermodynamisches SystemWrapper (data mining)SpacetimeScripting languageIntegrated development environmentRun time (program lifecycle phase)Execution unitTelecommunicationProcess (computing)Gastropod shellDemonLattice (order)WorkloadDefault (computer science)Patch (Unix)Latent heatAttribute grammarInheritance (object-oriented programming)CompilerCompilerCompilation albumMechanism designIntelBenchmarkoutputBuildingComputer programmingTelecommunicationFocus (optics)NamespaceGastropod shellLoginState of matterProcess (computing)Link (knot theory)Structural loadWorkloadData storage deviceKernel (computing)CASE <Informatik>RhombusBitRevision controlMultiplication signScripting languageRun time (program lifecycle phase)NumberMathematicsFunction (mathematics)System callLibrary (computing)Configuration spaceCycle (graph theory)Compilation albumFunctional (mathematics)Parameter (computer programming)Set (mathematics)SoftwareLatent heatNetwork topologyPatch (Unix)HookingOptisches KommunikationsnetzThermodynamisches SystemEmailSoftware developerDerivation (linguistics)Computer fileComputer configurationSoftware testingResultantVirtual machineResource allocationPoint (geometry)BootingClient (computing)DemonVisualization (computer graphics)Fiber (mathematics)Computer animation
Cache (computing)Software developerCompilerConfiguration spaceOpen sourceElectric currentIntegrated development environmentSoftwareThermodynamisches SystemHost Identity ProtocolSoftware testingSpacetimeGastropod shellScripting languagePhase transitionSubsetBefehlsprozessorWide area networkComputing platformLatent heatDemonSupercomputerComputer configurationData storage deviceMultiplication signCartesian coordinate systemComputer programmingModal logicNamespaceFlagBitCompilation albumRootOpen sourceWeb pageThermodynamisches SystemComputing platformMathematical optimizationCASE <Informatik>Perfect groupLibrary (computing)SpacetimeMiniDiscStandard deviationCycle (graph theory)Roundness (object)Wrapper (data mining)Virtual machineBuildingComputer architectureSupercomputerMathematicsSoftwareExterior algebraSlide ruleDemonDynamical systemScripting languageSigmoid functionSpeicherbereinigungModule (mathematics)Computer fileConfiguration spaceProcess (computing)Computer animation
DemonSupercomputerComputer wormComputer animationProgram flowchart
Transcript: English(auto-generated)
Hello, and for the next talk. Thank you, hello.
I hope you can hear me online well, if not complain on the chat. Okay, so we are going to talk about... No? You can really yell, that's fine. Okay. We are going to talk about HPC, high performance computing, and Nix,
and how we kind of deal with that. My name is Rodrigo, and my coworker Raul is here. First, a bit of what we do. Essentially, we work on a parallel concurrent task-based runtime,
similar to OpenMP, if you are familiar. We also need to work with a compiler based on LBM to read these pragmas in the code, and transform them to function calls to the runtime. In our job, the performance is critical,
so we really need to take care. And in general, we execute the workloads on several hundreds or even thousands of CPUs. Here's a little example of something that we have observed. We have a program here that runs,
and here you can see the CPUs, and this is the time of execution. And we are examining this little point here, because the time here is slightly bigger than what is normal. And we can see that the problem is that the allocator took a bit longer.
So this is just an example. In general, in H2C or this high-performance computing, it's just a lot of machines connected by a fiber optic. They are managed by this SLUR diamond, which allows you to request a certain number of nodes.
We don't have fruit in any node. In general, it's very old kernels and very old software stack. So, yeah, we are stuck with that. And in general, the state of the art now is to use LD library path to load other software and change versions. The problem with this technique is it's not very easy to reproduce.
So the question is, can we benefit from using Nix? In general, we will get up-to-date packages, configuration options. For every package, no more LD library path, and we can track everything that we use for an experiment.
The problem is we don't have fruit, so we cannot install the Nix diamond, as we would like to do. So let's take a closer look at what we do and how we do it. In general, we work in these three huts, so to say.
In the development side, we take a program, and we compile it several times until it actually compiles. We kind of need to do this cycle quickly, so we want the compilation time to be very low. So we need to reuse the already built tree to run the build command.
When we are finished, we switch to the experimentation side, and we run this program in the machine. And maybe we need to tickle with the arguments or the configuration file of the program to get some results that we want to examine. And then we also do some visualization of the results,
but we are not going to talk in this talk about this. So we will focus first on the experimentation and later in the development side. So a bit of what we did. We tried with individual installation of the Nix store by using user namespaces.
The problem is that the number of packages grows, so we would like to share the store with several users. So we use an auxiliary machine where we actually have a Nix daemon, and then we can perform the build in that machine, and then use the post-build-hoc to execute some script
that copies the output derivation to the actual cluster. Problem is, inside the cluster, Nix store doesn't work. So we wrap the command Nix store in a shell script, and when it's invoked by the auxiliary machine,
it creates namespace where it mounts the Nix store there, and then it runs the Nix store and receives the derivation, so we can actually copy it over SSH. We also tried to patch the Nix daemon to run inside the machine, but it's a bit complicated, because we cannot even run a user daemon there.
Okay, so let's focus on the experimentation cycle. The first requirement, the most important thing, well, assuming that you already have a program that somehow you built in a sandbox,
we want to execute this program in the machine, and we want to make sure that this program doesn't load anything that is outside the Nix store. So, especially the LD library path may have some path that actually has libraries for your program, so we don't want that,
and also it may use the dealopen to load other libraries. So, ideally we want something like the Nix boot with a sandbox that prevents access to slash user or a slash opt. And it needs to work in a SLORM too. Another requirement that we need is for MPI,
the communication mechanism, to use this syscall process bmreadby that only works if the process are inside the same namespace. So we solve this by running a check that checks if the namespace is already created, and if so, we enter it, otherwise we create another one.
So let's take an overview of how this works in the cluster. We have here the login node and two compute nodes that were given to us for running our program. In general, we have to wait a bit after requesting the nodes, that is fine.
After this moment, we take a shell that is connected to one of the allocated nodes. These are the nodes, and each node in your case has two sockets. So we usually run one process per socket, and we talk to one of them only.
Inside this process, we don't have Nix. So we first load this namespace by using our script, and then we can run other programs like SRAM, which is the client that will launch the workload
that is inside the Nix store. So we can compile programs, link it to this specific version of Nix, of Slurm, sorry. After that, it requests the Slurm diamond to execute something in parallel, and the Slurm diamond forks in every process,
one process that will run something, but it's outside the namespace because it's not controlled by us. So we execute our script again to join the namespace if it's found, otherwise we create another one, like in the second compute node. And here we can see that we can communicate
in the same node, because they are both in the same namespace, and they are one-two, and here we use fiber optic communications. Another requirement that we need is that we need custom packages, and we use that with this technique
where we define a call package function that takes priority over all attribute set. So we can change software that is provided in upstream with Nix packages, and we use our version first, so we can hack on those
without disturbing the whole package set. Another thing that we need is to define package with compilers. In general, we use LBM with a custom runtime, so we use the wrap-cc-width and inject this little environment bar so we can load our runtime
without needing to recompile the compiler. We also need, unfortunately, proprietary compilers, and we use this RPM extract and the auto-patch of hook to fix the headers so we can run them on Nix2 and compile the derivations for them.
Now we will talk about development cycle. Okay, let's move on to development cycle. In general, the development process consists in getting an application,
adding some new and cool features to it, breaking things, testing and re-testing that is okay, and this interactive workflow requires frequent changes in the source and compilation steps. For this reason, Nix build is not good to work with
because every change in the source will trigger a full copy of the source to the Nix store and a full compilation. With big repositories, this is a problem because, for example, in the slide we can see how much time it takes to build LLVM in a 32-core machine we have,
so it's a big machine, and we can see that, although we use CCADs, we talk about different orders of magnitude, comparing it with simply reusing the previous build. Another alternative could be using Nixel to get our tools to build the application,
but this environment is not isolated from the system, and we can find software that includes hard-code paths that lead to the system, like in this case with a sigmoid module file of ROC-EM that is good enough for EMD, for those who don't know what it is.
And if we take an application that uses ROC-EM and configure it and check the log output, we can see that, at the end, the installation selected is the system one, instead of the Nix package we want to. An isolated environment will prevent us from this situation,
avoiding the necessity of patching the source to solve this problem. Our solution for these two requirements is to first build an isolated environment with a tool we named Nixwrap. Nixwrap is a script that uses bubble wrap
to enter a user namespace where the Nix store is available, but not the system directories, like in this case, the slash user. And in this environment, we can launch our Nix tools, like, for example, Nix Build. And this works because inside the namespace,
Nix Build creates a new sandbox in a nested namespace, so the environment is not affected. And the most powerful feature of it is running Nix Shell inside this isolated environment to get your tools to build your application in an isolated environment
so you don't have to worry about accessing to the system. And in this case is the previous example, LLVM, and reusing the build. And finally, if you are using like SLURM, you can execute your application by running Nixwrap
after the SLURM step for process and your application. Another requirement for us is since we are in an HPC environment, we want to get the best performance of the applications. And for this reason, we need to build the critical performance software
with CPU optimization flags. Our solution for this situation is to override the compiler wrapper injected flags by overriding the host platform attribute, specifying the architecture and other stuff to the compiler in the standard environment NoCC.
And finally, we create the standard environment we will use to build our software with this compiler wrapper. So, I will talk about the conclusions.
In general, we can actually benefit from using Nix. But obviously, we have some drawbacks. These cycles that I was talking about, we can still do it very fast. So yeah, it's very nice for us. And also, if we have the chance to get something like a Nix demo,
without the root requirement, and still be able to share the Nix store, that would be awesome. Thank you very much. We have five minutes left for questions. If there are questions.
When I was working in the HPC environment, there was always an issue with disk space. How does it work with a dynamic Nix store where people could just, say, donate anything into the store? Can you repeat the question first? Yeah, so the question is, how can we manage to Nix store where users can install,
if that can be an issue for disk space. So, in general, right now, we have about 300 gigabytes of storage. For our particular group, we have around two to three thousand gigabytes of space available.
In general, in HPC, people used to use a lot of the space. But if we share the store, that will be the best solution instead of every user to have their own installation. And we also, when we kind of analyze someone,
that says to us, please use less space, we run the garbage collector. Thank you. Yeah. So, you said that the state-of-the-art was people using LD library path on the machines. Did you consider using, or can Nix use R path
instead of using run path? Because that would get rid of your problems there. That and then the other thing you can do is there's a talk in the HPC webinar about rebinding the path to SO. OK. But that's a little bigger hammer. OK.
I see your SPAC T-shirt from here. OK. So, question is about using R path. R path, yeah. Because it takes precedence over LD library path. You don't have to worry about the user being stupid. Yeah. So, the problem is that you can see programs
using dlopen to load their own... They don't... I'm sorry? dlopen respects R path too. Ah. OK. I didn't know that. We're doing something wrong. OK. Unless you can do like a define, what does it matter?
OK. So, dlopen is not only the only problem because we also see software trying to read its C slash configuration file somewhere. And we also want to prevent that.
Yeah. In general, we saw that it is safer to avoid the programs from accessing any path than trying to find every single option that the program can use to access. There was still one eager question over there. Can we find the next rapid script online?
Yeah. I think I will upload it to the Folsom page. Any other questions? Not so much of a question but a bit of a shameless plug. The main blocker for having a rootless Nix daemon was merged last week or the week before. So, hopefully that's going to eventually solve the third of your points.
Perfect. Thank you. So, what about interior libraries on the system? Are you only envisioning that you would install the library, things like MPI, through Nix? Because that's not a possible one. Yeah, it's a very good question.
For now, we have been very lucky to be able to work with only proprietary packages that can be put inside Nix. But it may happen that the proprietary something, it doesn't work. So, we don't have a solution for now. One more round of applause. Thank you.
Can I just switch it over again?