How the Spack package manager tames the stat storm
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
Contributors | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/62041 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 202367 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
Interpreter (computing)Decision theoryStatisticsPatch (Unix)Socket-SchnittstelleMenu (computing)Cache (computing)StatisticsLoop (music)Uniform resource locatorCartesian coordinate systemMultiplication signInstance (computer science)BitPower (physics)MereologyCondition numberInstallation artDampingDoubling the cubeoutputHash functionDirectory serviceRevision controlFile systemComputer hardwareDatabase normalizationTerm (mathematics)Software developerCASE <Informatik>Interpreter (computing)Computer fileModule (mathematics)Structural loadBinary codeLocal ringLibrary (computing)Different (Kate Ryan album)Variable (mathematics)Physical systemBootingData managementIntegrated development environmentGraph (mathematics)HierarchyStandard deviationRootLevel (video gaming)Binary fileSheaf (mathematics)Kernel (computing)Random matrixCache (computing)Run time (program lifecycle phase)Classical physicsCompilerWrapper (data mining)
08:12
StatisticsRange (statistics)Gamma functionCache (computing)Multitier architectureSemantics (computer science)Library (computing)Cache (computing)Local ringMappingObject (grammar)SimulationLink (knot theory)Product (business)Key (cryptography)Revision controlMultiplication signBootingFitness functionInstance (computer science)Slide ruleGame controllerBinary codeField (computer science)BitComputer wormOverhead (computing)Physical systemLetterpress printingProjective planeGraph (mathematics)Optical disc driveSoftwareSymbol tableProcess (computing)Interface (computing)Single-precision floating-point formatLinker (computing)EmailDirectory serviceFormal languagePatch (Unix)Run time (program lifecycle phase)Electronic mailing listTheory of relativityFlow separationComputer fileSeries (mathematics)Discrepancy theorySoftware bugSound effectOcean currentUniform resource locatorStructural loadDynamical systemAbsolute valueSystem callAlgebraic closure
16:18
Gamma functionLie groupEmailRange (statistics)Revision controlComputer iconPhysical systemBitEndliche ModelltheorieDiscrepancy theoryBootingSpacetimePatch (Unix)Open setLink (knot theory)LogicSystem callObject (grammar)Library (computing)Instance (computer science)Set (mathematics)Motion captureInformationMessage passingIntegrated development environmentFunction (mathematics)Revision controlMultiplication signLinker (computing)SoftwareModulare ProgrammierungComputer fileModule (mathematics)MultiplicationMereologyStructural loadOcean currentVariable (mathematics)Physical systemView (database)Classical physicsStatisticsCASE <Informatik>Directory serviceImplementationData structureStability theoryBinary codeRun time (program lifecycle phase)Sheaf (mathematics)Free variables and bound variables
24:23
Program flowchart
Transcript: English(auto-generated)
00:07
All right, it works, cool. So welcome all to this talk. My first time at FOSDEM, so I'm excited. I will be talking about taming the stat storm in SPAC.
00:20
So what is the stat storm and why should it be tamed? This term was coined, let's say, by the Geeks developers who happen to be also here. And for you to be affected by this problem, you need a few ingredients. One is a package manager that installs every package into its own prefix.
00:44
Nix, for instance, Geeks or SPAC. You need a loader, like a dynamic loader or an interpreter like Python, that has to locate dependencies at application startup and you have to have a typical HPC file system that is slow and shared.
01:03
And with all these ingredients, you get horrible startup times of applications and that is what the stat storm is about. First, before we look into the problem more, a little bit about SPAC. I guess Todd will also introduce it, but SPAC is a flexible package manager,
01:24
primarily targeted for HPC. One of the nice things or that attracted me to it was that you don't need root privileges to start using it and it builds on top of your distro, so it can also integrate with like MPI libraries that are already there.
01:41
It supports installing multiple flavors of the same package and I'm saying flavors because a version does not really describe a specific package. There are like tons of compile time toggles usually, like you can swap dependencies in and out and well, a version does not uniquely describe a package.
02:03
And it also comes with a very powerful dependency solver and it is quite easy to contribute to it because the package recipes are written in Python. So for example, below here is how you could write part of a recipe to specify like a conditional dependency on Python
02:24
under like these particular conditions. And if you're used to like Appcad install or whatever, you can do that basically with SPAC2. You can say SPAC install FFTW, but you can also be more precise.
02:42
That is unique about SPAC, I think. You can say SPAC install FFTW and set the variants to like a compile time toggle to like precision. I want float and doubles. I want FFTW to be compiled with MPI and the particular provider for MPI will be mPitch
03:01
and it should be not version 4 but limited to or constrained to version 3. So that is the input you give to SPAC and then it goes for a think and it spits out a dependency graph with all the details filled in. So there's like the concretization process.
03:20
It's called SPAC. That can then be installed and every package as I said, it will be installed into its own prefix and the directory name of this prefix will contain a hash derived from the dependency graph and that allows you to have multiple packages installed at the same time.
03:43
So that makes SPAC an intentionally non-file system hierarchy standard compliant package manager so there is no root level bin directory or lib directory. Everything is in its own prefix. So it can look like this for instance.
04:04
But then the problem is that packages have to actually be located at runtime and I guess the classical solution, I mean it's not very unique to SPAC, the classical solution in HPC is to use module files for example. So you log into a system, you do module load FFTW
04:22
and before you know it you have dozens of kilobytes of environment variables set for binaries typically like LD library path is filled with stuff and for Python you have Python path, etc. This is not necessarily great because especially for SPAC
04:41
if you want to use SPAC executables and also system executables if SPAC sets LD library path then your system executables may change behavior or if you use like two different SPAC executables with conflicting dependencies and you have this one global variable that might also lead to issues.
05:04
So these environment variables are not the way to solve it. Let's focus just on the ELF binaries or executables libraries on Linux. If you have an executable
05:23
like an ELF executable in there you find a section that says what interpreter to use that is the dynamic loader. That is at least one thing that is mentioned by absolute path so that is what the kernel finds. The kernel starts the loader.
05:40
The loader interprets the executable. It needs to find a bunch of libraries in the dynamic section. It recursively finds these libraries and so that's basically how the story goes. And then we want users to be able to run executables without all the magic variables or the opaque variables.
06:02
And the typical solution which I think is shared among Nix, Geeks, SPAC is to have like a compiler wrapper that injects R paths. And R paths are basically like binary local search paths so that you don't have to use global variables anymore.
06:21
The exact behavior of the loader then kind of depends on what libc you use. So for instance in glibc the R paths beat all the library path during the search which is also something that SPAC exploits. That's the way that SPAC actually can run executables on messy HPC systems that do set these variables for other things.
06:43
On muscle libc the search behavior is slightly different but you don't see muscle that much in HPC anyways. However, there is a cost to R path and that is so runtime search. Normally for system executables what happens is that if you start the executable
07:02
the loader will basically just loop over the things that it needs and look it up in like a global cache like the loader cache. And this is quick and nobody complains about startup times I guess. In SPAC you have at least a double loop
07:21
and in glibc you even have maybe a triple loop. You loop over the new libraries the search paths, the R paths and then there is like hardware related subdirectories of the R paths where you can maybe find optimized libraries which is actually kind of redundant in the SPAC world because we optimize every package
07:42
for a specific target so there is no need to loop in subdirectories. Well in any case there is like this triple loop where eventually there is some syscall. And generally that is not a big deal because like in general
08:01
like how many dependencies do you really have? Like if you look at git maybe there are like three packages involved so yeah there is not a whole lot of searching going on but things get really wild if you look at for instance Emacs with GTK support this is not the whole
08:20
it doesn't fit on the slide to show like all the dependencies and you can get like I don't know like 150 libraries with like about 700 DTMed entries in the binaries yeah there is a lot of load runtime overhead or startup overhead.
08:42
If you use strace you can actually see what happens and you get horrible things like about 5,000 syscalls of which 4,000 are basically searching for a library in the path where it can't find it.
09:01
So yeah there is some overhead to it. And even like I tried this on the production system on a worm cache there is like very noticeable overhead where a lot of time is spent in like system time just to I don't know like print the version of Emacs
09:21
which should really just print immediately of course. So with a dynamic loader in spec you typically have an overhead that is shifting towards loading objects and not like relocation where the dynamic loader is usually known to be slow for.
09:42
And then HPC is really a problem because typically you don't start like one process but you start like a whole series of processes among different nodes. So yeah there is a good reason to try and improve this. So obvious solution would be to just switch to static linking
10:03
because there is no dynamic loader involved anymore. But generally there is still use for libraries I would say. For one you can avoid all the symbol clashes especially like these huge graphs or dependency graphs
10:22
like with the Emacs example the odds that you find like some symbol that is being used twice are quite high and shared libraries have good ways to like say like this is my public interface and this is my private interface and if you have clashes in the private interface well there is no problem.
10:41
Also LD preloading is still nice to have I would say like you're like swapping out a malloc just to try like will this improve my performance for instance. So that would be gone with static linking and there are some other issues like I know dynamic languages if you have to interface with them
11:02
you kind of have to use shared libraries anyways. The geek solution that's already there for I don't know like over a year at least is to patch glibc and basically they create a package local cache of libraries
11:21
so that you basically know like the library name maps to a particular path it is made package local instead of global which I think is quite elegant but for spec it is not really usable because it requires glibc muscle doesn't have a loader cache for instance
11:42
and it also requires patching glibc and we currently don't control glibc so it's for spec not really an option. Another solution would be to emulate a loader cache by sim linking so in our prefix we add like a bunch of sim links
12:02
from these are the libraries that we probably need to the dependencies where they are and then we can replace an rpass into one and so there's like a single search path which is also easy which also works for muscle and is also the recommended way to like according to the muscle mailing list to emulate this cache
12:22
but there are some technical issues like you can still have relative rpass with origin semantics and they become relative to a sim link and not to the actual library in the prefix directory where they are so it may not always work.
12:40
Another solution, shrinkwrap this is actually done a bit more recent and it's currently a pull request to patch-elf from a NixOS project and their idea is basically to replace all the all the DT needed entries with absolute paths
13:02
of like the transitive closure so if you run LOD you're executable you got a bunch of libraries out of that and all of them go into the DT needed entries and by absolute path and the dynamic loader will do no search it will just directly open them so it's interesting it's built on top of patch-elf
13:20
which is also used like a lot in Nix at the same time patching-elf files that way is kind of tedious and there are bugs every now and then and there are some side effects I'll talk about it in a bit but before we look at the current spec solution let's step back a bit
13:41
like a typical user issue who is not very familiar with loader internals or whatever they build their software on an HPC system they submit their jobs and it doesn't work and like the loader cannot find particular libraries that were located during the build but not at runtime
14:01
or they suddenly end up with the wrong libcenter C++ or whatever that is a bit of an issue with the discrepancy between the linker and the loader and the basic example is like this you create a shared library you create an executable you link to that library
14:20
this is a libf you run the executable and oh no it cannot find a thing obviously we can understand why that happens but at the same time it's a bit dumb you just linked it why can you not find it right now in general of course we are probably going to install the library and maybe it's in a slightly different location
14:41
so we cannot fix the path ahead of time but if you think about spec all the dependencies they are pretty much fixed in their location in a prefix so they're not going to move anywhere so if linking immediately binds the library path
15:02
that would be great and one way to do that is if you think about what the linker does it does a whole lot of things but one of the things is it copies the shared object name of the library that you're linking to into the executable a library that needs it in the dynamic loader
15:20
it performs a search for that name always except if there is like a forward slash or like a directory separator in it then it directly opens it so what if you create a library with a shared object name that contains a forward slash that is the trick and actually the trick is also quite popular on Mac OS
15:43
just not very popular on Linux so what you get is any linker that you would use would, if you sorry, any linker that you would use would, can basically copy a path
16:01
directly into a DC needed entry so that raises the question can you just change shared object names and generally yes you could and they're mostly like a cache key anyways it's not a very special field in a binary there is some possibility to have like introspection
16:22
with DL info in C it is rarely ever used so I've only really seen it in Java where they check like what glibc version is used for instance but then okay in spec we can say like okay leave that so name there
16:40
then for that specific package and that is basically that leads us to the current trick that spec uses so we have a opt-in setting in spec 0.19 that you can enable with this command and basically what it does is
17:02
after something gets installed we replace all the shared object names with the path where the library is located itself and then what you get is not only better performance because there is no search anymore but also more hard like stability or hardening because whatever you link to
17:22
is also what you get at runtime there is no discrepancy anymore between the linker and the loader they will always use the same libraries it also works outside of spec so if there's like things installed with spec and people start linking against it they will automatically always use the spec libraries without having to set environment variables
17:41
or setting our paths themselves it does not in some cases the trick happens a bit too late like if you build curl curl links to libcurl like intra package linking then libcurl shared object name has not been replaced yet
18:00
we do that pass post install so sometimes there may be some small issues and last thing that I want to say about this is like how do you replace shared object names so currently we simply use patch elf it is generally good I would say apart from the issue tracker
18:23
which has multiple dozens of problems reported but it generally works but there is one downside namely that it increases the or it reduces the or it solves the statstorm problem at the same time it may like
18:41
change the elf files in non-trivial ways and create new load segments so you end up with fewer stat calls but more mmap calls for instance so if we can avoid patch elf that would actually be nice and then there is actually another trick
19:00
well it is under consideration or it is an open pull request to basically reserve some space in the dynamic section of the executables and libraries with a dummy rpath and then in python with spec
19:20
we just move the shared object name into that placeholder space and then we can basically update executables and libraries in place and it doesn't require all the advanced patch elf logic okay so with that solution do we improve the startup time or like do we improve the Emacs
19:42
time to printing the version and the answer is pretty much yes so the system time goes down quite a bit so that is good but we still don't have we don't capture glibc so this is what the ldd output looks like all absolute paths
20:01
but not glibc it is still searched for and now we end up in a rather funny situation where basically everything that the dynamic loader opens or needs is found directly except that it spends about 400 syscalls looking for glibc
20:21
and the loader itself is part of glibc so it feels a bit dumb but in muscle glibc actually that is not an issue at all because they are quite smart about it the loader is actually also the glibc implementation so they never locate glibc and that is also a reason why muscle
20:42
binaries may start actually a little bit faster than glibc binaries but if we are now the last issue like if we make the paths of glibc absolute or preload them then we actually finally reduce the startup time to something reasonable
21:01
and then the statstorm issue is solved so there are actually zero statcalls and the openat calls are well significantly reduced so basically to answer the question have we solved the statstorm spec?
21:21
mostly it would be easier if we also control tipsy but we are not there yet but at the same time it is definitely possible and well for sure you get the second runtime for free and if you push a little bit harder
21:41
like we could still make the path to glibc itself absolute for instance and then you get like the proper performance so here are some further links for like there is also the whole discussion and nix going on their issue has been open since 2017 I think where people reported this issue
22:03
like slow startup times and lately there is quite some discussion for them going on too they also have the same issue they not only want to support glibc but also muscle so it is interesting to read up on that too and I will leave it by that, thank you
22:29
any questions for Harman? hello
22:41
hi so I have a question so how do you load the prefixes on your software packages to use a module system like elmo or something like that? so we have multiple ways to actually use the software you can generate modules there is also a way to
23:01
which I like a little bit more is like you create an environment you add all the packages in there that you need and then you generate a view that is actually like a more classical directory structure that you get out of that
23:21
where everything is merged because for instance in elmo with the modules you can swap modules on the fly so it can be used by the user so I am wondering if you are using these absolute paths and then my one of my users decided to do a model swap
23:42
on the open mpi library or something else how is that handled? so one thing that you lose is the ability like if you use absolute paths one thing that you lose is the ability to use the ld library path but you can do ld preload and to be honest I am not sure why ld preload is not used that much
24:02
but ld preload has the advantage that you can very specifically talk like I want to use this library yeah that is quite hard to set yeah but it is also not very different it is prevented everywhere it is also not very different in my opinion from using ld library path but yeah
24:22
thank you