We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Graphing tools for scheduler tracing

00:00

Formal Metadata

Title
Graphing tools for scheduler tracing
Title of Series
Number of Parts
542
Author
Contributors
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Understanding scheduler behavior can be important for understanding application performance. In this talk, we present some tools that we have developed to help understand scheduling behavior on highly multicore machines. The tools to be presented enable 1) obtaining a graph showing what tasks are running on what cores, with a variety of coloring schemes, 2) detecting overload situations, and 3) stepping through a recorded execution. All tools rely on traces collected using trace-cmd. The tools make it possible to get an overview of the execution, as well as to study specific execution intervals.
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
Fatou-MengeScheduling (computing)Kernel (computing)Chemical equationTask (computing)Core dumpDecision theoryTerm (mathematics)Virtual machineConservation lawAxiom of choiceNetwork socketRead-only memoryJava appletLink (knot theory)BefehlsprozessorGraph (mathematics)Line (geometry)Scheduling (computing)Tracing (software)Graph theoryFile formatMultiplication signQueue (abstract data type)Core dumpBitNetwork socketElectronic visual displayVirtual machineGreatest elementAxiom of choiceDebuggerKernel (computing)Data storage deviceDifferent (Kate Ryan album)Shared memoryoutputCartesian coordinate systemInformationGraphical user interfaceMoment (mathematics)Software testing1 (number)QuicksortTask (computing)Diallyl disulfideInheritance (object-oriented programming)Focus (optics)MereologyComputer fileDecision theoryConservation law2 (number)International Date LineCuboidEvent horizonVector potentialFunction (mathematics)Graph (mathematics)DiagramComputer animation
Fatou-MengeSynchronizationCore dumpKernel (computing)AerodynamicsBenchmarkFocus (optics)Parallel portNetwork-attached storageSuite (music)SupercomputerComputer programPolygon meshAdaptive behaviorRead-only memoryTask (computing)IntelNetwork socketRun time (program lifecycle phase)Line (geometry)Law of large numbersNetwork socketBenchmarkVirtual machineCore dumpConservation lawKernel (computing)MereologyTask (computing)View (database)2 (number)Cartesian coordinate systemPoint (geometry)Maxima and minimaLine (geometry)Multiplication signBitVertex (graph theory)Graph coloringComputer configurationBounded variationGraph theoryLevel (video gaming)NumberSynchronizationDemonGraph (mathematics)LastteilungPatch (Unix)Parallel portGraph (mathematics)Information overloadStructural loadSocket-SchnittstelleCASE <Informatik>Position operatorHuman migrationMoment (mathematics)Overhead (computing)SpacetimeAxiom of choiceAreaScheduling (computing)Computer animationDiagram
Information overloadNetwork socketCore dumpConservation lawScheduling (computing)Kernel (computing)SynchronizationPatch (Unix)Suite (music)BenchmarkJava appletWechselseitige InformationGroup actionDifferent (Kate Ryan album)InformationPerspective (visual)MultiplicationGraph (mathematics)Computer configurationTask (computing)BitProbability density functionScheduling (computing)Task (computing)Information overloadGraph (mathematics)Cache (computing)Multiplication signImplementationState of matterOperator (mathematics)Graph theoryEvent horizonFrequencyCoprocessorTracing (software)CausalityGraph coloringLetterpress printingCASE <Informatik>Socket-SchnittstelleDecision theoryStructural loadCore dumpInformationNetwork socketAverageMereologyComputer programmingNumberKernel (computing)Patch (Unix)BenchmarkGoodness of fitJava appletCartesian coordinate system2 (number)Point (geometry)Suite (music)Bit rateMathematicsBefehlsprozessorDifferent (Kate Ryan album)Thread (computing)Line (geometry)SpacetimeKey (cryptography)Inheritance (object-oriented programming)Computer animation
Scheduling (computing)Perspective (visual)InformationDifferent (Kate Ryan album)MultiplicationCore dumpComputer configurationGraph (mathematics)Group actionTask (computing)MultilaterationComputer animationProgram flowchart
Transcript: English(auto-generated)
Julia is going to be talking about graphing tools for scheduler tracing. Okay. Do you hear me? Okay, so thank you. So I'm going to be talking about graphing tools for scheduler tracing.
So I'll start out, like actually someone started out yesterday, with what is a task scheduler. So for me a task scheduler is like an important part of the Linux kernel. It does two things, it places tasks on cores when they either are created with fork or when they wake up, or when there's load balancing, and it also when a core becomes empty it decides what core should run next.
Basically I'm interested in the Linux kernel files core.c and fair.c, so CFS. I'm not at all interested in the second point, I'm interested in where cores get placed when they wake up. So that's going to be the entire focus of this talk.
So next question is, how does a scheduler impact application performance? Basically a scheduler is confronted with a bunch of tasks and a bunch of cores, and it needs to make decisions, where is it going to put these tasks on these cores? And so sometimes if you make a bad decision, it can have a short-term impact,
maybe some task has to wait a little bit extra time, but this impact can kind of, there's kind of a domino effect, you do one bad decision and then other bad decisions follow from that. So one issue that one might be concerned with is what's called work conservation.
So we have a machine that has four cores, we have a task that wakes up, T1, where should we put it? So based on the information we have right now, we've just got an empty machine and a random task, maybe we have no idea, we'll just put it on core zero. Now another task wakes up, what should we do with T2?
So kind of intuitively it might be good to put T2 on either core one or core two or core three, because they're not doing anything at the moment. Putting on a core zero would perhaps be a core choice, because it will have to wait for task one to finish. So that seems completely obvious as a human looking at boxes on the screen,
but the scheduler is going to have to hunt around to find those empty cores, and so actually CFS is not actually work conserving. The basic principle is no core should be overloaded if any core is idle. So if you have an overload, you should have put it on that idle core instead. Another issue is locality, so instead of having just four random cores,
we may have a multi-socket machine. We've got cores zero and one which are together on one socket, cores two and three are together on a socket. We have T1 is on core zero, where should we put T2? So we have those three idle cores, but maybe core one would be a better choice
if either T2 has already allocated all of its data on the first socket or if T2 wants to discuss things with T1. If we put it on two or three, things might get slower. So basically you can see that there's a lot of potential for things to go wrong,
so we need to understand maybe what the scheduler is actually doing, but the problem is the scheduler is buried down there in the OS. When you're on the application, you don't really know where your tasks are running. So we want to consider how we can see what the scheduler is doing. So fortunately there are some tools that are available, so the most helpful one I would say is TraceCommand.
So TraceCommand allows you to trace all different kinds of kernel events. Basically it's a front end on ftrace, but in particular it lets you trace scheduling events, so that's the part we're interested in. So you can see a trace here, and if you get this trace, it will have basically all the information you need to solve all of your scheduling problems.
On the other hand, it unfortunately has all the information you need to solve all of your scheduling problems. That is, it's ordered, it's a sequential thing, it's ordered according to time. If your application runs for a certain amount of time, you'll end up with a huge file. And you can see even in this little tiny trace that I'm showing,
the activities on different cores are mixed up. We have core 26 and core 62 here. And so in practice, it can get very hard to actually sort out what's going on, who's doing what, and so on. And so the next tool, which is very helpful, this one is called KernelShark. So this gives you a graphical interface that lets you see what's going on on your different cores.
And it also gives you that same textual output at the bottom. You can kind of correlate them to each other, you can zoom in quite easily, and so on. On the other hand, in my personal application where I'm interested in actually very large machines,
KernelShark has some kind of a bit difficult to use in some cases. It's great for if you want to really zoom in on a specific problem. It's not so great if you actually don't really know where your problem is and you want to see somehow an overview with everything at once. Here I'm only showing two cores. You can see that the display is kind of a bit spread out.
It's going to be hard to get 128 cores to fit on your screen and be reasonably understandable. So what we would like is some way of understanding what's going on on big machines. So the thing I put the emphasis on previously is that we want to see what's going on on all the cores at once.
Something that I've also found extremely useful in practice is to be able to collect traces, collect these images, share them with my colleagues, put them in papers, put them in slides, and so on.
So I found it useful to collect lots of traces, compare them, store them, look at them later, and so on. On the other hand, I have at least for the moment completely abandoned this nice feature of KernelShark, which is that you can zoom in or zoom out and find out exactly what you want to see at what time.
My proposed approach that I'm going to present in this talk is completely uninteractive. So you run a command, you get a picture, you look at your picture, and you run another command, you get another picture, and you look at that picture. So actually in the last few years, I've made lots and lots of tools that start out with trace command input
and visualize it in different ways. Sort of the ones that have stood the test of time are the ones I'm going to present, which are dat2graph and running-waiting. Their names are not super imaginative, perhaps. Dat2graph takes a dat file, so that's what the trace command produces, and it makes a graph for you.
So basically it's going to show you we have the x-axis and the y-axis. The x-axis is the time, and then on the y-axis we have our cores, and we see what's running on each core at each time. So kind of like what KernelShark shows you, but in much more compressed format. And running-waiting is just a line graph. It shows you how many tasks are running at a particular time
and how many tasks are waiting on a run queue and are not able to run. So we'll see how that's used. So the rest of this talk I'm going to present these two tools, and I'm going to be motivated by this patch that I submitted a few years ago. I'm not going to discuss the patch in detail now.
We'll see it later after we've seen all the examples. The application I'm going to be interested in is part of the NAS parallel benchmarks. These are a bunch of...you can read what it says. It's small kernels having to do with HPC kind of things. We're going to focus on the UA benchmark.
It does something. What's important for our purposes is that it has n tasks, and they're running on n cores. And so they kind of run...at least superficially they seem to run all the time. You would expect that they would just choose their cores, stay on their cores, and run on those cores forever.
So you would expect this benchmark to be completely uninteresting from a scheduling point of view. So if we take this benchmark and we run it a few times, so I run it 10 times, and I've taken these runs and I've sorted it by increasing run time, you can see that something is going on because there's kind of these runs on the left-hand side here
which start out around 20 seconds, and there's a definite gap here. I mean, it gets a bit longer, a small amount, but there's a definite gap here, and then it jumps up to closer to 30 seconds. So maybe we have 40% overhead between the fastest one and the slowest one.
It's only 10 runs. It's quite a lot of variation for a benchmark that we expect will just run like this and not doing anything interesting at all. So we can ask why so much variation. So now we can actually look and see what's going on at the scheduling level. So this is the graphs. As I said, we have the time on the x-axis,
and we have what's going on on the different cores on the y-axis. It says socket order on the different cores. What I've done is actually on my machine, the numbers go kind of round-robin between the different sockets, but I have organized it so that we have the first socket at the bottom, second socket kind of in the middle, and so on.
It's not very important at this point, though. So I don't know. We have a graph, and we see what it's doing. So this is the fastest one. It looks kind of like what we expected. The thing's not moving around. Not much is happening. This is a much slower run. So this previous one was 22 seconds.
This next one is 28 seconds, so that's kind of a big overhead. And here we can see that things are not going as well at all because, in particular, over here in this region, we have these white spaces. And white spaces means that nothing is happening on that core.
So there could be two reasons why nothing is happening. One of them is that there's nothing to do. So maybe one of these tasks has gotten way ahead of the other one, and so it needs to sleep and to wait for the others to finish what they want to do. The more unpleasant reason that nothing is happening is because several of these tasks can be stuck on the same core,
and they're going to have to bounce back and forth between each other, and actually we have a work conservation problem. Some of the cores are idle. So we can see which case we're in by looking at the running waiting graph. So here we have, again, we have our, this time we have the number of tasks on the y-axis,
but we have n tasks on n cores, so it's the same. At the top we have a dotted line, which is the number of cores on the machine. And then the green lines are things, the number of tasks that are running, so it's kind of like all the tasks are running all the time, but not exactly. There's sometimes when only a very few tasks are running down here.
And then we have over here in this situation, this is the place where we had the gaps on the other graph. And here we have, often we have like almost all the cores, all the tasks that are running, but not quite. And we have this red lines here, and so red lines means tasks that are waiting.
So we're in overload situation, so some tasks have been placed on the same cores as each other, and so they have to wait for each other to finish. So this is kind of more of a problem for this kind of application. So basically the two problems we have, we have tasks that are moving around, and we have some cores that are overloaded, and so the tasks don't get to run as much as they ought to be.
So now what we're going to do is we're going to zoom into some of these situations and see what the problem could be. So here's the first one of these situations. If you look over here, basically around three seconds, at this point that I've circled,
you can see we have an orange task and then a blue task. And so something is happening to cause one core to change to another one. And if you look up a bit more, there's some other situations where that happens kind of all in the same area. So we can look into that in more detail if we zoom in a bit.
So in here I have the command line that you have to write. This socket order is to get the cores ordered in a certain way. Min and max are the, we want to go from three seconds to 3.2 seconds. Target UA is, it's going to give our application special colors,
and other things that happen are going to be black. So then we can see other, if there's some other strange things that are happening on the machine. So now that we have zoomed in at this level, we can see that things actually are not as nice as they looked when we were in the zoomed out situation. So here we have like everybody, almost everybody has stopped for a little gap here.
And then here, this is basically the fourth socket. There's a lot of unfortunate things happening up here. So we can zoom in a bit more. So now I've zoomed in just on this big vertical line here. And when we zoom in a bit more,
then we start to see that there are some other things going on on our machine. So there are the colored lines, and then we have some little black lines. So we can try to find out what the little black lines are. So this data graph, it has another option. What are the black lines?
It has another option where we can have colors to see, it's color by command, the colors are chosen not by the PID, but by what the command is. So mostly we have our command, which is blue, UA, but we have some demons here. So these are kind of inevitable. The kernel needs to do some things. And so basically, if we jump back here,
we can see that if we look, for example, in this place, our task is running along, a demon comes, and then it interrupts our task. So our task is not going to be working, but at least our task is staying in the same place. And so nothing extremely horrible happens, but these things get a bit unsynchronized. Some of them get slowed down and so on.
So that's one kind of slowdown that we can have. But in principle, it shouldn't have a huge long-term impact. So now we can move a bit further off to the right. We can see there are some more of these little black things here.
Here we have an orange task. Here we have a black line. And here we have another orange task up here that happens sort of at the same thing. It's a little bit off to the right. It doesn't start exactly right. So what is happening here is we're doing load balancing. And so the kernel thinks, okay, so there are two things going on here.
We should find one of these many idle cores up here and use one of them to put this task. But that's actually quite a poor choice in this case because basically in this application, we have all the sockets being filled up with all of their tasks. And so by doing this load balancing, we have put an extra task up there on the fourth socket.
And that's something we will come to regret later, one might say, even though it seems okay for the moment. So what this leads to, though, is so as I just said, it's going to lead to a cascade of migrations. We put something on that task.
Someone else is going to wake up for that core. It will have to go somewhere else. And that other place, someone's going to wake up for that, and so on. So then the third situation, this is actually in the same position. We see another situation over here. Here's another case where we are changing from one task to another one.
But this time there's no little black dot which is motivating this change somehow. Nothing strange seems to be happening. It just seems to be happening spontaneously by itself. So we can look again at the running waiting graph to see what's happening. It's not super easy to see. But basically what's happening is we have a green line
which is below the number of cores, and we have a red line that's just above it. And again, we have overload situations. So there's one thread which is actually this orange one here. This blue and orange core here, orange task are sharing the same core. And so they're going to have to bounce back and forth between them.
So we can try to look and see how did we end up with this situation. So this here, this is a graph that I made by hand more or less. This is just focusing on the two specific cores that we're interested in. Here we have this orange task. It's running along. It prefers to be on this core number 111.
It then goes to sleep. And then after some time we move along over here. At this point it wakes up. And we want to figure out it's waking up. Actually it's the task on core number 68 that is going to wake it up. And so we need to find a place to put it.
So the obvious place to put it would be on core 111. That's where it was before. The important thing is that core is actually not doing anything. But that's not what happens. What happens is it gets placed on core number 68. It gets placed on the core of the parent as opposed to the core where it was running before.
So this seems very surprising. We expect that we prefer to put cores where it can run immediately. Why does it for no particular reason get moved off? So the key to the situation is this mysterious little dot here.
So it's a k-worker that woke up and took advantage of this empty space so it could run immediately. And at the time, this is like Linux 5.10 when all of these graphs come from. At this time, basically there's a decision whenever a core wakes up,
should it go on the socket where it was before or should it go on the socket of the waker? And there are different sockets in this case. And the issue is that when you make that decision, you take into account the load average. And the load average is collected over time,
and then the old information gets decreased a bit over time. And so because we have this k-worker here, the load average is not zero. And so this core is seen as being not completely idle, even though it is completely idle. And so when that situation arises,
then there's some kind of competition between the waker and the place where you were before. And for some reason, this core number 111 loses this competition in this situation, and so the kernel thinks that this core down here would be a better place for it, which in this case it definitely is not.
So that's where this comes in. There's a little patch. All it does is it checks if the core where the task was previously, if that is completely idle, then just use that instead of considering other possibilities. So if we apply that patch, then here we have the pink lines here.
So now we still have a slight increase. We still have our task moving around. It's not going to solve all the problems, but we don't have this big jump, which happens when the overload situation is introduced. And we can see the impact on another completely different application.
So this is a Java program. It's part of the Decapo benchmark suite. And this patch causes tasks to kind of have a better chance of remaining where they were before. And on this benchmark, what happens after we have the patch is that all the tasks manage to stay on the same socket
because there's actually not that many of them that run at a time, and they fit there nicely. Previously they were tending to kind of move over the entire machine, and we have here more than 20 seconds between the fastest and slowest. Here we have a much more uniform running time, and obviously the running time is also much faster.
So it had multiple benefits. So in conclusion, if you want to understand what your scheduler is really doing, you have to actually look and see what your scheduler is really doing. Just seeing that the number, now it's faster, now it's slower, something like that, it doesn't really give you a good understanding of what's going on.
Different perspectives, we found that it provides different kinds of information. The running rating graph is actually very coarse-grained, but it actually sometimes can show you the problem is exactly in this region because there's overload in this region. So we have our two tools, data graph, what's going on at what time,
and running, waiting, how much is happening at each point in time. In future work, these graphs are a little bit slow to generate because we, at the moment, for technical reasons, we go through postscript and then go to PDF. So it would be nice to be able to generate them more quickly to be a bit more interactive looking. And also, as I mentioned in the beginning,
I've made lots of tools. If these tools could become a bit more configurable, then maybe I wouldn't have to restart the implementation each time and be more useful for other people. So everything is publicly available. So thank you.
We have time for one or two questions. Thanks for the talk. I have two questions, basically. Do you have a solution to visualize the latencies due to cache misses, for example,
after a migration? And the second one is, do you have a way to visualize when tasks are synchronizing on the mutex, for example, that also can bring some latencies? So no, we haven't done anything for cache misses. It could be interesting.
I mean, I have another tool which deals with events, and I think there's some way to add events to data graphs so then maybe you can see when different locking operations are happening. I mean, I definitely think that's useful. I don't think the support is as great as one might like at the moment,
but it's definitely an important thing. We have one more question. Hello, Julien. I was wondering, is there a way to show the CPU state at the time
you are putting the time, because your graph is making the assumption that typically the CPU frequency or whatever is stable over time. It would be very interesting to know the physical state of the processor at the time we are printing, because maybe the task is running on a faster...
The CPU frequency is higher on one cause than the other, and being able to visualize that this application is running on a fast or a slow CPU could be very interesting to know. Actually the tool does that, but the unfortunate thing, I didn't talk about it because you have to actually go and add a line,
a trace print K line to your kernel to actually print out that information, because it doesn't exist anywhere in the kernel. So that's the only issue, but actually the tool, once you print it out in the proper format, it actually does everything, and it can show you just the frequencies so you can see different colors for how fast it's going.
You can also see the merged thing, where you have the frequency in one line, and you have the activity in the other line. Sorry, we're out of time. Thank you for the talk, thank you for the questions. Sorry we can't take all questions, but I'm sure you can find Julia later.