Automatic CPU and NUMA pinning
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 287 | |
Autor | ||
Lizenz | CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/57013 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
FOSDEM 2022180 / 287
2
4
6
8
12
17
21
23
31
35
37
41
44
45
46
47
50
62
65
66
67
68
71
73
81
84
85
86
90
92
94
100
102
105
111
114
115
116
117
118
121
122
124
127
131
133
135
137
139
140
141
142
145
149
150
156
164
165
167
169
170
171
172
174
176
178
180
183
184
189
190
192
194
198
205
206
207
208
210
218
220
224
225
229
230
232
235
236
238
239
240
242
243
244
245
246
249
250
253
260
262
264
267
273
274
277
282
283
287
00:00
BefehlsprozessorGamecontrollerHauptplatineSocketCoprozessorSpeicherabzugMathematische LogikThreadLokalität <Informatik>SupercomputerQuellcodeROM <Informatik>SprachsyntheseBefehlsprozessorArithmetisches MittelHalbleiterspeicherTermSocket-SchnittstelleTypentheorieGamecontrollerStellenringUmwandlungsenthalpieGrenzschichtablösungBitPhysikalismusSocketHypermediaPolstelleSpeicherabzugAbgeschlossene MengeQuellcodeMetropolitan area networkDiagrammComputeranimation
02:31
ZeichenketteBefehlsprozessorInverser LimesTaskWort <Informatik>BefehlsprozessorMultiplikationsoperatorPhysikalismusSuite <Programmpaket>KonfigurationsraumRechter WinkelSelbst organisierendes SystemMultiplikationInverser LimesVirtualisierungZahlenbereichPersönliche IdentifikationsnummerTaskComputeranimation
04:34
Mengentheoretische TopologieMathematikBefehlsprozessorSkriptspracheZeichenketteSocket-SchnittstelleHydrostatikMereologiePersönliche IdentifikationsnummerOrdnung <Mathematik>PunktAlgorithmische ProgrammierspracheResultanteDienst <Informatik>BeobachtungsstudieTouchscreenMetropolitan area networkExtreme programmingGezeitenkraftStellenringKonfigurationsraumFormation <Mathematik>MomentenproblemVersionsverwaltungMinkowski-MetrikSystemaufrufSchlussregelTropfenIntegralTopologieSpeicherabzugDatenfeldThreadZeichenketteSkriptsprachePhysikalismusBefehlsprozessorHydrostatikDatensatzFunktion <Mathematik>TermComputeranimation
07:35
BefehlsprozessorKonfiguration <Informatik>ThreadMengentheoretische TopologieSpielkonsoleZufallszahlenBefehlsprozessorBitMereologieThreadTopologieMultiplikationPersönliche IdentifikationsnummerSocketInverser LimesSocket-SchnittstelleAlgorithmische ProgrammierspracheHyperbelverfahrenStrömungsrichtungAlgorithmusKonfiguration <Informatik>SichtenkonzeptBetriebsmittelverwaltungNabel <Mathematik>SkriptspracheEntscheidungstheorieQuick-SortWort <Informatik>DigitaltechnikGezeitenkraftDivergente ReihePhysikalismusAntwortfunktionNormalvektorKonfigurationsraumBildverstehenFigurierte ZahlZahlenbereichSchnitt <Mathematik>DatenflussVererbungshierarchieComputeranimation
10:42
SocketSpeicherabzugThreadSocket-SchnittstelleZeichenketteBefehlsprozessorRechter WinkelSchlussregelGruppenoperationSystemaufrufTermMomentenproblemBitQuick-SortStichprobenumfangVerkehrsinformationEinsZellularer AutomatBaum <Mathematik>PhysikalismusSinusfunktionIntelligentes NetzSocketSpeicherabzugBefehlsprozessorZweiThreadVirtualisierungMereologiePersönliche IdentifikationsnummerSocket-SchnittstelleTopologieRechenschieberComputeranimation
13:15
BefehlsprozessorSpeicherabzugEndliche ModelltheorieHeegaard-ZerlegungNormalvektorKappa-KoeffizientSystemaufrufTermEinsTopologieBefehlsprozessorKonfigurationsraumDifferenteDemoszene <Programmierung>SchnittmengeComputeranimation
14:14
SocketBefehlsprozessorSocket-SchnittstelleThreadTopologieZahlenbereichAlgorithmusBefehlsprozessorThreadTermVirtualisierungSpeicherabzugMomentenproblemStrömungsrichtungRechter WinkelGreen-FunktionAggregatzustandAuswahlaxiomDiagramm
15:39
SoftwareQuellcodeBefehlsprozessorÄhnlichkeitsgeometrieKeller <Informatik>Offene MengeLaufzeitfehlerZeichenketteSchnittmengeGammafunktionPhasenumwandlungOvalBitDisjunktion <Logik>Persönliche IdentifikationsnummerBefehlsprozessorPhasenumwandlungKonfigurationsraumPhysikalismusOrdnung <Mathematik>LaufzeitfehlerRechenbuchKategorie <Mathematik>Inverser LimesTouchscreenResultanteDemoszene <Programmierung>MomentenproblemUnrundheitSchreiben <Datenverarbeitung>Zellularer AutomatVorzeichen <Mathematik>Selbst organisierendes SystemGebäude <Mathematik>BildschirmfensterSchedulingDatensatzAggregatzustandBinärdatenWald <Graphentheorie>Quick-SortEntscheidungstheorie
19:12
BefehlsprozessorEreignishorizontProgrammschemaMigration <Informatik>EinsUmsetzung <Informatik>BenutzerfreundlichkeitBridge <Kommunikationstechnik>BefehlsprozessorWeb-SeiteMereologieMigration <Informatik>Persönliche IdentifikationsnummerKonfigurationsraumDatenflussVerschlingungComputeranimation
20:16
Computeranimation
20:38
Coxeter-GruppeHoaxSpieltheorieMigration <Informatik>LastAggregatzustandUmwandlungsenthalpiePhasenumwandlungBitrateDatenverwaltungInstantiierungOffene MengeImplementierungBeanspruchungMultiplikationsoperatorBefehlsprozessorPersönliche IdentifikationsnummerTopologieMatchingGammafunktionDatenflussAlgorithmusPhysikalismusMathematische LogikBesprechung/Interview
27:06
Computeranimation
Transkript: Englisch(automatisch erzeugt)
00:06
Hello everyone, I'm Van Vattenberg from VEDAT and I will speak today about automatic CPU and non-opening. It's a nuclear inovert.
00:21
Three years ago in FOSDEM we introduced inovert a new VM type, iPerformance. It was useful for CPU-intensive workloads, especially SAP HANA VMs. This VM type automatically configured some VM properties, usually that are auto-configured.
00:47
Such as making the VM endless, dropping the USB controller. More stuff, but it wasn't complete. You still needed to do some manual modifications to get the real benefit of the iPerformance in terms of CPU.
01:14
A little bit about CPU and its topology. You have the CPU, which is basically split into two sockets.
01:26
Within the sockets you have the cores, which are the processors. Each one of them can be split into threads. We don't deal with dies inovert. And as far from NUMA, NUMA is Non-Uniform Memory Access.
01:44
Each NUMA node has separate CPUs, memory controller and memory, IO controllers and devices. It's measured in terms of locality.
02:00
Usually each NUMA node has one socket. Which CPU and core basically assign to local memory to use. Which makes, if you configure it right in peeling, specific memory, local memory,
02:22
physically closed to the physical CPUs in terms of performance and it's faster. Here's how we configure CPUs over it.
02:41
So, it's a string. We specify the VM edit configuration. It's pretty difficult to understand and difficult to write. You can limit virtual CPUs to one or more physical CPUs.
03:00
It basically reduces the movement of other processors at most. Here's an example of a CPU pin exchange. As you can see, it's not very easy to read. It means, for example, that virtual CPU 0 is assigned to physical CPU 3 and 2 is assigned to physical CPUs 1 or 2 and so on.
03:31
And there are limitations of using this method. It's a static configuration. Once you edit on the VM, it requires to pin the VM to the host.
03:46
These CPUs are shared. This means it's not an exclusive physical CPU to the VM or virtual CPU you pin. It means that other VMs and processors can run on the same physical CPUs.
04:06
I'm configuring meaningful pinning for a number of VMs. On that host, it's a reduced task. As you can see, when it's a VM with many virtual CPUs and the host with many physical CPUs,
04:23
and you wish to pin it, maybe for one VM, it's fine. But when you're doing it for multiple VMs, it starts to be harsh to do. For the high-performance VM, there is a manual procedure, basically guidelines,
04:43
SAP HANA VMs in order of defining the pinning. Here is an example of the manual pinning. You select a host and you get its topology. Once you get the CPU topology, the normal topology,
05:02
then you change the VM CPU topology to fit this host topology. For example, if you had a host with one socket, three cores, and two threads, then you set your VM with one socket, two cores, which is one core less, and two threads.
05:26
This is a resized process, and that dropping core is basically to let the host know. For high-performance VM, you also pin the IO thread and emulator, usually, to the first core.
05:47
This is the idea behind it. You change the purple NUMA to fit the host physical NUMA in terms of numbers, and then you run the script on the desired host.
06:04
It generates for you the CPU pinning string based on the NUMA nodes and the host topology. So it pins according to the socket, of course, and because it's a script that only supports some topology, not all of them,
06:27
then you need to copy the output of the script, the CPU pinning string into the VM configuration, and pin manually the virtual NUMA to the physical NUMA recording.
06:42
In our profile, we introduce a new feature, which was CPU and NUMA auto-pinning. We assigned the CPUs based on host topology. We had one policy resizing pin that resizes the VM topology and the VM NUMA nodes
07:05
based on the previous manual procedure for SAP HANA. It was effective on PM edit, which means when you set it on the configuration and click OK, then all the static fields of the CPU pinning, NUMA, and so on are set.
07:25
And it did not change on the VM start. It was configurations that keep going as it's found in the VM static. OK, a bit of CPU pinning policy. So we introduce a new configuration, the VM CPU pin policy.
07:47
The resize and pin option does, as I said, the manual procedure automatically. We announced the support, for example, the script supported on the full thread topology on the host.
08:04
And we, for example, have one thread topology supported as well now. And in the future, we plan to make it generic for a known number of threads. And we have the same limitations that you need to pin the VM to one of MOBOS.
08:28
As well, we introduced another policy, which was called pin. The pin policy did not change. It didn't do the resize part. It means that you get the CPU topologies that the VM was configured with.
08:46
And the algorithm ran to the host and basically gets you the best pinning we can with the current CPU topology.
09:01
Finally, one major flaw is that we use the same physical CPUs on the host for multiple VMs. For example, when you run two VMs and the host has two sockets and your HPM was supposed to use one socket each,
09:23
it will use the same socket, which is not good, leaving the second socket free without any HPM. At the moment, we are discussing an alternative solution to add back this policy.
09:44
These are used as a feature of dedicated CPUs, which I will talk a bit later. I'll use shell CPUs as well, like the resize and pin, not changing the algorithm
10:01
to fit and decide which physical CPUs are free to use and not using the same. Here is just the view in the UI, how it can be easily configured. You can just edit the VM, go into the resource allocation tab, and then just setting the CPU ping policy to resize and pin NOMA.
10:29
Also, the API is pretty simple, you just need to provide the CPU ping policy with zero to the desired CPU.
10:43
Here is an example of how we do the resizing part. For example, in this we have a host that you will see in the next slide. It has two sockets, three cores and one thread.
11:02
We have the initial VM with one socket, one core, one thread. We just increased the VM topology to have two sockets as host has, and two cores instead of three, we dropped one.
11:23
And one thread as well. We also set the VM with two NOMA nodes, virtual NOMA nodes, like the host has. Here is how the pinning itself is done afterwards.
11:45
After we increased and resized the topology, we're now pinning it. As you can see, we leave the call 03, which is the first call in its socket, and we pin each core to the physical core accordingly.
12:01
Call 0 in the VM goes to call 1, call 1 to call 2, socket 0 to socket 0 basically. And the same for the second socket, socket 1. We also pin the NOMA, NOMA 0 to NOMA 0 physically.
12:20
The CPU pin, for example, the simple one, is 0 to 1, 1 to 2, and so on. This is a high-level example. It's a pretty simple one. Once you add the threads and so on, it's becoming a bit more complicated and pretty long.
12:45
This is ensuring that the virtual CPUs use a virtual NOMA, and it's pinned to the right physical CPUs, to the right physical NOMA,
13:00
and basically getting closer to bare metal in terms of CPU, because you're using virtually, going to the physically, and using in the same local place, so this is the idea. And we also, while doing so, fixed incorporating splitting of the vCPUs to the vNOMA nodes.
13:27
It generates the CPU set to the NOMA behind the scenes. And with the previous algorithm, a call can be divided into two different NOMA nodes.
13:41
It causes a problem. Within the guest, you could get CPU topologies that would not be the ones that you're actually set to the VM configuration. And that's not what we want in terms of performance. As I said earlier, in terms of NOMA, you don't want the cores, the threads, to be in the same call,
14:11
and the same call to be on the same call on that NOMA. Here is an example of how it's done in the previous algorithm.
14:21
You can see that we had eight CPUs VM with one socket, and four cores, and two threads. And we had three virtual NOMAs. So the Carleton algorithm just did the number of CPUs.
14:42
We divide it with the virtual NOMA count, which means eight divided by three. And once we have a reminder, we'll try to just add one virtual CPU to each NOMA until we didn't have more reminder lefts.
15:06
Now the algorithm is trying to pick up and build the right NOMA, grouping the threads into the calls, to get the same CPU core into the same NOMA instead of splitting.
15:25
It will be better performance-wise, and also better in terms of not misleading the underlying voice in the guest. And getting other topologies are expected.
15:42
And in the other four or five, a new feature is coming, called dedicated CPUs. And all of the pinning is required for that. The new policy will make CPU pinning exclusive. So each vCPU will get exclusiveness over physical CPUs.
16:04
And a lot of vCPUs won't be able to use it. Which means that each VM with vCPUs can get its own physical CPUs. A lot of VMs won't be able to use it.
16:21
The same physical CPUs, calls, processes, and something else that running on the host is able to use it, but not our VMs. The effort was to make the CPU pinning policy similar to that of OpenStack.
16:44
And based on that, it requires CPU assignment on runtime. And here we get a little bit of the chicken and egg problem. So, the old resize and pin flow, it was fairly simple.
17:01
You just need to pin the VM to the desired host and select the policy. And once you click OK, the engine sets the CPU topology, the CPU pinning strength, the normal pinning, all the pinning itself, and sets it into the static configuration of that VM.
17:24
And now, in the new resize and pin flow, we select the policy for the VM and we don't set anything. And once we do that, we run the VM and the engine selects a host for us.
17:47
And only then, the pinning is set. So, this is coming back to the chicken and egg problem. We do all the validations and resolve handling by the static configuration, but in this flow,
18:05
as we wish to do it in run phase, there is no static configuration regarding that. We don't have anything and we don't choose a host yet. So, we had a problem and we need to calculate the intended configuration and save
18:28
it into a special place in order to validate and schedule the VM on a host. And only then, to know what we're currently using.
18:42
Once the VM goes down, we need to reset it. And of course, we drop here the limitation of we don't need the VM to be pinned to a host. So, we basically ended up on setting it as dynamic properties that can be changed.
19:02
And we check it on run phase and calculate what's needed in order to make things work and to be aligned with dedicated CPUs. So, what is next and what is left? So, of course, the pin policy, which is under discussion.
19:21
The huge pages configuration, which completes the high performance configuration. It's a problematic configuration because we don't know the user requirement for the host use. And it requires preparation to have enough huge pages set with the current size in the host.
19:49
It can fail to run the VM, which we don't want it to happen. One gigabyte each pages can have migration flow in the converge part.
20:02
So, currently, we don't do it automatically. Here are links for the dedicated CPUs policy and to the host and high performance VM. And thank you all. And I'm ready for any questions.
20:21
Thank you.
21:10
Auto pin in VMs can migrate. They may require the same.
21:21
Previously, if you use the auto flow in 4.4, then it might be a problem because it needs the same hardware in between those hosts. But now, with the run phase, migration actually can work because it's recalculated once the VM is starting.
21:54
And I will repeat the questions and answers. Can auto pin in VMs migrate? Will they be pinned on destination host?
22:02
And what happens if the Luma topology doesn't match the destination host? If my presentation is about specific performance calls, then auto pinning is a shared CPU.
22:26
And it consumes all the host physical CPU hardware, basically. So, if you use VM with high workload of intensive CPU, it will consume your host.
22:44
So, running more VMs will cause less effective performance. For SAP HANA, basically, it's recommended to use one VM, such VM on the host. I think this is it, mostly.
23:05
You don't even need to pin the VM into host now with the current implementation.
24:11
Is the auto pinning feature overt specific? Or will it be available on plain Linux distro with overt gamma or KVM, for instance?
24:21
Yes, it's overt specific. We do all the logic and algorithm in overt in the manager. So, it is specific to overt.
24:50
Just to add up, of course, you can do it manually when you configure your VM.
25:03
And running comments.
26:21
Okay, I guess there is no more questions and the time is up. So, thank you all for joining and listening. I hope it will be useful for you. And see you all.