We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Reaching "EPYC" Virtualization Performance

00:00

Formal Metadata

Title
Reaching "EPYC" Virtualization Performance
Subtitle
Case Study: Tuning VMs for Best Performance on AMD EPYC 7002 /7004 Processor Series Based Servers
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Virtualization brings many advantages, but what about the overhead it introduces? What about performance? This talk will show how great virtualization performance can be achieved, if proper tuning is applied to all the components of the system: hypervisor, host and guests, for both Xen and KVM. As a case study, we will describe how we tuned our OS in order to be able to reach, inside VMs, close to baremetal performance, on a server powered by a CPU from the AMD EPYC 7002 (codename "Rome") series. We will, of course, show the benchmarks proving that (run on KVM), even when memory encryption is used. Virtualization is great because it decouples the software from the hardware on top of which it runs, and this brings benefits in terms of flexibility, security, reliability and cost savings. But what about the overhead that this, unavoidably, introduces? Well, often enough, a virtualized system is really able to fulfill its goals with an acceptable quality of service, efficient exploitation of HW resources, satisfactory user experience, etc., only if all the components are configured properly. This is not entirely new, as baremetal systems need tuning too, but in a virtualized environment one has to take care of tuning both the the host and the guests. And beware that the interactions between all the different components may not always be straightforward, especially on a large server with complex CPU architecture, such anything based on the AMD EPYC 7002 (codename "Rome") series of processors. This talk will go over some of the typical virtualization “tuning tricks” (for both Xen and KVM). Then, as a case study, we will illustrate how we managed to reach, inside Virtual Machines, a performance level that almost matches the one of the host, on a server powered by a CPU from the AMD EPYC 7002 series. In fact, we will show the results of running CPU and memory intensive benchmarks (on KVM) with and without the suggested tuning. Last (but not least :-D), we will show the impact that the Secure Encrypted Virtualization (SEV) technology has on performance.
Server (computing)Series (mathematics)Right angleVirtual machineWorkloadVirtual realityPersonal identification numberComputer animation
Series (mathematics)System on a chipNetwork socketThread (computing)Read-only memoryModule (mathematics)MultiplicationSocket-SchnittstelleKettenkomplexCore dumpCache (computing)HierarchyLink (knot theory)InfinitySemiconductor memoryInformationModule (mathematics)Link (knot theory)Network socketCombinational logicComputer architectureBitComplex (psychology)Central processing unitTelecommunicationThread (computing)BefehlsprozessorCore dumpElectric generatorHierarchySocial class1 (number)Cache (computing)Computer animation
Electronic program guideComputing platformServer (computing)SuSE LINUXEnterprise architectureSeries (mathematics)Covering spaceVirtualizationBefehlsprozessorCASE <Informatik>Social classSet (mathematics)Virtual machineObservational studyComputer animation
CoprocessorPrincipal ideal domainBitSeries (mathematics)Instance (computer science)Core dumpComputer architectureLatent heatElectronic program guideCache (computing)Central processing unitComputer animation
Partition (number theory)Fluid staticsVirtualizationServer (computing)Data managementUsabilityBefehlsprozessorCentral processing unitRead-only memoryFocus (optics)Semiconductor memoryRemote Access ServiceNetwork socketCore dumpPublic domainJava remote method invocationKernel (computing)Physical systemWeb pageDefault (computer science)Chemical equationConfiguration spaceBenchmarkState of matterLink (knot theory)InfinityCache (computing)Vertex (graph theory)TopologyData modelVirtual realityMountain passEncryptionKey (cryptography)Information securityMiniDiscComputing platformExecution unitOperations researchMatrix (mathematics)Thread (computing)Computer hardwareWorkloadVirtual machineStreaming mediaView (database)Server (computing)Semiconductor memoryEncryptionVirtual realityCentral processing unitCache (computing)Partition (number theory)Point (geometry)Thread (computing)Slide ruleCore dumpProper mapPower (physics)Parallel portComputing platformBitResultantSet (mathematics)Data managementNumberDifferent (Kate Ryan album)Network topologyBenchmarkWordMultiplication signBand matrixArithmetic meanBefehlsprozessorPersonal identification numberProxy serverProduct (business)Level (video gaming)Rule of inferenceGoodness of fitNetwork socketLink (knot theory)ThumbnailInfinityBootingHigh availabilityInformationParameter (computer programming)Web pageState of matterConfiguration spaceChemical equationComputer animation
Thread (computing)Streaming mediaBenchmarkJava remote method invocationStreaming mediaAsynchronous Transfer ModeThread (computing)Matching (graph theory)TunisResultantCentral processing unitPersonal identification numberBitComputer animation
NumberStreaming mediaBenchmarkThread (computing)ConsistencyLink (knot theory)Element (mathematics)MereologyPartition (number theory)Streaming media1 (number)Operator (mathematics)Central processing unitPlotterSummierbarkeitComputer animation
Streaming mediaBenchmarkWorkloadBenchmarkVirtual realitySemiconductor memoryOverhead (computing)Central processing unitComputer animation
BenchmarkCentral processing unitFluidKernel (computing)Overhead (computing)AerodynamicsNetwork-attached storageWorkloadBenchmarkSoftware frameworkVirtual realityCASE <Informatik>Parallel portCentral processing unitComputer animation
Enterprise architectureSuSE LINUXCentral processing unitData modelRevision controlBefehlsprozessorElectric generatorEndliche ModelltheorieComputing platformDistribution (mathematics)Revision controlCentral processing unitVirtual machineVirtual realityThread (computing)Network topologySeries (mathematics)TunisComputer animation
Central processing unitEndliche ModelltheorieThread (computing)BenchmarkEmulationRevision controlMereologyDoubling the cubeEndliche ModelltheorieCodeComputer animation
Streaming mediaBenchmarkElectronic program guideMaizeElectronic meeting systemScale (map)SuSE LINUXEnterprise architectureEmulationServer (computing)Computing platformCoprocessorBand matrixRead-only memoryPartition (number theory)ScalabilityEncryptionSemiconductor memoryOverhead (computing)Computer hardwareTopologyVirtual reality3 (number)Multiplication signStreaming mediaPartition (number theory)Computing platformSubsetNetwork topologyPoint (geometry)EncryptionGoodness of fitInformationBitVulnerability (computing)Semiconductor memoryComputer animation
SoftwareVirtualizationScheduling (computing)Software maintenanceEmulationTopologyVirtual realityPerformance appraisalReal numberInternet forumGoodness of fitMultiplication signComputer animation
Cache (computing)Central processing unitEmulationTask (computing)Real numberVirtual realityBefehlsprozessorHeuristicThread (computing)Temporal logicBenchmarkStreaming mediaTopologyScheduling (computing)Chemical affinityServer (computing)SuSE LINUXEnterprise architectureCoprocessorComputing platformBand matrixRead-only memoryPartition (number theory)ScalabilityOverhead (computing)EncryptionSemiconductor memoryComputer hardware3 (number)Scale (map)Data modelRevision controlNetwork-attached storageFluidKernel (computing)AerodynamicsPerfect groupMultiplicationBenchmark1 (number)Computer animationLecture/Conference
NumberConsistencyStreaming mediaBenchmarkThread (computing)WorkloadNetwork-attached storageElectronic program guideSlide rule1 (number)Exception handlingNetwork topologyBenchmarkWeb pageComputer animation
Streaming mediaScale (map)3 (number)BefehlsprozessorMultiplicationCASE <Informatik>Heegaard splittingComputer animation
Streaming mediaScale (map)3 (number)ResultantLevel (video gaming)BitComputer animation
Streaming mediaScale (map)3 (number)Mathematical optimizationLevel (video gaming)Interface (computing)Similarity (geometry)PlanningSoftwareMultiplication signPartition (number theory)Computer animation
Streaming mediaBefehlsprozessorOperating systemComputer animation
Streaming mediaScale (map)3 (number)BefehlsprozessorPerformance appraisalSimilarity (geometry)ConsistencyResultantCategory of beingComputer animationDiagram
Streaming mediaSupport vector machineScale (map)3 (number)BefehlsprozessorHexagonNo free lunch in search and optimizationThread (computing)Computer animationDiagram
Point cloudOpen source
Transcript: English(auto-generated)
Right. Hi everyone, and I am Dario. Thanks for staying until now, until the very last session of the day, at least in this Dev Room. And come and see this talk about maximizing the performance of virtual machines, or rather of workloads running inside virtual machines.
Which oftentimes is simplified as, I mean, I do the vCPU pinning and then I'm done, I'm not high. Well, it's not false, that's the bulk of it actually, but there are a couple of more things at least, in my opinion.
And let's start having a very quick look at this class of processors, of CPUs, which are the CPUs from AMD, the so-called Epic or Epic 2, because it's the second generation of this architecture family, 7.002 series, which are
multi-chip modules composed of nine dies, that's how they call it, these things. One of these dies is dedicated to IO and off-chip communications in every socket, I mean.
There are nine of these dies in every socket and one is dedicated to IO, it's the IO die. And the other eight ones are the actual compute dies. Then there is the concept of core complex CCX, which is basically a set, a
combination of four cores, which also means eight threads, because these processors have hyper-threading. And we will look into this a little bit more detail later, each core complex has its own L1 to L3 cache hierarchy.
But yeah, we'll see about this later. Let's also introduce the concept of core complex die, or CCD, which is basically two CCXs, and so eight cores, 16 threads. Each CCD, this is the important part, this is the important thing about CCDs, each CCD
has a dedicated infinity fabric, it's the name of the technology, linked to the IO die. Right, and so these are processors that can have up to 64 cores, which means 128 threads, and you can have them in two socket arrangements.
And each socket has eight memory channels towards the memory. So yeah, these are links if you download the slides and navigate them to fetch more information, but you can easily find a lot more information on Google. So I said that this talk is going to be about tuning, virtualization, and some workloads, at least, running inside virtual machines.
I will say a few things, most of which are going to be general enough, but we will use a case study throughout the talk. And so I will speak about this effort, this work that we did together, us, SUSE, with AMD as a partner,
on coming up with a set of tuning advice for optimizing the performance of one of our SUSE
products, SUSE Linux Enterprise Server 15 SP1, which is pretty much the same as OpenSUSE Leap 15.1, on this class of AMD processors. So I will use this as a case study, that's important to say and to remember.
Right, so this is another way to look at that, one specific instance of the series of processors that I introduced before, the 7742, whatever. It's a big one, it's the biggest setup, the one that I say that has 64 cores, which means 128 threads, and comes into sockets.
And this is the one that we used for this guide and the one that I'm going to refer to for this talk.
A close-up on one CCX, this is what I was saying. So each CCX has its own L3 cache, and of course each core also has a dedicated L1 and L2, as usual. But the fact that there is an L3 cache per CCX and, for example, not per NUMA node, it's a little bit...
Well, it's not weird, but it's something specific about these architectures, something which is at least very different than many other architectures that you find, at least in the x86 world.
And, yeah, tuning, the performance, basically means if you really want to try to achieve performance inside VMs with that much is the one that you would get on bare metal, then it also means static partitioning. You cannot avoid doing at least some of that.
And does it still make sense to speak about virtualization then, if we have to partition resources statically? Well, yes, according to me at least, especially on such a large platform because you can still use it for server consolidation because it's so huge that you can put a lot of VMs on it.
And then you have the argument about flexibility and high availability and other stuff. So what resources are we talking about partitioning? Well, all the relevant resources, CPU, memory, and IO, this talk will be focusing on CPU and memory.
IO, we'll leave it for another one. So the first kind of partitioning is going to be between host and guest or guests, meaning that you most of the time want to leave some of the resources, namely some CPUs and some memory
to the host because you have to connect with SSH or whatever to the host to do monitoring or management. And then oftentimes, depending on the configuration, but most of the time the host has to carry out some activity
on behalf or to help, let's say like that, the VMs, for example, for doing IO, running the QMIO threads, whatever. Recommendation, well, it depends on what are your actual goals.
One good rule of thumb is to leave at least one core per socket to host activities. Also, on this particular architecture, it will be better if you manage not to break, for example, what we said before, to be the CCX because otherwise you will have the VMs or some of the
VMs and the host sharing L3 hashes, which is generally not something that you want for good performances. If possible, you would also try to not break, you should also try to not break a CCD, but then that would mean leaving eight core 16 threads for the host, which you may or may not want to do.
And how much memory to leave to the host, it really depends, let's say 50 gigabytes and be done with it. So another thing, huge pages, so whether or not to use huge pages and how to use them.
Typically, and this is one of the general things, this is really general about virtualization, not really specific about this platform. If possible, you always want to use huge pages for the virtual machine, but you don't want to use them in the transparent huge pages way. Let's say you want to pre-allocate the huge pages at boot time of the host and then use them for the backing of the memory of the VMs.
And you don't want to have automatic normal balancing at the host level because you are going to do static partitioning anyway.
In the guest, it depends, it depends on the workload that you run on the guest. It's not different than tuning workload on bare metal from this point of view. Once you have tuned in the host, then inside the VM, you just treat the problem like you would do it on a bare metal machine similar to the VM that you are focusing on. And one word about power management at the host level, of course, again, it depends.
In general, it's good to do at least some benchmarks limiting, for example, the deep slip states and using performance as a CPU for a governor because it will help you get a first set of results which are consistent and that don't vary too much.
Then it depends whether this is okay for you and for your actual goals to keep these settings or if saving a little bit more of power is important.
And if it is, you have to reassess the tuning and rerun the benchmarks and so on and so forth with the proper power management configuration that you want to have, let's say, in production. Then, as I said, pinning the vCPUs, we want to do that. And we do that in, for example, libvert like this.
And you want to, if possible, I was already touching on this before, if possible, you want to pin the vCPUs of the VMs in such a way that you pin to the CCDs. Because in such a way, you won't have two different VMs which will have to share the bandwidth of the infinity fabric link from the CCD to the IODI.
This means that if you do that, you will be able to configure like that up to either 14 or 16 VMs.
It depends on how many CPUs you leave to the host on an EPIC-II platform like the one I showed at the beginning. And if it's not possible to pin at the CCD level, then you may consider pinning at the CCX level. Because, again, the VM will share the bandwidth of the infinity fabric link to the IODI, but at least they don't share the L3 caches.
And at worst, at least pin to cores and don't make VMs share cores and execute on sibling cipher thread and also share L1 and L2 caches unless you really want to ask for big troubles.
Memory placement, similar to vCPUs, but even simpler probably. Because if the VM that you want to use is big enough to take both the NUMA nodes, then you put half of the memory of the VM in one NUMA node and the other half on the other.
And then you also, I guess, yeah, I have this in the next slide. Sorry. And then in the other case, which is when the VM is not large enough to span both the NUMA nodes and it fits in just one of the NUMA nodes, then you put all its memory in that NUMA node, as simple as that.
And then enlightenment, that's what I wanted to try to say before. If the VM spans both of the NUMA nodes, then you have, yes, put one half of its memory in one node and the other half on the other.
But you also have to provide to the VM a suitable and meaningful virtual topology, virtual NUMA topology actually. If it doesn't, you're fine. You just enforce that the memory stays on one NUMA node.
But you still have to provide, in both cases, a meaningful CPU topology, so virtual sockets, threads, cores, stuff like that, and also a good, let's say, CPU model. What does it mean, good? We will see in a few slides. Yeah, then secure virtualization, secure encrypted virtualization, AMD, and this services processor also provides
a feature which basically allows you to encrypt the memory of the virtual machines. And it's transparent to the VMs. It's very efficient. It's very cool. There are instructions to set it up. I'm not going to cover these in details. And security, so the hardware vulnerabilities, which are well known these days.
The good thing about this processor is that AMD processors in general, and this in particular, are only vulnerable to a subset of them. And in particular, the nastier one for virtualization, this is not vulnerable, so we are happy about that.
Benchmarks, the benchmarks that I run, I said I wanted to focus on CPU and memory, so I will show results of running Stream, which is a memory benchmark. And I will show what I will show right now is going to be the results of running Stream on bare metal, and then inside one or more VMs, so that we can compare results.
And you can configure Stream. In our case, we used OpenMP for parallelization of the Stream jobs, and so we used a different number of threads, which was either 16 or 32 on bare metal.
In general, the rule of thumb, again, is to use as many threads as there are memory channels, but this is not information that is easily available via software, that you can easily figure out by software. So you can kind of approximate it by using one thread per LNC. And this applies to both the host and the virtual machine.
So what do I have here? Here I have in purple bars the results of running Stream on bare metal, and then in green, Stream runs inside the VM without any kind of tuning, so the performance doesn't match, and you see it very well.
Then I applied a little bit of tuning, so the VM had a virtual topology, but it wasn't doing any pinning of CPU memory, and so again in the light blue bar, and again the performance doesn't match. And then magic, you apply the tuning that I described, and you see in the last bar that now
the performance of bare metal and inside VMs, inside just one VM, basically matches. So that's what we wanted. And this is when running Stream just in single-thread mode. The same when using 30 threads for Stream. As you can see,
we are able to reach very good performance, because inside the VM we achieve pretty much the same level as on the host.
Here I use two VMs instead of one, so there are a lot of elements of the plot. You would want to focus on, again, the first one is bare metal, the red and black ones are, I'm now using two VMs, and are the score, the results, the performance that you get from Stream when run inside these two VMs.
So it's okay that it's lower, it's less than bare metal, because now you have partition and the CPU in two, basically, and you have assigned each part to a different one VM.
And the important part, the nice part, is that, as you can see, the performance of the VM is quite consistent, because they are basically performing the same in all the four Stream operations. And then this last part is basically the sum of these two, which, again, pretty much matches bare metal, and so we are happy again.
Now, I mentioned secure encrypted virtualization. I said that the memory of the VM is encrypted. What's the overhead that comes with that? As a matter of fact, at least for this benchmark, it's very, very low. On papers, you find that
it always stays within 3%. In these cases, at least as far as I can measure, it stayed within 1%. And yeah, another benchmark, this is called NAS PB. It's a very CPU-intensive benchmark this time, which is what I said, memory and CPU.
It also uses a parallelization framework. It's OpenMPI this time, not OpenMP. And this time, lower is better, before I probably forgot to say higher was better. And yeah, the same stuff. Basically, first bar bare metal, last bar VM with tuning, and we want them to match and
to be very similar, and that's actually the case with tuning applied in all the various variants of the NAS PB benchmark. And again, I also benchmarked with encrypted virtualization enabled or disabled with this CPU
-intensive benchmark, and again, less than 1% performance impact, which is very good. Now, the CPU model. In theory, this is QEMU that builds virtual CPU for the virtual machine,
what flags, how it presents it, what kind of virtual CPU does QEMU present to your virtual machine. In theory, if you want to achieve the best possible performance, you find in various pieces of documentation that you should use this thing, so host pass-through.
But it depends on, for example, the version of your software, in this case, QEMU or libvirt. As we said, this effort was about doing this tuning on these particular distributions, and as a matter of fact, the distribution went out before QEMU.
The Epic platform series processor was available, and so if you use host pass-through, it turns out that in this particular case, it doesn't do a good job, and the detail is here, and the detail of why it's here, because the threads basically are not exposed correctly.
As a matter of fact, there is a CPU model called Epic, which is there because it's the one that represents the previous generation of Epic processors. And if you use that one, except I pasted the basic same thing, but that's a typo, let's
call it like that, this would have been two, and it provides the VM a better virtual topology. And in fact, this is what happens if you use host pass-through, it's this one, so again, lower is better,
so tuning applied, but using host pass-through as a CPU model, very, very bad because we want it to be here. Using Epic, it's here. Of course, if you use a more updated distribution, a new version of OpenSUSE or ZLII or whatever other distribution, or just the code from upstream, you will find the Epic 2 CPU model there, and you can use it.
I put this part in here because I wanted to stress the fact that yes, there are all these tuning advices, but you really should always double check, because host pass-through was the natural choice, and it wasn't performing well.
Now, I have other stream benchmarks, but I'd rather try to leave some time for questions than, yeah, let's see. So, yeah. Basically, the conclusions are that achieving very good performance, even performance that actually matches one of the hosts
inside either one or more VMs is possible, at least for certain workloads, and it happens mostly via resource partitioning. If you use KVM, QEMU, libvirt in that particular product, even better if you use them
from upstream, you have all the tools of the capabilities to achieve this very good resource partitioning. We, at SUSE, also support XEN, and you can do pretty much the same with XEN. Also, you will lack, the performance won't be as good as this because
XEN still lacks the capability of exposing properly the virtual topology to the guest. And the Epic 2 platform turned out to be quite a good platform from this point of view, because they offer great scalability, offer great memory encryption with exceptionally low overhead, as we see,
and because they are only affected by a subset of the vulnerability flows related to speculative execution. So, with that, yeah, in the slides, you will find a little bit more information about myself, and while taking questions, let me, as we
did this morning, say one more time farewell to my very good friend last course with this picture taken at FOSDEM a few years ago. Yeah, but really, questions.
Yeah. I see three hands, I guess. Sure. Sorry.
Ah, perfect. Yeah, I will always forget about that. The question is whether the benchmarks, any of the benchmarks that I showed were run in a scenario where a VM was spawning multiple NUMA nodes. So, when I showed these results, the one, these ones for, no, these ones, one VM, okay, if you use one
VM, one very big VM, then, yeah, I have another, one slide that I didn't show, but let's use it for that. This was the VM that was using that benchmark.
So, it was spawning both of the NUMA nodes. It basically had all the, pretty much as many vCPUs as there are PCPUs, with the exception of the ones that I decided to leave to the host. But this was spawning both the NUMA nodes, and so it had virtual topology exposed to it.
The question now is about what was the huge page size chosen. It was one gigabyte. The other questions, let's go there.
So, this is, the question was about, since I said that if possible it's better to configure a VM so that it stays inside a CCD, inside a CCX, if it goes outside, stuff like that, whether I have numbers for that. Not yet. Again, this was in these slides that I decided to skip, but if you
see, we are, this is an ongoing effort, we are running more benchmark, continuing doing our evaluation, and so I have, I'm finished, but going over investigation with multiple VMs, in cases where I actually fulfilled my own recommendation,
and so I don't split CCDs and stuff, but also in cases where I violate them and I put VMs across CCDs. Just as a hint, when you start, this for example is a case where 6VM were used, and you
see that the absolute level of the performance is the correct one, if you do the math, this is fine,
but the performance is also actually quite consistent in this case, this case, this case, but you see some strange behavior here. And that's what typically, again, this is an ongoing investigation, so this is just a little bit of speculation, but what we are seeing is that when you start not respecting these recommendations and
putting VMs in, pinning VMs in such a way that they share too much resources, then what happens is that you have this not so much consistent behavior in the results. I have other graphs, here is another example where the recommendations were not really respected,
and you have the performances which are not equally, exactly the same in all VMs. Yeah, there was other questions, but I think we are out of time. We are, so I'm happy to answer, I mean.
It's just the last presentation, so. I mean, I can. It's not recorded, but if you want, then. I mean, I'm fine. Go ahead. I'm good with that. Sure. First of all, thank you for your talk, I very much enjoyed it. Thank you. I was wondering, have this implemented in, for example, OpenStack?
I have the quest, well, I guess I repeat it as well. The question was about whether these implementations, these optimizations are implemented in OpenStack or similar software. I have no idea. I have never played with OpenStack, and I don't plan to in the foreseeable future, to be honest. I am aware of very few, very few efforts and very few capabilities similar to the one that you are saying.
So doing resource partitioning and optimization at this level automatically, either in OpenStack or in many other software.
There are solutions, but achieving this level of details in the tuning, it's quite hard, because after all, because of reasons. I mean, I don't know myself, but you have, then it's a matter of interface that you present to the user for letting him or her able to achieve this.
After all, it turns out to be rather similar to the XML itself, because it's a very detailed level. So I'm not saying it's not possible. I would really hope that the situation was better, but I'm not aware of anything that reaches this level of details.
Yeah, go ahead. Before that, any questions you have left? OK, bottles and stuff. Thank you. Bye. My question would be about the scenario you're explaining on the right side, where you have uneaten behavior.
Which tries to see what behavior is? Sorry, the last part? So that, for example, you have the operating system on boot, which tries to swap instantly instead of automatically?
Yes. I haven't monitored that part, but the fact is that, at least
according to me and to my experience in running similar evaluations in other platforms, the consistencies of results like these is something which is quite good and that you don't find very often. But apparently as soon as you mix things in a not necessarily super ideal way, then these very nice properties start to fade away.
So, yeah, I haven't checked whether what you said also was happening in these cases.
But, yeah, I have a scenario with 30 VMs where I am violating the recommendations by using too many, basically, threads for stream, if you count all of them running inside all the VMs. And if you look at the actual throughput that you achieved, that's actually quite good, but it's all unbalanced.
If you sum all of them up, it matches or even overcome the one that you achieved on the host. But then it's all like that, up and downs.