strace: fight for performance
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/47404 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2020200 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
Prisoner's dilemmaSoftware developerTracing (software)MathematicsSpacetimeProcess (computing)Stress (mechanics)Computer animation
00:25
Process (computing)SpacetimeReflektor <Informatik>Physical systemKernel (computing)Stress (mechanics)MereologyTracing (software)Process (computing)CausalityComputer animation
00:44
Generic programmingInterface (computing)Operations researchEvent horizonRead-only memoryProcess (computing)Standard deviationMechanism designLocal GroupProcess (computing)Operator (mathematics)BitStress (mechanics)Standard deviationMechanism designEvent horizonTracing (software)Interface (computing)Generic programmingSet (mathematics)Link (knot theory)Different (Kate Ryan album)Order (biology)Staff (military)SpacetimeWeightWebsiteDigital photographyMusical ensembleMachine visionData structureComputer animation
01:53
System callEvent horizonInformationCodeLatent heatFunction (mathematics)Read-only memoryElectronic mailing listData structureParameter (computer programming)Event horizonOperator (mathematics)Process (computing)CodeLatent heatStress (mechanics)Parameter (computer programming)Link (knot theory)Type theoryTracing (software)WeightSemiconductor memoryFerry CorstenInsertion lossSign (mathematics)Task (computing)BitNumberSinc functionComputer animation
03:14
Information retrievalParameter (computer programming)ArchitectureUser interfaceVirtual machineReading (process)Set (mathematics)Computer architectureSemiconductor memoryWordPurchasingSystem callMereologyMultiplication signParameter (computer programming)Regulärer Ausdruck <Textverarbeitung>Point (geometry)Different (Kate Ryan album)WebsiteValue-added networkProgram slicingInterface (computing)Facebook
04:45
Parameter (computer programming)Information retrievalArchitectureSet (mathematics)MathematicsRegulärer Ausdruck <Textverarbeitung>Software developerCountingProcess (computing)10 (number)Computer animationXML
05:12
Parameter (computer programming)Information retrievalTask (computing)MathematicsFerry CorstenNumberSystem callInformationReduction of orderCASE <Informatik>Latent heatDifferent (Kate Ryan album)Computer animation
05:52
Parameter (computer programming)Information retrievalSystem callFlow separationRegulärer Ausdruck <Textverarbeitung>Multiplication signPointer (computer programming)Computer animation
06:32
Information retrievalBlock (periodic table)StrutStatisticsKernel (computing)Cellular automatonSpacetimeProcess (computing)Address spaceFunction (mathematics)Patch (Unix)Metropolitan area networkCASE <Informatik>Web syndicationFilm editingMultiplication signSource codeFamilyProcess (computing)Uniform resource locatorCodierung <Programmierung>Semiconductor memorySystem callPresentation of a groupRow (database)Doubling the cubeSet (mathematics)Standard deviationShared memoryWritingReading (process)10 (number)MehrprozessorsystemStress (mechanics)Cartesian coordinate systemFlow separationTracing (software)Figurate numberComputer animation
09:03
Information retrievalCache (computing)Read-only memoryWeb pageImplementationFunction (mathematics)Functional (mathematics)CASE <Informatik>Semiconductor memoryWeb pageAreaArray data structureElectronic mailing listTracing (software)Reading (process)Latent heat
09:48
BefehlsprozessorCache (computing)Execution unitPhysical systemMultiplication signParticle systemQuicksortObservational studyMathematical optimizationDivision (mathematics)Process (computing)Group actionOrder (biology)Flow separationBlock (periodic table)BefehlsprozessorTracing (software)Electronic mailing listQuery languageGame controllerStress (mechanics)CASE <Informatik>Algorithm10 (number)
11:17
Process (computing)Thread (computing)Computer configurationMultiplicationProcess (computing)Ferry CorstenMetropolitan area networkComputer configurationSpacetimeKernel (computing)Traffic reportingThread (computing)Stress (mechanics)Software bug
11:54
Process (computing)OvalConvex hullProcess (computing)Thread (computing)MathematicsSystem callLoop (music)Flow separationMultiplicationFerry CorstenComputer animation
12:27
BefehlsprozessorProcess (computing)Software testingPatch (Unix)Thread (computing)Physical systemFitness functionResultantComputer programmingData storage deviceAnnihilator (ring theory)BitBus (computing)SurfaceVotingCASE <Informatik>Arithmetic progressionSet (mathematics)Digital photographyThread (computing)Event horizonFerry CorstenKernel (computing)Patch (Unix)System callFunction (mathematics)Software testingSuite (music)
14:01
Process (computing)Thread (computing)Series (mathematics)MultiplicationComputer configurationSoftware maintenanceQuicksortMatching (graph theory)PressureParameter (computer programming)Hand fanCausalityPatch (Unix)Software maintenanceMultiplication signSoftware developerSoftware bugComputer animation
14:51
Process (computing)Decision theoryPatch (Unix)Stress (mechanics)Multiplication signPatch (Unix)Software maintenanceSpacetimeUltraviolet photoelectron spectroscopyAuthorizationComputer animation
15:50
Process (computing)Computer fileImplementationThread (computing)AreaWeb pageResultantCuboidSpacetimeWave packetRevision controlPatch (Unix)Software bugGroup actionTraffic reportingFigurate numberSource code
16:42
Process (computing)Physical systemBefehlsprozessorThread (computing)Software testingPatch (Unix)POWER <Computerarchitektur>Multiplication sign2 (number)Core dumpVirtual machineSoftware testingStress (mechanics)Port scannerTracing (software)Flow separationInterface (computing)Computer animation
17:20
IntelBridging (networking)Interface (computing)Multiplication signMereologyProcess (computing)Stress (mechanics)Computer animation
17:52
Source codeMultiplication signMeasurementOffice suiteSpacetimeSuite (music)WeightComputer programmingData conversionComputer animation
18:33
Bridging (networking)IntelMikroarchitekturCore dumpPhysical systemBitGroup actionWeb 2.0DivisorLevel (video gaming)Electronic mailing listSpacetimeMultiplication signSound effectMachine visionPoint (geometry)FamilyTunisComputer architectureBoss CorporationScaling (geometry)Address spaceSoftware developerPairwise comparisonMessage passingStress (mechanics)Presentation of a groupAdditionVapor barrierSide channel attackBridging (networking)Process (computing)CodeComputer animation
21:07
Physical systemMechanism designProcess (computing)Computer configurationAsynchronous Transfer ModeCodeSlide ruleComputer programmingSystem callOrder (biology)10 (number)Computer animation
21:37
Physical systemSimulationDefault (computer science)Read-only memoryStrategy gameDigital filterKernel (computing)Local ringComputer programCoefficient of determinationOrder (biology)Real numberPlanningBus (computing)Mathematics2 (number)Stress (mechanics)Buffer solutionConstraint (mathematics)Computer programmingCASE <Informatik>Default (computer science)MappingKernel (computing)Mechanism designCache (computing)Process (computing)CausalityArithmetic progressionPhysical lawSystem callState of matterComputer animation
23:28
Home pageComputer animation
23:58
Point cloudFacebookOpen source
Transcript: English(auto-generated)
00:07
So, hello everyone, my name is Eugene Sremetnikov, and I'm a stress developer. And today I'll present a little overview of some performance impacting changes in stress.
00:25
So, well, as you all know, stress is a Cisco tracer. And this talk is focused especially in this part mentioned prominently in stress's month page, that well, a trace process runs slowly.
00:42
Well, why is it so? Actually, it's because of the way stress tracers processes. Stress utilizes PTrace infrastructure for tracing. And PTrace is a pretty old generic debugging interface that provides a set of requests that allow, manipulate traces in whatever way possible.
01:08
And it's originally conceived for debuggers, but then it has been repoposed for different traces and jailers and all kinds of stuff for any kind of process manipulation tracing.
01:24
Well, one aspect of PTrace is that, well, all PTrace operations are synchronous and are performed on a stopped process. And another one is that PTrace API actually abuses the standard Unix parent-child sign-on
01:40
link which is then overweight bit Cisco in order to deliver PTrace event notification. And yeah, this kind of mechanism is used to notify about all kinds of events. And yeah, the way stress works is actually like weights in a loop.
02:00
And when a weight event appears and it tries to figure out what this event is and, well, performs various operations in accordance to the way what the event, what type of event was like, whether it was Cisco, whether it was sign-on delivery to
02:20
a tracee, whether it was like actual weight event like process exiting or process sign-on link and so on. So in order, what actually stress does during that is it retrieves the Cisco arguments or
02:44
its return code and performs then based on the actual Cisco number, of course, the decoder which also performs the decoding specific like reads from tracee's memories.
03:00
And after all the stuff being done, tracee is resumed and this kind of stuff happens twice per each Cisco on entering and exiting. So since PTrace is kind of old mechanism, it actually has a lot of various requests
03:24
implemented for all the same tasks, namely reading registers and memory. So and the worst part of that is like reading registers. Like originally there was a PTrace peak user interface that allows reading like one machine word at a time versus call.
03:43
Then like some new PTrace request appeared like PTrace get regs that allows getting architecture registers and when then they appear like PTrace get fp regs, get a regs, get vregs for different kinds of architectures and it was device that
04:05
decided that well this solution is not scalable. So yet another PTrace request appears like PTrace get reg set and yeah PTrace and this PR status is returns the purchase status register or general purpose register where the
04:28
Cisco arguments usually start out as part from ix86 and nips and itanium. But nevertheless, originally PTrace used peak user interface and at some point
04:46
Denise Vysenko which used to be a prominent PTrace developer suggested switching to PTrace get reg set and this change alone like improved its performance by well several
05:04
tens of persons by well simply issuing like lesser count of PTrace requests. And there was like there were some additional changes like that reduces the amount of
05:20
PTrace requests issued for like achieving specific tasks for example for like we don't need to get Cisco number on exiting a part of for the cases when we want to make sure that well we actually exit the Cisco it's a different topic which is also solved
05:43
by PTrace get Cisco info like we don't want to get to issue get access call for like and so on yeah and another like reduction in amount of PTrace request is like
06:00
since well originally it was well the registers have been read one by one but after like switching to PTrace get reg set PTrace separate PTrace pick user for retrieving IP instruction pointer is no longer needed so it was another cleanup by
06:23
Denise he's actually contributed a lot of like performance improvement at the time but yeah he's no longer as accurate as it was before. Another aspect is as I was said is the fact that well when decoder starts and decoder performs its job it actually reads from
06:44
traces memory and yes yet original way of doing so is PTrace pick data requests but then in Linux 3.2 a new set of Cisco's appeared that originally like created for
07:06
various well various multi-process applications such as MPI applications that allows avoiding double copying of data well that allows communicating data data and transferring data
07:22
between processes without double copy because well when you're using some standard ways of copying copying of data between processes like shared memory or pipe you actually like perform this copying twice and this new ProcessVM readv and ProcessVM writev Cisco's allowed
07:43
well reduce well remove this double copying and it also can be used for like PTrace tracers and it actually like improved the performance of decoding Cisco's significantly like
08:02
in some cases it's like tens of persons in some cases when you want to read like a lot of data like large get the entry for example Cisco's when you read a whole bunch of data like several
08:21
kilobytes of data because well you need for example to well forget the ends it's actually quite peculiar because well you don't actually you can't figure out the amount of entries present until you go through all of them that's why you have to retrieve all the data
08:40
and yeah for like read and write requests like the metric previously showed and for dumping read and write data it's also very useful and sped up the stress in the suspects
09:01
significantly yes and you know yet another step which has been done seven years later is actually caching this request because in some cases for example when some lists and arrays are read well it's the way it is done the decoder function is used like
09:26
a read from Tracy for like specific array element and then from the next one then for the next one and usually they hit the same page of the traces memory but with caching it actually well retrieves the whole page once and then well performs the
09:45
local reads without issuing additional Cisco's. So interacting with Tracy is not the only aspect well of a stress that may make it slow. Another aspect is that well
10:06
it performs a lot of internal works itself and some of the algorithms used there are not well entirely optimal. For example we have a list we have an array of
10:24
trace control blocks it's like some descriptors of traces we are tracing and most of the time this array is usually quite small because well we don't trace a lot of processes at the time like tens or hundreds but when you hit like thousands of them the
10:44
original originally there was a like linear search implemented so when you like trace an order of thousands of processes and you switch switches between them all the time and well it actually can start consuming significant amount of CPU time so by well
11:04
implementing a simplification query reduces this single bottleneck and also improves the performance like several times this particular well rather rare but still well at use case. Yeah the next one is not about the performance itself but well the way the stress
11:24
the way stress interacts with the Linux kernel and the way the kernel works. So there was bug report in the end of 2008 opened by some Red Hat partners that were titled
11:42
some threads stop stops when trace with minus f option follow fork option is executed on a multi-thread process and the rapid user was what actually quite simple as it usually is so all it does it creates a multitude of processes like the main process create like
12:05
several processes in a loop and then exits multiple threads in the loop and then exits and all the threads do is well issuing like some cheap syscall for example get your ID or like change d or something like that well the important thing that they do that in the loop and
12:25
the syscall is rather cheap. And what happens it's actually well a pretty much modified output of the currently used test case in Stasis test suite but well it was originally
12:43
like it was a test program that this was committed along with the original submission so as you can see when you like run this program with like two three four threads it doesn't affect much but when you like run with five threads it slows down significantly like
13:05
all it does it like creates five threads and then exits killing all the threads along with it and with six threads it's already like three minutes and I couldn't well wait for the when the seven thread variant of this program finishes so yeah it's a way like kernel
13:27
notifies as trace about new events one and the way as trace actually handles this experience usually well as it was well it handles them just one by one the first kind first sort and after like resuming this like two three four threads kernel notifies
13:46
about the threads again and again and again because they have new syscalls and that way there was no progression on this on the other threads and they became stored as a result but the upstreaming of the patch didn't go as well as planned
14:08
so the issue was reached and patched well rather quickly in a matter of two months again by Denis Vlasenko but the patch was kind of intrusive and there were several disagreements
14:24
with the maintainer Ron Pangla at the time so the patch was then reverted and then there was no like significant push from the developer to actually upstream it
14:41
uh yeah and the fix that fixes the behavior became real only and yeah it was like that for almost 10 years originally the bug was reported again in real five and then it was forward ported to real six uh real seven and finally relate
15:06
when there was the time has come to forward port this patch to real seven there was again some discussion well whether we should upstream it and so on but well it led to nothing and yeah when it came the time has come to like forward port it yet again to
15:24
stress for 24 in 2018 i decided to i became well i was a maintainer at the time and decided to try to upstream it yet again and it took like as usually nine months
15:42
and the patch was included in stress 5.0 as released in march 19 and as you can see this patch has well i decided to have some fun by including all the results
16:01
uh the old bugs that were created with this issue reports and included all the actual like people that uh were participating in the fixing of this issue because well the stasis
16:21
mentorship changed hands throughout the years in real and yeah there were quite a few people that actually have needs to like figure out how stresses tracing group now works and how he should reimplement this patch for this new version yeah and actually i forgot to include
16:43
one person i figured i found that out when i prepared this talk uh and as now you can see stress scan traces like several hundreds of traces each of them performs like uh this course all the time and yeah the test was done on an eight core machines
17:07
and well it handles 600 of that creation or in 20 seconds which will kind of improvement over the original uh yeah but still stress still uses uh petress interface and yeah it is still slow
17:33
and if uh all the process does is issuing cscals well it is pretty much slow and as you can see
17:42
and well as some infamous stress example that's famous dd example suggests it well hundreds of times but um in some more real-time examples it's not so drastic uh i used as a real-time
18:00
as a real uh world example some other talks which presented at fosden two years ago by philip pombergdan uh regarding the way it tries to trace the sources for each binary on and each artifact and in his in his measurements it was like not very significant and my
18:27
measurements it was like uh two to three times depending conversion of s trace and uh yeah but we still have this enormous load on but actually can we go another way uh can we
18:44
like slow the original dd command uh a bit and actually we can but well not we the developers of stress but well some researchers which research uh side channel mexico attacks uh i was slowing it down like by uh for around three or four times and now a stress only like
19:09
free 30 to 40 uh times slower which are significant improvement again actually i've been cheated here because well i've used sandy bridge in this example and
19:22
it's actually the most impacted uh CPU family because well it has the most of the architectural uh side channel attacks backported and it was like uh pretty old so it doesn't have less uh like skylake sti bp and ibrs and uh haswell's uh address space id improvements
19:49
so yeah it was impacted the worst but well other uh tpu families are impacted as well and as you can see with comparison with meeps arm or ppc or s390 it's actually now in line
20:04
with with the slow down present in other architectures in fact because well originally there was but pretty much hardly handcrafted fine-tuned optimized uh cisco entry routine which was a separate for like fast pass uh when the process is not like p traced or like
20:29
traced by sec comp and now it is all merged and now it executes the same code regardless of whether the p trace is in uh way or not but well that was there are other factors by well
20:46
i think additional barriers and stuff um but still well it's a significant so well 40 times and we have to stop for each uh cisco uh ever if we don't need to trace it
21:05
uh but yeah we don't actually because well since uh as mitra says trace 5.3 we have this nice nice secomp bpf feature implemented and well paul will tell about this more but in short well
21:23
it generates a custom uh bpf programs that implements the filter that is set where the common line and if you don't trace a lot of cscals and your bpf programs isn't as big then you can get like uh a not so significant slide down like order of tens of persons and
21:45
the same slow down you you'll see in some real world examples so what about future plans um there's not like significant future plans regarding constraints performance
22:05
uh as of now like probably like some refinements regarding like caching of the retrieved data and like uh improvements in secom bpf like try to enable it by default in some cases when we can buy some refinements in the way program generate is generated but
22:29
there is like we are not able to make some uh fundamental improvements on this uh like some drastic changes in kernel in uh implemented for example uh if we can use uh eBPF
22:44
in second programs along with the maps we can significantly simplify and shorten the way secom bpf filter is now implemented and if we actually implement some blocking mechanism for example in perf that allows stopping process when the buffer is overflown then again
23:06
we can port s trace to it and again have significant performance improvement in much wider scope of use cases but yeah it's all is not done yet there's like some again some discussions
23:22
but well there's no actual work done and there's nothing to well improve in stress itself so that's about it and if you have any questions or comments i can ask them so no questions great thank you