We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

strace: fight for performance

00:00

Formal Metadata

Title
strace: fight for performance
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The talk gives an overview of various optimisations implemented in strace over the past several years. While most of them are quite trivial (like caching of frequently-used data or avoiding syscalls whenever possible), some of them are a bit more tricky (like usage of seccomp BPF programs for avoiding excessive ptrace stops) and/or target more specific use cases (like the infamous thread queueing patch[1], which was carried as a RHEL downstream patch for almost 10 years).
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
Prisoner's dilemmaSoftware developerTracing (software)MathematicsSpacetimeProcess (computing)Stress (mechanics)Computer animation
Process (computing)SpacetimeReflektor <Informatik>Physical systemKernel (computing)Stress (mechanics)MereologyTracing (software)Process (computing)CausalityComputer animation
Generic programmingInterface (computing)Operations researchEvent horizonRead-only memoryProcess (computing)Standard deviationMechanism designLocal GroupProcess (computing)Operator (mathematics)BitStress (mechanics)Standard deviationMechanism designEvent horizonTracing (software)Interface (computing)Generic programmingSet (mathematics)Link (knot theory)Different (Kate Ryan album)Order (biology)Staff (military)SpacetimeWeightWebsiteDigital photographyMusical ensembleMachine visionData structureComputer animation
System callEvent horizonInformationCodeLatent heatFunction (mathematics)Read-only memoryElectronic mailing listData structureParameter (computer programming)Event horizonOperator (mathematics)Process (computing)CodeLatent heatStress (mechanics)Parameter (computer programming)Link (knot theory)Type theoryTracing (software)WeightSemiconductor memoryFerry CorstenInsertion lossSign (mathematics)Task (computing)BitNumberSinc functionComputer animation
Information retrievalParameter (computer programming)ArchitectureUser interfaceVirtual machineReading (process)Set (mathematics)Computer architectureSemiconductor memoryWordPurchasingSystem callMereologyMultiplication signParameter (computer programming)Regulärer Ausdruck <Textverarbeitung>Point (geometry)Different (Kate Ryan album)WebsiteValue-added networkProgram slicingInterface (computing)Facebook
Parameter (computer programming)Information retrievalArchitectureSet (mathematics)MathematicsRegulärer Ausdruck <Textverarbeitung>Software developerCountingProcess (computing)10 (number)Computer animationXML
Parameter (computer programming)Information retrievalTask (computing)MathematicsFerry CorstenNumberSystem callInformationReduction of orderCASE <Informatik>Latent heatDifferent (Kate Ryan album)Computer animation
Parameter (computer programming)Information retrievalSystem callFlow separationRegulärer Ausdruck <Textverarbeitung>Multiplication signPointer (computer programming)Computer animation
Information retrievalBlock (periodic table)StrutStatisticsKernel (computing)Cellular automatonSpacetimeProcess (computing)Address spaceFunction (mathematics)Patch (Unix)Metropolitan area networkCASE <Informatik>Web syndicationFilm editingMultiplication signSource codeFamilyProcess (computing)Uniform resource locatorCodierung <Programmierung>Semiconductor memorySystem callPresentation of a groupRow (database)Doubling the cubeSet (mathematics)Standard deviationShared memoryWritingReading (process)10 (number)MehrprozessorsystemStress (mechanics)Cartesian coordinate systemFlow separationTracing (software)Figurate numberComputer animation
Information retrievalCache (computing)Read-only memoryWeb pageImplementationFunction (mathematics)Functional (mathematics)CASE <Informatik>Semiconductor memoryWeb pageAreaArray data structureElectronic mailing listTracing (software)Reading (process)Latent heat
BefehlsprozessorCache (computing)Execution unitPhysical systemMultiplication signParticle systemQuicksortObservational studyMathematical optimizationDivision (mathematics)Process (computing)Group actionOrder (biology)Flow separationBlock (periodic table)BefehlsprozessorTracing (software)Electronic mailing listQuery languageGame controllerStress (mechanics)CASE <Informatik>Algorithm10 (number)
Process (computing)Thread (computing)Computer configurationMultiplicationProcess (computing)Ferry CorstenMetropolitan area networkComputer configurationSpacetimeKernel (computing)Traffic reportingThread (computing)Stress (mechanics)Software bug
Process (computing)OvalConvex hullProcess (computing)Thread (computing)MathematicsSystem callLoop (music)Flow separationMultiplicationFerry CorstenComputer animation
BefehlsprozessorProcess (computing)Software testingPatch (Unix)Thread (computing)Physical systemFitness functionResultantComputer programmingData storage deviceAnnihilator (ring theory)BitBus (computing)SurfaceVotingCASE <Informatik>Arithmetic progressionSet (mathematics)Digital photographyThread (computing)Event horizonFerry CorstenKernel (computing)Patch (Unix)System callFunction (mathematics)Software testingSuite (music)
Process (computing)Thread (computing)Series (mathematics)MultiplicationComputer configurationSoftware maintenanceQuicksortMatching (graph theory)PressureParameter (computer programming)Hand fanCausalityPatch (Unix)Software maintenanceMultiplication signSoftware developerSoftware bugComputer animation
Process (computing)Decision theoryPatch (Unix)Stress (mechanics)Multiplication signPatch (Unix)Software maintenanceSpacetimeUltraviolet photoelectron spectroscopyAuthorizationComputer animation
Process (computing)Computer fileImplementationThread (computing)AreaWeb pageResultantCuboidSpacetimeWave packetRevision controlPatch (Unix)Software bugGroup actionTraffic reportingFigurate numberSource code
Process (computing)Physical systemBefehlsprozessorThread (computing)Software testingPatch (Unix)POWER <Computerarchitektur>Multiplication sign2 (number)Core dumpVirtual machineSoftware testingStress (mechanics)Port scannerTracing (software)Flow separationInterface (computing)Computer animation
IntelBridging (networking)Interface (computing)Multiplication signMereologyProcess (computing)Stress (mechanics)Computer animation
Source codeMultiplication signMeasurementOffice suiteSpacetimeSuite (music)WeightComputer programmingData conversionComputer animation
Bridging (networking)IntelMikroarchitekturCore dumpPhysical systemBitGroup actionWeb 2.0DivisorLevel (video gaming)Electronic mailing listSpacetimeMultiplication signSound effectMachine visionPoint (geometry)FamilyTunisComputer architectureBoss CorporationScaling (geometry)Address spaceSoftware developerPairwise comparisonMessage passingStress (mechanics)Presentation of a groupAdditionVapor barrierSide channel attackBridging (networking)Process (computing)CodeComputer animation
Physical systemMechanism designProcess (computing)Computer configurationAsynchronous Transfer ModeCodeSlide ruleComputer programmingSystem callOrder (biology)10 (number)Computer animation
Physical systemSimulationDefault (computer science)Read-only memoryStrategy gameDigital filterKernel (computing)Local ringComputer programCoefficient of determinationOrder (biology)Real numberPlanningBus (computing)Mathematics2 (number)Stress (mechanics)Buffer solutionConstraint (mathematics)Computer programmingCASE <Informatik>Default (computer science)MappingKernel (computing)Mechanism designCache (computing)Process (computing)CausalityArithmetic progressionPhysical lawSystem callState of matterComputer animation
Home pageComputer animation
Point cloudFacebookOpen source
Transcript: English(auto-generated)
So, hello everyone, my name is Eugene Sremetnikov, and I'm a stress developer. And today I'll present a little overview of some performance impacting changes in stress.
So, well, as you all know, stress is a Cisco tracer. And this talk is focused especially in this part mentioned prominently in stress's month page, that well, a trace process runs slowly.
Well, why is it so? Actually, it's because of the way stress tracers processes. Stress utilizes PTrace infrastructure for tracing. And PTrace is a pretty old generic debugging interface that provides a set of requests that allow, manipulate traces in whatever way possible.
And it's originally conceived for debuggers, but then it has been repoposed for different traces and jailers and all kinds of stuff for any kind of process manipulation tracing.
Well, one aspect of PTrace is that, well, all PTrace operations are synchronous and are performed on a stopped process. And another one is that PTrace API actually abuses the standard Unix parent-child sign-on
link which is then overweight bit Cisco in order to deliver PTrace event notification. And yeah, this kind of mechanism is used to notify about all kinds of events. And yeah, the way stress works is actually like weights in a loop.
And when a weight event appears and it tries to figure out what this event is and, well, performs various operations in accordance to the way what the event, what type of event was like, whether it was Cisco, whether it was sign-on delivery to
a tracee, whether it was like actual weight event like process exiting or process sign-on link and so on. So in order, what actually stress does during that is it retrieves the Cisco arguments or
its return code and performs then based on the actual Cisco number, of course, the decoder which also performs the decoding specific like reads from tracee's memories.
And after all the stuff being done, tracee is resumed and this kind of stuff happens twice per each Cisco on entering and exiting. So since PTrace is kind of old mechanism, it actually has a lot of various requests
implemented for all the same tasks, namely reading registers and memory. So and the worst part of that is like reading registers. Like originally there was a PTrace peak user interface that allows reading like one machine word at a time versus call.
Then like some new PTrace request appeared like PTrace get regs that allows getting architecture registers and when then they appear like PTrace get fp regs, get a regs, get vregs for different kinds of architectures and it was device that
decided that well this solution is not scalable. So yet another PTrace request appears like PTrace get reg set and yeah PTrace and this PR status is returns the purchase status register or general purpose register where the
Cisco arguments usually start out as part from ix86 and nips and itanium. But nevertheless, originally PTrace used peak user interface and at some point
Denise Vysenko which used to be a prominent PTrace developer suggested switching to PTrace get reg set and this change alone like improved its performance by well several
tens of persons by well simply issuing like lesser count of PTrace requests. And there was like there were some additional changes like that reduces the amount of
PTrace requests issued for like achieving specific tasks for example for like we don't need to get Cisco number on exiting a part of for the cases when we want to make sure that well we actually exit the Cisco it's a different topic which is also solved
by PTrace get Cisco info like we don't want to get to issue get access call for like and so on yeah and another like reduction in amount of PTrace request is like
since well originally it was well the registers have been read one by one but after like switching to PTrace get reg set PTrace separate PTrace pick user for retrieving IP instruction pointer is no longer needed so it was another cleanup by
Denise he's actually contributed a lot of like performance improvement at the time but yeah he's no longer as accurate as it was before. Another aspect is as I was said is the fact that well when decoder starts and decoder performs its job it actually reads from
traces memory and yes yet original way of doing so is PTrace pick data requests but then in Linux 3.2 a new set of Cisco's appeared that originally like created for
various well various multi-process applications such as MPI applications that allows avoiding double copying of data well that allows communicating data data and transferring data
between processes without double copy because well when you're using some standard ways of copying copying of data between processes like shared memory or pipe you actually like perform this copying twice and this new ProcessVM readv and ProcessVM writev Cisco's allowed
well reduce well remove this double copying and it also can be used for like PTrace tracers and it actually like improved the performance of decoding Cisco's significantly like
in some cases it's like tens of persons in some cases when you want to read like a lot of data like large get the entry for example Cisco's when you read a whole bunch of data like several
kilobytes of data because well you need for example to well forget the ends it's actually quite peculiar because well you don't actually you can't figure out the amount of entries present until you go through all of them that's why you have to retrieve all the data
and yeah for like read and write requests like the metric previously showed and for dumping read and write data it's also very useful and sped up the stress in the suspects
significantly yes and you know yet another step which has been done seven years later is actually caching this request because in some cases for example when some lists and arrays are read well it's the way it is done the decoder function is used like
a read from Tracy for like specific array element and then from the next one then for the next one and usually they hit the same page of the traces memory but with caching it actually well retrieves the whole page once and then well performs the
local reads without issuing additional Cisco's. So interacting with Tracy is not the only aspect well of a stress that may make it slow. Another aspect is that well
it performs a lot of internal works itself and some of the algorithms used there are not well entirely optimal. For example we have a list we have an array of
trace control blocks it's like some descriptors of traces we are tracing and most of the time this array is usually quite small because well we don't trace a lot of processes at the time like tens or hundreds but when you hit like thousands of them the
original originally there was a like linear search implemented so when you like trace an order of thousands of processes and you switch switches between them all the time and well it actually can start consuming significant amount of CPU time so by well
implementing a simplification query reduces this single bottleneck and also improves the performance like several times this particular well rather rare but still well at use case. Yeah the next one is not about the performance itself but well the way the stress
the way stress interacts with the Linux kernel and the way the kernel works. So there was bug report in the end of 2008 opened by some Red Hat partners that were titled
some threads stop stops when trace with minus f option follow fork option is executed on a multi-thread process and the rapid user was what actually quite simple as it usually is so all it does it creates a multitude of processes like the main process create like
several processes in a loop and then exits multiple threads in the loop and then exits and all the threads do is well issuing like some cheap syscall for example get your ID or like change d or something like that well the important thing that they do that in the loop and
the syscall is rather cheap. And what happens it's actually well a pretty much modified output of the currently used test case in Stasis test suite but well it was originally
like it was a test program that this was committed along with the original submission so as you can see when you like run this program with like two three four threads it doesn't affect much but when you like run with five threads it slows down significantly like
all it does it like creates five threads and then exits killing all the threads along with it and with six threads it's already like three minutes and I couldn't well wait for the when the seven thread variant of this program finishes so yeah it's a way like kernel
notifies as trace about new events one and the way as trace actually handles this experience usually well as it was well it handles them just one by one the first kind first sort and after like resuming this like two three four threads kernel notifies
about the threads again and again and again because they have new syscalls and that way there was no progression on this on the other threads and they became stored as a result but the upstreaming of the patch didn't go as well as planned
so the issue was reached and patched well rather quickly in a matter of two months again by Denis Vlasenko but the patch was kind of intrusive and there were several disagreements
with the maintainer Ron Pangla at the time so the patch was then reverted and then there was no like significant push from the developer to actually upstream it
uh yeah and the fix that fixes the behavior became real only and yeah it was like that for almost 10 years originally the bug was reported again in real five and then it was forward ported to real six uh real seven and finally relate
when there was the time has come to forward port this patch to real seven there was again some discussion well whether we should upstream it and so on but well it led to nothing and yeah when it came the time has come to like forward port it yet again to
stress for 24 in 2018 i decided to i became well i was a maintainer at the time and decided to try to upstream it yet again and it took like as usually nine months
and the patch was included in stress 5.0 as released in march 19 and as you can see this patch has well i decided to have some fun by including all the results
uh the old bugs that were created with this issue reports and included all the actual like people that uh were participating in the fixing of this issue because well the stasis
mentorship changed hands throughout the years in real and yeah there were quite a few people that actually have needs to like figure out how stresses tracing group now works and how he should reimplement this patch for this new version yeah and actually i forgot to include
one person i figured i found that out when i prepared this talk uh and as now you can see stress scan traces like several hundreds of traces each of them performs like uh this course all the time and yeah the test was done on an eight core machines
and well it handles 600 of that creation or in 20 seconds which will kind of improvement over the original uh yeah but still stress still uses uh petress interface and yeah it is still slow
and if uh all the process does is issuing cscals well it is pretty much slow and as you can see
and well as some infamous stress example that's famous dd example suggests it well hundreds of times but um in some more real-time examples it's not so drastic uh i used as a real-time
as a real uh world example some other talks which presented at fosden two years ago by philip pombergdan uh regarding the way it tries to trace the sources for each binary on and each artifact and in his in his measurements it was like not very significant and my
measurements it was like uh two to three times depending conversion of s trace and uh yeah but we still have this enormous load on but actually can we go another way uh can we
like slow the original dd command uh a bit and actually we can but well not we the developers of stress but well some researchers which research uh side channel mexico attacks uh i was slowing it down like by uh for around three or four times and now a stress only like
free 30 to 40 uh times slower which are significant improvement again actually i've been cheated here because well i've used sandy bridge in this example and
it's actually the most impacted uh CPU family because well it has the most of the architectural uh side channel attacks backported and it was like uh pretty old so it doesn't have less uh like skylake sti bp and ibrs and uh haswell's uh address space id improvements
so yeah it was impacted the worst but well other uh tpu families are impacted as well and as you can see with comparison with meeps arm or ppc or s390 it's actually now in line
with with the slow down present in other architectures in fact because well originally there was but pretty much hardly handcrafted fine-tuned optimized uh cisco entry routine which was a separate for like fast pass uh when the process is not like p traced or like
traced by sec comp and now it is all merged and now it executes the same code regardless of whether the p trace is in uh way or not but well that was there are other factors by well
i think additional barriers and stuff um but still well it's a significant so well 40 times and we have to stop for each uh cisco uh ever if we don't need to trace it
uh but yeah we don't actually because well since uh as mitra says trace 5.3 we have this nice nice secomp bpf feature implemented and well paul will tell about this more but in short well
it generates a custom uh bpf programs that implements the filter that is set where the common line and if you don't trace a lot of cscals and your bpf programs isn't as big then you can get like uh a not so significant slide down like order of tens of persons and
the same slow down you you'll see in some real world examples so what about future plans um there's not like significant future plans regarding constraints performance
uh as of now like probably like some refinements regarding like caching of the retrieved data and like uh improvements in secom bpf like try to enable it by default in some cases when we can buy some refinements in the way program generate is generated but
there is like we are not able to make some uh fundamental improvements on this uh like some drastic changes in kernel in uh implemented for example uh if we can use uh eBPF
in second programs along with the maps we can significantly simplify and shorten the way secom bpf filter is now implemented and if we actually implement some blocking mechanism for example in perf that allows stopping process when the buffer is overflown then again
we can port s trace to it and again have significant performance improvement in much wider scope of use cases but yeah it's all is not done yet there's like some again some discussions
but well there's no actual work done and there's nothing to well improve in stress itself so that's about it and if you have any questions or comments i can ask them so no questions great thank you