We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Introduction to low-level profiling and tracing

00:00

Formal Metadata

Title
Introduction to low-level profiling and tracing
Subtitle
User space, Kernel, and hardware profiling and tracing with ptrace, perf, SystemTap, and BCC
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Python has built-in tracing and profiling facilities in form of callback hooks in the sys module. The settrace and setprofile callbacks have several drawbacks. They slow down the Python interpreter considerable and only allow tracing of Python code. Modern OS and CPUs come with a variety of APIs for efficient and low-level tracing down to system calls, Kernel space code, and hardware events. Some tools even create code that runs in Kernel space. This talks is an introduction and comparison of various low and high level tools for profiling and tracing as well as visualization tools like flame graphs. It covers ptrace, perf, SystemTap, and BCC/eBPF. Ptrace based commands like strace are easy to use but slow. Perf allows lightweight profiling of hardware events and CPU instructions. SystemTap is a powerful toolkit plus DSL to instrument probe points inside the Kernel as well as static SystemTap/DTrace markers in libs and languages like Java, PHP, and Python. CPython comes with a set of instrumentations for SystemTap. BCC is a collection of tools that run as JIT optimized eBPF code in Kernel space. The talk is an introduction to basic concepts of low-level tracing and profiling on Linux. The main goal is the show the potential of the tools.
Keywords
20
58
GoogolPoint cloudSystem programmingBlock (periodic table)Goodness of fitPhysical system2 (number)Figurate numberDependent and independent variablesMoment (mathematics)Multiplication signBitRobotComputer animationMeeting/Interview
Physical systemWeb-DesignerGraph (mathematics)Block (periodic table)Meeting/InterviewComputer animationLecture/Conference
Software developerCore dumpHash functionModule (mathematics)Software maintenanceInformation securityFreewareSoftwarePrincipal idealIdentity managementPublic-key infrastructureBlogLoop (music)CASE <Informatik>Software testingSystem callSoftware developerInformation securitySoftwareFreewareDiagramComputer animationLecture/ConferenceMeeting/Interview
BefehlsprozessorGraphics processing unitPower (physics)Cache (computing)User profileSocket-SchnittstelleScheduling (computing)System callPhysical systemBlock (periodic table)System programmingComputer fileVolumeRead-only memoryComputer networkGame controllerMiniDiscSynchronizationLaceInclusion mapMenu (computing)FirmwareHand fanConvex hullMomentumPay televisionComputer hardwareStochastic kernel estimationDifferent (Kate Ryan album)Profil (magazine)Fraction (mathematics)Energy levelBefehlsprozessorCartesian coordinate systemSystem callStochastic kernel estimationComputer hardwareTracing (software)SpacetimeComputer animation
Computer hardwareStochastic kernel estimationSystem programmingBenchmarkEnergy levelDemonBlogDemo (music)Systems engineeringWebsiteMultiplication signSoftware maintenanceBenchmarkCombinational logicProper mapComputer animationMeeting/Interview
Process (computing)Revision controlDebuggerMetric systemQuery languageSpacetimeKernel (computing)Computer hardwareLocal GroupRootInstallation artCompilerLinker (computing)FlagBootingModul <Datentyp>AerodynamicsStatisticsBenchmarkStandard deviationAverageMedianError messageDivisorCognitionPresentation of a groupGeometric quantizationTheoremComputerState of matterWärmestrahlungBefehlsprozessorTurbo-CodeJust-in-Time-CompilerCache (computing)Physical systemDirect numerical simulationRandomizationAddress spacePointer (computer programming)Level (video gaming)Computer hardwareEnergy levelDirection (geometry)Query languagePattern languageProcess (computing)SpacetimeTheoremProduct (business)Error messageClient (computing)Data managementPhysical systemStochastic kernel estimationComputerProfil (magazine)Function (mathematics)BitDivisorHoaxAreaExecution unitInstance (computer science)InjektivitätPower (physics)BootingResultantOperating systemCache (computing)System administratorLaptopDifferent (Kate Ryan album)Square number2 (number)Sampling (statistics)CuboidRegular graphVideo gameInsertion lossMeasurementRight angleTracing (software)AdditionVisualization (computer graphics)StatisticsExistential quantificationOutlierCycle (graph theory)Arithmetic meanMultiplication signMereologyMedical imagingBefehlsprozessorCartesian coordinate systemData loggerData analysisGame controllerAverageEndliche ModelltheoriePatch (Unix)Computer animation
BefehlsprozessorComputerState of matterTurbo-CodeWärmestrahlungCache (computing)Just-in-Time-CompilerPhysical systemDirect numerical simulationComputer hardwareSpacetimeAddress spaceRandomizationBenchmarkStudent's t-testState of matterPower (physics)BefehlsprozessorBlock (periodic table)BlogPhysical systemStochastic kernel estimationOperating systemNoise (electronics)Thresholding (image processing)Integrated development environmentBit rateMappingVariable (mathematics)Semiconductor memoryLevel (video gaming)Principle of maximum entropyWeb pageChemical equationMobile appComputer animationMeeting/InterviewLecture/Conference
User profileLine (geometry)Gastropod shellQuicksortDifferent (Kate Ryan album)Sheaf (mathematics)Computer-assisted translationSoftwareResultantComputer fileTime zoneLoginEncryptionCASE <Informatik>Gastropod shellAreaSource codeConnected spaceMultiplication signState transition systemProfil (magazine)Operator (mathematics)Tracing (software)Computer animation
Process (computing)Revision controlSpacetimeMathematical analysisDebuggerCodeLibrary (computing)Resource allocationRead-only memoryElectric currentSoftware maintenancePhysical systemPort scannerLibrary (computing)Network operating systemPhysical systemFunctional (mathematics)PerimeterPlastikkarteStochastic kernel estimationSystem callCodeCircleComputer programmingRing (mathematics)SpacetimeBridging (networking)Touch typingStructural loadSoftwareOpen setComputer fileProcess (computing)ResultantSemiconductor memoryBitContext awarenessBefehlsprozessorSign (mathematics)Resource allocationForcing (mathematics)Overhead (computing)CountingMultiplication signOperating systemRadio-frequency identificationVirtualizationFormal verificationFreewareComputer hardwareSpeicheradresseDebuggerMemory managementWordPerfect groupVariable (mathematics)GDB <Programm>Latent heatTracing (software)Computer animation
Error messageOpen setReduced instruction set computingStatisticsComputer configurationDigital filterParameter (computer programming)System callComputer networkCommunications protocolInformationStack (abstract data type)Directory serviceConnected spaceInterior (topology)Tracing (software)Computer fileNetwork socketStrutConvex hullSpacetimeLimit (category theory)View (database)PermianSpacetimeStochastic kernel estimationState of matterBefehlsprozessorComputer fileMultiplication signNetwork socketMultiplicationProcess (computing)Operator (mathematics)Error messageComputer programmingSoftwareReading (process)Closed setBitFerry CorstenOpen setSystem callVolume (thermodynamics)Socket-SchnittstelleFamilySocial classDifferent (Kate Ryan album)StatisticsPhysical systemBinary codeNumberSoftware testingData structureFile systemDirect numerical simulationConfiguration spaceMathematical analysisRegulärer Ausdruck <Textverarbeitung>Keyboard shortcutPerspective (visual)InformationComputer-assisted translationResultantConnected spaceReflektor <Informatik>Task (computing)Stress (mechanics)Musical ensembleDiagramPresentation of a groupGroup actionRadio-frequency identification1 (number)Identity managementIP addressMikroarchitekturFlagFunction (mathematics)Quantum stateComputer configurationExpressionTheory of relativityInterior (topology)EmailLink (knot theory)Online helpMeeting/Interview
Computer hardwareKernel (computing)Tracing (software)User profilePhysical systemBefehlsprozessorCache (computing)Read-only memorySpacetimeProcess (computing)Source codeEvent horizonTLB <Informatik>Web pageNumber theoryFunction (mathematics)Binary fileFluidContext awarenessMetric systemFile formatClique-widthInformation privacyMusical ensembleBitComputer hardwareSoftwareDifferent (Kate Ryan album)Stochastic kernel estimationWeb pagePhysical systemResultantEvent horizonCache (computing)Functional (mathematics)SpacetimeBefehlsprozessorRing (mathematics)Standard deviationWireless LANNumberMereologyMotherboardProcess (computing)Operating systemFrame problemData storage deviceSource codeBuffer solutionFrequencyMemory managementMultiplication signDevice driverFile systemPlastikkarteExecution unitStaff (military)Semiconductor memoryTask (computing)TouchscreenCompilerComputer animation
System programmingFreewareFunction (mathematics)Stochastic kernel estimationKernel (computing)Data bufferGastropod shellGraph (mathematics)Asynchronous Transfer ModePermianPerformance appraisalMathematical analysisJava appletBefehlsprozessorPoint (geometry)IntelBranch (computer science)User interfaceStatisticsCache (computing)Task (computing)Web pageStructural loadContext awarenessEvent horizonScripting languageSystem callPartial derivativeGraph (mathematics)Computer fileSystem callReading (process)Different (Kate Ryan album)Stochastic kernel estimationComputer programmingVirtualizationPhysical systemStatisticsGraph of a functionCore dumpStack (abstract data type)Operator (mathematics)BitFunctional (mathematics)Cache (computing)SoftwareMetadata2 (number)BefehlsprozessorPoint (geometry)SpacetimeRepresentation (politics)LengthScripting languageGastropod shellMultiplication signEnergy levelExtension (kinesiology)AlgorithmFormal languagePhysicalismContext awarenessJust-in-Time-CompilerVisualization (computer graphics)Pattern languageSpecial unitary groupFile systemWritingCuboidDatabase transactionVirtual machineVariety (linguistics)Data storage deviceLaptopCurveComputerMetreCASE <Informatik>Utility softwarePlanar graphJava appletResultantCompilerRadio-frequency identificationProof theorySoftware developerTheory
Zoom lensGraph (mathematics)Decision tree learningUniform resource locatorRoutingData storage deviceCuboidRootStructural loadCausalityMultiplication signConnected spaceCellular automatonOperator (mathematics)Data structureContext awarenessSemiconductor memorySlide ruleLoop (music)Web-Designer
Mathematical analysisKernel (computing)SpacetimeStochastic kernel estimationComputer programCompilerCodeDigital filterExtension (kinesiology)Just-in-Time-CompilerVolumeSystem programmingComputer fileBlock (periodic table)Interface (computing)HTTP cookieOperations researchPrincipal ideal domainHash functionIdentity managementFormal languageIRIS-TAerodynamicsModule (mathematics)BitFunctional (mathematics)CompilerMixed realityWritingCodeGroup actionMultiplication signRight angleConnected spaceEncryptionComputerIdentity managementResultantPhysical systemSystem callDirectory serviceHookingProfil (magazine)Computer fileLine (geometry)Stochastic kernel estimationElectric generatorComputer programmingReading (process)Subject indexingGame controllerRootRange (statistics)Open setHistogramFile systemTheoryIntercept theoremDatabaseMemory managementTheory of relativityRadio-frequency identificationPrincipal ideal domainCore dumpSpacetimeScripting languageEndliche ModelltheorieProcess (computing)Data compressionFerry CorstenDifferent (Kate Ryan album)TouchscreenElectronic mailing listLetterpress printingJava appletData storage deviceMultiplicationSpeicherbereinigungMereologyOperator (mathematics)MehrplatzsystemCase moddingTransport Layer SecurityFunction (mathematics)Lecture/ConferenceComputer animation
Process (computing)Scripting languageMountain passLibrary (computing)Function (mathematics)Error messageStochastic kernel estimationTracing (software)FirmwareCodeWebsiteCodecReliefInformationEvent horizonPhysical systemComputer hardwareCore dumpServer (computing)Commodore VIC-20BitJava appletType theoryKey (cryptography)AdditionLaptopEndliche ModelltheoriePhysical systemThread (computing)Library (computing)Server (computing)Multiplication signInformationSign (mathematics)Hash functionComputer programmingOverhead (computing)Point (geometry)Inheritance (object-oriented programming)WebsiteStochastic kernel estimationLevel (video gaming)Structural loadRootIdentifiabilityCodecWrapper (data mining)ÜberlastkontrolleComputerBootingModule (mathematics)CodeRandomizationCurveRevision controlStack (abstract data type)Figurate numberFunctional (mathematics)Information securityTap (transformer)TimestampPointer (computer programming)Parameter (computer programming)Process (computing)SoftwareGoodness of fitComputer hardwareString (computer science)SpacetimeVariety (linguistics)Lecture/ConferenceComputer animation
Task (computing)Stochastic kernel estimationAerodynamicsTracing (software)System programmingSlide ruleBitMultiplicationHacker (term)Physical systemVideoconferencingCodeMultiplication signJust-in-Time-CompilerSpacetimeComputer programmingRevision controlRight angleEndliche ModelltheorieWebsiteTemplate (C++)Stochastic kernel estimationStack (abstract data type)Mixed realityProjective planeNumberEnergy levelOperator (mathematics)Metropolitan area networkElectric generatorTracing (software)Computer animation
Execution unitVideoconferencingBefehlsprozessorLecture/ConferenceSource code
Video game consoleNumberVoltmeterOnline helpComputer virusMachine codeCodeBitFormal languageInterpreter (computing)Tracing (software)Boundary value problemWeb serviceProcess (computing)FreewareInformation.NET FrameworkProof theoryMetropolitan area networkGroup actionFrequencyPhysical systemSystem callFacebookMixed realityOpen source1 (number)ImplementationWeightSource codeLecture/ConferenceMeeting/Interview
Stochastic kernel estimationProcess (computing)NamespaceInformation securityPhysical systemDemonComputing platformComputerIntegrated development environmentDifferent (Kate Ryan album)Coefficient of determinationLecture/Conference
Stochastic kernel estimationPhysical systemComputing platformSystem callMeeting/Interview
Transcript: English(auto-generated)
Good morning, everybody. I have to say, you're missing five other excellent talks, so thanks for coming for my talk. This is a very heavily battled slot at the moment. Hi, so I'm Christian Heimers. I'm going to talk to you about profiling and tracing with different tooling.
And one reason why I started to look into the topic was an issue I faced a couple of years ago that, well, there's just this little like, every five minutes there was a blocking. The whole system didn't respond for a couple of seconds and we couldn't figure that out for a long time and bots were kind of like, yeah, angry because we lost a bit of money. That's really annoying.
And with the two thing I will show you, you can actually figure that out. And a similar talk we had yesterday, which I missed while I was giving another talk with Steve Dower, but by Christoph Herr, who looked into why is, so is it a problem with the guild or a problem with me because the system is not responding?
And one of the final motivations for this talk is something when I parted this talk for originally a talk at a general web developer conference with examples in Python and PHP to pure Python using requests, figured out there's a big block, like almost 30% wasting on something very, very interesting.
And so this is a flame graph of requests call. At the end we'll see what that thing actually is, this block, and why you should use requests.session if you do a lot of requests in a loop. An alternative title for this talk is also actually two and a half use cases for tracing tools
because you can use the tracing tools not only for debugging and profiling, there's also something very cool for anybody working in quality engineering or testing, which I will show later on. I learned that just half a year ago at another conference. So, who am I? Hi. I'm from Hamburg, a Python call developer
using Python for 18 years almost now. Working mostly in security in Python and I'm making money by working for Red Hat on security engineering software stack called Free IPA. So, agenda and goals for this talk is to explain what this picture's all about.
So this is not all a big fraction of different profiling and tracing and performance investigation tools for Linux. For different kinds of the stack down from what the CPU's doing on a very, very low level to very, very high level what the application is doing.
After a short introduction, I will introduce you to user-space tracing tools based on the ptrace syscall and the second half will be about kernel tracing and hardware tracing tool that can go down to the very, very low level. Summary and I may have, I don't think, maybe five minute of question and answers.
Some special thanks. So, a lot of these demos I'm going to show are based on tooling and blog posts by Brandon Gregg. So if you're interested in profiling and tracing, go to Brandon Gregg's website. He's fantastic. Victor Stinner also has been investing lots of time
to optimize Python and has some very helpful explanations how to do proper benchmarking because benchmarking is super hard and super hard to get correct. Dimitri Lovin, I met him, the maintainer of strace at a conference a while ago. I learned some of the cool new tricks and another engineer, Redhead, showed me another couple of tricks
for tracing for system engineers. So this is a combination of some talks I saw before, some tooling I've been using for a while and yeah. Introduction. So some terminology. Most people think of debugging as identifying and removing bugs, which is usually if you do debugging as an engineer,
it's very costly because you have to invest a bit time. You can't do that easily on a production system because you slow down production system a lot. So you need to build your additional fake production system or your staging system at data and you can't actually do much of the same things
because you don't see kind of traffic patterns. So there's a better thing, it's tracing, which I will show here. It's more like observing monitoring. We do it the right way, that's why there's a small star and you're fast. It doesn't slow down your production system much. It's still a small impact but it's mostly okay
and better than having big performance issue in production. And once you do tracing and get data, as a byproduct you can also do some kind of profiling and data analysis and visualization of what's going wrong. Mythology, so there are different kinds of tracing. The simplest part you may know
is like application level tracing. You build a debug build, you have some kind of tooling that writes log files somewhere like MySQL slow query, Python at a high level. So set trace call, you can add a callback that gets called when something's going on, et cetera, et cetera. There are also more user space tracing
that go a bit deeper like on the C level. You can load something into your process space while it preloads, you can use P tracing. Even deeper is kernel space tracing which uses the kernel to investigate what the operating system do in a lower level
or how user space communicates with kernel space and hardware. And finally there's actually hardware tracing. So any modern CPUs had special capabilities to do hardware performance counting, power management unit controls. You can see what different health cache levels and just things you could actually do.
To do tracing you often have to do some special steps. You have to install some tooling. Often you need special permission so you can't do some of the tracing as a user. And sometimes not even as an admin so you have to disable some protections like you have to disable secure boot to inject kernel models because low level tracing
is often like live patching your kernel which is a bit scary but also very cool. Just write stuff that runs in a kernel unlimited accessing all hardware, yeah. But to do tracing, profiling, understanding what's going on a big issue is statistics are very hard.
So you should at least know a bit of statistics and you may know the first phrase. The second one is German for who measures measures fertilizer so just to maneuver, yeah. So don't understand statistics, just some things if you're interested in that you should learn about as different kinds
or what's different than average mean and for profiling often they're useful to get the percentile. So how good are 95% of all requests or 99% because you don't often care that much about outliers or you want to specifically know when there's a big outlier. There are different kind of errors you can have like observational errors,
you can have random errors that modify your output. There are also kind of biases so if you're looking very hard at some area have some opinions about that you try to confirm your own opinion through the human factor while you may be looking in the wrong direction. There's also very misleading ways to present data
that can actually fool you or fool other people so Vatican City has like 2.27 per square kilometer which is a correct result but totally, yeah, bad, misleading. If you use a sampling profile and not profiling like every instance
but do like regular sampling you can have like sampling errors, Nyquist Channel Theorem is a fun thing if you're working with images or any kind of other sampling or any kind of electronics. And there are multiple papers on the topic. Love that one, it's very fun to read. Producing wrong data
without doing anything obviously wrong. This is a very fun paper that claims like stuff that went wrong because people were looking the wrong way. And computers are very, very noisy so this is one second of my laptop doing nothing and there's, well, lots of going on although it's doing actually nothing in the background.
So CPUs, power states, et cetera. So Victor Stinnell's blog explains how you have to configure and reboot and set up your kernel and operating system to make sure that you remove some of the noise for some of the CPUs and get rid of RFQ handlers and balancers and whatever. Other things is we had a very fun back a while ago
where depending on your environment how the length of like your host name or how much environment variables you have in your system changes the way how memories are located and if you go over a certain threshold then you may suddenly go from just a couple of mem app calls to a lot of mem mapping and un-mapping of the memory of pages
that can change your performance a lot. It just took us ages to figure that out. So let's profile. And very easy case. So the first thing I'm going to show is reading a file. And the simple thing you can do in Python is just to take the time, do something,
take the time again. While there are multiple issues with that kind of approach you're missing a fun talk on the LS log like what a day is 24 hours plus minus one. Why if you have time zone switches, if you have like a clock sort that's changing you have to use a clock source to do profiling that does not change on your surroundings.
So this clock source I'm using here would be a bad one so if you do profiling like on DST switch you get like text or extra or I get negative results for that. So it's one of the caveats you should mention as you take care of.
So some of the examples I'm using next one is just a very simple shell thing, just doing cat. I think I'm using Python to open read from a file. And some of the more advanced things I'm using HTTPS connection with requests. Because requests is a tool that most of you know. It does a lot of things so it does like
network operations, file operations, it does something with SSL, OpenSSL, TLS encryption and a couple of other things. So it shows lots of different areas where you can get very surprising results. So section one, ptrace. Ptrace is a very old feature
from the old Unix days, about 1985. It's a way to do user space tracing and it's used by lots of tooling, you may not. So if you use the GNU debugger to debug a C program,
if you use strace, alt-trace, if you do code coverage of any kind of C library, a lot of libraries you will use ptrace more like. And ptrace is useful but slow and useful in the sense it's mostly easy to use. So one example here is I'm using alt-trace at the library called tracer to see,
is it big enough? Oh, perfect. To see if I do requests, I want to see all variables or all function calls starting with SSL, CTX, something at any library that's loaded. So that's the at sign, after that it's the library.
That shows me the different function calls I'm getting, which libraries, so the pipe SSL model, the internal dash SSL, underscore SSL, and there are a couple of calls to see which memory addresses are called and the results, return results on the right side. Not very helpful but slowly you can see if, for example, you want to investigate
if your program is calling a specific function at all, it's rather nice for that. You can also do something like count how many memory allocations you do. So I'm running two processes on two different shells, so I get the pit, then I use alt-trace to trace malloc, realloc, and free at any library.
I attach it to that pit, I run requests call, and I get like how many allocations I get and free calls I get. But one downside from pit trace is it's slow. So that one took instead of like half a second on a very slow social network,
like three or four seconds. So the overhead is extremely high because alt-trace has to jump back and forth between kernel space and user space a lot. Similar tool is the tall S trace where you can investigate system calls and developed by Paul Cranberg in 1991
and now maintained by Timothy Levin. And so the logo of S trace is an ostrich. Why an ostrich? Well, if you know Dutch or maybe German, it's Straus, it's the name of the word. So in German or in Netherlands, it's Dutch.
And so that's Straus. And with S trace, you do system call tracing. So what's a system call tracing? Oh, it doesn't load? No, it doesn't load. Oh, is it back in my... Okay, that's supposed to be a circle
of the different rings of a CPU. You see a bit of the circles around. I guess a way how modern operating systems work is that all processes run in user space, and user space is not allowed to directly do anything with hardware, even like memories virtualized, access to any kind of hardware calls
abstracted by the kernel. You have to tell the kernel, please open that file for me, or please do something, send something for me on the network device. And this is done by a syscall. So you call a feature in the kernel. The kernel does some verification and then talks for you to the operating system and goes back.
And this is called a syscall. And anytime you do a syscall, you have to do a context switch. So the kernel has to save the user space state, set up its kernel space state. On the CPU, you do something and go back, and that takes a lot of time, which you can't see in here. Well, it worked yesterday. So, stracing, like you,
one thing is make open a file. So you want to catch this etsyos-release file and see which files are actually opened by cat in that way. And you see, like, nothing. Well, that's bad. Strange, we're tracing open, but open is the system call to open a file.
Still, it's no result. It's peculiar. So let's just look for all syscalls made. And you see, oh, it doesn't call open. It calls something called openAT, or open at. So the kernel does not have those one syscall for a specific task. It opened, like, a family of syscall to do related things.
And GFC decided to move away from the old open syscall, because it's not available on all operating systems and CPU architectures, but used openAT a while ago. So one thing, if you wanted to stress, like open calls is to use, like, this regular expression, or even easier, they're like multiple families
of things I will explain in a minute. And this approach, ultimately, regular expression is a bit bad if you, like, stat, stat call gets status of a file, like file size and permission, they're like a plantitude of different syscalls, which may or may not be available, or may or may not give you the correct results.
GFC does the correct thing, but you need to actually track them all. So, easier way to do that comes next, but first, yeah. First look at how this trace openAT call works. So if I openAT, trace that, you see, like, multiple calls, and the result on the right side is the low-level file descriptor,
which is, in this example, OA3, because the program opens the file, reads something, closes it, and then the kernel reuses the same file descriptor number. Or I can do something else, and another new feature is dash capital P to trace all activity on one file. You see, it's doing, it's actually doing a redirect because it's assembling, and then you use some stat calls,
read something, and finally closes the file. So this is a rather nifty tool to see what operations a process does. Again, there are multiple helper classes, so you want anything with a file, you can use percent file, percent desk is anything with a file descriptor, there's sockets and file operations,
network operation has different family class you can help, and there are other multiple helpers to get, like, more output, and have strace give you a more detailed analysis of what's going on. For example, tracing all file access while running a program. This is a request call that tries to load
some CA certs, not present, so I use that feature a lot to investigate why something doesn't work, although I expect that's one configuration mistake I had in one of my systems. Another thing is to see network activity. So if you do, like, a network call, first thing you always do is DNS lookup.
You see it's opened a socket to INET DGRAM, that's UDP socket, does something on port 53, looks for Python, and then gets back an IP address, and then next one it connects to that IP address, and does a request. And with the right options, you see actually what's going on in the internal data structures.
This tracing tool's also a nice way to learn more about how operating systems and G2C and kernel works. A cool feature for any kind of tester is something called syscall tampering. You can actually modify and play around and disturb how syscalls work. For example, I inject an error
into the socket syscall, it opens a new socket, and say, okay, EM file is an error number, so an error on a bit, and do something, and if I do a request, then, okay, DNS lookup doesn't work, because the first socket call during the DNS lock is just intercepted,
and get an error. I can also do something like, okay, don't want to just intercept the first one, I want to intercept the second and any following one, so that's the when equals two plus, and now the DNS lookup works, but the first connection fails,
because I can't open a file, there's too many open file descriptors, that's the EM file error number. Or, perhaps you want to slow down some operations, reading and writing to a file, slow down some kind of network operations, you can do something to slow that down with strace two, you can add a delay,
either in the beginning or in the exit of a system call, and that slows down copying from def zero to def null, from 3.2 gigabytes a second to just about 10 megabytes a second, just by slowing it down a bit. Our thing is, for example,
you want to analyze a program that removes temp files, so let's just disable the unlink call, unlink is the internal name to remove a name from a file handle on disk, so that's remove in Unix speak, and yeah. So you see that it's injected,
and the file's still there, but the program doesn't get an error, so it doesn't fail from the perspective of the program, but it doesn't do anything, so it's just return volume zero, and yeah, files still there. So, verdict, I'll use Ltrace, especially Strace a lot, because it's easy to use for small, simple tests,
it's powerful, and usually does need extra privileges, but on the other hand, it's slow, and especially Ltrace does not work with any kind of modern binaries when they're compiled with special flags, so if you have a bind now thing, then there's missing some information in the header, and Ltrace can analyze and see what's happening.
So there was high-level tracing, it's a little bit lower to actually see what's my operating system doing, what's my kernel doing, what's my hardware doing? There are several tracing capabilities inside the kernel for different kind of tasks, so a lot of the tracing is for the kernel tracing,
you can see what the file system is doing, what your hardware drivers are doing, the CPU tracing capabilities, you can see what your CPU cache is doing, what your memory management unit is doing, handling memory, or there's also a way to do user space tracing from the kernel space,
so ptrace is very slow, because ptrace, any time you do something, has to copy the values back to user space and copy it back, and copy lots of data with efficient user space tracing in kernel space, you can do lots of pre-filtering in the kernel, store it in an efficient ring buffer, and then have another process extract the ring buffer
with the pre-filtered or pre-aggregated results to a file, that's very, very efficient. Different, and yeah, what I mentioned before, it's a fun way to learn more about how actually the kernel works. Different data sources are kernel probes, and user space probes, k-probes, u-probes,
different event handlers, the kernel defines several events, CPU has several events, some kind of chips on your motherboard may emit events that are handled by the kernel and offered to you, and they're different, have user space things, USDT I will explain later on with some examples in Python.
K-probes, u-probes, with kernel probes, you can see almost everything happening inside the kernel, and with user space, also almost everything. Things you can't inject or intercept is anything that's statically internally optimized
C function or internalized C functions, they're optimized away by the compiler and they're no longer available, but the rest, yeah. And performance counter, this is a small, small, small part of which kind of performance counters you actually have.
I think the page is usually like 20 pages or so, my screen, I have a big screen, so it's a lot. So something you can do is like, yeah, these are almost 1,900 different events I have on my system. So, and kernel trace events, you can see is here,
for example, I'm listening to chg80211, that's the standard for wireless network cards, and I get the base station frames and get base station packages,
so I see what my wireless network card is doing when connecting to a new base station. And you see different frequencies, you see activities, you see that in the end it's connecting to the physical hardware to band one on frequency 5180, and the MAC number and other stuff, and these are two of the events you can see.
If you wanna know what your system is doing, fun tool. So the advanced tools, I mentioned, I will not cover all of these tools, I'm going quickly, ftrace, because ftrace is the foundation of lots of the other tooling using function tracing in the kernel, the perf tool to handle perf events,
BCC and extended package filter language tools are a new way to write kernel programs, and these days the kernel has a virtual machine with a JIT that you can run eBPF programs in. System tab, the last tool I'll explain,
and there's several more tools, so like LTTNG, which is a cool tool developed by the University of Montreal. Yeah, and dtrace and system tab,
so if you look into the pattern documentation on the instrumentation, system tab is one way to do dtracing on Linux, originally developed by Sun. Ftrace, the function tracer, you can do function tracing on a system that just has a kernel and a busy box, because you only need very low-level shell commands
to do that, and the rest you can do with a virtual file system. And so just one example is, I had one issue on one system that stored data on FS, I wanted to see which kind of kernel calls, just a simple Python program does it,
it opens a file and reads something, and you see, I have a function graph attached to any kind of NFS kernel function, this is the call stack inside the kernel, the kernel does to read something from NFS store. Or a different representation is, I wanted to see, because I noticed that getting permissions from NFS was very slow,
so give me anything NFS permission related and give me the call stack of that. And you see here, it does actually two different open calls, first it does a check using new stat and something, and the second one, it just opens the actual file,
see at the end, this do this open, or is this call 64, that's the entry point where the user space calls into kernel space, and then the kernel space does permission checks if you're actually allowed to open the file. The issue we are having here is that the metadata cache for permissions had some issues and didn't cache the permissions the correct way.
Perf counting on Linux, as I mentioned before, different ways to do that. In most cases, you can do that as an unprivileged user, and you can also, the tooling has high level tooling to look into Python, Java, and Node.js, and PHP, and whatever else you're using.
One thing, it's very fun, so, what does my CPU do when I compile CPython on my system? So, I'm here using Perf, getting stats, and calling the command make-j, so do parallel builds of CPython,
and you see how many context features I got, how many CPU instructions it took to compile CPython on my laptop, how often the level one, level two cache was utilized. Very nice way to see if your algorithms, so if you're into data science, you wanna have an algorithm that uses the CPU caches very efficiently,
and you see you have some of issue there. Well, or how good compiling CPython uses the turbo-utilization. So, these days, computers, CPUs have turbo-boost that allows you to, basically, handle one physical CPU cores to virtual CPU cores,
and in theory, you can do a visualization of two, which would mean that both virtual cores would be perfectly used. 1.7 is a very good ratio, so there are a lot of ways to do that. You can also do user-space probing. So, I used, a while ago, the example L-trace.
Perf looks a bit different. First, you have to define your probing points, which you wanna get, and then you can get statistics about how often the different syscalls are called, and so, the plain one, without any tracing, took my slow hosting length work like a bit more than half a second. The ultra is around almost 3.5, three and four seconds,
and with Perf, it's just a bit slower, a bit more than one millisecond slower than the original one, so it's much more efficient. And to get, like, Perf results, I'll use now another approach
to get a call graph from a request call. So first, you have to record what you're doing, then you have to do reporting annotating, and finally, you can pipe it through a script and create something called a flame graph, the graph I showed before, and if you do that, first calling Python, you get something like that, which is a bit hard to read.
It also, it contains both things happening in request, but also all the operations happening while starting Python and shutting Python down, it's only counting user space time, so it does not see any time that happens in kernel space, like doing network transactions.
So, we want to look closer, just want to know what requests does internally, using the similar approach as before, getting the pit, running Perf on that pit, just do the one request, then I press Control-C, stop that, do the flame graph, now you see that graph, now here again, the box.
You look at the very low, you see something called x549 store load locations, so that calls wasting almost 40% locating, loading, and parsing the root CA certs to validate requests.
So, if you do a cell connection, you have to load root CA, trust anchors, and that loading takes a very, very long time, because it does a lot of internal operations, and if you do requests.get in a loop, you always have to do the same operations, load CA certs from the disk, do some operations,
put it into special memory structures, validate them, et cetera, et cetera. So, the correct way is use a session or reuse SSL context to load only the root CA's one time, and you're fine. It was something I just figured out like a week ago when I updated the slides from using PHP examples,
used cURL and PHP, my original slides from the web developer conference to Python, yep. Other advanced tools to look deeper into the cURL is a bit easier to use than doing raw function tracing as the BCC compiler collection, which I like a lot.
It's a way to write something Python mixed with C code, which will then generate eBPF programs and upload them into your kernel and do some kind of operations. This BCC collection is a collection of a lot, a lot of tools. This is a slightly older explanation,
a listing of different tools, and the things you see around, except for the C, Java, Node, PHP, Python, Ruby thing, these are all tools that are already available and ready to use in the example directory of BCC. For example, one tool is x4sf-slowoperations tool,
which shows you which processes on your computer take a lot of time doing something on the x4 file system. So this is, for me, BASH does something, so it's a BOLO controller using KDE, BOLO CTL is like the index database that indexes files on your file system,
and there's also a mod storage that's part of Mozilla Firefox to store cookies, and yeah. So these are some processes that take a lot of time on my x4 files in my home directory. Or get TCP connections, but filter them by user. So with TCP dump, you can just get all TCP connections,
but you don't see the user ID, and this TCP connect program can filter connection like by user ID or other things that are available in the kernel space, but not available to tools like TCP dump. Or, well, how about we break all SSL encryption with SSL sniff using request call,
using accept encoding identity, because I don't want to have like gzip compression, it makes it hard to read, and run the program, and well, I get like clear text results. So that one uses user space probes to hook into OpenSSL before SSL write encrypt stator
or after read decrypt stator. And then it dumps all the traffic, and it works for all processes on the computer, so that one only attaches to one bit, but in theory, you could run it at the root on your system, it will dump all TLS SSL encrypted traffic to a file. Or, how about you want to see which files are open,
so you can run that small script, this BPF trace, and to accept all this, enter open AT calls, and just dump the command name, the PID, and the file open. It just will print lots and lots of output on your screen, all file activity on your computer.
Or, how about you want to know how memory allocations are handled. So this is a histogram of all sizes of memory allocations for requests call. So you see on the left side, the ranges, so one byte, or two to four bytes, allocated like 661 times, et cetera, et cetera.
And well, these are very cool tools to just look very quickly into your process, so you get the idea of what may go wrong. The most powerful tool I currently know of is SystemTap. SystemTap is a way to write kernel models
that does system interception and profiling, and including using user space-defined probes. So USDTs are a way where a program can tell SystemTap where there's something
that could be intercepted or done. So Python offers multiple user space probes, like function entry, function return, GCC done to see what the garbage collector is doing, imports, or different line activities, and the last one I added for 3.8 is for auditing hooks.
PHP has even more ones, and if you look into Java, they're going a bit overboard, 521, the time I wrote the initial version of that talk, like half a year ago, probably more now. And if you do tracing with SystemTap or STAP command,
they say one problem, so I keep my wrapper very secure, which kind of prohibited me from using SystemTap, because SystemTap creates a kernel model and tries to load it into the kernel. But if you run your kernel securely with a secure boot, then you're not allowed to inject any random unsigned kernel models.
So the first thing you have to do is reboot your computer and disable secure boot, or figure out how to do kernel signing on your laptop with a mock key. So let's trace Python. Let's see how you can write your first STAP program to trace what's happening inside Python.
So we have to find a propind, attach the propind to a process, which is actually not necessarily a process name, maybe sometimes a library name. And then we use a marker, and the marker's always double underscore, thunder, for some reasons, although the entry point has to single the under.
So it gets multiple arguments. The first one takes really a string pointer, so you have to use user string to convert the remodel name. We didn't like the current time, we're using the current time and the thread identifier to store them in a kind of hash map, and then print out some stuff and incrementing the depth.
And the other way around, so you have a second one that goes on find load done, which gets the model name, but also information whether the input's successful or not. Keeping the time and printing some stuff, and just to run your first STAP program, I need root permissions to inject the kernel model
for good reasons, so I'm using the import STAP program, and run this Python pass, we see the different imports, including nested imports. So you see like the encoding model loads codecs, or the site model tries to load site customize, user customize, and the timestamps, and when it's done.
And the talk yesterday about GIL or no-GIL, Eric, I think he sits over there somewhere, he added additional user space probes, use DTs to investigate when the GIL is allocated, or GIL is acquired, released,
or when somebody tries to get the GIL. And this is something I may add to, actually the next version of Python, you can see when there's any kind of congestion on the GIL, it's a cool thing. So, verdict, kernel space, and user space tracing, sorry. I think you can get a lot of detailed information,
which is also one of the issues, you get so much information, you may get overwhelmed. It's very fast, mostly efficient, the overhead is like 10%, 15%, 20%, depending on what you're doing. You can get extremely detailed information, what your hardware or software is doing.
There's a wide variety of already pre-built tooling, but yeah, the learning curve can be very steep, I've been playing around with it for a year now, still not very good at it, and yeah. And also, if you do it the wrong way, so if you do something like enable dumping all call stacks for all kernel functions,
you turn your big server into a very, very slow computer, or even slower than that one. Yeah, so in my own opinion, and this is actually, I watched Swiss Army Knife before I knew they were going to EuroPython, so this is from my first version. So S trace and the BPF trace are very nice tools,
very quick hacks and quick approaches. BCC is very cool if you can use the pre-built tools. For writing new tools, it's not that hard, you need to know a bit of Python, a bit of C, because the mix is very wild, you use like ginger templates with C code inside Python code
to do something, generate new code, so it's a bit like Cython, but for kernel models. Perf is great if you want to go very, very low level. I may have time to show you a video what we did in CPython to investigate an issue in the long add operation to add two numbers.
I think I've sold enough time. System tab is very interesting if you use like user space defined probing. And this cooling approach to replace some of the C code, actually BPF code, so then you no longer have to
run scary kernel models inside your kernel or just run BPF program inside the JIT of the kernel is a bit more saner and safer. F trace is useful either if you're just booting up the system that has no user space tooling yet or for all the kernels, and again, BPF is really the future.
You want to learn more? Brandon Gregg's website is just the beginning of everything related to tracing. He's just fantastic. BPF and I supervise a project. There are multiple books on system tab that are really great. There has been a talk on PyBase 16
by even Freeman, who got into more details how to extract stack traces in Python from kernel space and getting a bit more deeper on that topic. Can take like three minutes of questions, and while I'm taking questions,
I'll just show you the video that, okay, I think it just, where's my mascara? So this, I'm compiling Python, running a perf on that, and going deep into which CPU instructions I executed. So run, and anybody, any questions?
Any questions? Anybody has any questions? Thank you, Christian. Any questions, please come to the microphones in front. Okay, no questions? Oh, there's somebody coming.
Okay, perfect, hello. Hi, thanks for your talk. So when I was using native code, it was very interesting to use as through trace and similar tools, but for interpreted languages, I thought it was a bit more difficult because you see all the code which the interpreter is actually doing, and I found it hard to actually
associate my code with the traces. Yes, that's one issue if you cross language boundaries like from native C code or native code on to interpreted code or between, I used to do like Java and Python mix up on the same process or .NET and Python on the same process did get harder, so there are ways to extract like stack traces,
back traces of calls using this tooling. So GDB has a lot of very elaborate scripting to extract information, and this is something like a call to action. If you're interested in doing more system tab, even Freeman, yeah, even Freeman,
he wrote some proof of concept tooling and Instagram, Facebook, they are the ones that usually contributed the first implementation calls to Python, and they have some tooling, but they're not yet open sourced. I hope that I can convince somebody at Instagram
that it may be released some of the tooling which they use to optimize the Instagram web services for doing Python 3.7, I think. That would be cool. Again, yeah, it's a big issue. That's why you need extra work on that.
I agree. Okay, thanks. Okay, so one very quick question, maybe. Okay, hi. Can we use these tools in Docker environment, and can we prepare the Docker environment to trigger them remotely, for example? Docker, so yes and no.
If you're able to, depending which tools you have, so s-press should work if you don't need to attach it to a different PID, some of the other tooling, you probably don't wanna allow your Docker daemon to modify your kernel. But if you have access to the base system, so the system that runs your container environment,
you can run there as a privileged user, because containers are just processes in a different namespace on the same computer. That would work. So the low-level kernel tracing tool is probably not, for security reasons, Docker or any other container platform restricts syscalls, especially the ptrace, attach, and kernel-loading syscalls.
But again, if you use base system, that would work. Okay, thanks. You're welcome. If you have any more questions, I will be here. We have, yeah, sorry. Any other week. So, yeah. So, thank you again.