We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Postmodern strace

00:00

Formal Metadata

Title
Postmodern strace
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
strace is a diagnostic, debugging and instructional utility for Linux. It is used to monitor and tamper with interactions between processes and the Linux kernel, which include system calls, signal deliveries, and changes of process state. In this talk the maintainer of strace will describe new features implemented since FOSDEM 2018. Several interesting features were implemented within strace project since FOSDEM 2018, including: seccomp-assisted system call filtering system call return status filtering PTRACE_GET_SYSCALL_INFO API support new options: -DD, -DDD, -X, -z, -Z In this talk the maintainer of strace will describe these new features and demonstrate what kinds of problems they help to solve.
SoftwareNetwork topologyForcing (mathematics)DemoscenePhysical systemSoftware maintenanceComputer animation
String (computer science)File formatSystem callTimestampPointer (computer programming)Computer configurationFunction (mathematics)Data structurePhysical systemComputer configurationDifferent (Kate Ryan album)Letterpress printingSystem callSemiconductor memoryString (computer science)Pointer (computer programming)Calculus of variationsMultiplication signState of matterWater vaporComputer animation
Function (mathematics)Computer fileCore dumpComputer configurationThermodynamischer ProzessPhysical systemStatisticsControl flowError messageHistogramSystem callLetterpress printingDifferent (Kate Ryan album)Computer configurationFlow separationFunction (mathematics)StatisticsProcess (computing)PRINCE2Thermodynamischer ProzessPrice indexComputer animation
Function (mathematics)File formatPhysical systemComputer configurationCommunications protocolComputer networkRegulärer Ausdruck <Textverarbeitung>Social classSystem callExpressionStatisticsComputer configurationPhysical systemInformationGroup actionRegulärer Ausdruck <Textverarbeitung>Socket-SchnittstelleFunctional (mathematics)System callNetwork socketComputer animation
Control flowStatisticsPhysical systemComputer configurationFunction (mathematics)Regular graphThermodynamischer ProzessFamilyInjektivitätProcess (computing)StatisticsTracing (software)Thermodynamischer ProzessGroup actionPhysical systemEvent horizonMathematicsStokes' theoremInjektivitätCodeComputer animation
Physical systemComputer configurationFile formatLogical constantFlagSystem callParsingAxiom of choiceSystem callParsingFilter <Stochastik>Computer configurationPhysical systemTracing (software)BitKernel (computing)Inheritance (object-oriented programming)Level (video gaming)Line (geometry)MathematicsComputer animation
Asynchronous Transfer ModePhysical systemSystem callRippingKernel (computing)ArchitectureSystem callComputer architecture32-bitCodeMachine codeMixed realityPhysical systemKernel (computing)SpacetimeDebuggerNumberProcess (computing)Computer animation
Physical systemInformationArchitectureSystem callBitLogicProcess (computing)Software bugCASE <Informatik>Traffic reportingComputer animation
System callThermodynamischer ProzessPhysical systemFunction (mathematics)Computer programmingSystem callLine (geometry)Data structureNumberTraffic reportingSoftware bug32-bitProcess (computing)Offenes KommunikationssystemWhiteboardPRINCE2Computer animation
Thermodynamischer ProzessOpen setSystem callProcess (computing)Offenes KommunikationssystemParameter (computer programming)Computer animation
Thermodynamischer ProzessOpen setWritingOpen setPhysical systemSystem callFlow separationComputer animation
System callThermodynamischer ProzessOpen setCloningMathematicsFluxCombinational logicMultiplication signArithmetic progressionComputer programmingAddress spaceFlagKernel (computing)Computer animation
Insertion lossComputer filePatch (Unix)ArchitectureDigitizingData structureComputer architectureNumberIterationWebsiteElectric generatorParadoxComputer animation
TypsystemPointer (computer programming)BefehlsprozessorSystem callStack (abstract data type)StrutInformationFingerprintError messageTracing (software)Kernel (computing)Computer architectureComputer programmingNumberPoint (geometry)Parameter (computer programming)Physical systemTexture mappingActive contour modelMixed realityComputer animation
WritingThermodynamischer ProzessAsynchronous Transfer ModeSystem callKernel (computing)CASE <Informatik>Process (computing)Ferry CorstenProper mapExpected valuePhysical systemTracing (software)DebuggerDigital photographyVotingMultiplication signComputer animation
TypsystemBefehlsprozessorPointer (computer programming)System callStack (abstract data type)StrutInformationFingerprintError messageWritingThermodynamischer ProzessAsynchronous Transfer ModeElement (mathematics)Default (computer science)Letterpress printingPhysical systemAliasingCodeTracing (software)FrustrationComputer configurationPhysical systemType theorySource codeFilter <Stochastik>Letterpress printingSystem callComputer programmingDifferent (Kate Ryan album)Computer-assisted translationComputer animation
Library (computing)Directory servicePhysical systemLinker (computing)Different (Kate Ryan album)Library (computing)Computer-assisted translationComputer programmingLink (knot theory)Computer configurationSound effectComputer animation
Physical systemThermodynamischer ProzessFreewareSound effectMaizeComputer animation
System callOrder (biology)Computer configurationCASE <Informatik>Physical systemMoment (mathematics)State of matterPrice indexComputer animation
Physical systemError messageCodeoutputDirectory serviceComputer programmingComputer configurationParameter (computer programming)Letterpress printingComputer fileComputer animation
Physical systemDigital filterOrder (biology)Order of magnitudeLetterpress printingPhysical systemComputer configurationStudent's t-testPrototypeComputer programmingPerspective (visual)SpacetimeOrder of magnitudeComputer animation
Computer configurationAliasingPhysical systemComputer configurationMultiplication signAnalogyComputer programmingPRINCE2Office suiteVideo gameComputer animation
Function (mathematics)Computer configurationSystem callClosed setCodeDebuggerComputer-assisted translationFerry CorstenPhysical systemLogicComputer configurationMultiplication signThermodynamischer ProzessFunctional (mathematics)Function (mathematics)Network topologyComputer programmingComputer animation
FamilyComputer configurationNetwork topologyDifferent (Kate Ryan album)Inheritance (object-oriented programming)NumberThermodynamischer ProzessComputer configurationPrincipal ideal domainComputer animation
Cache (computing)Computer configurationFlagFile formatLogical constantNumberComputer configurationComputer programmingComputer architectureSymbol tableParameter (computer programming)Different (Kate Ryan album)System callProjective planePhysical systemClassical physicsRow (database)Order (biology)Logical constantComputer animation
CloningSystem callPhysical systemPhysical systemWeb pageGame controllerPoint (geometry)ParsingKernel (computing)System callComputer animation
Kernel (computing)Link (knot theory)Local ringProgrammable read-only memoryCommunications protocolWeightParsingNetwork socketPhysical systemParticle systemMonster groupProcess (computing)Execution unitSystem callCommunications protocolParsingCodierung <Programmierung>Function (mathematics)
Kernel (computing)Local ringTable (information)Pointer (computer programming)Network socketLink (knot theory)ParsingMeta elementType theoryLimit (category theory)Broadcasting (networking)Communications protocolLucas sequenceInterior (topology)Inclusion mapInverter (logic gate)Convex hullInfinityGraph coloringCommunications protocolElectronic mailing listSubsetMobile appStress (mechanics)Computer animation
Suite (music)MathematicsMultiplication signLibrary (computing)Metropolitan area networkPoint (geometry)SpacetimeFrequencySoftware testingComputer animation
Home pagePhysical systemDefault (computer science)Computer animation
Cache (computing)Integrated development environmentComputer configurationFile formatFlagStiff equationFamilyIntrusion detection systemDigital filterOrder of magnitudePhysical systemElement (mathematics)Default (computer science)Letterpress printingCodeAliasingError messageWritingThermodynamischer ProzessAsynchronous Transfer ModeInsertion lossComputer filePatch (Unix)ArchitectureLibrary (computing)FingerprintDirectory serviceOrder (biology)Type theoryFocus (optics)Computer programmingPhysical systemPoint (geometry)Slide ruleProcess (computing)CuboidDefault (computer science)Forcing (mathematics)Network topologyElectronic mailing listParameter (computer programming)Tracing (software)Computer configurationMathematicsKernel (computing)Thermodynamischer ProzessGraph coloringFilter <Stochastik>Thermal expansionGastropod shellFunction (mathematics)Computer animation
Daylight saving timeInclusion mapInterior (topology)Communications protocolKernel (computing)Local ringTable (information)Pointer (computer programming)Network socketParsingLink (knot theory)FreewareExecution unitLucas sequenceTerm (mathematics)Data structurePlanningSoftwareGraph coloringFunction (mathematics)Tracing (software)Water vaporExistenceComputer animation
Point cloudFacebookOpen source
Transcript: English(auto-generated)
Hello everybody, thank you for coming. My name is Dmitry Levin. I am the chief software architect at Vezalt where we do GNU Linux operating system, but I'm also the maintainer
of strace for the last, slightly more than last 10 years. So today I'll be talking about postmodern strace. What is postmodern strace? I used to talk about modern strace last year, so I understood that I can't name it modern strace any longer if I'm talking
about very recent features. So where traditional strace ends and modern strace begins, and when modern strace ends, well, modern strace never ends, so when it turns into postmodern.
It's kind of subjective, so my definition is very simple. The strace that was before I started maintaining it is traditional, and all the rest is modern. So here it ends, and well, postmodern is all new features since the last talk at Fosdom. So I'll be
covering mostly what have changed for the last two years, but I'll remind you briefly about traditional features just to refresh these features in your memory. So strace is
basically a Linux system called Tracer. It also can not just trace but tamper the system calls since like several years ago, but it has a lot of options to control its behavior
in different ways, like whether it prints instruction pointers, whether it prints timestamps or not, how it prints strings, what system calls are printed and which way they're printed, what's abbreviated and what's not. There are also options to control what signals are printed.
It can also dump the data that goes through descriptors. It can print its output in different ways, so you can, for example, redirect it into a pipe or collect output
for each process separately. A lot of features that control what syscalls would be printed. It can also print statistics on system call invocations. It can attach to already existing processes. It can follow forks, and it can don't follow forks, depending on whether you specify
the option. Well, that was traditional. There were also quite a few options added for the last 10 years, like you can print a lot of details about descriptors, like what paths are associated with them or what socket information is behind sockets when these descriptors
are circuits. We can bring stack of user function calls. Yeah, you can filter system calls by path names. We've finally got support for regular expressions for filtering system calls,
so you can specify which syscalls are printed using regular expressions and so on. Yeah, more ways to control how statistics is printed, what is being, how it's traced, so you can, for example, attach to many processes. You can run this trace as a detached
process and so on and so on. Well, and there's also this big feature which changed this trace. I mean, it changed not just this trace, but the way how people look at it. It's a system called tampering. So you cannot just trace system calls, but also inject various things,
like starting with return code. Also, you can inject signals and delays, but this all was more or less covered in the previous talk. So in the last two years, we got the trace gets syscall for support. It went both into the kernel and into this trace. We got system
call return status filtering. We have sycomp assisted syscall filter nowadays. There are also a lot of new system calls in the kernel that are supported,
and we have more and more elaborated system call parses. We also finally have long options. Yeah, we had no choice. We will soon see why. And finally, a bit more than a year ago, we changed our BSD-style answers to a copyleft license.
So let's start with the first feature. Well, the story itself started very, very long. Like, I think it was 2001, then this new architecture, it is 664, appeared. So the way
it was added in Linux kernel, obviously, was to support both 64 and 32-bit processes, for obvious reasons, because it was the main, like, feature of this architecture compared to
its competitor that it could run legacy code. In early years of this, there were a lot of legacy code and very little native 64-bit code. But the way it was implemented in the kernel, it allowed not just to mix instructions, but also mix system call notifications. So you could
actually invoke from a native code both native 64-bit system calls, but also legacy 32-bit syscalls. And it was very poorly documented, if at all. It was very surprising to
many people, and it wasn't really exposed in the kernel API. So yeah, what user space and debuggers could do, they could fetch the system call number, they could, like, fetch this register that describes the bitness of process, and then they would just guess, do the wild guess,
and say, well, if the process is 64-bit, then probably the syscall is also 64-bit, right? It's mostly the case, and if it's a 32-bit process, then syscall is definitely 32-bit,
and all the logic depended on this wild guess. And it mostly works, because in most cases, it's exactly what happens, but sometimes it's not the case. And back in 2008, there was a bug report, again, a trace in debug structure. There is a very simple example that looks,
you can see, very similar. It's somewhat simplified compared to the one reported in that bug report. So the program does a very simple thing. It just brings a line of output,
and then it invokes a 32-bit system call, and then it prints another line of output. But this 32-bit syscall is actually a fork. So what happens is that there are two processes, and each of them prints the line. So if you compile a link and run this program,
you will see an output similar to this. Maybe the numbers will change, but all the rest will be just very simple. But if you run this very simple program under stress,
you will see something very strange. So you will see this line is being printed, and then suddenly a process attaches, and then you see this ridiculous open system call with very, very odd, very impossible, I would say, arguments.
But all you can say about this is, what? And all the rest looks very usual and regular, making the whole picture completely ridiculous, like this ridiculous open among all nice,
expected system calls. So yeah, if you run this program several times, you will see that all these odd open flags are different. You will never see the same combination, or probably never see
the same combination of flags, because nowadays, thanks to kernel address randomization, all these register contain garbage that changes. And this reminds me of a toy I had in my childhood,
a kaleidoscope. You turn it slightly, and you see a different nice picture. So you can use this simple program as a kaleidoscope if you like. Yeah, so this problem was
approached several times, but until 2018, there was no progress. And finally, well, thanks to two people who contributed this API in the kernel, and there are two authors, and
academic bias, it took us almost nine months to get this into the kernel, and don't remember how many iterations, but it was two digit number of iterations.
So finally, we have it in the kernel, and for all architectures that support trace groups, which are all supported architectures, or almost all, I would say, and some that are not supported, but get it for free. We have this, and the API looks this way. There is a structure you can
request from the kernel. It contains this crucial architecture field, and in other ways, it looks similar to C comp data. So you can obtain in one go both the architecture, the Cisco number, Cisco arguments, also instruction pointer, step pointer, and this makes traces that use this API
reliable in this respect, in respect to the original problem. So the same program now looks, if Linux is fresh enough, and this trace is fresh enough, you see this is as expected.
So process attaches, you see a proper fork call, and not this ridiculous open, and all looks good. So I think other traces and debuggers that have something to do with the system calls should switch to this API. By the way, it also allows to find out what kind
of trace stop is the current stop. Otherwise, up to this time, kernel provided no way to find out, so they used to think that they alternate. So first, you enter this call and exit this call,
but it's not always the case. So you actually can use this nice API to find out what is actually the trace stop you're dealing with. Okay, so it was a very major feature for this trace.
And yeah, as I said, some other traces are welcome to use this, of course. Let's speak about
system call filtering. There is a new option to filter system calls by return status. It had a very unusual history, first trace. So it was actually introduced in 2002, but it was broken from the beginning, and it was never announced. You couldn't find it
exists unless you accidentally type it in or look into the source code, because it was broken. What it did, it printed the beginning of system call, and when it failed, it just didn't print the ending. It wasn't useful. But now you can filter system calls by return status. So you can
print only those system calls that are succeeded or those that are failed. So in this very simple example, you can see the difference. Well, if you run a very simple program like CAT
with a modified LD library path, it makes a dynamic linker to look in different places. I wonder whether you expect dynamic linker to look into so many different places. But well, you can see the difference.
As a very useful side effect of this option, you can have an aggregation for free. So for example, if you trace several processes that are running asynchronously, then you will see a lot of this unfinished and resumed stuff. And sometimes it's not very convenient. We used to implement
special aggregators to collect this data, so it would look like this. But now, thanks, you can use this option also to aggregate. The only need I would say is that
it might change the order of invocations. So in this example, it looks like if nano slipsis calls were invoked sequentially, which is definitely not the case. They were invoked
simultaneously, but because they were printed at the moment they finished these system calls, it looks not the way you are used to. But then when you're aggregating, it doesn't really matter in which order they are printed. So there is also another option
that there is a funny story connected with it. So when I try to come up with something useful
as an example, I started invoking all programs I had in my small short. And I found out a few programs that were not printing their arguments correctly,
then they couldn't find them. I just invoked programs with a non-existing file. And I found two of these programs and I fixed them. But you can get an idea when this could be useful.
For example, when program doesn't print what's going on, you can trace and have a look. When you are filtering system calls, if you don't want to print all the rest, you probably want to make those system calls you are not printing execute faster.
And now we have a very nice feature, which we planned for several years but couldn't get until we had two GSoC, Google Summer course students. In the year before last, the student
made a prototype. And last year, we had a student who is going to talk about this feature very soon, I hope. So he will describe how it works. But from user perspective, it looks like trace no longer delays everything by two orders of magnitude
on those system calls that are not traced. It's a famous example because this is a modification of example BPF people use to describe how slow a trace is.
And now we use the BPF stuff to show how fast a trace is. So you can see that CPF itself slows down things about 10%,
which is nothing compared to what all these PTR stops do with speed of running programs. Yeah, you can see this is a long option. And it was actually the first option that we couldn't find a good short analog. So we had quite a few, not as many as LS program has,
but quite a few options. And some of them are not obvious. And we had, I think, what is it, dash n in our prototype, but we couldn't find an explanation why it should be called dash n. So we decided that it's time to introduce long options. And now we are started
adding ILSs for not so obvious names. Yeah, so CPF was the first one. Another option which should probably have a long option analog is the option that has
named dash k. I don't know why it's called dash k. It prints a stack of user calls at the time of system qualification. Yeah, it's very useful thing because you can see the logic that behind the program, if you don't know what's going on, you can just apply this.
It will produce a lot of output, but it makes a straight somewhat kind of debugger more than the tracer. So in this example, you can see why, for example, cat closes. It's a doubt. From names of these functions, you can see that it does some kind of
at exit handling. And it closes out to ensure that everything is written. Otherwise, it should return non-zero exit code. Another option you can attach, you can use this trace as a
in different ways. It's not always desirable. For example, if these processes interact with their parents and they want to know their PID numbers, so you can run this trace
this way and be more transparent. Yeah, there is also a relatively new option that says how all these symbolic constants should be printed. So you can print, as usual,
like translate these numbers into symbolic names. So you can print both symbolic numbers and row numbers or just row numbers. It has various useful implications. You can debug programs that you suspect pass arguments to system calls in a wrong way, which is not very surprising
because on different architectures, there are different system call, different ABIs, different number and order of passing system call arguments. And this also can be used,
I think it's used in syscaller project. So yeah, we added support for all new system calls that were added into the Linux kernel. And nowadays, they started adding system calls again. So there are a bunch of new system calls that work with mount points. Well, I can't describe
them all. There are too many. You should never look into mount pages probably, but we have support for them. We also have a lot of very sophisticated system call parsers.
And I'll show you an example which looks very monstrous, but you will get the idea how sophisticated system call parsers could be. So we support decoding of netlink protocol. You can see this output. This is the output of a very simple routing table. And here you can see
what's going on behind. So you see this netlink protocol is very structured. It has some structures, substructures, sub-substructures, and everything is printed. Coloring is mine.
All the rest is made by Estrace. And the last but not least is that in December of 2018, we changed the license. So Estrace used to be released since the very beginning under a Berkeley-style license. So it was by request of Paul Kreinberg. I don't know this man.
It was too early. So when we added support for Estrace gets his calling for API, it was kind of crucial point. Most contributors to Estrace didn't want to contribute under
permissive license any longer. So we decided we will drive a change to a copyleft. So test is released under v2 plus and all the rest is license that allow us to release this
as a library someday in the future. You can manage to make a library out of Estrace. So this is more or less what I wanted to talk about. And if you have some questions
or ideas or something related to Estrace to discuss, we have some time. Yeah. Should I repeat the question? So please be concise. Should I pick, yeah, from the back
to the front? Yeah. So the question was why this system called filtering is not by default
and why you have to type this very long option to use this feature. First of all,
you can abbreviate long options. So you don't have to type that much. I think two letters usually enough. If not, then type three letters. Some shells allow expansions of
program arguments. So I don't think this is a problem. Well, there are two important points about this way of filtering. First, it generates and attaches a BPF program. I'm not going to dive into details, but it makes this program and you can attach it,
but you can't get rid of it unless you're privileged. So this implies that you have to follow forks. You have to follow all processes that are forked by the process you are tracing. And this kind of change behavior and one of important points of a trace is that a
trace is backers compatible. So we can't enable follow forks by default because people are not used to this. Yeah. And if you specify this option and do not specify follow forks, it says that I am enabling follow forks. Hey. So this is one point. Another point is that
unless you are privileged and a trace is used as a privileged program, you can't attach a BPF program to another process. So you can attach to a process using ptrace c or ptrace attach, but you can't attach a BPF program to another process. You can only attach a BPF program to
yourself. So one of important features of a trace is to trace already existing process wouldn't work with this unless you're privileged. But if you're privileged, you can use a lot of kernel tracing nowadays. It's not really a big deal, although they don't have so elaborate traces. Yeah.
Yeah. Please, another question. Yes. So you mentioned that on the last slide that the color was your own. Would you consider adding color output to S trace? So the question was that on this slide, the color was my own, and would I consider coloring by a trace?
It's kind of, this is a difficult question because we had actually a plan to generate a structured output from a trace. And if you generate, for example, some JSON output, you would apply already existing software to do all this fancy stuff like coloring.
So we decided we will make structured output first, and then other people will do whatever coloring they like. But as you can see, there is no structured output yet, and I have to do all the coloring. Yes, please. Sorry? Me? I think I can.
So the question was whether I can pretty print this. I think this is pretty enough. So what was your question? So whether I can bring this in blocks so it'll be easy to read.
Yeah, it's getting closer and closer to our idea of structured output. So yeah, you can see
why we decided to go the simple way, but it was not so simple. Is it over? Yes. Okay, thank you.