Linux kernel debugging for sysadmins
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 95 | |
Author | ||
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/32303 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
FrOSCon 201744 / 95
4
8
9
15
20
22
23
24
25
27
29
32
36
37
38
39
40
45
46
47
48
49
50
51
53
54
59
63
64
65
74
75
76
79
83
84
86
87
88
89
91
92
93
94
95
00:00
Open sourceFreewareKernel (computing)Hacker (term)Process (computing)Physical systemKernel (computing)Parameter (computer programming)Point (geometry)View (database)LoginComputer programmingMultiplication signSoftwareFeedbackEvent horizonMoment (mathematics)IntegerSystem administratorExistenceXMLComputer animationLecture/Conference
02:08
System programmingEinheitswurzelMathematical analysisOnline helpPhysical systemService (economics)TouchscreenVideo game consoleConnectivity (graph theory)Computer networkBlogPhysical systemSystem callTouchscreenMechanism designMultiplication signMusical ensembleKey (cryptography)Presentation of a groupKernel (computing)Service-oriented architectureOrder (biology)Game controllerInformationFrequencyArithmetic meanFocus (optics)OntologyMathematical analysisDirected graphRevision controlMultilaterationCellular automatonInsertion lossShape (magazine)CASE <Informatik>Student's t-testMereologyBuildingSemantics (computer science)File systemSoftwareEinheitswurzelService (economics)Video game consoleBootingConnected spaceCausalityMobile appTelnetConfiguration spaceData centerLoginSynchronizationComputer animation
06:37
GEDCOMKernel (computing)Keyboard shortcutMultiplication signIntegrated development environmentProcess (computing)Standard deviationParameter (computer programming)Type theoryBitKernel (computing)Computer animation
07:36
Kernel (computing)InformationMessage passingGame theoryNetwork socketDatabase normalizationInformationSemiconductor memoryCore dumpProcess (computing)MereologyCountingState of matterFunktionalanalysisDot productCoprocessorGreatest element2 (number)Forcing (mathematics)Right angleInstance (computer science)Kernel (computing)Network topologyWindowCodeText editorMultilaterationSoftware testingThread (computing)Computer animationSource code
09:43
WindowGEDCOMKernel (computing)CodeFunktionalanalysisElectronic mailing listParameter (computer programming)Computer animationSource code
10:19
StatisticsGEDCOMMessage passingMessage passingPhysical systemBootingWechselseitige InformationMetric systemKernel (computing)Level (video gaming)Medical imagingInformationQuicksortComputer animation
11:04
Message passingKernel (computing)StatisticsCore dumpRippingTask (computing)Process (computing)System callServer (computing)Software bugBefehlsprozessorModul <Datentyp>Gamma functionCivil engineeringRaw image formatInformationComputer-generated imageryFreewareWeb pageMemory managementMaxima and minimaNormal (geometry)Cache (computing)Data bufferRead-only memorySpacetimeStructural loadInformationPhysical systemSemiconductor memoryQuicksortPhysical lawLatent heatObject (grammar)MereologyPoint (geometry)BitMultiplication signOnline helpDifferent (Kate Ryan album)Block (periodic table)Image resolutionWeb pageFlock (web browser)Process (computing)2 (number)Pattern languageSystem callCASE <Informatik>Module (mathematics)State of matterFunktionalanalysisResultantOrder (biology)FreewareNonlinear systemMessage passingInterface (computing)Limit (category theory)Standard deviationCartesian coordinate systemLocal ringCondition numberCore dumpVideo gameComputer fileExpert systemNumberPC CardSource codeTask (computing)Musical ensembleLink (knot theory)BefehlsprozessorWell-formed formulaParticle systemArithmetic meanMoment (mathematics)Principal ideal domainRippingMathematicsForcing (mathematics)Right angleBasis <Mathematik>CoalitionInsertion lossDigital photographyNetwork topologyResource allocationKeyboard shortcutPointer (computer programming)Kernel (computing)PlanningSoftware bugData structureError messageLogicUsabilityFlagWindows RegistryCodeSystem administratorTheory of relativityLine (geometry)Revision controlFrequencyElectronic mailing listGoogolFunction (mathematics)Memory managementIP addressLevel (video gaming)TableComputer animation
21:03
FreewareWeb pageKernel (computing)Memory managementMaxima and minimaNormal (geometry)Cache (computing)Data bufferProcess (computing)Read-only memoryVirtual machineComputer hardwareException handlingBefehlsprozessorError messageFehlererkennungEvent horizonHypercubeCore dumpParameter (computer programming)MiniDiscPhysical systemCartesian coordinate systemWeb pageResource allocationKernel (computing)Semiconductor memoryCore dumpInformationLevel (video gaming)Computer fileElectronic mailing listNormal (geometry)FunktionalanalysisBit32-bitFehlererkennungRight angleAnalytic continuationMultiplication signLogicDocument management systemSet (mathematics)Point (geometry)RoutingOnline helpOcean currentParameter (computer programming)Principal ideal domainProcess (computing)Term (mathematics)Utility softwareOrder of magnitudeCrash (computing)WeightTunisComputer hardwareVirtual machineException handlingSoftwareMiniDiscSeitentabelleEvent horizonSoftware bugBefehlsprozessorTask (computing)Different (Kate Ryan album)CASE <Informatik>Digital watermarkingDemonMemory managementState of matterArithmetic meanOrder (biology)BackupCode division multiple accessUniform resource locatorPhase transitionVideo gameFreewareSource codePhysical lawMereologyBoss CorporationLink (knot theory)Moment (mathematics)Condition numberField (computer science)QuicksortoutputForm (programming)NumberDemo (music)Bound stateMusical ensembleWater vaporTouchscreenAngleLocal ringComputer animation
30:35
Core dumpRead-only memoryKernel (computing)Parameter (computer programming)MiniDiscMessage passingGEDCOMEmailPhysical systemCore dumpThread (computing)InformationCrash (computing)Default (computer science)Computer animation
31:17
Message passingKernel (computing)AreaRevision controlVirtual machineWorld Wide Web ConsortiumLocal ringTime domainSpacetimeIEC-BusCore dumpProcess (computing)Principal ideal domainMultiplication signInformationState of matterPhysical systemDifferent (Kate Ryan album)Crash (computing)Wrapper (data mining)Symbol tableKernel (computing)Computer fileTask (computing)Canonical ensembleComputer animationSource code
33:03
RippingWorld Wide Web ConsortiumElectronic meeting systemMoving averageGEDCOMSpacetimeTexture mappingFlagInformation securityFile systemComputer fileInstance (computer science)Data storage deviceParameter (computer programming)Right angleInformationPrincipal ideal domainData structureSource codeCodeSemiconductor memoryKernel (computing)Physical systemProcess (computing)Multiplication signDirectory serviceSystem callContext awarenessState of matterDifferent (Kate Ryan album)Interface (computing)Office suiteForcing (mathematics)Network topologyStapeldateiUser interfaceInverse elementWireless LANSource code
36:03
Hash functionSpacetimeGEDCOMProcess (computing)Task (computing)Type theoryThread (computing)InformationComputer fileParticle systemPoint (geometry)Computer animation
36:51
SpacetimeGEDCOMMetreFlagLimit (category theory)Task (computing)Logical constantInformationMultiplication signTerm (mathematics)TimestampOnline helpQueue (abstract data type)Computer animation
37:38
Limit (category theory)FlagOpen sourceFreewareFunction (mathematics)AreaoutputPiPrincipal ideal domainArray data structureEmailArmGamma functionProcess (computing)Online helpWater vaporModel theoryState of matterInformationBefehlsprozessorOcean currentQueue (abstract data type)Lecture/ConferenceComputer animation
38:35
StrutPrincipal ideal domainWorld Wide Web ConsortiumBridging (networking)Data structureGamma functionMassSigma-algebraFunction (mathematics)Maximum likelihoodGEDCOMoutputKey (cryptography)Resource allocationPoint (geometry)InformationSummierbarkeitSemiconductor memoryMiniDiscSource codeComputer animation
40:56
Physical systemSystem callPulse (signal processing)Interface (computing)Level (video gaming)Multiplication signNumberDuality (mathematics)State of matterLink (knot theory)VideoconferencingTerm (mathematics)Order (biology)MereologyRadical (chemistry)Normal (geometry)Point (geometry)Bit ratePhysical lawCognitionDimensional analysisCoprocessorCASE <Informatik>Kernel (computing)Structural loadAverageArithmetic meanProcess (computing)CalculationLoginInterrupt <Informatik>Mechanism designGeometryLecture/Conference
46:08
Open sourceFreewareEvent horizonComputer animation
Transcript: English(auto-generated)
00:09
Yes, so once again Welcome here at the next lecture at the first come 12 our next lecture. I'm I'm pretty sure it's going to be great because it's into Joseph talking about Linux currently bugging for resist arguments
00:24
Which is quite an interesting topic and? from my point of view So just one more thing before we start if you liked the talk or even if you dislike Which I don't believe but please provide us some feedback log on to the to the frat to the program
00:41
software On for us going to e and give us some feedback because it really helps us to to organize this event so Yeah enough with the talk. Please give some warm hands for mentor Joseph. Hello
01:05
so How many assessments here? Good how many kernel hacker I saw one Okay, okay, you are going to answer the QA
01:21
So so this talk is based on my experiments and Experience based on My system in job like Of like somebody who has been managing systems for quite a long time
01:41
So this is I hope this will help See sermons who are currently in the verge of trying to do something in kernel to do more so I I Currently work with a Canadian company called pithian
02:02
So I will start my talk so This is the agenda of this what I'm trying to cover like some basic investigation methodologies and some common issues and some tools pretty simple
02:20
And why why should see sermons do? Canal debugging of course to learn more about the systems we manage Debug efficiently when you face an issue and Root cause analysis this is becoming more and more
02:41
Important nowadays with more focus on SRE DevOps kind of workflows integrating post-mortem analysis and everything in the workflow, so so this is the Thing which I'm going to cover You all must have seen some user complaining or maybe
03:04
in the shape of Nagi of salad seeing that yeah, my system is not responding. I mean and the whole talk is based on this So I have the investigation I have split into two
03:22
before reboot and after reboot why reboot because many time You would end up rebooting your system if you face a kernel issue Not all the times many times But let's talk about that So
03:41
I'm just going to talk about stuff. Which we generally do so you basically identify whether the claim is actually Right whether it is a system issue or a service issue like yes a user might be complaining My app is not working the system is not working that doesn't mean that the system is strong you do the basic
04:05
Talking you do the basic magic with your telnet ping Toolset and see whether whether issue is system or a service issue Then You will check out
04:22
Whether What do you see in the screen like if you have like Any kind of user provided console you will check You if you have like a VM you will check the VM console KVM
04:40
Anybody heard of KVM like somebody with a gray beard probably know it's this Switch which we switch between different monitors and data centers you check the screen and see whether What is the issue if possible? network connectivity definitely you will try to
05:00
See the kind of how is the connectivity whether the network connectivity issue is within the system using ETS tool or if config and logs Or you can check if there is any connectivity issue to the system And If we check all of this stuff, and we identify that it is a system issue
05:24
How do we what do we do next I mean it there could be cases Where the system is totally stuck You are able to connect to the system, but It's not responding at all like if you can you can see something in the screen or maybe nothing in the screen
05:43
But it is not responding to anything Maybe or may not be you are able to connect to net network. What would you do? You can try sis RQ So Why sis RQ sis RQ?
06:01
You say kernel technology and mechanism which will allow you to send keys Generally it is called magic keys, and it will allow the system to dump Useful information it will allow the kernel to do useful stuff
06:22
like sync your file systems or even Panic your camera. I mean what I mean why that is useful. I will come to that later But let's have a quick look at sis RQ So sis RQ how about now?
06:52
Sorry a bit more yeah, okay Okay, so this is the sis RQ kernel
07:05
CCTL parameter you can enable it once you enable it you can If you have access to a keyboard you can type all sis RQ and the letter key
07:20
So let me type the another way to send a sis RQ request would be echoing the parameter echoing the value for example echo M To draw sis RQ Sis RQ trigger it will dump the memory information to your syslog
07:43
Let's have a look so you can see the memory information is Dumped here. What can we do with this? I will talk about this later. I mean how to make sense of this
08:02
Don't be alarmed with the bar of test. I'm so we can also dump The thread state information of all the processes Let's give it a couple of seconds, okay, you can see
08:21
The process trace information You can see the process name you can see the state of process whether the process sleeping or in Uninterruptible sleep which is D state or which is in runnable state which is R state then you can see the code path in which
08:42
The process the code path which process was executing What to do with the code path I generally if I want to try Understanding what is going on with this? I'll just Look at the kernel functions. These are the kernel functions. I will Maybe go into the kernel tree and
09:03
So I have already run C tags here a lot of people Oh, yeah, it's a different window How about is it good? So I have already done C tags here. It will basically create C tag information
09:22
so that a text editor like we win can Have a look for example. I'm just doing a Basic VFS, right? I'm just looking for it. So I can just go right into the code. I Can look for like if I want to see another instances of
09:41
The code I can just easily look a lot of people do C-scope, I don't use it because I'm not a kernel developer, so I don't have to search too much code Okay, so That is this RQ
10:01
We will talk about I mean how this can be more useful would be Finding the arguments which are passed By one function to another that will be interesting or finding the arguments for a function We'll come to that later Let's go And continue so after the reboot
10:23
We all check the syslog of course your one long messages you your one log kernel Then you check the sysstat I hope everybody helps us start installing your system Even if you have your fancy metric system definitely try to install sysstat include that in your
10:46
In your AMI's your images, I'm pretty sure most of the people knows, but I'll just show so just that includes our command which We can have a look at the system log information
11:05
So Sorry in our system load information here, and you can see whether the load was in user space System space I await all this stuff This these are stuff which we generally do
11:22
You can check memory stuff like this If anybody have more questions you can ask after the talk any of this or like When we have QA I'm pretty sure that many of you have Interfaced with some OS vendors and have provided a VM core one time or another
11:45
You are a sysadmin so VM core is a memory dump of Your Linux system we'll talk about VM core later Now let's just see few stuff like panics and
12:03
Kernel related issues, which you might see One you must have seen in your work life, so this is a normal panic. It's a very old panic I chose it for a reason because It's pretty straightforward and easy to explain so this is a
12:20
Kernel bug get and you can see the The file name in the source code and line number so this panics generally happen when There is a Condition called bug on in kernel source, and if your code ended up being there this can this panic generally hits
12:46
And You can see the so I just I'm just showing this just to Get you a bit more familiarized with the structure of Panic so the next time when you have a look at it you make more sense out of it, so this was a CPU
13:03
which was Executing at that time and the modules loaded In the system during the time of issue you can see if a particular Module is having any kind of flags like Proprietary or force loaded you can see this
13:22
You can see the PID Of the process which was running during the time but a process cannot panic a kernel period a process behavior can trigger a panic But a process from user space cannot panic a kernel if that happens there is something problem
13:45
The problem is in the camera Then Yeah, you can see the kernel version you can see whether or not the kernel was tainted with any proprietary module or stuff and This is the most important thing if you are starting with this if you just want to google it out
14:05
Don't google these you just google this this is the instruction pointer This was the function which was being executed when this panic happened so In 64 bits you will find e r IPs in 32-bit you will see e IPs
14:22
I will skip the whole registry part here, and I will Go to the call trace again So this is the this the call trace like the call trace which you're seeing in the SysRQ Third state output you can see that there's a system called Function and it is either most of the or probably all of the system called letter
14:45
Kernel calls are like this is underscore times or sys under core the system called name so in this case Two processes were Trying to do something and there was a race condition which caused the issue
15:02
We'll not get into how to debug a kernel planning That's a very big topic, and I'm not hundred percentage called qualified to do that, but I can show you pointers on how to do that and I Have some specific knowledge about some specific part of subsystems, but definitely if you have any questions
15:22
I can try to point you to people or try to point you to resources So this is a soft lockup Again, why would a panel cannot panic happen a camera panic happens when kernel? things that at this point of time I
15:42
Cannot properly Recover the system kernel might think that yeah at this point of time. I cannot This if I am continuing this might cause a data loss in that case That is when Can across the planning that is the standard
16:03
Definition that could be a lot of different Logic below that, but this is the baseline So next is a lockup the soft lockup soft lockup doesn't need to always cause the system unusable Soft lockup usually happens when kernel tries to evict a process from the CPU
16:24
But it is not able to so it's continuously Running for 10 seconds so here you can see the instruction pointer is EIP because it's the 32-bit system You can see this the pattern is saying the the you can see the CPU you can see
16:45
The registry information you can see the call trace here one more thing you can see you can see that Some of this code is coming from module one specific module, so Or this specific code is coming from this module, so this is also kind of useful information
17:06
And I'll show you an example of hung task hung task happens when a Process is in this state for more than 120 seconds. What is this state?
17:23
this state is Uninterruptible sleep that way basically happens when a process is waiting on I you some most of the cases there can be other cases as well mostly are you?
17:40
So here the process is waiting on This state for more than 120 seconds you can this can this behavior in in some cases can be an expected behavior your process might be Supposed to be waiting like running being on D state for a long time There can be corner cases in that case you can just disable the hung task
18:04
hung task necessarily doesn't Make the system unusable, but many cases it can here also you can see the process the the process state and the call trace you can try to
18:21
Read through the kernel code and functions one by one and see how the code flows So another is out of memory. I'm pretty sure that all of the system is half seen and out of memory error Here I have chosen this specific because specific
18:44
Message because this is from a very old kernel, and I don't have to explain much nowadays I would have to explain all Pneuma all the new quadrilateral changes all the stuff here, but you can see
19:03
It is dumping. This is the same information. Which is dumped by the SysRQ Memory in memory dump so the pattern the it would look the same of course it will have different information But the pattern is all the same you can see that
19:22
You can see the number of active pages and number of inactive pages active and inactive pages basically mean that Kernel uses a list called LRU least recently used it is used to identify whether a particular process Is being a particular page whether a page is currently used or not?
19:46
If kernel think remember think that it is active it will Be in the active list and if a kernel want to free a particular page. It will first put it to inactive Before it frees up, so you have the active and inactive pages
20:02
you have the dirty pages dirty pages of pages which are In the memory which have changed information which are not written back Right back is the pages if I remember correctly which are in in flight which has been currently written back to the system
20:21
This unstable is NFS specific stuff NFS does have this unstable tree which have Pages which need to be written back of course the free memory and slab slab is basically Kind of objects which are used
20:41
Defined predefined in the kernel so that you get contiguous So the page allocations will be contiguous, so it it usually have the dentry level information it will have It how all the K malloc Objects you can check
21:02
slab top command in your system to see slab information mapped is the map pages It's basically yeah, any pages which does have a file back backup and then page table entries It does have all the information on where the pages are
21:23
in the memory then so this is this have like a list of Pages or and we are it's trying to show you where all the pages are allocated Then you will see The different songs you see DMA normal high mum. This is a 32-bit system
21:45
That's why you see all DMA normal high mum stuff so One thing why did this? system panicked Because it does have a lot of memory. It does have almost
22:00
Free pages it does have considerable amount of memory any idea. Why would this? System face out of memory when does a system face out of memory yeah Sorry
22:20
Yeah, yeah, it is When when the system normally when the system runs out of memory right then so Yeah, and also Yeah, so the question is From me yeah, okay, so you want to repeat my question or his question?
22:43
the answer Okay, oh, yeah, okay Okay, so Let's continue. I will tell you why the system this system specifically panic because of the songs
23:00
Because each song does what are these songs DM it traditionally we have DMA songs and normal and high mum it's a bit complicated topic, but Long time back when the DMA devices can only access Some of the devices can only access the DMA so can only access up to I think
23:24
16 Yeah, 16 MB of ramps so and So for those devices DMS on was introduced and rest of then kernel mostly works in normal song then high mum is used
23:42
for mapping rest of the memory because 32-bit only have Like theoretically it can only have 4 GB of RAM It's a bit more complicated topic, but basically we have different songs like
24:01
DMA normal and high mum and in this case Each song does have Few watermarks one is free min and low now the thing is when the free mum goes below min that is when an Out of memory happens in the end that is a specific reason any out of memory happens
24:25
So when the free mum goes below low it will the kernel will very actively try to reclaim memory using your pretty flush using your Whatever the current demon kernel is running the kernel demons. It will try to reclaim memory and
24:42
until try to reclaim memory and until The free reaches high so in this case even though there was high High mum does have enough free memory the normal song Didn't had enough free memory, so this was like a historical problem with 32-bit systems
25:04
Nowadays, we don't have to worry about it. I just Use this the example from a 32-bit system so that I can explain That there is something called songs in our memory in a virtual memory, then we have Buddy alligator, but the alligator basically kind of show you show the camera
25:24
In each songs how much contiguous memory is available, so if you see if more memory is allocated in in 4 KB the chance of the System being in a system memory being highly fragmented state is high so when
25:45
the more memory is here In with larger chunks of memory that means that the memory is less fragmented, so That is one more thing so if some of the applications probably need contiguous memory
26:01
Very large contiguous memory, so when it tries to allocate Like a contiguous piece of memory it can help page allocation failures and stuff like that and Yeah, you have the swap Information then you have the process which was killed For killing a process normally a function called badness is used. It does use different
26:29
Logic like a set of logics To allocate the points to a different process in the system for example if a process is
26:41
Niced it will be given less priority to kill if if a process is a route Run if it's if the process is run by a route the chance of it is getting killed is less so the badness value is based on that and Currently in the current kernel we have
27:00
score and a DJ parameters which can kind of Tune this behavior for each PIDs and make sure for example if you don't want your my SPL to be killed you can pass a value to the home score
27:20
the prop PID home score and You can make it to zero I guess then it or minus 17 or something then it will make sure that whenever the home kill happens your favorite Process will not get killed
27:42
Okay, then we have Normal hangs Like hang is a very bad term actually I mean it's very abstract term, but here. I'm just I have just listed like hardware issues Like there can be machine check exception exceptions or error detection and correction you'd act stuff
28:05
If you find anything Like something like machine check events you can have a look at bar log and C log and Probably contact the vendor if it is a bare-metal system if it is a hyper if it is a hypervisor
28:20
If it's a VM check the hypervisor talk with the hypervisor vendor, maybe doesn't Many times happen if it happens in a Virtual machine it's probably a bug Then there can be CPU memory or IO utilization Which can cause a hang so these?
28:42
OOM or hung task these are also kind of resource allocation issues But you can also use SAR tools like SAR to identify issues Which is caused by high utilization of resources
29:00
Let's get into VM core a bit so VM core is For dumping a VM core we need a crash kernel parameter in the grub and Traditionally a long time back it was only a net dump and dis dump available net dump was dumping the memory
29:22
Over network and this dump was dumping in the local disk currently care them is capable of Dumping it everywhere so now Okay Kedem can Kedem is configured in ETC Kedem If if Kedem can only dump when there is a panic Kedem dumps the virtual
29:47
the memory information or the memory of system when there is a panic So if you specifically want to debug one of the previous issue in detail if you don't understand from the
30:01
Screenshots if you don't understand you can Intentionally panic this kernel. I'm not talking about every time you said this parameter I'm just suggesting that if you have a recurring issue if you want to avoid Like an ongoing issue if you want to have a deeper investigation you can Pass this parameter so when there is a soft lock up or an OOM or a hung task
30:24
panic the system so that Kedem will dump a VM core You can also do alt-sysrqc or previously I dumped a System Thread information so instead of that if I am doing a C. This will panic the system. I'm not going to do that now
30:46
And that will dump a VM core If by default will dump in war crash So Okay, so let's have it. What's the time?
31:04
okay, so Let's have a quick look at VM core, so I have a VM core here Which I have so this VM core. I have dumped from the system and
31:24
for Analyzing a VM core I need a command-line tool called crash crash is basically a wrapper around the GDB tool which you probably know And this VM Linux. I have extracted from
31:42
kernel hyphen debug info package Which does have the debug symbols unlike your VM Linux file in your slash boot? So you can see The gdb information it will give you the basics basic information on
32:02
The kernel the when it crashed this long time back, so it's the uptime of the system during the time of the panic, so You can see that the panic was caused by a ssrq dump. You can see the
32:21
kernel release information the hostname You can see the processes which are running in the system at the time of issue You can see the PID information you can see the task You can see the state of different processes ru is runnable processes
32:44
IN is sleeping If it is UN if there is a UN that is uninterruptible sleep If I want to see the trace of one particular process, I can just do BT PID
33:01
Another Interesting thing would be The memory state during the time of issue You can see When when the system was hung or when system was in panic state what was happening in the system?
33:21
Now so this So this was the Process which actually panicked the system this was on context You can see that it is bash because I run the command from bash. I did echo see proc ssrq trigger from bash
33:40
You can see that there's a VFS, right? in the proc File system now. What is VFS? VFS is an interface with Between different file systems and the kernel So here what I would like to so I was trying to demonstrate how to find an argument and
34:07
Which is passed so you can see the stack information like this Sorry, so this will dump the the complete stack of this call trace you can see that
34:38
after sys after sys write
34:41
It's calling VFS, right and there is stack information here From this doesn't make sense these looks invalid so Let's see. What does VFS, right have so this is the code I can just go to the source as well so you can see that it does have a struct as
35:04
Stuck stuck file as first argument, so I'm going to pass that struct File and I'm going to use that that memory
35:22
Then so you can see the structure the Whatever there is in the structure so here. You can see the UID The PID it's probably wrong all this information What's interesting here is the dentry dentry is the
35:46
Storage place of your Directory structure for instance. That's just it's one job of dentry So let's check start dentry and see what is in there
36:03
Type oh You can see the name So I was just looking for this you can see the in the dentry does have the information of
36:22
We obviously know that I run this rq trigger, so it made sense more to Dissect this particular thread this particular process you can see that this was the name of the here also the di name also we can see that name of the
36:42
File it has been accessed Another thing which you can do is probably look at the task details For example here is a yum stuff, so there is a command called task You can see the task letter information
37:02
Timestamp you can do a lot from this but all the task related information is Here as well, so Sorry
37:22
Okay, okay, yeah, so Yeah, this is basically what I was planning to cover. I think I was super fast Yeah Yeah, you any questions. Oh yeah, so yeah, let's ask we have the help we have run queue
37:56
So run queue does show you What our processes which were running in each CPU?
38:04
During the day it's not water process which were running water process which were in the runnable state and the current state current process as well and Then what else do we have? What else I am familiar with
38:22
Yeah, I think the VIM information. Yeah, of course the mount So the VFS mount information is also there, so if you want to see that Sorry, oh, sorry
38:44
So I probably have to get it from a stack Not from here. Yeah, so you can Yeah, oh, I have to check whether I think it is VFS underscore moment. I don't remember exactly but
39:09
Yeah, we have this mount so you can see you can pass through the mount information as well Yeah stuff like that
39:21
Or I PCS is there Your shared memory information any other questions Yeah, can please repeat disk so they come from
39:55
Yeah, yeah, it's
40:06
I don't understand. Okay, probably you got your answer. Yeah Okay, any other questions
40:28
If yeah
40:54
Yeah, so you're talking about a if you have a UI So the one thing I forgot to mention if you want to run any of this this RQ
41:04
You probably need to switch to one of the terminals like alt Control alt f1 and then So one one thing You can do probably is if okay, so the question is you're having
41:23
a UI interface and If you face a hand, how do you recover if it is if you're having this issue? and if it is an ongoing issue and You suspect that it is like large number of these state processes a lot of IO if you want to
41:42
Find what is happening and Or specifically if it is a load average issue if there is a solution there is a tool called hung hang hang watch Yeah, hang watch what hang watch does is you can just install the hang watch it will monitor the load average load average is a very tricky subject load average doesn't mean it is there is a problem load average is a
42:07
Calculation based on the runnable processes and uninterpretable sleep was basically our state and D state processes So if your load average is for example about 10 the hang watch will detect that and
42:21
It will automatically run a ssrq Like which was which you want to configure you can configure it in the system in the hang watch And you can do that that is one possibility another thing is that if the kernel itself have some other method or like methods to Deal with this you probably must have seen NMI watchdog
42:43
if If there is a problem, which would affect the interrupts It's non-maskable interrupts if the NMI interrupts are not incrementing our time it will kernel itself will dump An NMI watchdog error, and if you are configured that NMI watchdog should panic the kernel
43:02
It will get me to panic the kernel so the kernel also have its own mechanism but specifically if your problem is with UI I think it's You probably try to connect Through ssrq can you do that and do your commands? Yeah, you're you're saying you have a
43:26
You have a geo a system, which you use Like mostly for UI like norm or something
44:04
We visualize the first So
44:25
Are you kind of like able to refer the genome terminal? I think you can just do I think alt f2 and press R, which will Refresh the terminal we refresh the UI if it's a norm. Do you know okay? Nothing is moving
45:15
When you can after you
45:25
Recover the system you can check the logs and see what was happening in the I Mean you can find something from the logs and based on that you can continue in investigation as well I mean there is no one way to do things you just need to improvise and based on what you have
45:47
Anything else thanks for answering the questions Anything else good