amdn | d51f4af | 2019-05-28 16:09:01 -0500 | [diff] [blame] | 1 | Demonstrations of exitsnoop. |
| 2 | |
| 3 | This Linux tool traces all process terminations and reason, it |
| 4 | - is implemented using BPF, which requires CAP_SYS_ADMIN and |
| 5 | should therefore be invoked with sudo |
| 6 | - traces sched_process_exit tracepoint in kernel/exit.c |
| 7 | - includes processes by root and all users |
| 8 | - includes processes in containers |
| 9 | - includes processes that become zombie |
| 10 | |
| 11 | The following example shows the termination of the 'sleep' and 'bash' commands |
| 12 | when run in a loop that is interrupted with Ctrl-C from the terminal: |
| 13 | |
| 14 | # ./exitsnoop.py > exitlog & |
| 15 | [1] 18997 |
| 16 | # for((i=65;i<100;i+=5)); do bash -c "sleep 1.$i;exit $i"; done |
| 17 | ^C |
| 18 | # fg |
| 19 | ./exitsnoop.py > exitlog |
| 20 | ^C |
| 21 | # cat exitlog |
| 22 | PCOMM PID PPID TID AGE(s) EXIT_CODE |
| 23 | sleep 19004 19003 19004 1.65 0 |
| 24 | bash 19003 17656 19003 1.65 code 65 |
| 25 | sleep 19007 19006 19007 1.70 0 |
| 26 | bash 19006 17656 19006 1.70 code 70 |
| 27 | sleep 19010 19009 19010 1.75 0 |
| 28 | bash 19009 17656 19009 1.75 code 75 |
| 29 | sleep 19014 19013 19014 0.23 signal 2 (INT) |
| 30 | bash 19013 17656 19013 0.23 signal 2 (INT) |
| 31 | |
| 32 | # |
| 33 | |
| 34 | The output shows the process/command name (PCOMM), the PID, |
| 35 | the process that will be notified (PPID), the thread (TID), the AGE |
| 36 | of the process with hundredth of a second resolution, and the reason for |
| 37 | the process exit (EXIT_CODE). |
| 38 | |
| 39 | A -t option can be used to include a timestamp column, it shows local time |
amdn | 471f6ab | 2019-05-28 17:51:41 -0500 | [diff] [blame] | 40 | by default. The --utc option shows the time in UTC. The --label |
amdn | d51f4af | 2019-05-28 16:09:01 -0500 | [diff] [blame] | 41 | option adds a column indicating the tool that generated the output, |
| 42 | 'exit' by default. If other tools follow this format their outputs |
| 43 | can be merged into a single trace with a simple lexical sort |
| 44 | increasing in time order with each line labeled to indicate the event, |
| 45 | e.g. 'exec', 'open', 'exit', etc. Time is displayed with millisecond |
| 46 | resolution. The -x option will show only non-zero exits and fatal |
| 47 | signals, which excludes processes that exit with 0 code: |
| 48 | |
amdn | 471f6ab | 2019-05-28 17:51:41 -0500 | [diff] [blame] | 49 | # ./exitsnoop.py -t --utc -x --label= > exitlog & |
amdn | d51f4af | 2019-05-28 16:09:01 -0500 | [diff] [blame] | 50 | [1] 18289 |
| 51 | # for((i=65;i<100;i+=5)); do bash -c "sleep 1.$i;exit $i"; done |
| 52 | ^C |
| 53 | # fg |
amdn | 471f6ab | 2019-05-28 17:51:41 -0500 | [diff] [blame] | 54 | ./exitsnoop.py -t --utc -x --label= > exitlog |
amdn | d51f4af | 2019-05-28 16:09:01 -0500 | [diff] [blame] | 55 | ^C |
| 56 | # cat exitlog |
| 57 | TIME-UTC LABEL PCOMM PID PPID TID AGE(s) EXIT_CODE |
| 58 | 13:20:22.997 exit bash 18300 17656 18300 1.65 code 65 |
| 59 | 13:20:24.701 exit bash 18303 17656 18303 1.70 code 70 |
| 60 | 13:20:26.456 exit bash 18306 17656 18306 1.75 code 75 |
| 61 | 13:20:28.260 exit bash 18310 17656 18310 1.80 code 80 |
| 62 | 13:20:30.113 exit bash 18313 17656 18313 1.85 code 85 |
| 63 | 13:20:31.495 exit sleep 18318 18317 18318 1.38 signal 2 (INT) |
| 64 | 13:20:31.495 exit bash 18317 17656 18317 1.38 signal 2 (INT) |
| 65 | # |
| 66 | |
| 67 | USAGE message: |
| 68 | |
| 69 | # ./exitsnoop.py -h |
Shohei YOSHIDA | a28337a | 2020-05-22 22:13:01 +0900 | [diff] [blame] | 70 | usage: exitsnoop.py [-h] [-t] [--utc] [-p PID] [--label LABEL] [-x] [--per-thread] |
amdn | d51f4af | 2019-05-28 16:09:01 -0500 | [diff] [blame] | 71 | |
| 72 | Trace all process termination (exit, fatal signal) |
| 73 | |
| 74 | optional arguments: |
| 75 | -h, --help show this help message and exit |
| 76 | -t, --timestamp include timestamp (local time default) |
amdn | 471f6ab | 2019-05-28 17:51:41 -0500 | [diff] [blame] | 77 | --utc include timestamp in UTC (-t implied) |
amdn | d51f4af | 2019-05-28 16:09:01 -0500 | [diff] [blame] | 78 | -p PID, --pid PID trace this PID only |
| 79 | --label LABEL label each line |
| 80 | -x, --failed trace only fails, exclude exit(0) |
Shohei YOSHIDA | a28337a | 2020-05-22 22:13:01 +0900 | [diff] [blame] | 81 | --per-thread trace per thread termination |
amdn | d51f4af | 2019-05-28 16:09:01 -0500 | [diff] [blame] | 82 | |
| 83 | examples: |
| 84 | exitsnoop # trace all process termination |
| 85 | exitsnoop -x # trace only fails, exclude exit(0) |
| 86 | exitsnoop -t # include timestamps (local time) |
amdn | 471f6ab | 2019-05-28 17:51:41 -0500 | [diff] [blame] | 87 | exitsnoop --utc # include timestamps (UTC) |
amdn | d51f4af | 2019-05-28 16:09:01 -0500 | [diff] [blame] | 88 | exitsnoop -p 181 # only trace PID 181 |
| 89 | exitsnoop --label=exit # label each output line with 'exit' |
Shohei YOSHIDA | a28337a | 2020-05-22 22:13:01 +0900 | [diff] [blame] | 90 | exitsnoop --per-thread # trace per thread termination |
amdn | d51f4af | 2019-05-28 16:09:01 -0500 | [diff] [blame] | 91 | |
| 92 | Exit status: |
| 93 | |
| 94 | 0 EX_OK Success |
| 95 | 2 argparse error |
| 96 | 70 EX_SOFTWARE syntax error detected by compiler, or |
| 97 | verifier error from kernel |
| 98 | 77 EX_NOPERM Need sudo (CAP_SYS_ADMIN) for BPF() system call |
| 99 | |
| 100 | About process termination in Linux |
| 101 | ---------------------------------- |
| 102 | |
| 103 | A program/process on Linux terminates normally |
| 104 | - by explicitly invoking the exit( int ) system call |
| 105 | - in C/C++ by returning an int from main(), |
| 106 | ...which is then used as the value for exit() |
| 107 | - by reaching the end of main() without a return |
| 108 | ...which is equivalent to return 0 (C99 and C++) |
| 109 | Notes: |
| 110 | - Linux keeps only the least significant eight bits of the exit value |
| 111 | - an exit value of 0 means success |
| 112 | - an exit value of 1-255 means an error |
| 113 | |
| 114 | A process terminates abnormally if it |
| 115 | - receives a signal which is not ignored or blocked and has no handler |
| 116 | ... the default action is to terminate with optional core dump |
| 117 | - is selected by the kernel's "Out of Memory Killer", |
| 118 | equivalent to being sent SIGKILL (9), which cannot be ignored or blocked |
| 119 | Notes: |
| 120 | - any signal can be sent asynchronously via the kill() system call |
| 121 | - synchronous signals are the result of the CPU detecting |
| 122 | a fault or trap during execution of the program, a kernel handler |
| 123 | is dispatched which determines the cause and the corresponding |
| 124 | signal, examples are |
| 125 | - attempting to fetch data or instructions at invalid or |
| 126 | privileged addresses, |
| 127 | - attempting to divide by zero, unmasked floating point exceptions |
| 128 | - hitting a breakpoint |
| 129 | |
| 130 | Linux keeps process termination information in 'exit_code', an int |
| 131 | within struct 'task_struct' defined in <linux/sched.c> |
| 132 | - if the process terminated normally: |
| 133 | - the exit value is in bits 15:8 |
| 134 | - the least significant 8 bits of exit_code are zero (bits 7:0) |
| 135 | - if the process terminates abnormally: |
| 136 | - the signal number (>= 1) is in bits 6:0 |
| 137 | - bit 7 indicates a 'core dump' action, whether a core dump was |
| 138 | actually done depends on ulimit. |
| 139 | |
| 140 | Success is indicated with an exit value of zero. |
| 141 | The meaning of a non zero exit value depends on the program. |
| 142 | Some programs document their exit values and their meaning. |
| 143 | This script uses exit values as defined in <include/sysexits.h> |
| 144 | |
| 145 | References: |
| 146 | |
| 147 | https://github.com/torvalds/linux/blob/master/kernel/exit.c |
| 148 | https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/signal.h |
| 149 | https://code.woboq.org/userspace/glibc/misc/sysexits.h.html |
| 150 | |