| Demonstrations of trace. |
| |
| |
| trace probes functions you specify and displays trace messages if a particular |
| condition is met. You can control the message format to display function |
| arguments and return values. |
| |
| For example, suppose you want to trace all commands being exec'd across the |
| system: |
| |
| # trace 'sys_execve "%s", arg1' |
| TIME PID COMM FUNC - |
| 05:11:51 4402 bash sys_execve /usr/bin/man |
| 05:11:51 4411 man sys_execve /usr/local/bin/less |
| 05:11:51 4411 man sys_execve /usr/bin/less |
| 05:11:51 4410 man sys_execve /usr/local/bin/nroff |
| 05:11:51 4410 man sys_execve /usr/bin/nroff |
| 05:11:51 4409 man sys_execve /usr/local/bin/tbl |
| 05:11:51 4409 man sys_execve /usr/bin/tbl |
| 05:11:51 4408 man sys_execve /usr/local/bin/preconv |
| 05:11:51 4408 man sys_execve /usr/bin/preconv |
| 05:11:51 4415 nroff sys_execve /usr/bin/locale |
| 05:11:51 4416 nroff sys_execve /usr/bin/groff |
| 05:11:51 4418 groff sys_execve /usr/bin/grotty |
| 05:11:51 4417 groff sys_execve /usr/bin/troff |
| ^C |
| |
| The ::sys_execve syntax specifies that you want an entry probe (which is the |
| default), in a kernel function (which is the default) called sys_execve. Next, |
| the format string to print is simply "%s", which prints a string. Finally, the |
| value to print is the first argument to the sys_execve function, which happens |
| to be the command that is exec'd. The above trace was generated by executing |
| "man ls" in a separate shell. As you see, man executes a number of additional |
| programs to finally display the man page. |
| |
| Next, suppose you are looking for large reads across the system. Let's trace |
| the read system call and inspect the third argument, which is the number of |
| bytes to be read: |
| |
| # trace 'sys_read (arg3 > 20000) "read %d bytes", arg3' |
| TIME PID COMM FUNC - |
| 05:18:23 4490 dd sys_read read 1048576 bytes |
| 05:18:23 4490 dd sys_read read 1048576 bytes |
| 05:18:23 4490 dd sys_read read 1048576 bytes |
| 05:18:23 4490 dd sys_read read 1048576 bytes |
| ^C |
| |
| During the trace, I executed "dd if=/dev/zero of=/dev/null bs=1M count=4". |
| The individual reads are visible, with the custom format message printed for |
| each read. The parenthesized expression "(arg3 > 20000)" is a filter that is |
| evaluated for each invocation of the probe before printing anything. |
| |
| You can also trace user functions. For example, let's simulate the bashreadline |
| script, which attaches to the readline function in bash and prints its return |
| value, effectively snooping all bash shell input across the system: |
| |
| # trace 'r:bash:readline "%s", retval' |
| TIME PID COMM FUNC - |
| 05:24:50 2740 bash readline echo hi! |
| 05:24:53 2740 bash readline man ls |
| ^C |
| |
| The special retval keywords stands for the function's return value, and can |
| be used only in a retprobe, specified by the 'r' prefix. The next component |
| of the probe is the library that contains the desired function. It's OK to |
| specify executables too, as long as they can be found in the PATH. Or, you |
| can specify the full path to the executable (e.g. "/usr/bin/bash"). |
| |
| Multiple probes can be combined on the same command line. For example, let's |
| trace failed read and write calls on the libc level: |
| |
| # trace 'r:c:read ((int)retval < 0) "read failed: %d", retval' \ |
| 'r:c:write ((int)retval < 0) "write failed: %d", retval' |
| TIME PID COMM FUNC - |
| 05:31:57 3388 bash write write failed: -1 |
| 05:32:00 3388 bash write write failed: -1 |
| ^C |
| |
| Note that the retval variable must be cast to int before comparing to zero. |
| The reason is that the default type for argN and retval is an unsigned 64-bit |
| integer, which can never be smaller than 0. |
| |
| trace has also some basic support for kernel tracepoints. For example, let's |
| trace the block:block_rq_complete tracepoint and print out the number of sectors |
| transferred: |
| |
| # trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector' |
| TIME PID COMM FUNC - |
| 01:23:51 0 swapper/0 block_rq_complete sectors=8 |
| 01:23:55 10017 kworker/u64: block_rq_complete sectors=1 |
| 01:23:55 0 swapper/0 block_rq_complete sectors=8 |
| ^C |
| |
| To discover the tracepoint structure format (which you can refer to as the "tp" |
| variable), use the tplist tool. For example: |
| |
| # tplist -v block:block_rq_complete |
| block:block_rq_complete |
| dev_t dev; |
| sector_t sector; |
| unsigned int nr_sector; |
| int errors; |
| char rwbs[8]; |
| |
| This output tells you that you can use "tp.dev", "tp.sector", etc. in your |
| predicate and trace arguments. |
| |
| As a final example, let's trace open syscalls for a specific process. By |
| default, tracing is system-wide, but the -p switch overrides this: |
| |
| # trace -p 2740 'do_sys_open "%s", arg2' |
| TIME PID COMM FUNC - |
| 05:36:16 15872 ls do_sys_open /etc/ld.so.cache |
| 05:36:16 15872 ls do_sys_open /lib64/libselinux.so.1 |
| 05:36:16 15872 ls do_sys_open /lib64/libcap.so.2 |
| 05:36:16 15872 ls do_sys_open /lib64/libacl.so.1 |
| 05:36:16 15872 ls do_sys_open /lib64/libc.so.6 |
| 05:36:16 15872 ls do_sys_open /lib64/libpcre.so.1 |
| 05:36:16 15872 ls do_sys_open /lib64/libdl.so.2 |
| 05:36:16 15872 ls do_sys_open /lib64/libattr.so.1 |
| 05:36:16 15872 ls do_sys_open /lib64/libpthread.so.0 |
| 05:36:16 15872 ls do_sys_open /usr/lib/locale/locale-archive |
| 05:36:16 15872 ls do_sys_open /home/vagrant |
| ^C |
| |
| In this example, we traced the "ls ~" command as it was opening its shared |
| libraries and then accessing the /home/vagrant directory listing. |
| |
| |
| USAGE message: |
| |
| # trace -h |
| usage: trace.py [-h] [-p PID] [-v] [-Z STRING_SIZE] [-S] [-M MAX_EVENTS] [-o] |
| probe [probe ...] |
| |
| Attach to functions and print trace messages. |
| |
| positional arguments: |
| probe probe specifier (see examples) |
| |
| optional arguments: |
| -h, --help show this help message and exit |
| -p PID, --pid PID id of the process to trace (optional) |
| -v, --verbose print resulting BPF program code before executing |
| -Z STRING_SIZE, --string-size STRING_SIZE |
| maximum size to read from strings |
| -S, --include-self do not filter trace's own pid from the trace |
| -M MAX_EVENTS, --max-events MAX_EVENTS |
| number of events to print before quitting |
| -o, --offset use relative time from first traced message |
| |
| EXAMPLES: |
| |
| trace do_sys_open |
| Trace the open syscall and print a default trace message when entered |
| trace 'do_sys_open "%s", arg2' |
| Trace the open syscall and print the filename being opened |
| trace 'sys_read (arg3 > 20000) "read %d bytes", arg3' |
| Trace the read syscall and print a message for reads >20000 bytes |
| trace r::do_sys_return |
| Trace the return from the open syscall |
| trace 'c:open (arg2 == 42) "%s %d", arg1, arg2' |
| Trace the open() call from libc only if the flags (arg2) argument is 42 |
| trace 'c:malloc "size = %d", arg1' |
| Trace malloc calls and print the size being allocated |
| trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3' |
| Trace the write() call from libc to monitor writes to STDOUT |
| trace 'r::__kmalloc (retval == 0) "kmalloc failed!" |
| Trace returns from __kmalloc which returned a null pointer |
| trace 'r:c:malloc (retval) "allocated = %p", retval |
| Trace returns from malloc and print non-NULL allocated buffers |
| trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector' |
| Trace the block_rq_complete kernel tracepoint and print # of tx sectors |
| trace 'u:pthread:pthread_create (arg4 != 0)' |
| Trace the USDT probe pthread_create when its 4th argument is non-zero |
| |