Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 1 | Demonstrations of trace. |
| 2 | |
| 3 | |
| 4 | trace probes functions you specify and displays trace messages if a particular |
ShelbyFrances | 69abacc | 2017-02-08 05:56:53 +0300 | [diff] [blame] | 5 | condition is met. You can control the message format to display function |
| 6 | arguments and return values. |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 7 | |
| 8 | For example, suppose you want to trace all commands being exec'd across the |
| 9 | system: |
| 10 | |
Sasha Goldshtein | 8acd015 | 2016-02-22 02:25:03 -0800 | [diff] [blame] | 11 | # trace 'sys_execve "%s", arg1' |
Sasha Goldshtein | 49d50ba | 2016-12-19 10:17:38 +0000 | [diff] [blame] | 12 | PID COMM FUNC - |
| 13 | 4402 bash sys_execve /usr/bin/man |
| 14 | 4411 man sys_execve /usr/local/bin/less |
| 15 | 4411 man sys_execve /usr/bin/less |
| 16 | 4410 man sys_execve /usr/local/bin/nroff |
| 17 | 4410 man sys_execve /usr/bin/nroff |
| 18 | 4409 man sys_execve /usr/local/bin/tbl |
| 19 | 4409 man sys_execve /usr/bin/tbl |
| 20 | 4408 man sys_execve /usr/local/bin/preconv |
| 21 | 4408 man sys_execve /usr/bin/preconv |
| 22 | 4415 nroff sys_execve /usr/bin/locale |
| 23 | 4416 nroff sys_execve /usr/bin/groff |
| 24 | 4418 groff sys_execve /usr/bin/grotty |
| 25 | 4417 groff sys_execve /usr/bin/troff |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 26 | ^C |
| 27 | |
| 28 | The ::sys_execve syntax specifies that you want an entry probe (which is the |
| 29 | default), in a kernel function (which is the default) called sys_execve. Next, |
| 30 | the format string to print is simply "%s", which prints a string. Finally, the |
| 31 | value to print is the first argument to the sys_execve function, which happens |
| 32 | to be the command that is exec'd. The above trace was generated by executing |
| 33 | "man ls" in a separate shell. As you see, man executes a number of additional |
| 34 | programs to finally display the man page. |
| 35 | |
| 36 | Next, suppose you are looking for large reads across the system. Let's trace |
| 37 | the read system call and inspect the third argument, which is the number of |
| 38 | bytes to be read: |
| 39 | |
Sasha Goldshtein | 8acd015 | 2016-02-22 02:25:03 -0800 | [diff] [blame] | 40 | # trace 'sys_read (arg3 > 20000) "read %d bytes", arg3' |
Sasha Goldshtein | 49d50ba | 2016-12-19 10:17:38 +0000 | [diff] [blame] | 41 | PID COMM FUNC - |
| 42 | 4490 dd sys_read read 1048576 bytes |
| 43 | 4490 dd sys_read read 1048576 bytes |
| 44 | 4490 dd sys_read read 1048576 bytes |
| 45 | 4490 dd sys_read read 1048576 bytes |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 46 | ^C |
| 47 | |
| 48 | During the trace, I executed "dd if=/dev/zero of=/dev/null bs=1M count=4". |
| 49 | The individual reads are visible, with the custom format message printed for |
| 50 | each read. The parenthesized expression "(arg3 > 20000)" is a filter that is |
| 51 | evaluated for each invocation of the probe before printing anything. |
| 52 | |
tty5 | 5cf529e | 2019-12-06 17:52:56 +0800 | [diff] [blame] | 53 | Event message filter is useful while you only interesting the specific event. |
| 54 | Like the program open thousands file and you only want to see the "temp" file |
| 55 | and print stack. |
| 56 | |
Sumanth Korikkar | 7cbd074 | 2020-04-27 09:09:28 -0500 | [diff] [blame] | 57 | # trace 'do_sys_open "%s", arg2@user' -UK -f temp |
tty5 | 5cf529e | 2019-12-06 17:52:56 +0800 | [diff] [blame] | 58 | PID TID COMM FUNC - |
| 59 | 9557 9557 a.out do_sys_open temp.1 |
| 60 | do_sys_open+0x1 [kernel] |
| 61 | do_syscall_64+0x5b [kernel] |
| 62 | entry_SYSCALL_64_after_hwframe+0x44 [kernel] |
| 63 | __open_nocancel+0x7 [libc-2.17.so] |
| 64 | __libc_start_main+0xf5 [libc-2.17.so] |
| 65 | 9558 9558 a.out do_sys_open temp.2 |
| 66 | do_sys_open+0x1 [kernel] |
| 67 | do_syscall_64+0x5b [kernel] |
| 68 | entry_SYSCALL_64_after_hwframe+0x44 [kernel] |
| 69 | __open_nocancel+0x7 [libc-2.17.so] |
| 70 | __libc_start_main+0xf5 [libc-2.17.so] |
| 71 | |
tty5 | 9ce7b7e | 2019-12-04 22:49:38 +0800 | [diff] [blame] | 72 | Process name filter is porting from tools/opensnoop |
| 73 | |
Sumanth Korikkar | 7cbd074 | 2020-04-27 09:09:28 -0500 | [diff] [blame] | 74 | # trace 'do_sys_open "%s", arg2@user' -UK -n out |
tty5 | 9ce7b7e | 2019-12-04 22:49:38 +0800 | [diff] [blame] | 75 | PID TID COMM FUNC - |
| 76 | 9557 9557 a.out do_sys_open temp.1 |
| 77 | do_sys_open+0x1 [kernel] |
| 78 | do_syscall_64+0x5b [kernel] |
| 79 | entry_SYSCALL_64_after_hwframe+0x44 [kernel] |
| 80 | __open_nocancel+0x7 [libc-2.17.so] |
| 81 | __libc_start_main+0xf5 [libc-2.17.so] |
| 82 | |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 83 | You can also trace user functions. For example, let's simulate the bashreadline |
| 84 | script, which attaches to the readline function in bash and prints its return |
| 85 | value, effectively snooping all bash shell input across the system: |
| 86 | |
| 87 | # trace 'r:bash:readline "%s", retval' |
Sasha Goldshtein | 49d50ba | 2016-12-19 10:17:38 +0000 | [diff] [blame] | 88 | PID COMM FUNC - |
| 89 | 2740 bash readline echo hi! |
| 90 | 2740 bash readline man ls |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 91 | ^C |
| 92 | |
Mirek Klimos | e538228 | 2018-01-26 14:52:50 -0800 | [diff] [blame] | 93 | The special retval keyword stands for the function's return value, and can |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 94 | be used only in a retprobe, specified by the 'r' prefix. The next component |
| 95 | of the probe is the library that contains the desired function. It's OK to |
| 96 | specify executables too, as long as they can be found in the PATH. Or, you |
| 97 | can specify the full path to the executable (e.g. "/usr/bin/bash"). |
| 98 | |
Mirek Klimos | e538228 | 2018-01-26 14:52:50 -0800 | [diff] [blame] | 99 | Sometimes it can be useful to see where in code the events happen. There are |
| 100 | flags to print the kernel stack (-K), the user stack (-U) and optionally |
| 101 | include the virtual address in the stacks as well (-a): |
| 102 | |
| 103 | # trace.py -U -a 'r::sys_futex "%d", retval' |
| 104 | PID TID COMM FUNC - |
| 105 | 793922 793951 poller sys_futex 0 |
| 106 | 7f6c72b6497a __lll_unlock_wake+0x1a [libpthread-2.23.so] |
| 107 | 627fef folly::FunctionScheduler::run()+0x46f [router] |
| 108 | 7f6c7345f171 execute_native_thread_routine+0x21 [libstdc++.so.6.0.21] |
| 109 | 7f6c72b5b7a9 start_thread+0xd9 [libpthread-2.23.so] |
| 110 | 7f6c7223fa7d clone+0x6d [libc-2.23.so] |
| 111 | |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 112 | Multiple probes can be combined on the same command line. For example, let's |
Sasha Goldshtein | 49d50ba | 2016-12-19 10:17:38 +0000 | [diff] [blame] | 113 | trace failed read and write calls on the libc level, and include a time column: |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 114 | |
| 115 | # trace 'r:c:read ((int)retval < 0) "read failed: %d", retval' \ |
Sasha Goldshtein | 49d50ba | 2016-12-19 10:17:38 +0000 | [diff] [blame] | 116 | 'r:c:write ((int)retval < 0) "write failed: %d", retval' -T |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 117 | TIME PID COMM FUNC - |
| 118 | 05:31:57 3388 bash write write failed: -1 |
| 119 | 05:32:00 3388 bash write write failed: -1 |
| 120 | ^C |
| 121 | |
| 122 | Note that the retval variable must be cast to int before comparing to zero. |
| 123 | The reason is that the default type for argN and retval is an unsigned 64-bit |
| 124 | integer, which can never be smaller than 0. |
| 125 | |
Sasha Goldshtein | fd60d55 | 2016-03-01 12:15:34 -0800 | [diff] [blame] | 126 | trace has also some basic support for kernel tracepoints. For example, let's |
| 127 | trace the block:block_rq_complete tracepoint and print out the number of sectors |
| 128 | transferred: |
| 129 | |
Sasha Goldshtein | 49d50ba | 2016-12-19 10:17:38 +0000 | [diff] [blame] | 130 | # trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' -T |
Sasha Goldshtein | fd60d55 | 2016-03-01 12:15:34 -0800 | [diff] [blame] | 131 | TIME PID COMM FUNC - |
| 132 | 01:23:51 0 swapper/0 block_rq_complete sectors=8 |
| 133 | 01:23:55 10017 kworker/u64: block_rq_complete sectors=1 |
| 134 | 01:23:55 0 swapper/0 block_rq_complete sectors=8 |
| 135 | ^C |
| 136 | |
vijunag | 9924e64 | 2019-01-23 12:35:33 +0530 | [diff] [blame] | 137 | Suppose that you want to trace a system-call in a short-lived process, you can use |
| 138 | the -s option to trace. The option is followed by list of libraries/executables to |
| 139 | use for symbol resolution. |
| 140 | # trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U |
| 141 | Note: Kernel bpf will report stack map with ip/build_id |
| 142 | PID TID COMM FUNC |
| 143 | 4175 4175 ping inet_pton |
| 144 | inet_pton+0x136340 [libc.so.6] |
| 145 | getaddrinfo+0xfb510 [libc.so.6] |
| 146 | _init+0x2a08 [ping] |
| 147 | |
| 148 | During the trace, 'ping -c1 google.com' was executed to obtain the above results |
| 149 | |
Sasha Goldshtein | 376ae5c | 2016-10-04 19:49:57 +0300 | [diff] [blame] | 150 | To discover the tracepoint structure format (which you can refer to as the "args" |
| 151 | pointer variable), use the tplist tool. For example: |
Sasha Goldshtein | fd60d55 | 2016-03-01 12:15:34 -0800 | [diff] [blame] | 152 | |
| 153 | # tplist -v block:block_rq_complete |
| 154 | block:block_rq_complete |
| 155 | dev_t dev; |
| 156 | sector_t sector; |
| 157 | unsigned int nr_sector; |
| 158 | int errors; |
| 159 | char rwbs[8]; |
| 160 | |
Sasha Goldshtein | 376ae5c | 2016-10-04 19:49:57 +0300 | [diff] [blame] | 161 | This output tells you that you can use "args->dev", "args->sector", etc. in your |
Sasha Goldshtein | fd60d55 | 2016-03-01 12:15:34 -0800 | [diff] [blame] | 162 | predicate and trace arguments. |
| 163 | |
Sasha Goldshtein | accd4cf | 2016-10-11 07:56:13 -0700 | [diff] [blame] | 164 | |
| 165 | More and more high-level libraries are instrumented with USDT probe support. |
| 166 | These probes can be traced by trace just like kernel tracepoints. For example, |
Teng Qin | c200b6c | 2017-12-16 00:15:55 -0800 | [diff] [blame] | 167 | trace new threads being created and their function name, include time column |
| 168 | and on which CPU it happened: |
Sasha Goldshtein | accd4cf | 2016-10-11 07:56:13 -0700 | [diff] [blame] | 169 | |
Teng Qin | c200b6c | 2017-12-16 00:15:55 -0800 | [diff] [blame] | 170 | # trace 'u:pthread:pthread_create "%U", arg3' -T -C |
| 171 | TIME CPU PID TID COMM FUNC - |
| 172 | 13:22:01 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount] |
| 173 | 13:22:01 5 21360 21414 osqueryd pthread_create [unknown] [osqueryd] |
| 174 | 13:22:03 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount] |
| 175 | 13:22:04 15 21360 21414 osqueryd pthread_create [unknown] [osqueryd] |
| 176 | 13:22:07 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount] |
| 177 | 13:22:07 4 21360 21414 osqueryd pthread_create [unknown] [osqueryd] |
Sasha Goldshtein | accd4cf | 2016-10-11 07:56:13 -0700 | [diff] [blame] | 178 | ^C |
| 179 | |
| 180 | The "%U" format specifier tells trace to resolve arg3 as a user-space symbol, |
| 181 | if possible. Similarly, use "%K" for kernel symbols. |
| 182 | |
| 183 | Ruby, Node, and OpenJDK are also instrumented with USDT. For example, let's |
Nikita V. Shirokov | 3953c70 | 2018-07-27 16:13:47 -0700 | [diff] [blame] | 184 | trace Ruby methods being called (this requires a version of Ruby built with |
Sasha Goldshtein | accd4cf | 2016-10-11 07:56:13 -0700 | [diff] [blame] | 185 | the --enable-dtrace configure flag): |
| 186 | |
Sasha Goldshtein | 49d50ba | 2016-12-19 10:17:38 +0000 | [diff] [blame] | 187 | # trace 'u:ruby:method__entry "%s.%s", arg1, arg2' -p $(pidof irb) -T |
Sasha Goldshtein | accd4cf | 2016-10-11 07:56:13 -0700 | [diff] [blame] | 188 | TIME PID COMM FUNC - |
| 189 | 12:08:43 18420 irb method__entry IRB::Context.verbose? |
| 190 | 12:08:43 18420 irb method__entry RubyLex.ungetc |
| 191 | 12:08:43 18420 irb method__entry RuxyLex.debug? |
| 192 | ^C |
| 193 | |
| 194 | In the previous invocation, arg1 and arg2 are the class name and method name |
| 195 | for the Ruby method being invoked. |
| 196 | |
ShelbyFrances | 86cd535 | 2017-02-08 05:56:37 +0300 | [diff] [blame] | 197 | You can also trace exported functions from shared libraries, or an imported |
| 198 | function on the actual executable: |
| 199 | |
| 200 | # sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval' |
| 201 | # tput -V |
| 202 | |
| 203 | PID TID COMM FUNC - |
| 204 | 21720 21720 tput curses_version Version=ncurses 6.0.20160709 |
| 205 | ^C |
| 206 | |
Sasha Goldshtein | accd4cf | 2016-10-11 07:56:13 -0700 | [diff] [blame] | 207 | |
Sasha Goldshtein | f4797b0 | 2016-10-17 01:44:56 -0700 | [diff] [blame] | 208 | Occasionally, it can be useful to filter specific strings. For example, you |
| 209 | might be interested in open() calls that open a specific file: |
| 210 | |
Sasha Goldshtein | 49d50ba | 2016-12-19 10:17:38 +0000 | [diff] [blame] | 211 | # trace 'p:c:open (STRCMP("test.txt", arg1)) "opening %s", arg1' -T |
Sasha Goldshtein | f4797b0 | 2016-10-17 01:44:56 -0700 | [diff] [blame] | 212 | TIME PID COMM FUNC - |
| 213 | 01:43:15 10938 cat open opening test.txt |
| 214 | 01:43:20 10939 cat open opening test.txt |
| 215 | ^C |
| 216 | |
| 217 | |
Sasha Goldshtein | 23e72b8 | 2017-01-17 08:49:36 +0000 | [diff] [blame] | 218 | In the preceding example, as well as in many others, readability may be |
| 219 | improved by providing the function's signature, which names the arguments and |
| 220 | lets you access structure sub-fields, which is hard with the "arg1", "arg2" |
| 221 | convention. For example: |
| 222 | |
| 223 | # trace 'p:c:open(char *filename) "opening %s", filename' |
| 224 | PID TID COMM FUNC - |
| 225 | 17507 17507 cat open opening FAQ.txt |
| 226 | ^C |
| 227 | |
| 228 | # trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec' |
| 229 | PID TID COMM FUNC - |
| 230 | 777 785 automount SyS_nanosleep sleep for 500000000 ns |
| 231 | 777 785 automount SyS_nanosleep sleep for 500000000 ns |
| 232 | 777 785 automount SyS_nanosleep sleep for 500000000 ns |
| 233 | 777 785 automount SyS_nanosleep sleep for 500000000 ns |
| 234 | ^C |
| 235 | |
ShelbyFrances | 69abacc | 2017-02-08 05:56:53 +0300 | [diff] [blame] | 236 | Remember to use the -I argument include the appropriate header file. We didn't |
| 237 | need to do that here because `struct timespec` is used internally by the tool, |
| 238 | so it always includes this header file. |
Sasha Goldshtein | 23e72b8 | 2017-01-17 08:49:36 +0000 | [diff] [blame] | 239 | |
zhenwei pi | 047541c | 2022-01-13 19:56:32 +0800 | [diff] [blame] | 240 | To aggregate amount of trace, you need specify -A with -M EVENTS. A typical |
| 241 | example: |
| 242 | 1, if we find that the sys CPU utilization is higher by 'top' command |
| 243 | 2, then find that the timer interrupt is more normal by 'irqtop' command |
| 244 | 3, to confirm kernel timer setting frequence by 'funccount -i 1 clockevents_program_event' |
| 245 | 4, to trace timer setting by 'trace clockevents_program_event -K -A -M 1000' |
| 246 | |
| 247 | 1294576 1294584 CPU 0/KVM clockevents_program_event |
| 248 | clockevents_program_event+0x1 [kernel] |
| 249 | hrtimer_start_range_ns+0x209 [kernel] |
| 250 | start_sw_timer+0x173 [kvm] |
| 251 | restart_apic_timer+0x6c [kvm] |
| 252 | kvm_set_msr_common+0x442 [kvm] |
| 253 | __kvm_set_msr+0xa2 [kvm] |
| 254 | kvm_emulate_wrmsr+0x36 [kvm] |
| 255 | vcpu_enter_guest+0x326 [kvm] |
| 256 | kvm_arch_vcpu_ioctl_run+0xcc [kvm] |
| 257 | kvm_vcpu_ioctl+0x22f [kvm] |
| 258 | do_vfs_ioctl+0xa1 [kernel] |
| 259 | ksys_ioctl+0x60 [kernel] |
| 260 | __x64_sys_ioctl+0x16 [kernel] |
| 261 | do_syscall_64+0x59 [kernel] |
| 262 | entry_SYSCALL_64_after_hwframe+0x44 [kernel] |
| 263 | -->COUNT 271 |
| 264 | ... |
| 265 | So we can know that 271 timer setting in recent 1000(~27%). |
ShelbyFrances | 69abacc | 2017-02-08 05:56:53 +0300 | [diff] [blame] | 266 | |
| 267 | As a final example, let's trace open syscalls for a specific process. By |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 268 | default, tracing is system-wide, but the -p switch overrides this: |
| 269 | |
Sumanth Korikkar | 7cbd074 | 2020-04-27 09:09:28 -0500 | [diff] [blame] | 270 | # trace -p 2740 'do_sys_open "%s", arg2@user' -T |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 271 | TIME PID COMM FUNC - |
| 272 | 05:36:16 15872 ls do_sys_open /etc/ld.so.cache |
| 273 | 05:36:16 15872 ls do_sys_open /lib64/libselinux.so.1 |
| 274 | 05:36:16 15872 ls do_sys_open /lib64/libcap.so.2 |
| 275 | 05:36:16 15872 ls do_sys_open /lib64/libacl.so.1 |
| 276 | 05:36:16 15872 ls do_sys_open /lib64/libc.so.6 |
| 277 | 05:36:16 15872 ls do_sys_open /lib64/libpcre.so.1 |
| 278 | 05:36:16 15872 ls do_sys_open /lib64/libdl.so.2 |
| 279 | 05:36:16 15872 ls do_sys_open /lib64/libattr.so.1 |
| 280 | 05:36:16 15872 ls do_sys_open /lib64/libpthread.so.0 |
| 281 | 05:36:16 15872 ls do_sys_open /usr/lib/locale/locale-archive |
| 282 | 05:36:16 15872 ls do_sys_open /home/vagrant |
| 283 | ^C |
| 284 | |
| 285 | In this example, we traced the "ls ~" command as it was opening its shared |
| 286 | libraries and then accessing the /home/vagrant directory listing. |
| 287 | |
| 288 | |
Mark Drayton | 5f5687e | 2017-02-20 18:13:03 +0000 | [diff] [blame] | 289 | Lastly, if a high-frequency event is traced you may overflow the perf ring |
| 290 | buffer. This shows as "Lost N samples": |
| 291 | |
| 292 | # trace sys_open |
| 293 | 5087 5087 pgrep sys_open |
| 294 | 5087 5087 pgrep sys_open |
| 295 | 5087 5087 pgrep sys_open |
| 296 | 5087 5087 pgrep sys_open |
| 297 | 5087 5087 pgrep sys_open |
| 298 | Lost 764896 samples |
| 299 | Lost 764896 samples |
| 300 | Lost 764896 samples |
| 301 | |
| 302 | The perf ring buffer size can be changed with -b. The unit is size per-CPU buffer |
| 303 | size and is measured in pages. The value must be a power of two and defaults to |
| 304 | 64 pages. |
| 305 | |
Nikita V. Shirokov | 3953c70 | 2018-07-27 16:13:47 -0700 | [diff] [blame] | 306 | # trace.py 'sys_setsockopt(int fd, int level, int optname, char* optval, int optlen)(level==0 && optname == 1 && STRCMP("{0x6C, 0x00, 0x00, 0x00}", optval))' -U -M 1 --bin_cmp |
| 307 | PID TID COMM FUNC - |
| 308 | 1855611 1863183 worker sys_setsockopt found |
| 309 | |
| 310 | In this example we are catching setsockopt syscall to change IPv4 IP_TOS |
| 311 | value only for the cases where new TOS value is equal to 108. we are using |
| 312 | STRCMP helper in binary mode (--bin_cmp flag) to compare optval array |
| 313 | against int value of 108 (parametr of setsockopt call) in hex representation |
| 314 | (little endian format) |
| 315 | |
Ferenc Fejes | 8105317 | 2020-08-02 21:45:45 +0200 | [diff] [blame] | 316 | For advanced users there is a possibility to insert the kprobes or uprobes |
| 317 | after a certain offset, rather than the start of the function call |
| 318 | This is useful for tracing register values at different places of the |
| 319 | execution of a function. Lets consider the following example: |
Nikita V. Shirokov | 3953c70 | 2018-07-27 16:13:47 -0700 | [diff] [blame] | 320 | |
Ferenc Fejes | 8105317 | 2020-08-02 21:45:45 +0200 | [diff] [blame] | 321 | int main() |
| 322 | { |
| 323 | int val = 0xdead; |
| 324 | printf("%d\n", val); |
| 325 | val = 0xbeef; |
| 326 | printf("%d\n", val); |
| 327 | } |
| 328 | |
| 329 | After compiling the code with -O3 optimization the object code looks |
| 330 | like the following (with GCC 10 and x86_64 architecture): |
| 331 | |
| 332 | objdump --disassemble=main --prefix-addresses a.out |
| 333 | |
| 334 | 0000000000001060 <main> endbr64 |
| 335 | 0000000000001064 <main+0x4> sub $0x8,%rsp |
| 336 | 0000000000001068 <main+0x8> mov $0xdead,%edx |
| 337 | 000000000000106d <main+0xd> mov $0x1,%edi |
| 338 | 0000000000001072 <main+0x12> xor %eax,%eax |
| 339 | 0000000000001074 <main+0x14> lea 0xf89(%rip),%rsi |
| 340 | 000000000000107b <main+0x1b> callq 0000000000001050 <__printf_chk@plt> |
| 341 | 0000000000001080 <main+0x20> mov $0xbeef,%edx |
| 342 | 0000000000001085 <main+0x25> lea 0xf78(%rip),%rsi |
| 343 | 000000000000108c <main+0x2c> xor %eax,%eax |
| 344 | 000000000000108e <main+0x2e> mov $0x1,%edi |
| 345 | 0000000000001093 <main+0x33> callq 0000000000001050 <__printf_chk@plt> |
| 346 | 0000000000001098 <main+0x38> xor %eax,%eax |
| 347 | 000000000000109a <main+0x3a> add $0x8,%rsp |
| 348 | 000000000000109e <main+0x3e> retq |
| 349 | |
| 350 | The 0xdead and later the 0xbeef values are moved into the edx register. |
Fejes Ferenc | 6432324 | 2020-08-02 22:07:20 +0200 | [diff] [blame] | 351 | As the disassembly shows the edx register contains the 0xdead value |
Ferenc Fejes | 8105317 | 2020-08-02 21:45:45 +0200 | [diff] [blame] | 352 | after the 0xd offset and 0xbeef after the 0x25 offset. To verify this |
| 353 | with trace lets insert probes to those offsets. The following |
Fejes Ferenc | 6432324 | 2020-08-02 22:07:20 +0200 | [diff] [blame] | 354 | command inserts two uprobe one after the 0xd offset and another one |
Ferenc Fejes | 8105317 | 2020-08-02 21:45:45 +0200 | [diff] [blame] | 355 | after the 0x25 offset of the main function. The probe print the |
| 356 | value of the edx register which will show us the correct values. |
| 357 | |
| 358 | trace 'p:/tmp/a.out:main+0xd "%x", ctx->dx' 'p:/tmp/a.out:main+0x25 "%x", ctx->dx' |
| 359 | PID TID COMM FUNC - |
| 360 | 25754 25754 a.out main dead |
| 361 | 25754 25754 a.out main beef |
Nikita V. Shirokov | 3953c70 | 2018-07-27 16:13:47 -0700 | [diff] [blame] | 362 | |
Mark Drayton | 5f5687e | 2017-02-20 18:13:03 +0000 | [diff] [blame] | 363 | |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 364 | USAGE message: |
| 365 | |
evilpan | f32f772 | 2021-12-11 00:58:51 +0800 | [diff] [blame] | 366 | usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [--uid UID] [-v] |
| 367 | [-Z STRING_SIZE] [-S] [-M MAX_EVENTS] [-t] [-u] [-T] [-C] |
| 368 | [-c CGROUP_PATH] [-n NAME] [-f MSG_FILTER] [-B] |
| 369 | [-s SYM_FILE_LIST] [-K] [-U] [-a] [-I header] |
Mark Drayton | aa6c916 | 2016-11-03 15:36:29 +0000 | [diff] [blame] | 370 | probe [probe ...] |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 371 | |
| 372 | Attach to functions and print trace messages. |
| 373 | |
| 374 | positional arguments: |
| 375 | probe probe specifier (see examples) |
| 376 | |
| 377 | optional arguments: |
| 378 | -h, --help show this help message and exit |
Mark Drayton | 5f5687e | 2017-02-20 18:13:03 +0000 | [diff] [blame] | 379 | -b BUFFER_PAGES, --buffer-pages BUFFER_PAGES |
| 380 | number of pages to use for perf_events ring buffer |
| 381 | (default: 64) |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 382 | -p PID, --pid PID id of the process to trace (optional) |
Sasha Goldshtein | 49d50ba | 2016-12-19 10:17:38 +0000 | [diff] [blame] | 383 | -L TID, --tid TID id of the thread to trace (optional) |
evilpan | f32f772 | 2021-12-11 00:58:51 +0800 | [diff] [blame] | 384 | --uid UID id of the user to trace (optional) |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 385 | -v, --verbose print resulting BPF program code before executing |
| 386 | -Z STRING_SIZE, --string-size STRING_SIZE |
| 387 | maximum size to read from strings |
| 388 | -S, --include-self do not filter trace's own pid from the trace |
| 389 | -M MAX_EVENTS, --max-events MAX_EVENTS |
| 390 | number of events to print before quitting |
Sasha Goldshtein | 49d50ba | 2016-12-19 10:17:38 +0000 | [diff] [blame] | 391 | -t, --timestamp print timestamp column (offset from trace start) |
evilpan | f32f772 | 2021-12-11 00:58:51 +0800 | [diff] [blame] | 392 | -u, --unix-timestamp print UNIX timestamp instead of offset from trace |
| 393 | start, requires -t |
Mark Drayton | 5f5687e | 2017-02-20 18:13:03 +0000 | [diff] [blame] | 394 | -T, --time print time column |
evilpan | f32f772 | 2021-12-11 00:58:51 +0800 | [diff] [blame] | 395 | -C, --print_cpu print CPU id |
| 396 | -c CGROUP_PATH, --cgroup-path CGROUP_PATH |
| 397 | cgroup path |
tty5 | 9ce7b7e | 2019-12-04 22:49:38 +0800 | [diff] [blame] | 398 | -n NAME, --name NAME only print process names containing this name |
tty5 | 5cf529e | 2019-12-06 17:52:56 +0800 | [diff] [blame] | 399 | -f MSG_FILTER, --msg-filter MSG_FILTER |
evilpan | f32f772 | 2021-12-11 00:58:51 +0800 | [diff] [blame] | 400 | only print the msg of event containing this string |
Nikita V. Shirokov | 3953c70 | 2018-07-27 16:13:47 -0700 | [diff] [blame] | 401 | -B, --bin_cmp allow to use STRCMP with binary values |
evilpan | f32f772 | 2021-12-11 00:58:51 +0800 | [diff] [blame] | 402 | -s SYM_FILE_LIST, --sym_file_list SYM_FILE_LIST |
| 403 | coma separated list of symbol files to use for symbol |
| 404 | resolution |
Sasha Goldshtein | 4725a72 | 2016-10-18 20:54:47 +0300 | [diff] [blame] | 405 | -K, --kernel-stack output kernel stack trace |
| 406 | -U, --user-stack output user stack trace |
Mirek Klimos | e538228 | 2018-01-26 14:52:50 -0800 | [diff] [blame] | 407 | -a, --address print virtual address in stacks |
Sasha Goldshtein | 4725a72 | 2016-10-18 20:54:47 +0300 | [diff] [blame] | 408 | -I header, --include header |
| 409 | additional header files to include in the BPF program |
evilpan | f32f772 | 2021-12-11 00:58:51 +0800 | [diff] [blame] | 410 | as either full path, or relative to current working |
| 411 | directory, or relative to default kernel header search |
| 412 | path |
zhenwei pi | 047541c | 2022-01-13 19:56:32 +0800 | [diff] [blame] | 413 | -A, --aggregate aggregate amount of each trace |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 414 | |
| 415 | EXAMPLES: |
| 416 | |
Sasha Goldshtein | 8acd015 | 2016-02-22 02:25:03 -0800 | [diff] [blame] | 417 | trace do_sys_open |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 418 | Trace the open syscall and print a default trace message when entered |
Ferenc Fejes | 8105317 | 2020-08-02 21:45:45 +0200 | [diff] [blame] | 419 | trace kfree_skb+0x12 |
| 420 | Trace the kfree_skb kernel function after the instruction on the 0x12 offset |
Sumanth Korikkar | 7cbd074 | 2020-04-27 09:09:28 -0500 | [diff] [blame] | 421 | trace 'do_sys_open "%s", arg2@user' |
zhenwei pi | 047541c | 2022-01-13 19:56:32 +0800 | [diff] [blame] | 422 | Trace the open syscall and print the filename being opened @user is |
Sumanth Korikkar | 7cbd074 | 2020-04-27 09:09:28 -0500 | [diff] [blame] | 423 | added to arg2 in kprobes to ensure that char * should be copied from |
| 424 | the userspace stack to the bpf stack. If not specified, previous |
| 425 | behaviour is expected. |
zhenwei pi | 047541c | 2022-01-13 19:56:32 +0800 | [diff] [blame] | 426 | |
Sumanth Korikkar | 7cbd074 | 2020-04-27 09:09:28 -0500 | [diff] [blame] | 427 | trace 'do_sys_open "%s", arg2@user' -n main |
tty5 | 9ce7b7e | 2019-12-04 22:49:38 +0800 | [diff] [blame] | 428 | Trace the open syscall and only print event that process names containing "main" |
evilpan | f32f772 | 2021-12-11 00:58:51 +0800 | [diff] [blame] | 429 | trace 'do_sys_open "%s", arg2@user' --uid 1001 |
| 430 | Trace the open syscall and only print event that processes with user ID 1001 |
Sumanth Korikkar | 7cbd074 | 2020-04-27 09:09:28 -0500 | [diff] [blame] | 431 | trace 'do_sys_open "%s", arg2@user' -f config |
tty5 | 5cf529e | 2019-12-06 17:52:56 +0800 | [diff] [blame] | 432 | Trace the open syscall and print the filename being opened filtered by "config" |
Sasha Goldshtein | 8acd015 | 2016-02-22 02:25:03 -0800 | [diff] [blame] | 433 | trace 'sys_read (arg3 > 20000) "read %d bytes", arg3' |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 434 | Trace the read syscall and print a message for reads >20000 bytes |
Sasha Goldshtein | 23e72b8 | 2017-01-17 08:49:36 +0000 | [diff] [blame] | 435 | trace 'r::do_sys_open "%llx", retval' |
Mark Drayton | aa6c916 | 2016-11-03 15:36:29 +0000 | [diff] [blame] | 436 | Trace the return from the open syscall and print the return value |
Sasha Goldshtein | 8acd015 | 2016-02-22 02:25:03 -0800 | [diff] [blame] | 437 | trace 'c:open (arg2 == 42) "%s %d", arg1, arg2' |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 438 | Trace the open() call from libc only if the flags (arg2) argument is 42 |
Sasha Goldshtein | 8acd015 | 2016-02-22 02:25:03 -0800 | [diff] [blame] | 439 | trace 'c:malloc "size = %d", arg1' |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 440 | Trace malloc calls and print the size being allocated |
Sasha Goldshtein | 8acd015 | 2016-02-22 02:25:03 -0800 | [diff] [blame] | 441 | trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3' |
| 442 | Trace the write() call from libc to monitor writes to STDOUT |
Mark Drayton | 5f5687e | 2017-02-20 18:13:03 +0000 | [diff] [blame] | 443 | trace 'r::__kmalloc (retval == 0) "kmalloc failed!"' |
Sasha Goldshtein | 8acd015 | 2016-02-22 02:25:03 -0800 | [diff] [blame] | 444 | Trace returns from __kmalloc which returned a null pointer |
Mark Drayton | 5f5687e | 2017-02-20 18:13:03 +0000 | [diff] [blame] | 445 | trace 'r:c:malloc (retval) "allocated = %x", retval' |
Sasha Goldshtein | 38847f0 | 2016-02-22 02:19:24 -0800 | [diff] [blame] | 446 | Trace returns from malloc and print non-NULL allocated buffers |
Sasha Goldshtein | 376ae5c | 2016-10-04 19:49:57 +0300 | [diff] [blame] | 447 | trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' |
Sasha Goldshtein | fd60d55 | 2016-03-01 12:15:34 -0800 | [diff] [blame] | 448 | Trace the block_rq_complete kernel tracepoint and print # of tx sectors |
Sasha Goldshtein | 3e39a08 | 2016-03-24 08:39:47 -0700 | [diff] [blame] | 449 | trace 'u:pthread:pthread_create (arg4 != 0)' |
| 450 | Trace the USDT probe pthread_create when its 4th argument is non-zero |
zhenwei pi | 047541c | 2022-01-13 19:56:32 +0800 | [diff] [blame] | 451 | trace 'u:pthread:libpthread:pthread_create (arg4 != 0)' |
| 452 | Ditto, but the provider name "libpthread" is specified. |
Sasha Goldshtein | 23e72b8 | 2017-01-17 08:49:36 +0000 | [diff] [blame] | 453 | trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec' |
| 454 | Trace the nanosleep syscall and print the sleep duration in ns |
yonghong-song | c2a530b | 2019-10-20 09:35:55 -0700 | [diff] [blame] | 455 | trace -c /sys/fs/cgroup/system.slice/workload.service '__x64_sys_nanosleep' '__x64_sys_clone' |
| 456 | Trace nanosleep/clone syscall calls only under workload.service |
| 457 | cgroup hierarchy. |
Yonghong Song | f4470dc | 2017-12-13 14:12:13 -0800 | [diff] [blame] | 458 | trace -I 'linux/fs.h' \ |
| 459 | 'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops' |
| 460 | Trace the uprobe_register inode mapping ops, and the symbol can be found |
| 461 | in /proc/kallsyms |
| 462 | trace -I 'kernel/sched/sched.h' \ |
| 463 | 'p::__account_cfs_rq_runtime(struct cfs_rq *cfs_rq) "%d", cfs_rq->runtime_remaining' |
| 464 | Trace the cfs scheduling runqueue remaining runtime. The struct cfs_rq is defined |
| 465 | in kernel/sched/sched.h which is in kernel source tree and not in kernel-devel |
| 466 | package. So this command needs to run at the kernel source tree root directory |
| 467 | so that the added header file can be found by the compiler. |
zhenwei pi | 047541c | 2022-01-13 19:56:32 +0800 | [diff] [blame] | 468 | trace -I 'net/sock.h' \ |
tehnerd | 86293f0 | 2018-01-23 21:21:58 -0800 | [diff] [blame] | 469 | 'udpv6_sendmsg(struct sock *sk) (sk->sk_dport == 13568)' |
| 470 | Trace udpv6 sendmsg calls only if socket's destination port is equal |
| 471 | to 53 (DNS; 13568 in big endian order) |
Yonghong Song | f92fef2 | 2018-01-24 20:51:46 -0800 | [diff] [blame] | 472 | trace -I 'linux/fs_struct.h' 'mntns_install "users = %d", $task->fs->users' |
| 473 | Trace the number of users accessing the file system of the current task |
vijunag | 9924e64 | 2019-01-23 12:35:33 +0530 | [diff] [blame] | 474 | trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U |
| 475 | Trace inet_pton system call and use the specified libraries/executables for |
| 476 | symbol resolution. |