Denys Vlasenko | dfa0acc | 2011-08-31 15:58:06 +0200 | [diff] [blame^] | 1 | This document describes Linux ptrace implementation in Linux kernels |
| 2 | version 3.0.0. (Update this notice if you update the document |
| 3 | to reflect newer kernels). |
| 4 | |
| 5 | |
| 6 | Ptrace userspace API. |
| 7 | |
| 8 | Ptrace API (ab)uses standard Unix parent/child signaling over waitpid. |
| 9 | An unfortunate effect of it is that resulting API is complex and has |
| 10 | subtle quirks. This document aims to describe these quirks. |
| 11 | |
| 12 | Debugged processes (tracees) first need to be attached to the debugging |
| 13 | process (tracer). Attachment and subsequent commands are per-thread: in |
| 14 | multi-threaded process, every thread can be individually attached to a |
| 15 | (potentially different) tracer, or left not attached and thus not |
| 16 | debugged. Therefore, "tracee" always means "(one) thread", never "a |
| 17 | (possibly multi-threaded) process". Ptrace commands are always sent to |
| 18 | a specific tracee using ptrace(PTRACE_foo, pid, ...), where pid is a |
| 19 | TID of the corresponding Linux thread. |
| 20 | |
| 21 | After attachment, each tracee can be in two states: running or stopped. |
| 22 | |
| 23 | There are many kinds of states when tracee is stopped, and in ptrace |
| 24 | discussions they are often conflated. Therefore, it is important to use |
| 25 | precise terms. |
| 26 | |
| 27 | In this document, any stopped state in which tracee is ready to accept |
| 28 | ptrace commands from the tracer is called ptrace-stop. Ptrace-stops can |
| 29 | be further subdivided into signal-delivery-stop, group-stop, |
| 30 | syscall-stop and so on. They are described in detail later. |
| 31 | |
| 32 | |
| 33 | 1.x Death under ptrace. |
| 34 | |
| 35 | When a (possibly multi-threaded) process receives a killing signal (a |
| 36 | signal set to SIG_DFL and whose default action is to kill the process), |
| 37 | all threads exit. Tracees report their death to the tracer(s). This is |
| 38 | not a ptrace-stop (because tracer can't query tracee status such as |
| 39 | register contents, cannot restart tracee etc) but the notification |
| 40 | about this event is delivered through waitpid API similarly to |
| 41 | ptrace-stop. |
| 42 | |
| 43 | Note that killing signal will first cause signal-delivery-stop (on one |
| 44 | tracee only), and only after it is injected by tracer (or after it was |
| 45 | dispatched to a thread which isn't traced), death from signal will |
| 46 | happen on ALL tracees within multi-threaded process. |
| 47 | |
| 48 | SIGKILL operates similarly, with exceptions. No signal-delivery-stop is |
| 49 | generated for SIGKILL and therefore tracer can't suppress it. SIGKILL |
| 50 | kills even within syscalls (syscall-exit-stop is not generated prior to |
| 51 | death by SIGKILL). The net effect is that SIGKILL always kills the |
| 52 | process (all its threads), even if some threads of the process are |
| 53 | ptraced. |
| 54 | |
| 55 | Tracer can kill a tracee with ptrace(PTRACE_KILL, pid, 0, 0). This |
| 56 | opeartion is deprecated, use kill/tgkill(SIGKILL) instead. |
| 57 | |
| 58 | ^^^ Oleg prefers to deprecate it instead of describing (and needing to |
| 59 | support) PTRACE_KILL's quirks. |
| 60 | |
| 61 | When tracee executes exit syscall, it reports its death to its tracer. |
| 62 | Other threads are not affected. |
| 63 | |
| 64 | When any thread executes exit_group syscall, every tracee in its thread |
| 65 | group reports its death to its tracer. |
| 66 | |
| 67 | If PTRACE_O_TRACEEXIT option is on, PTRACE_EVENT_EXIT will happen |
| 68 | before actual death. This applies to exits on exit syscall, group_exit |
| 69 | syscall, signal deaths (except SIGKILL), and when threads are torn down |
| 70 | on execve in multi-threaded process. |
| 71 | |
| 72 | Tracer cannot assume that ptrace-stopped tracee exists. There are many |
| 73 | scenarios when tracee may die while stopped (such as SIGKILL). |
| 74 | Therefore, tracer must always be prepared to handle ESRCH error on any |
| 75 | ptrace operation. Unfortunately, the same error is returned if tracee |
| 76 | exists but is not ptrace-stopped (for commands which require stopped |
| 77 | tracee), or if it is not traced by process which issued ptrace call. |
| 78 | Tracer needs to keep track of stopped/running state, and interpret |
| 79 | ESRCH as "tracee died unexpectedly" only if it knows that tracee has |
| 80 | been observed to enter ptrace-stop. Note that there is no guarantee |
| 81 | that waitpid(WNOHANG) will reliably report tracee's death status if |
| 82 | ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead. |
| 83 | IOW: tracee may be "not yet fully dead" but already refusing ptrace ops. |
| 84 | |
| 85 | Tracer can not assume that tracee ALWAYS ends its life by reporting |
| 86 | WIFEXITED(status) or WIFSIGNALED(status). |
| 87 | |
| 88 | ??? or can it? Do we include such a promise into ptrace API? |
| 89 | |
| 90 | |
| 91 | 1.x Stopped states. |
| 92 | |
| 93 | When running tracee enters ptrace-stop, it notifies its tracer using |
| 94 | waitpid API. Tracer should use waitpid family of syscalls to wait for |
| 95 | tracee to stop. Most of this document assumes that tracer waits with: |
| 96 | |
| 97 | pid = waitpid(pid_or_minus_1, &status, __WALL); |
| 98 | |
| 99 | Ptrace-stopped tracees are reported as returns with pid > 0 and |
| 100 | WIFSTOPPED(status) == true. |
| 101 | |
| 102 | ??? Do we require __WALL usage, or will just using 0 be ok? Are the |
| 103 | rules different if user wants to use waitid? Will waitid require |
| 104 | WEXITED? |
| 105 | |
| 106 | __WALL value does not include WSTOPPED and WEXITED bits, but implies |
| 107 | their functionality. |
| 108 | |
| 109 | Setting of WCONTINUED bit in waitpid flags is not recommended: the |
| 110 | continued state is per-process and consuming it can confuse real parent |
| 111 | of the tracee. |
| 112 | |
| 113 | Use of WNOHANG bit in waitpid flags may cause waitpid return 0 ("no |
| 114 | wait results available yet") even if tracer knows there should be a |
| 115 | notification. Example: kill(tracee, SIGKILL); waitpid(tracee, &status, |
| 116 | __WALL | WNOHANG); |
| 117 | |
| 118 | ??? waitid usage? WNOWAIT? |
| 119 | |
| 120 | ??? describe how wait notifications queue (or not queue) |
| 121 | |
| 122 | The following kinds of ptrace-stops exist: signal-delivery-stops, |
| 123 | group-stop, PTRACE_EVENT stops, syscall-stops [, SINGLESTEP, SYSEMU, |
| 124 | SYSEMU_SINGLESTEP]. They all are reported as waitpid result with |
| 125 | WIFSTOPPED(status) == true. They may be differentiated by checking |
| 126 | (status >> 8) value, and if looking at (status >> 8) value doesn't |
| 127 | resolve ambiguity, by querying PTRACE_GETSIGINFO. (Note: |
| 128 | WSTOPSIG(status) macro returns ((status >> 8) & 0xff) value). |
| 129 | |
| 130 | |
| 131 | 1.x.x Signal-delivery-stop |
| 132 | |
| 133 | When (possibly multi-threaded) process receives any signal except |
| 134 | SIGKILL, kernel selects a thread which handles the signal (if signal is |
| 135 | generated with t[g]kill, thread selection is done by user). If selected |
| 136 | thread is traced, it enters signal-delivery-stop. By this point, signal |
| 137 | is not yet delivered to the process, and can be suppressed by tracer. |
| 138 | If tracer doesn't suppress the signal, it passes signal to tracee in |
| 139 | the next ptrace request. This second step of signal delivery is called |
| 140 | "signal injection" in this document. Note that if signal is blocked, |
| 141 | signal-delivery-stop doesn't happen until signal is unblocked, with the |
| 142 | usual exception that SIGSTOP can't be blocked. |
| 143 | |
| 144 | Signal-delivery-stop is observed by tracer as waitpid returning with |
| 145 | WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. If |
| 146 | WSTOPSIG(status) == SIGTRAP, this may be a different kind of |
| 147 | ptrace-stop - see "Syscall-stops" and "execve" sections below for |
| 148 | details. If WSTOPSIG(status) == stopping signal, this may be a |
| 149 | group-stop - see below. |
| 150 | |
| 151 | |
| 152 | 1.x.x Signal injection and suppression. |
| 153 | |
| 154 | After signal-delivery-stop is observed by tracer, tracer should restart |
| 155 | tracee with |
| 156 | |
| 157 | ptrace(PTRACE_rest, pid, 0, sig) |
| 158 | |
| 159 | call, where PTRACE_rest is one of the restarting ptrace ops. If sig is |
| 160 | 0, then signal is not delivered. Otherwise, signal sig is delivered. |
| 161 | This operation is called "signal injection" in this document, to |
| 162 | distinguish it from signal-delivery-stop. |
| 163 | |
| 164 | Note that sig value may be different from WSTOPSIG(status) value - |
| 165 | tracer can cause a different signal to be injected. |
| 166 | |
| 167 | Note that suppressed signal still causes syscalls to return |
| 168 | prematurely. Restartable syscalls will be restarted (tracer will |
| 169 | observe tracee to execute restart_syscall(2) syscall if tracer uses |
| 170 | PTRACE_SYSCALL), non-restartable syscalls (for example, nanosleep) may |
| 171 | return with -EINTR even though no observable signal is injected to the |
| 172 | tracee. |
| 173 | |
| 174 | Note that restarting ptrace commands issued in ptrace-stops other than |
| 175 | signal-delivery-stop are not guaranteed to inject a signal, even if sig |
| 176 | is nonzero. No error is reported, nonzero sig may simply be ignored. |
| 177 | Ptrace users should not try to "create new signal" this way: use |
| 178 | tgkill(2) instead. |
| 179 | |
| 180 | This is a cause of confusion among ptrace users. One typical scenario |
| 181 | is that tracer observes group-stop, mistakes it for |
| 182 | signal-delivery-stop, restarts tracee with ptrace(PTRACE_rest, pid, 0, |
| 183 | stopsig) with the intention of injecting stopsig, but stopsig gets |
| 184 | ignored and tracee continues to run. |
| 185 | |
| 186 | SIGCONT signal has a side effect of waking up (all threads of) |
| 187 | group-stopped process. This side effect happens before |
| 188 | signal-delivery-stop. Tracer can't suppress this side-effect (it can |
| 189 | only suppress signal injection, which only causes SIGCONT handler to |
| 190 | not be executed in the tracee, if such handler is installed). In fact, |
| 191 | waking up from group-stop may be followed by signal-delivery-stop for |
| 192 | signal(s) *other than* SIGCONT, if they were pending when SIGCONT was |
| 193 | delivered. IOW: SIGCONT may be not the first signal observed by the |
| 194 | tracee after it was sent. |
| 195 | |
| 196 | Stopping signals cause (all threads of) process to enter group-stop. |
| 197 | This side effect happens after signal injection, and therefore can be |
| 198 | suppressed by tracer. |
| 199 | |
| 200 | PTRACE_GETSIGINFO can be used to retrieve siginfo_t structure which |
| 201 | corresponds to delivered signal. PTRACE_SETSIGINFO may be used to |
| 202 | modify it. If PTRACE_SETSIGINFO has been used to alter siginfo_t, |
| 203 | si_signo field and sig parameter in restarting command must match, |
| 204 | otherwise the result is undefined. |
| 205 | |
| 206 | |
| 207 | 1.x.x Group-stop |
| 208 | |
| 209 | When a (possibly multi-threaded) process receives a stopping signal, |
| 210 | all threads stop. If some threads are traced, they enter a group-stop. |
| 211 | Note that stopping signal will first cause signal-delivery-stop (on one |
| 212 | tracee only), and only after it is injected by tracer (or after it was |
| 213 | dispatched to a thread which isn't traced), group-stop will be |
| 214 | initiated on ALL tracees within multi-threaded process. As usual, every |
| 215 | tracee reports its group-stop separately to corresponding tracer. |
| 216 | |
| 217 | Group-stop is observed by tracer as waitpid returning with |
| 218 | WIFSTOPPED(status) == true, WSTOPSIG(status) == signal. The same result |
| 219 | is returned by some other classes of ptrace-stops, therefore the |
| 220 | recommended practice is to perform |
| 221 | |
| 222 | ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo) |
| 223 | |
| 224 | call. The call can be avoided if signal number is not SIGSTOP, SIGTSTP, |
| 225 | SIGTTIN or SIGTTOU - only these four signals are stopping signals. If |
| 226 | tracer sees something else, it can't be group-stop. Otherwise, tracer |
| 227 | needs to call PTRACE_GETSIGINFO. If PTRACE_GETSIGINFO fails with |
| 228 | EINVAL, then it is definitely a group-stop. (Other failure codes are |
| 229 | possible, such as ESRCH "no such process" if SIGKILL killed the tracee). |
| 230 | |
| 231 | As of kernel 2.6.38, after tracer sees tracee ptrace-stop and until it |
| 232 | restarts or kills it, tracee will not run, and will not send |
| 233 | notifications (except SIGKILL death) to tracer, even if tracer enters |
| 234 | into another waitpid call. |
| 235 | |
| 236 | Currently, it causes a problem with transparent handling of stopping |
| 237 | signals: if tracer restarts tracee after group-stop, SIGSTOP is |
| 238 | effectively ignored: tracee doesn't remain stopped, it runs. If tracer |
| 239 | doesn't restart tracee before entering into next waitpid, future |
| 240 | SIGCONT will not be reported to the tracer. Which would make SIGCONT to |
| 241 | have no effect. |
| 242 | |
| 243 | |
| 244 | 1.x.x PTRACE_EVENT stops |
| 245 | |
| 246 | If tracer sets TRACE_O_TRACEfoo options, tracee will enter ptrace-stops |
| 247 | called PTRACE_EVENT stops. |
| 248 | |
| 249 | PTRACE_EVENT stops are observed by tracer as waitpid returning with |
| 250 | WIFSTOPPED(status) == true, WSTOPSIG(status) == SIGTRAP. Additional bit |
| 251 | is set in a higher byte of status word: value ((status >> 8) & 0xffff) |
| 252 | will be (SIGTRAP | PTRACE_EVENT_foo << 8). The following events exist: |
| 253 | |
| 254 | PTRACE_EVENT_VFORK - stop before return from vfork/clone+CLONE_VFORK. |
| 255 | When tracee is continued after this, it will wait for child to |
| 256 | exit/exec before continuing its execution (IOW: usual behavior on |
| 257 | vfork). |
| 258 | |
| 259 | PTRACE_EVENT_FORK - stop before return from fork/clone+SIGCHLD |
| 260 | |
| 261 | PTRACE_EVENT_CLONE - stop before return from clone |
| 262 | |
| 263 | PTRACE_EVENT_VFORK_DONE - stop before return from |
| 264 | vfork/clone+CLONE_VFORK, but after vfork child unblocked this tracee by |
| 265 | exiting or exec'ing. |
| 266 | |
| 267 | For all four stops described above: stop occurs in parent, not in newly |
| 268 | created thread. PTRACE_GETEVENTMSG can be used to retrieve new thread's |
| 269 | tid. |
| 270 | |
| 271 | PTRACE_EVENT_EXEC - stop before return from exec. |
| 272 | |
| 273 | PTRACE_EVENT_EXIT - stop before exit (including death from exit_group), |
| 274 | signal death, or exit caused by execve in multi-threaded process. |
| 275 | PTRACE_GETEVENTMSG returns exit status. Registers can be examined |
| 276 | (unlike when "real" exit happens). The tracee is still alive, it needs |
| 277 | to be PTRACE_CONTed or PTRACE_DETACHed to finish exit. |
| 278 | |
| 279 | PTRACE_GETSIGINFO on PTRACE_EVENT stops returns si_signo = SIGTRAP, |
| 280 | si_code = (event << 8) | SIGTRAP. |
| 281 | |
| 282 | |
| 283 | 1.x.x Syscall-stops |
| 284 | |
| 285 | If tracee was restarted by PTRACE_SYSCALL, tracee enters |
| 286 | syscall-enter-stop just prior to entering any syscall. If tracer |
| 287 | restarts it with PTRACE_SYSCALL, tracee enters syscall-exit-stop when |
| 288 | syscall is finished, or if it is interrupted by a signal. (That is, |
| 289 | signal-delivery-stop never happens between syscall-enter-stop and |
| 290 | syscall-exit-stop, it happens *after* syscall-exit-stop). |
| 291 | |
| 292 | Other possibilities are that tracee may stop in a PTRACE_EVENT stop, |
| 293 | exit (if it entered exit or exit_group syscall), be killed by SIGKILL, |
| 294 | or die silently (if execve syscall happened in another thread). |
| 295 | |
| 296 | Syscall-enter-stop and syscall-exit-stop are observed by tracer as |
| 297 | waitpid returning with WIFSTOPPED(status) == true, WSTOPSIG(status) == |
| 298 | SIGTRAP. If PTRACE_O_TRACESYSGOOD option was set by tracer, then |
| 299 | WSTOPSIG(status) == (SIGTRAP | 0x80). |
| 300 | |
| 301 | Syscall-stops can be distinguished from signal-delivery-stop with |
| 302 | SIGTRAP by querying PTRACE_GETSIGINFO: si_code <= 0 if sent by usual |
| 303 | suspects like [tg]kill/sigqueue/etc; or = SI_KERNEL (0x80) if sent by |
| 304 | kernel, whereas syscall-stops have si_code = SIGTRAP or (SIGTRAP | |
| 305 | 0x80). However, syscall-stops happen very often (twice per syscall), |
| 306 | and performing PTRACE_GETSIGINFO for every syscall-stop may be somewhat |
| 307 | expensive. |
| 308 | |
| 309 | Some architectures allow to distinguish them by examining registers. |
| 310 | For example, on x86 rax = -ENOSYS in syscall-enter-stop. Since SIGTRAP |
| 311 | (like any other signal) always happens *after* syscall-exit-stop, and |
| 312 | at this point rax almost never contains -ENOSYS, SIGTRAP looks like |
| 313 | "syscall-stop which is not syscall-enter-stop", IOW: it looks like a |
| 314 | "stray syscall-exit-stop" and can be detected this way. But such |
| 315 | detection is fragile and is best avoided. |
| 316 | |
| 317 | Using PTRACE_O_TRACESYSGOOD option is a recommended method, since it is |
| 318 | reliable and does not incur performance penalty. |
| 319 | |
| 320 | Syscall-enter-stop and syscall-exit-stop are indistinguishable from |
| 321 | each other by tracer. Tracer needs to keep track of the sequence of |
| 322 | ptrace-stops in order to not misinterpret syscall-enter-stop as |
| 323 | syscall-exit-stop or vice versa. The rule is that syscall-enter-stop is |
| 324 | always followed by syscall-exit-stop, PTRACE_EVENT stop or tracee's |
| 325 | death - no other kinds of ptrace-stop can occur in between. |
| 326 | |
| 327 | If after syscall-enter-stop tracer uses restarting command other than |
| 328 | PTRACE_SYSCALL, syscall-exit-stop is not generated. |
| 329 | |
| 330 | PTRACE_GETSIGINFO on syscall-stops returns si_signo = SIGTRAP, si_code |
| 331 | = SIGTRAP or (SIGTRAP | 0x80). |
| 332 | |
| 333 | |
| 334 | 1.x.x SINGLESTEP, SYSEMU, SYSEMU_SINGLESTEP |
| 335 | |
| 336 | ??? document PTRACE_SINGLESTEP, PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP |
| 337 | |
| 338 | |
| 339 | 1.x Informational and restarting ptrace commands. |
| 340 | |
| 341 | Most ptrace commands (all except ATTACH, TRACEME, KILL) require tracee |
| 342 | to be in ptrace-stop, otherwise they fail with ESRCH. |
| 343 | |
| 344 | When tracee is in ptrace-stop, tracer can read and write data to tracee |
| 345 | using informational commands. They leave tracee in ptrace-stopped state: |
| 346 | |
| 347 | longv = ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0); |
| 348 | ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val); |
| 349 | ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct); |
| 350 | ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct); |
| 351 | ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo); |
| 352 | ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo); |
| 353 | ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var); |
| 354 | ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags); |
| 355 | |
| 356 | Note that some errors are not reported. For example, setting siginfo |
| 357 | may have no effect in some ptrace-stops, yet the call may succeed |
| 358 | (return 0 and don't set errno). |
| 359 | |
| 360 | ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags) affects one tracee. |
| 361 | Current flags are replaced. Flags are inherited by new tracees created |
| 362 | and "auto-attached" via active PTRACE_O_TRACE[V]FORK or |
| 363 | PTRACE_O_TRACECLONE options. |
| 364 | |
| 365 | Another group of commands makes ptrace-stopped tracee run. They have |
| 366 | the form: |
| 367 | |
| 368 | ptrace(PTRACE_cmd, pid, 0, sig); |
| 369 | |
| 370 | where cmd is CONT, DETACH, SYSCALL, SINGLESTEP, SYSEMU, or |
| 371 | SYSEMU_SINGLESTEP. If tracee is in signal-delivery-stop, sig is the |
| 372 | signal to be injected. Otherwise, sig may be ignored. |
| 373 | |
| 374 | |
| 375 | 1.x Attaching and detaching |
| 376 | |
| 377 | A thread can be attached to tracer using ptrace(PTRACE_ATTACH, pid, 0, |
| 378 | 0) call. This also sends SIGSTOP to this thread. If tracer wants this |
| 379 | SIGSTOP to have no effect, it needs to suppress it. Note that if other |
| 380 | signals are concurrently sent to this thread during attach, tracer may |
| 381 | see tracee enter signal-delivery-stop with other signal(s) first! The |
| 382 | usual practice is to reinject these signals until SIGSTOP is seen, then |
| 383 | suppress SIGSTOP injection. The design bug here is that attach and |
| 384 | concurrent SIGSTOP are racing and SIGSTOP may be lost. |
| 385 | |
| 386 | ??? Describe how to attach to a thread which is already group-stopped. |
| 387 | |
| 388 | Since attaching sends SIGSTOP and tracer usually suppresses it, this |
| 389 | may cause stray EINTR return from the currently executing syscall in |
| 390 | the tracee, as described in "signal injection and suppression" section. |
| 391 | |
| 392 | ptrace(PTRACE_TRACEME, 0, 0, 0) request turns current thread into a |
| 393 | tracee. It continues to run (doesn't enter ptrace-stop). A common |
| 394 | practice is to follow ptrace(PTRACE_TRACEME) with raise(SIGSTOP) and |
| 395 | allow parent (which is our tracer now) to observe our |
| 396 | signal-delivery-stop. |
| 397 | |
| 398 | If PTRACE_O_TRACE[V]FORK or PTRACE_O_TRACECLONE options are in effect, |
| 399 | then children created by (vfork or clone(CLONE_VFORK)), (fork or |
| 400 | clone(SIGCHLD)) and (other kinds of clone) respectively are |
| 401 | automatically attached to the same tracer which traced their parent. |
| 402 | SIGSTOP is delivered to them, causing them to enter |
| 403 | signal-delivery-stop after they exit syscall which created them. |
| 404 | |
| 405 | Detaching of tracee is performed by ptrace(PTRACE_DETACH, pid, 0, sig). |
| 406 | PTRACE_DETACH is a restarting operation, therefore it requires tracee |
| 407 | to be in ptrace-stop. If tracee is in signal-delivery-stop, signal can |
| 408 | be injected. Othervice, sig parameter may be silently ignored. |
| 409 | |
| 410 | If tracee is running when tracer wants to detach it, the usual solution |
| 411 | is to send SIGSTOP (using tgkill, to make sure it goes to the correct |
| 412 | thread), wait for tracee to stop in signal-delivery-stop for SIGSTOP |
| 413 | and then detach it (suppressing SIGSTOP injection). Design bug is that |
| 414 | this can race with concurrent SIGSTOPs. Another complication is that |
| 415 | tracee may enter other ptrace-stops and needs to be restarted and |
| 416 | waited for again, until SIGSTOP is seen. Yet another complication is to |
| 417 | be sure that tracee is not already ptrace-stopped, because no signal |
| 418 | delivery happens while it is - not even SIGSTOP. |
| 419 | |
| 420 | ??? Describe how to detach from a group-stopped tracee so that it |
| 421 | doesn't run, but continues to wait for SIGCONT. |
| 422 | |
| 423 | If tracer dies, all tracees are automatically detached and restarted, |
| 424 | unless they were in group-stop. Handling of restart from group-stop is |
| 425 | currently buggy, but "as planned" behavior is to leave tracee stopped |
| 426 | and waiting for SIGCONT. If tracee is restarted from |
| 427 | signal-delivery-stop, pending signal is injected. |
| 428 | |
| 429 | |
| 430 | 1.x execve under ptrace. |
| 431 | |
| 432 | During execve, kernel destroys all other threads in the process, and |
| 433 | resets execve'ing thread tid to tgid (process id). This looks very |
| 434 | confusing to tracers: |
| 435 | |
| 436 | All other threads stop in PTRACE_EXIT stop, if requested by active |
| 437 | ptrace option. Then all other threads except thread group leader report |
| 438 | death as if they exited via exit syscall with exit code 0. Then |
| 439 | PTRACE_EVENT_EXEC stop happens, if requested by active ptrace option |
| 440 | (on which tracee - leader? execve-ing one?). |
| 441 | |
| 442 | The execve-ing tracee changes its pid while it is in execve syscall. |
| 443 | (Remember, under ptrace 'pid' returned from waitpid, or fed into ptrace |
| 444 | calls, is tracee's tid). That is, pid is reset to process id, which |
| 445 | coincides with thread group leader tid. |
| 446 | |
| 447 | If thread group leader has reported its death by this time, for tracer |
| 448 | this looks like dead thread leader "reappears from nowhere". If thread |
| 449 | group leader was still alive, for tracer this may look as if thread |
| 450 | group leader returns from a different syscall than it entered, or even |
| 451 | "returned from syscall even though it was not in any syscall". If |
| 452 | thread group leader was not traced (or was traced by a different |
| 453 | tracer), during execve it will appear as if it has become a tracee of |
| 454 | the tracer of execve'ing tracee. All these effects are the artifacts of |
| 455 | pid change. |
| 456 | |
| 457 | PTRACE_O_TRACEEXEC option is the recommended tool for dealing with this |
| 458 | case. It enables PTRACE_EVENT_EXEC stop which occurs before execve |
| 459 | syscall return. |
| 460 | |
| 461 | Pid change happens before PTRACE_EVENT_EXEC stop, not after. |
| 462 | |
| 463 | When tracer receives PTRACE_EVENT_EXEC stop notification, it is |
| 464 | guaranteed that except this tracee and thread group leader, no other |
| 465 | threads from the process are alive. |
| 466 | |
| 467 | On receiving this notification, tracer should clean up all its internal |
| 468 | data structures about all threads of this process, and retain only one |
| 469 | data structure, one which describes single still running tracee, with |
| 470 | pid = tgid = process id. |
| 471 | |
| 472 | Currently, there is no way to retrieve former pid of execve-ing tracee. |
| 473 | If tracer doesn't keep track of its tracees' thread group relations, it |
| 474 | may be unable to know which tracee execve-ed and therefore no longer |
| 475 | exists under old pid due to pid change. |
| 476 | |
| 477 | Example: two threads execve at the same time: |
| 478 | |
| 479 | ** we get syscall-entry-stop in thread 1: ** |
| 480 | PID1 execve("/bin/foo", "foo" <unfinished ...> |
| 481 | ** we issue PTRACE_SYSCALL for thread 1 ** |
| 482 | ** we get syscall-entry-stop in thread 2: ** |
| 483 | PID2 execve("/bin/bar", "bar" <unfinished ...> |
| 484 | ** we issue PTRACE_SYSCALL for thread 2 ** |
| 485 | ** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL ** |
| 486 | ** we get syscall-exit-stop for PID0: ** |
| 487 | PID0 <... execve resumed> ) = 0 |
| 488 | |
| 489 | In this situation there is no way to know which execve succeeded. |
| 490 | |
| 491 | If PTRACE_O_TRACEEXEC option is NOT in effect for the execve'ing |
| 492 | tracee, kernel delivers an extra SIGTRAP to tracee after execve syscall |
| 493 | returns. This is an ordinary signal (similar to one which can be |
| 494 | generated by "kill -TRAP"), not a special kind of ptrace-stop. |
| 495 | GETSIGINFO on it has si_code = 0 (SI_USER). It can be blocked by signal |
| 496 | mask, and thus can happen (much) later. |
| 497 | |
| 498 | Usually, tracer (for example, strace) would not want to show this extra |
| 499 | post-execve SIGTRAP signal to the user, and would suppress its delivery |
| 500 | to the tracee (if SIGTRAP is set to SIG_DFL, it is a killing signal). |
| 501 | However, determining *which* SIGTRAP to suppress is not easy. Setting |
| 502 | PTRACE_O_TRACEEXEC option and thus suppressing this extra SIGTRAP is |
| 503 | the recommended approach. |
| 504 | |
| 505 | |
| 506 | 1.x Real parent |
| 507 | |
| 508 | Ptrace API (ab)uses standard Unix parent/child signaling over waitpid. |
| 509 | This used to cause real parent of the process to stop receiving several |
| 510 | kinds of waitpid notifications when child process is traced by some |
| 511 | other process. |
| 512 | |
| 513 | Many of these bugs have been fixed, but as of 2.6.38 several still |
| 514 | exist. |
| 515 | |
| 516 | As of 2.6.38, the following is believed to work correctly: |
| 517 | |
| 518 | - exit/death by signal is reported first to tracer, then, when tracer |
| 519 | consumes waitpid result, to real parent (to real parent only when the |
| 520 | whole multi-threaded process exits). If they are the same process, the |
| 521 | report is sent only once. |
| 522 | |
| 523 | |
| 524 | 1.x Known bugs |
| 525 | |
| 526 | Following bugs still exist: |
| 527 | |
| 528 | Group-stop notifications are sent to tracer, but not to real parent. |
| 529 | Last confirmed on 2.6.38.6. |
| 530 | |
| 531 | If thread group leader is traced and exits by calling exit syscall, |
| 532 | PTRACE_EVENT_EXIT stop will happen for it (if requested), but subsequent |
| 533 | WIFEXITED notification will not be delivered until all other threads |
| 534 | exit. As explained above, if one of other threads execve's, thread |
| 535 | group leader death will *never* be reported. If execve-ed thread is not |
| 536 | traced by this tracer, tracer will never know that execve happened. |
| 537 | |
| 538 | ??? need to test this scenario |
| 539 | |
| 540 | One possible workaround is to detach thread group leader instead of |
| 541 | restarting it in this case. Last confirmed on 2.6.38.6. |
| 542 | |
| 543 | SIGKILL signal may still cause PTRACE_EVENT_EXIT stop before actual |
| 544 | signal death. This may be changed in the future - SIGKILL is meant to |
| 545 | always immediately kill tasks even under ptrace. Last confirmed on |
| 546 | 2.6.38.6. |