sewardj | 79f76f2 | 2005-03-14 13:35:15 +0000 | [diff] [blame] | 1 | |
| 2 | /* Make a thread the running thread. The thread must previously been |
| 3 | sleeping, and not holding the CPU semaphore. This will set the |
| 4 | thread state to VgTs_Runnable, and the thread will attempt to take |
| 5 | the CPU semaphore. By the time it returns, tid will be the running |
| 6 | thread. */ |
| 7 | extern void VG_(set_running) ( ThreadId tid ); |
| 8 | |
| 9 | /* Set a thread into a sleeping state. Before the call, the thread |
| 10 | must be runnable, and holding the CPU semaphore. When this call |
| 11 | returns, the thread will be set to the specified sleeping state, |
| 12 | and will not be holding the CPU semaphore. Note that another |
| 13 | thread could be running by the time this call returns, so the |
| 14 | caller must be careful not to touch any shared state. It is also |
| 15 | the caller's responsibility to actually block until the thread is |
| 16 | ready to run again. */ |
| 17 | extern void VG_(set_sleeping) ( ThreadId tid, ThreadStatus state ); |
| 18 | |
| 19 | |
| 20 | The master semaphore is run_sema in vg_scheduler.c. |
| 21 | |
sewardj | c112139 | 2005-04-09 18:24:19 +0000 | [diff] [blame] | 22 | |
| 23 | (what happens at a fork?) |
| 24 | |
| 25 | VG_(scheduler_init) registers sched_fork_cleanup as a child atfork |
| 26 | handler. sched_fork_cleanup, among other things, reinitializes the |
| 27 | semaphore with a new pipe so the process has its own. |
| 28 | |
sewardj | 79f76f2 | 2005-03-14 13:35:15 +0000 | [diff] [blame] | 29 | -------------------------------------------------------------------- |
| 30 | |
| 31 | Re: New World signal handling |
| 32 | From: Jeremy Fitzhardinge <jeremy@goop.org> |
| 33 | To: Julian Seward <jseward@acm.org> |
| 34 | Date: Mon Mar 14 09:03:51 2005 |
| 35 | |
| 36 | Well, the big-picture things to be clear about are: |
| 37 | |
| 38 | 1. signal handlers are process-wide global state |
| 39 | 2. signal masks are per-thread (there's no notion of a process-wide |
| 40 | signal mask) |
| 41 | 3. a signal can be targeted to either |
| 42 | 1. the whole process (any eligable thread is picked for |
| 43 | delivery), or |
| 44 | 2. a specific thread |
| 45 | |
| 46 | 1 is why it is always a bug to temporarily reset a signal handler (say, |
| 47 | for SIGSEGV), because if any other thread happens to be sent one in that |
| 48 | window it will cause havok (I think there's still one instance of this |
| 49 | in the symtab stuff). |
| 50 | 2 is the meat of your questions; more below. |
| 51 | 3 is responsible for some of the nitty detail in the signal stuff, so |
| 52 | its worth bearing in mind to understand it all. (Note that even if a |
| 53 | signal is targeting the whole process, its only ever delivered to one |
| 54 | particular thread; there's no such thing as a broadcast signal.) |
| 55 | |
| 56 | While a thread are running core code or generated code, it has almost |
| 57 | all its signals blocked (all but the fault signals: SEGV, BUS, ILL, etc). |
| 58 | |
| 59 | Every N basic blocks, each thread calls VG_(poll_signals) to see what |
| 60 | signals are pending for it. poll_signals grabs the next pending signal |
| 61 | which the client signal mask doesn't block, and sets it up for delivery; |
| 62 | it uses the sigtimedwait() syscall to fetch blocked pending signals |
| 63 | rather than have them delivered to a signal handler. This means that |
| 64 | we avoid the complexity of having signals delivered asynchronously via |
| 65 | the signal handlers; we can just poll for them synchronously when |
| 66 | they're easy to deal with. |
| 67 | |
| 68 | Fault signals, being caused by a specific instruction, are the exception |
| 69 | because they can't be held off; if they're blocked when an instruction |
| 70 | raises one, the kernel will just summarily kill the process. Therefore, |
| 71 | they need to be always unblocked, and the signal handler is called when |
| 72 | an instruction raises one of these exceptions. (It's also necessary to |
| 73 | call poll_signals after any syscall which may raise a signal, since |
| 74 | signal-raising syscalls are considered to be synchronous with respect to |
| 75 | their signal; ie, calling kill(getpid(), SIGUSR1) will call the handler |
| 76 | for SIGUSR1 before kill is seen to complete.) |
| 77 | |
| 78 | The one time when the thread's real signal mask actually matches the |
| 79 | client's requested signal mask is while running a blocking syscall. We |
| 80 | have to set things up to accept signals during a syscall so that we get |
| 81 | the right signal-interrupts-syscall semantics. The tricky part about |
| 82 | this is that there's no general atomic |
| 83 | set-signal-mask-and-block-in-syscall mechanism, so we need to fake it |
| 84 | with the stuff in VGA_(_client_syscall)/VGA_(interrupted_syscall). |
| 85 | These two basically form an explicit state machine, where the state |
| 86 | variable is the instruction pointer, which allows it to determine what |
| 87 | point the syscall got to when the async signal happens. By keeping the |
| 88 | window where signals are actually unblocked very narrow, the number of |
| 89 | possible states is pretty small. |
| 90 | |
| 91 | This is all quite nice because the kernel does almost all the work of |
| 92 | determining which thread should get a signal, what the correct action |
| 93 | for a syscall when it has been interrupted is, etc. Particularly nice |
| 94 | is that we don't need to worry about all the queuing semantics, and the |
| 95 | per-signal special cases (which is, roughly, signals 1-32 are not queued |
| 96 | except when they are, and signals 33-64 are queued except when they aren't). |
| 97 | |
| 98 | BUT, there's another complexity: because the Unix signal mechanism has |
| 99 | been overloaded to deal with two separate kinds of events (asynchronous |
| 100 | signals raised by kill(), and synchronous faults raised by an |
| 101 | instruction), we can't block a signal for one form and not the other. |
| 102 | That is, because we have to leave SIGSEGV unblocked for faulting |
| 103 | instructions, it also leaves us open to getting an async SIGSEGV sent |
| 104 | with kill(pid, SIGSEGV). |
| 105 | |
| 106 | To handle this case, there's a small per-thread signal queue set up to |
| 107 | deal with this case (I'm using tid 0's queue for "signals sent to the |
| 108 | whole process" - a hack, I'll admit). If an async SIGSEGV (etc) signal |
| 109 | appears, then it is pushed onto the appropriate queue. |
| 110 | VG_(poll_signals) also checks these queues for pending signals to decide |
| 111 | what signal to deliver next. These queues are only manipulated with |
| 112 | *all* signals blocked, so there's no risk of two concurrent async signal |
| 113 | handlers modifying the queues at once. Also, because the liklihood of |
| 114 | actually being sent an async SIGSEGV is pretty low, the queues are only |
| 115 | allocated on demand. |
| 116 | |
| 117 | |
| 118 | |
| 119 | There are two mechanisms to prevent disaster if multiple threads get |
| 120 | signals concurrently. One is that a signal handler is set up to block a |
| 121 | set of signals while the signal is being delivered. Valgrind's handlers |
| 122 | block all signals, so there's no risk of a new signal being delivered to |
| 123 | the same thread until the old handler has finished. |
| 124 | |
| 125 | The other is that if the thread which recieves the signal is not running |
| 126 | (ie, doesn't hold the run_sema, which implies it must be waiting for a |
| 127 | syscall to complete), then the signal handler will grab the run_sema |
| 128 | before making any global state changes. Since the only time we can get |
| 129 | an async signal asynchronously is during a blocking syscall, this should |
| 130 | be all the time. (And since synchronous signals are always the result of |
| 131 | running an instruction, we should already be holding run_sema.) |
| 132 | |
| 133 | |
| 134 | Valgrind will occasionally generate signals for itself. These are always |
| 135 | synchronous faults as a result instruction fetch or something an |
| 136 | instruction did. The two mechanims are the synth_fault_* functions, |
| 137 | which are used to signal a problem while fetching an instruction, or by |
| 138 | getting generated code to call a helper which contains a fault-raising |
| 139 | instruction (used to deal with illegal/unimplemented instructions and |
| 140 | for instructions who's only job is to raise exceptions). |
| 141 | |
| 142 | That all explains how signals come in, but the second part is how they |
| 143 | get delivered. |
| 144 | |
| 145 | The main function for this is VG_(deliver_signal). There are three cases: |
| 146 | |
| 147 | 1. the process is ignoring the signal (SIG_IGN) |
| 148 | 2. the process is using the default handler (SIG_DFL) |
| 149 | 3. the process has a handler for the signal |
| 150 | |
| 151 | In general, VG_(deliver_signal) shouldn't be called for ignored signals; |
| 152 | if it has been called, it assumes the ignore is being overridden (if an |
| 153 | instruction gets a SEGV etc, SIG_IGN is ignored and treated as SIG_DFL). |
| 154 | |
| 155 | VG_(deliver_signal) handles the default handler case, and the |
| 156 | client-specified signal handler case. |
| 157 | |
| 158 | The default handler case is relatively easy: the signal's default action |
| 159 | is either Terminate, or Ignore. We can ignore Ignore. |
| 160 | |
| 161 | Terminate always kills the entire process; there's no such thing as a |
| 162 | thread-specific signal death. Terminate comes in two forms: with |
| 163 | coredump, or without. vg_default_action() will write a core file, and |
| 164 | then will tell all the threads to start terminating; it then longjmps |
| 165 | back to the current thread's scheduler loop. The scheduler loop will |
| 166 | terminate immediately, and the master_tid thread will wait for all the |
| 167 | others to exit before shutting down the process (this is the same |
| 168 | mechanism as exit_group). |
| 169 | |
| 170 | Delivering a signal to a client-side handler modifys the thread state so |
| 171 | that there's a signal frame on the stack, and the instruction pointer is |
| 172 | pointing to the handler. The fiddly bit is that there are two |
| 173 | completely different signal frame formats: old and RT. While in theory |
| 174 | the exact shape of these frames on stack is abstracted, there are real |
| 175 | programs which know exactly where various parts of the structures are on |
| 176 | stack (most notably, g++'s exception throwing code), which is why it has |
| 177 | to have two separate pieces of code for each frame format. Another |
| 178 | tricky case is dealing with the client stack running out/overflowing |
| 179 | while setting up the signal frame. |
| 180 | |
| 181 | Signal return is also interesting. There are two syscalls, sigreturn |
| 182 | and rt_sigreturn, which a signal handler will use to resume execution. |
| 183 | The client will call the right one for the frame it was passed, so the |
| 184 | core doesn't need to track that state. The tricky part is moving the |
| 185 | frame's register state back into the thread's state, particularly all |
| 186 | the FPU state reformatting gunk. Also, *sigreturn checks for new |
| 187 | pending signals after the old frame has been cleaned up, since there's a |
| 188 | requirement that all deliverable pending signals are delivered before |
| 189 | the mainline code makes progress. This means that a program could |
| 190 | live-lock on signals, but that's what would happen running natively... |
| 191 | |
| 192 | Another thing to watch for: programs which unwind the stack (like gdb, |
| 193 | or exception throwers) recognize the existence of a signal frame by |
| 194 | looking at the code the return address points to: if it is one of the |
| 195 | two specific signal return sequences, it knows its a signal frame. |
| 196 | That's why the signal handler return address must point to a very |
| 197 | specific set of instructions. |
| 198 | |
| 199 | |
| 200 | What else. Ah, the two internal signals. |
| 201 | |
| 202 | SIGVGKILL is pretty straightforward: its just used to dislodge a thread |
| 203 | from being blocked in a syscall, so that we can get the thread to |
| 204 | terminate in a timely fashion. |
| 205 | |
| 206 | SIGVGCHLD is used by a thread to tell the master_tid that it has |
| 207 | exited. However, the only time the master_tid cares about this is when |
| 208 | it has already exited, and its waiting for everyone else to exit. If |
| 209 | the master_tid hasn't exited, then this signal is ignored. It isn't |
| 210 | enough to simply block it, because that will cause a pile of queued |
| 211 | SIGVGCHLDs to build up, eventually clogging the kernel's signal delivery |
| 212 | mechanism. If its unblocked and ignored, it doesn't interrupt syscalls |
| 213 | and it doesn't accumulate. |
| 214 | |
| 215 | |
| 216 | I hope that helps clarify things. And explain why there's so much stuff |
| 217 | in there: it's tracking a very complex and arcane underlying set of |
| 218 | machinery. |
| 219 | |
| 220 | J |
sewardj | 0aa2479 | 2005-03-14 17:32:52 +0000 | [diff] [blame] | 221 | |
| 222 | -------------------------------------------------------------------- |
| 223 | |
| 224 | >I've been seeing references to 'master thread' around the place. |
| 225 | >What distinguishes the master thread from the rest? Where does |
| 226 | >the requirement to have a master thread come from? |
| 227 | > |
| 228 | It used to be tid 1, but I had to generalize it. |
| 229 | |
| 230 | The master_tid isn't very special; its main job is at process shutdown. |
| 231 | It waits for all the other threads to exit, and then produces all the |
| 232 | final reports. Until it exits, it's just a normal thread, with no other |
| 233 | responsibilities. |
| 234 | |
| 235 | The alternative to having a master thread would be to make whichever |
| 236 | thread exits last be responsible for emitting all the output. That |
| 237 | would work, but it would make the results a bit asynchronous (that is, |
| 238 | if the main thread exits and the other hang around for a while, anyone |
| 239 | waiting on the process would see it as having exited, but no results |
| 240 | would have been produced). |
| 241 | |
| 242 | VG_(master_tid) is a varable to handle the case where a threaded program |
| 243 | forks. In the first process, the master_tid will be 1. If that program |
| 244 | creates a few threads, and then, say, thread 3 forks, the child process |
| 245 | will have a single thread in it. In the child, master_tid will be 3. |
| 246 | It was easier to make the master thread a variable than to try to work |
| 247 | out how to rename thread 3 to 1 after a fork. |
| 248 | |
| 249 | J |
| 250 | |
sewardj | 3350a43 | 2005-03-17 12:55:35 +0000 | [diff] [blame] | 251 | -------------------------------------------------------------------- |
| 252 | |
| 253 | Re: Fwd: Documentation of kernel's signal routing ? |
| 254 | From: David Woodhouse <...> |
| 255 | To: Julian Seward <jseward@acm.org> |
| 256 | |
| 257 | > Regarding sys_clone created threads. I have a vague idea that |
| 258 | > there is a notion of 'thread group'. I further understand that if |
| 259 | > one thread in a group calls sys_exit_group then all threads in that |
| 260 | > group exit. Whereas if a thread calls sys_exit then just that |
| 261 | > thread exits. |
| 262 | > |
| 263 | > I'm pretty hazy on this: |
| 264 | |
| 265 | Hmm, so am I :) |
| 266 | |
| 267 | > * Is the above correct? |
| 268 | |
| 269 | Yes, I believe so. |
| 270 | |
| 271 | > * How is thread-group membership defined/changed? |
| 272 | |
| 273 | By specifying CLONE_THREAD in the flags to clone(), you remain part of |
| 274 | the same thread group as the parent. In a single-threaded process, the |
| 275 | thread group id (tgid) is the same as the pid. |
| 276 | |
| 277 | Linux just has tasks, which sometimes happen to share VM -- and now with |
| 278 | NPTL we also share other stuff like signals, etc. The 'pid' in Linux is |
| 279 | what POSIX would call the 'thread id', and the 'tgid' in Linux is |
| 280 | equivalent to the POSIX 'pid'. |
| 281 | |
| 282 | > * Do you know offhand how LinuxThreads and NPTL use thread groups? |
| 283 | |
| 284 | I believe that LT doesn't use the kernel's concept of thread groups at |
| 285 | all. LT predates the kernel's support for proper POSIX-like sharing of |
| 286 | anything much but memory, so uses only the CLONE_VM (and possibly |
| 287 | CLONE_FILES) flags. I don't _think_ it uses CLONE_SIGHAND -- it does |
| 288 | most of its work by propagating signals manually between threads. |
| 289 | |
| 290 | NTPL uses thread groups as generated by the CLONE_THREAD flag, which is |
| 291 | what invokes the POSIX-related thread semantics. |
| 292 | |
| 293 | > Is it the case that each LinuxThreads threads is in its own |
| 294 | > group whereas all NTPL threads [in a process] are in a single |
| 295 | > group? |
| 296 | |
| 297 | Yes, that's my understanding. |
| 298 | |
| 299 | -- |
| 300 | dwmw2 |