Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 1 | # |
| 2 | # Copyright (c) 2006 Steven Rostedt |
| 3 | # Licensed under the GNU Free Documentation License, Version 1.2 |
| 4 | # |
| 5 | |
| 6 | RT-mutex implementation design |
| 7 | ------------------------------ |
| 8 | |
| 9 | This document tries to describe the design of the rtmutex.c implementation. |
| 10 | It doesn't describe the reasons why rtmutex.c exists. For that please see |
| 11 | Documentation/rt-mutex.txt. Although this document does explain problems |
| 12 | that happen without this code, but that is in the concept to understand |
| 13 | what the code actually is doing. |
| 14 | |
| 15 | The goal of this document is to help others understand the priority |
| 16 | inheritance (PI) algorithm that is used, as well as reasons for the |
| 17 | decisions that were made to implement PI in the manner that was done. |
| 18 | |
| 19 | |
| 20 | Unbounded Priority Inversion |
| 21 | ---------------------------- |
| 22 | |
| 23 | Priority inversion is when a lower priority process executes while a higher |
| 24 | priority process wants to run. This happens for several reasons, and |
| 25 | most of the time it can't be helped. Anytime a high priority process wants |
| 26 | to use a resource that a lower priority process has (a mutex for example), |
| 27 | the high priority process must wait until the lower priority process is done |
| 28 | with the resource. This is a priority inversion. What we want to prevent |
| 29 | is something called unbounded priority inversion. That is when the high |
| 30 | priority process is prevented from running by a lower priority process for |
| 31 | an undetermined amount of time. |
| 32 | |
Xishi Qiu | c79a8d8 | 2013-11-06 13:18:21 -0800 | [diff] [blame] | 33 | The classic example of unbounded priority inversion is where you have three |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 34 | processes, let's call them processes A, B, and C, where A is the highest |
| 35 | priority process, C is the lowest, and B is in between. A tries to grab a lock |
| 36 | that C owns and must wait and lets C run to release the lock. But in the |
| 37 | meantime, B executes, and since B is of a higher priority than C, it preempts C, |
| 38 | but by doing so, it is in fact preempting A which is a higher priority process. |
| 39 | Now there's no way of knowing how long A will be sleeping waiting for C |
| 40 | to release the lock, because for all we know, B is a CPU hog and will |
| 41 | never give C a chance to release the lock. This is called unbounded priority |
| 42 | inversion. |
| 43 | |
| 44 | Here's a little ASCII art to show the problem. |
| 45 | |
| 46 | grab lock L1 (owned by C) |
| 47 | | |
| 48 | A ---+ |
| 49 | C preempted by B |
| 50 | | |
| 51 | C +----+ |
| 52 | |
| 53 | B +--------> |
| 54 | B now keeps A from running. |
| 55 | |
| 56 | |
| 57 | Priority Inheritance (PI) |
| 58 | ------------------------- |
| 59 | |
| 60 | There are several ways to solve this issue, but other ways are out of scope |
| 61 | for this document. Here we only discuss PI. |
| 62 | |
| 63 | PI is where a process inherits the priority of another process if the other |
| 64 | process blocks on a lock owned by the current process. To make this easier |
| 65 | to understand, let's use the previous example, with processes A, B, and C again. |
| 66 | |
| 67 | This time, when A blocks on the lock owned by C, C would inherit the priority |
| 68 | of A. So now if B becomes runnable, it would not preempt C, since C now has |
| 69 | the high priority of A. As soon as C releases the lock, it loses its |
| 70 | inherited priority, and A then can continue with the resource that C had. |
| 71 | |
| 72 | Terminology |
| 73 | ----------- |
| 74 | |
| 75 | Here I explain some terminology that is used in this document to help describe |
| 76 | the design that is used to implement PI. |
| 77 | |
| 78 | PI chain - The PI chain is an ordered series of locks and processes that cause |
| 79 | processes to inherit priorities from a previous process that is |
| 80 | blocked on one of its locks. This is described in more detail |
| 81 | later in this document. |
| 82 | |
| 83 | mutex - In this document, to differentiate from locks that implement |
| 84 | PI and spin locks that are used in the PI code, from now on |
| 85 | the PI locks will be called a mutex. |
| 86 | |
| 87 | lock - In this document from now on, I will use the term lock when |
| 88 | referring to spin locks that are used to protect parts of the PI |
| 89 | algorithm. These locks disable preemption for UP (when |
| 90 | CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from |
| 91 | entering critical sections simultaneously. |
| 92 | |
| 93 | spin lock - Same as lock above. |
| 94 | |
| 95 | waiter - A waiter is a struct that is stored on the stack of a blocked |
| 96 | process. Since the scope of the waiter is within the code for |
| 97 | a process being blocked on the mutex, it is fine to allocate |
| 98 | the waiter on the process's stack (local variable). This |
| 99 | structure holds a pointer to the task, as well as the mutex that |
| 100 | the task is blocked on. It also has the plist node structures to |
| 101 | place the task in the waiter_list of a mutex as well as the |
| 102 | pi_list of a mutex owner task (described below). |
| 103 | |
| 104 | waiter is sometimes used in reference to the task that is waiting |
| 105 | on a mutex. This is the same as waiter->task. |
| 106 | |
| 107 | waiters - A list of processes that are blocked on a mutex. |
| 108 | |
| 109 | top waiter - The highest priority process waiting on a specific mutex. |
| 110 | |
| 111 | top pi waiter - The highest priority process waiting on one of the mutexes |
| 112 | that a specific process owns. |
| 113 | |
| 114 | Note: task and process are used interchangeably in this document, mostly to |
| 115 | differentiate between two processes that are being described together. |
| 116 | |
| 117 | |
| 118 | PI chain |
| 119 | -------- |
| 120 | |
| 121 | The PI chain is a list of processes and mutexes that may cause priority |
| 122 | inheritance to take place. Multiple chains may converge, but a chain |
| 123 | would never diverge, since a process can't be blocked on more than one |
| 124 | mutex at a time. |
| 125 | |
| 126 | Example: |
| 127 | |
| 128 | Process: A, B, C, D, E |
| 129 | Mutexes: L1, L2, L3, L4 |
| 130 | |
| 131 | A owns: L1 |
| 132 | B blocked on L1 |
| 133 | B owns L2 |
| 134 | C blocked on L2 |
| 135 | C owns L3 |
| 136 | D blocked on L3 |
| 137 | D owns L4 |
| 138 | E blocked on L4 |
| 139 | |
| 140 | The chain would be: |
| 141 | |
| 142 | E->L4->D->L3->C->L2->B->L1->A |
| 143 | |
| 144 | To show where two chains merge, we could add another process F and |
| 145 | another mutex L5 where B owns L5 and F is blocked on mutex L5. |
| 146 | |
| 147 | The chain for F would be: |
| 148 | |
| 149 | F->L5->B->L1->A |
| 150 | |
| 151 | Since a process may own more than one mutex, but never be blocked on more than |
| 152 | one, the chains merge. |
| 153 | |
| 154 | Here we show both chains: |
| 155 | |
| 156 | E->L4->D->L3->C->L2-+ |
| 157 | | |
| 158 | +->B->L1->A |
| 159 | | |
| 160 | F->L5-+ |
| 161 | |
| 162 | For PI to work, the processes at the right end of these chains (or we may |
| 163 | also call it the Top of the chain) must be equal to or higher in priority |
| 164 | than the processes to the left or below in the chain. |
| 165 | |
| 166 | Also since a mutex may have more than one process blocked on it, we can |
| 167 | have multiple chains merge at mutexes. If we add another process G that is |
| 168 | blocked on mutex L2: |
| 169 | |
| 170 | G->L2->B->L1->A |
| 171 | |
| 172 | And once again, to show how this can grow I will show the merging chains |
| 173 | again. |
| 174 | |
| 175 | E->L4->D->L3->C-+ |
| 176 | +->L2-+ |
| 177 | | | |
| 178 | G-+ +->B->L1->A |
| 179 | | |
| 180 | F->L5-+ |
| 181 | |
| 182 | |
| 183 | Plist |
| 184 | ----- |
| 185 | |
| 186 | Before I go further and talk about how the PI chain is stored through lists |
| 187 | on both mutexes and processes, I'll explain the plist. This is similar to |
| 188 | the struct list_head functionality that is already in the kernel. |
| 189 | The implementation of plist is out of scope for this document, but it is |
| 190 | very important to understand what it does. |
| 191 | |
| 192 | There are a few differences between plist and list, the most important one |
| 193 | being that plist is a priority sorted linked list. This means that the |
| 194 | priorities of the plist are sorted, such that it takes O(1) to retrieve the |
| 195 | highest priority item in the list. Obviously this is useful to store processes |
| 196 | based on their priorities. |
| 197 | |
| 198 | Another difference, which is important for implementation, is that, unlike |
| 199 | list, the head of the list is a different element than the nodes of a list. |
| 200 | So the head of the list is declared as struct plist_head and nodes that will |
| 201 | be added to the list are declared as struct plist_node. |
| 202 | |
| 203 | |
| 204 | Mutex Waiter List |
| 205 | ----------------- |
| 206 | |
| 207 | Every mutex keeps track of all the waiters that are blocked on itself. The mutex |
| 208 | has a plist to store these waiters by priority. This list is protected by |
| 209 | a spin lock that is located in the struct of the mutex. This lock is called |
| 210 | wait_lock. Since the modification of the waiter list is never done in |
| 211 | interrupt context, the wait_lock can be taken without disabling interrupts. |
| 212 | |
| 213 | |
| 214 | Task PI List |
| 215 | ------------ |
| 216 | |
| 217 | To keep track of the PI chains, each process has its own PI list. This is |
| 218 | a list of all top waiters of the mutexes that are owned by the process. |
| 219 | Note that this list only holds the top waiters and not all waiters that are |
| 220 | blocked on mutexes owned by the process. |
| 221 | |
| 222 | The top of the task's PI list is always the highest priority task that |
| 223 | is waiting on a mutex that is owned by the task. So if the task has |
| 224 | inherited a priority, it will always be the priority of the task that is |
| 225 | at the top of this list. |
| 226 | |
| 227 | This list is stored in the task structure of a process as a plist called |
| 228 | pi_list. This list is protected by a spin lock also in the task structure, |
| 229 | called pi_lock. This lock may also be taken in interrupt context, so when |
| 230 | locking the pi_lock, interrupts must be disabled. |
| 231 | |
| 232 | |
| 233 | Depth of the PI Chain |
| 234 | --------------------- |
| 235 | |
| 236 | The maximum depth of the PI chain is not dynamic, and could actually be |
| 237 | defined. But is very complex to figure it out, since it depends on all |
| 238 | the nesting of mutexes. Let's look at the example where we have 3 mutexes, |
| 239 | L1, L2, and L3, and four separate functions func1, func2, func3 and func4. |
| 240 | The following shows a locking order of L1->L2->L3, but may not actually |
| 241 | be directly nested that way. |
| 242 | |
| 243 | void func1(void) |
| 244 | { |
| 245 | mutex_lock(L1); |
| 246 | |
| 247 | /* do anything */ |
| 248 | |
| 249 | mutex_unlock(L1); |
| 250 | } |
| 251 | |
| 252 | void func2(void) |
| 253 | { |
| 254 | mutex_lock(L1); |
| 255 | mutex_lock(L2); |
| 256 | |
| 257 | /* do something */ |
| 258 | |
| 259 | mutex_unlock(L2); |
| 260 | mutex_unlock(L1); |
| 261 | } |
| 262 | |
| 263 | void func3(void) |
| 264 | { |
| 265 | mutex_lock(L2); |
| 266 | mutex_lock(L3); |
| 267 | |
| 268 | /* do something else */ |
| 269 | |
| 270 | mutex_unlock(L3); |
| 271 | mutex_unlock(L2); |
| 272 | } |
| 273 | |
| 274 | void func4(void) |
| 275 | { |
| 276 | mutex_lock(L3); |
| 277 | |
| 278 | /* do something again */ |
| 279 | |
| 280 | mutex_unlock(L3); |
| 281 | } |
| 282 | |
| 283 | Now we add 4 processes that run each of these functions separately. |
| 284 | Processes A, B, C, and D which run functions func1, func2, func3 and func4 |
| 285 | respectively, and such that D runs first and A last. With D being preempted |
| 286 | in func4 in the "do something again" area, we have a locking that follows: |
| 287 | |
| 288 | D owns L3 |
| 289 | C blocked on L3 |
| 290 | C owns L2 |
| 291 | B blocked on L2 |
| 292 | B owns L1 |
| 293 | A blocked on L1 |
| 294 | |
| 295 | And thus we have the chain A->L1->B->L2->C->L3->D. |
| 296 | |
| 297 | This gives us a PI depth of 4 (four processes), but looking at any of the |
| 298 | functions individually, it seems as though they only have at most a locking |
| 299 | depth of two. So, although the locking depth is defined at compile time, |
| 300 | it still is very difficult to find the possibilities of that depth. |
| 301 | |
| 302 | Now since mutexes can be defined by user-land applications, we don't want a DOS |
| 303 | type of application that nests large amounts of mutexes to create a large |
| 304 | PI chain, and have the code holding spin locks while looking at a large |
| 305 | amount of data. So to prevent this, the implementation not only implements |
| 306 | a maximum lock depth, but also only holds at most two different locks at a |
| 307 | time, as it walks the PI chain. More about this below. |
| 308 | |
| 309 | |
| 310 | Mutex owner and flags |
| 311 | --------------------- |
| 312 | |
| 313 | The mutex structure contains a pointer to the owner of the mutex. If the |
| 314 | mutex is not owned, this owner is set to NULL. Since all architectures |
| 315 | have the task structure on at least a four byte alignment (and if this is |
| 316 | not true, the rtmutex.c code will be broken!), this allows for the two |
| 317 | least significant bits to be used as flags. This part is also described |
| 318 | in Documentation/rt-mutex.txt, but will also be briefly described here. |
| 319 | |
| 320 | Bit 0 is used as the "Pending Owner" flag. This is described later. |
| 321 | Bit 1 is used as the "Has Waiters" flags. This is also described later |
| 322 | in more detail, but is set whenever there are waiters on a mutex. |
| 323 | |
| 324 | |
| 325 | cmpxchg Tricks |
| 326 | -------------- |
| 327 | |
| 328 | Some architectures implement an atomic cmpxchg (Compare and Exchange). This |
| 329 | is used (when applicable) to keep the fast path of grabbing and releasing |
| 330 | mutexes short. |
| 331 | |
| 332 | cmpxchg is basically the following function performed atomically: |
| 333 | |
| 334 | unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) |
| 335 | { |
Jan Altenberg | 9ba0bdf | 2006-09-30 23:28:08 -0700 | [diff] [blame] | 336 | unsigned long T = *A; |
| 337 | if (*A == *B) { |
| 338 | *A = *C; |
| 339 | } |
| 340 | return T; |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 341 | } |
| 342 | #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) |
| 343 | |
| 344 | This is really nice to have, since it allows you to only update a variable |
| 345 | if the variable is what you expect it to be. You know if it succeeded if |
| 346 | the return value (the old value of A) is equal to B. |
| 347 | |
| 348 | The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If |
| 349 | the architecture does not support CMPXCHG, then this macro is simply set |
| 350 | to fail every time. But if CMPXCHG is supported, then this will |
| 351 | help out extremely to keep the fast path short. |
| 352 | |
| 353 | The use of rt_mutex_cmpxchg with the flags in the owner field help optimize |
| 354 | the system for architectures that support it. This will also be explained |
| 355 | later in this document. |
| 356 | |
| 357 | |
| 358 | Priority adjustments |
| 359 | -------------------- |
| 360 | |
| 361 | The implementation of the PI code in rtmutex.c has several places that a |
| 362 | process must adjust its priority. With the help of the pi_list of a |
| 363 | process this is rather easy to know what needs to be adjusted. |
| 364 | |
| 365 | The functions implementing the task adjustments are rt_mutex_adjust_prio, |
| 366 | __rt_mutex_adjust_prio (same as the former, but expects the task pi_lock |
Jim Cromie | c20cbe4 | 2010-06-03 08:40:55 -0600 | [diff] [blame] | 367 | to already be taken), rt_mutex_getprio, and rt_mutex_setprio. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 368 | |
| 369 | rt_mutex_getprio and rt_mutex_setprio are only used in __rt_mutex_adjust_prio. |
| 370 | |
| 371 | rt_mutex_getprio returns the priority that the task should have. Either the |
| 372 | task's own normal priority, or if a process of a higher priority is waiting on |
| 373 | a mutex owned by the task, then that higher priority should be returned. |
| 374 | Since the pi_list of a task holds an order by priority list of all the top |
| 375 | waiters of all the mutexes that the task owns, rt_mutex_getprio simply needs |
| 376 | to compare the top pi waiter to its own normal priority, and return the higher |
| 377 | priority back. |
| 378 | |
| 379 | (Note: if looking at the code, you will notice that the lower number of |
| 380 | prio is returned. This is because the prio field in the task structure |
| 381 | is an inverse order of the actual priority. So a "prio" of 5 is |
| 382 | of higher priority than a "prio" of 10.) |
| 383 | |
| 384 | __rt_mutex_adjust_prio examines the result of rt_mutex_getprio, and if the |
| 385 | result does not equal the task's current priority, then rt_mutex_setprio |
| 386 | is called to adjust the priority of the task to the new priority. |
Viresh Kumar | 0a0fca9 | 2013-06-04 13:10:24 +0530 | [diff] [blame] | 387 | Note that rt_mutex_setprio is defined in kernel/sched/core.c to implement the |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 388 | actual change in priority. |
| 389 | |
| 390 | It is interesting to note that __rt_mutex_adjust_prio can either increase |
| 391 | or decrease the priority of the task. In the case that a higher priority |
| 392 | process has just blocked on a mutex owned by the task, __rt_mutex_adjust_prio |
| 393 | would increase/boost the task's priority. But if a higher priority task |
| 394 | were for some reason to leave the mutex (timeout or signal), this same function |
| 395 | would decrease/unboost the priority of the task. That is because the pi_list |
| 396 | always contains the highest priority task that is waiting on a mutex owned |
| 397 | by the task, so we only need to compare the priority of that top pi waiter |
| 398 | to the normal priority of the given task. |
| 399 | |
| 400 | |
| 401 | High level overview of the PI chain walk |
| 402 | ---------------------------------------- |
| 403 | |
| 404 | The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain. |
| 405 | |
| 406 | The implementation has gone through several iterations, and has ended up |
| 407 | with what we believe is the best. It walks the PI chain by only grabbing |
| 408 | at most two locks at a time, and is very efficient. |
| 409 | |
| 410 | The rt_mutex_adjust_prio_chain can be used either to boost or lower process |
| 411 | priorities. |
| 412 | |
| 413 | rt_mutex_adjust_prio_chain is called with a task to be checked for PI |
| 414 | (de)boosting (the owner of a mutex that a process is blocking on), a flag to |
| 415 | check for deadlocking, the mutex that the task owns, and a pointer to a waiter |
| 416 | that is the process's waiter struct that is blocked on the mutex (although this |
| 417 | parameter may be NULL for deboosting). |
| 418 | |
| 419 | For this explanation, I will not mention deadlock detection. This explanation |
| 420 | will try to stay at a high level. |
| 421 | |
| 422 | When this function is called, there are no locks held. That also means |
| 423 | that the state of the owner and lock can change when entered into this function. |
| 424 | |
| 425 | Before this function is called, the task has already had rt_mutex_adjust_prio |
| 426 | performed on it. This means that the task is set to the priority that it |
| 427 | should be at, but the plist nodes of the task's waiter have not been updated |
| 428 | with the new priorities, and that this task may not be in the proper locations |
| 429 | in the pi_lists and wait_lists that the task is blocked on. This function |
| 430 | solves all that. |
| 431 | |
| 432 | A loop is entered, where task is the owner to be checked for PI changes that |
| 433 | was passed by parameter (for the first iteration). The pi_lock of this task is |
| 434 | taken to prevent any more changes to the pi_list of the task. This also |
| 435 | prevents new tasks from completing the blocking on a mutex that is owned by this |
| 436 | task. |
| 437 | |
| 438 | If the task is not blocked on a mutex then the loop is exited. We are at |
| 439 | the top of the PI chain. |
| 440 | |
| 441 | A check is now done to see if the original waiter (the process that is blocked |
| 442 | on the current mutex) is the top pi waiter of the task. That is, is this |
| 443 | waiter on the top of the task's pi_list. If it is not, it either means that |
| 444 | there is another process higher in priority that is blocked on one of the |
| 445 | mutexes that the task owns, or that the waiter has just woken up via a signal |
| 446 | or timeout and has left the PI chain. In either case, the loop is exited, since |
| 447 | we don't need to do any more changes to the priority of the current task, or any |
| 448 | task that owns a mutex that this current task is waiting on. A priority chain |
| 449 | walk is only needed when a new top pi waiter is made to a task. |
| 450 | |
| 451 | The next check sees if the task's waiter plist node has the priority equal to |
| 452 | the priority the task is set at. If they are equal, then we are done with |
| 453 | the loop. Remember that the function started with the priority of the |
| 454 | task adjusted, but the plist nodes that hold the task in other processes |
| 455 | pi_lists have not been adjusted. |
| 456 | |
| 457 | Next, we look at the mutex that the task is blocked on. The mutex's wait_lock |
| 458 | is taken. This is done by a spin_trylock, because the locking order of the |
| 459 | pi_lock and wait_lock goes in the opposite direction. If we fail to grab the |
| 460 | lock, the pi_lock is released, and we restart the loop. |
| 461 | |
| 462 | Now that we have both the pi_lock of the task as well as the wait_lock of |
| 463 | the mutex the task is blocked on, we update the task's waiter's plist node |
| 464 | that is located on the mutex's wait_list. |
| 465 | |
| 466 | Now we release the pi_lock of the task. |
| 467 | |
| 468 | Next the owner of the mutex has its pi_lock taken, so we can update the |
| 469 | task's entry in the owner's pi_list. If the task is the highest priority |
| 470 | process on the mutex's wait_list, then we remove the previous top waiter |
| 471 | from the owner's pi_list, and replace it with the task. |
| 472 | |
| 473 | Note: It is possible that the task was the current top waiter on the mutex, |
| 474 | in which case the task is not yet on the pi_list of the waiter. This |
| 475 | is OK, since plist_del does nothing if the plist node is not on any |
| 476 | list. |
| 477 | |
| 478 | If the task was not the top waiter of the mutex, but it was before we |
| 479 | did the priority updates, that means we are deboosting/lowering the |
| 480 | task. In this case, the task is removed from the pi_list of the owner, |
| 481 | and the new top waiter is added. |
| 482 | |
| 483 | Lastly, we unlock both the pi_lock of the task, as well as the mutex's |
| 484 | wait_lock, and continue the loop again. On the next iteration of the |
| 485 | loop, the previous owner of the mutex will be the task that will be |
| 486 | processed. |
| 487 | |
| 488 | Note: One might think that the owner of this mutex might have changed |
| 489 | since we just grab the mutex's wait_lock. And one could be right. |
| 490 | The important thing to remember is that the owner could not have |
| 491 | become the task that is being processed in the PI chain, since |
| 492 | we have taken that task's pi_lock at the beginning of the loop. |
| 493 | So as long as there is an owner of this mutex that is not the same |
| 494 | process as the tasked being worked on, we are OK. |
| 495 | |
| 496 | Looking closely at the code, one might be confused. The check for the |
| 497 | end of the PI chain is when the task isn't blocked on anything or the |
| 498 | task's waiter structure "task" element is NULL. This check is |
| 499 | protected only by the task's pi_lock. But the code to unlock the mutex |
| 500 | sets the task's waiter structure "task" element to NULL with only |
| 501 | the protection of the mutex's wait_lock, which was not taken yet. |
| 502 | Isn't this a race condition if the task becomes the new owner? |
| 503 | |
| 504 | The answer is No! The trick is the spin_trylock of the mutex's |
| 505 | wait_lock. If we fail that lock, we release the pi_lock of the |
| 506 | task and continue the loop, doing the end of PI chain check again. |
| 507 | |
| 508 | In the code to release the lock, the wait_lock of the mutex is held |
| 509 | the entire time, and it is not let go when we grab the pi_lock of the |
| 510 | new owner of the mutex. So if the switch of a new owner were to happen |
| 511 | after the check for end of the PI chain and the grabbing of the |
| 512 | wait_lock, the unlocking code would spin on the new owner's pi_lock |
| 513 | but never give up the wait_lock. So the PI chain loop is guaranteed to |
| 514 | fail the spin_trylock on the wait_lock, release the pi_lock, and |
| 515 | try again. |
| 516 | |
| 517 | If you don't quite understand the above, that's OK. You don't have to, |
| 518 | unless you really want to make a proof out of it ;) |
| 519 | |
| 520 | |
| 521 | Pending Owners and Lock stealing |
| 522 | -------------------------------- |
| 523 | |
| 524 | One of the flags in the owner field of the mutex structure is "Pending Owner". |
| 525 | What this means is that an owner was chosen by the process releasing the |
| 526 | mutex, but that owner has yet to wake up and actually take the mutex. |
| 527 | |
| 528 | Why is this important? Why can't we just give the mutex to another process |
| 529 | and be done with it? |
| 530 | |
| 531 | The PI code is to help with real-time processes, and to let the highest |
| 532 | priority process run as long as possible with little latencies and delays. |
| 533 | If a high priority process owns a mutex that a lower priority process is |
| 534 | blocked on, when the mutex is released it would be given to the lower priority |
| 535 | process. What if the higher priority process wants to take that mutex again. |
| 536 | The high priority process would fail to take that mutex that it just gave up |
| 537 | and it would need to boost the lower priority process to run with full |
| 538 | latency of that critical section (since the low priority process just entered |
| 539 | it). |
| 540 | |
| 541 | There's no reason a high priority process that gives up a mutex should be |
| 542 | penalized if it tries to take that mutex again. If the new owner of the |
| 543 | mutex has not woken up yet, there's no reason that the higher priority process |
| 544 | could not take that mutex away. |
| 545 | |
| 546 | To solve this, we introduced Pending Ownership and Lock Stealing. When a |
| 547 | new process is given a mutex that it was blocked on, it is only given |
| 548 | pending ownership. This means that it's the new owner, unless a higher |
| 549 | priority process comes in and tries to grab that mutex. If a higher priority |
| 550 | process does come along and wants that mutex, we let the higher priority |
| 551 | process "steal" the mutex from the pending owner (only if it is still pending) |
| 552 | and continue with the mutex. |
| 553 | |
| 554 | |
| 555 | Taking of a mutex (The walk through) |
| 556 | ------------------------------------ |
| 557 | |
| 558 | OK, now let's take a look at the detailed walk through of what happens when |
| 559 | taking a mutex. |
| 560 | |
| 561 | The first thing that is tried is the fast taking of the mutex. This is |
| 562 | done when we have CMPXCHG enabled (otherwise the fast taking automatically |
| 563 | fails). Only when the owner field of the mutex is NULL can the lock be |
| 564 | taken with the CMPXCHG and nothing else needs to be done. |
| 565 | |
| 566 | If there is contention on the lock, whether it is owned or pending owner |
| 567 | we go about the slow path (rt_mutex_slowlock). |
| 568 | |
| 569 | The slow path function is where the task's waiter structure is created on |
| 570 | the stack. This is because the waiter structure is only needed for the |
| 571 | scope of this function. The waiter structure holds the nodes to store |
| 572 | the task on the wait_list of the mutex, and if need be, the pi_list of |
| 573 | the owner. |
| 574 | |
| 575 | The wait_lock of the mutex is taken since the slow path of unlocking the |
| 576 | mutex also takes this lock. |
| 577 | |
| 578 | We then call try_to_take_rt_mutex. This is where the architecture that |
| 579 | does not implement CMPXCHG would always grab the lock (if there's no |
| 580 | contention). |
| 581 | |
| 582 | try_to_take_rt_mutex is used every time the task tries to grab a mutex in the |
| 583 | slow path. The first thing that is done here is an atomic setting of |
| 584 | the "Has Waiters" flag of the mutex's owner field. Yes, this could really |
Jan Altenberg | 9ba0bdf | 2006-09-30 23:28:08 -0700 | [diff] [blame] | 585 | be false, because if the mutex has no owner, there are no waiters and |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 586 | the current task also won't have any waiters. But we don't have the lock |
| 587 | yet, so we assume we are going to be a waiter. The reason for this is to |
| 588 | play nice for those architectures that do have CMPXCHG. By setting this flag |
| 589 | now, the owner of the mutex can't release the mutex without going into the |
| 590 | slow unlock path, and it would then need to grab the wait_lock, which this |
| 591 | code currently holds. So setting the "Has Waiters" flag forces the owner |
| 592 | to synchronize with this code. |
| 593 | |
| 594 | Now that we know that we can't have any races with the owner releasing the |
| 595 | mutex, we check to see if we can take the ownership. This is done if the |
| 596 | mutex doesn't have a owner, or if we can steal the mutex from a pending |
| 597 | owner. Let's look at the situations we have here. |
| 598 | |
| 599 | 1) Has owner that is pending |
| 600 | ---------------------------- |
| 601 | |
| 602 | The mutex has a owner, but it hasn't woken up and the mutex flag |
| 603 | "Pending Owner" is set. The first check is to see if the owner isn't the |
| 604 | current task. This is because this function is also used for the pending |
| 605 | owner to grab the mutex. When a pending owner wakes up, it checks to see |
| 606 | if it can take the mutex, and this is done if the owner is already set to |
| 607 | itself. If so, we succeed and leave the function, clearing the "Pending |
| 608 | Owner" bit. |
| 609 | |
| 610 | If the pending owner is not current, we check to see if the current priority is |
| 611 | higher than the pending owner. If not, we fail the function and return. |
| 612 | |
| 613 | There's also something special about a pending owner. That is a pending owner |
| 614 | is never blocked on a mutex. So there is no PI chain to worry about. It also |
| 615 | means that if the mutex doesn't have any waiters, there's no accounting needed |
| 616 | to update the pending owner's pi_list, since we only worry about processes |
| 617 | blocked on the current mutex. |
| 618 | |
| 619 | If there are waiters on this mutex, and we just stole the ownership, we need |
| 620 | to take the top waiter, remove it from the pi_list of the pending owner, and |
| 621 | add it to the current pi_list. Note that at this moment, the pending owner |
| 622 | is no longer on the list of waiters. This is fine, since the pending owner |
| 623 | would add itself back when it realizes that it had the ownership stolen |
| 624 | from itself. When the pending owner tries to grab the mutex, it will fail |
| 625 | in try_to_take_rt_mutex if the owner field points to another process. |
| 626 | |
| 627 | 2) No owner |
| 628 | ----------- |
| 629 | |
| 630 | If there is no owner (or we successfully stole the lock), we set the owner |
| 631 | of the mutex to current, and set the flag of "Has Waiters" if the current |
| 632 | mutex actually has waiters, or we clear the flag if it doesn't. See, it was |
| 633 | OK that we set that flag early, since now it is cleared. |
| 634 | |
| 635 | 3) Failed to grab ownership |
| 636 | --------------------------- |
| 637 | |
| 638 | The most interesting case is when we fail to take ownership. This means that |
| 639 | there exists an owner, or there's a pending owner with equal or higher |
| 640 | priority than the current task. |
| 641 | |
| 642 | We'll continue on the failed case. |
| 643 | |
| 644 | If the mutex has a timeout, we set up a timer to go off to break us out |
| 645 | of this mutex if we failed to get it after a specified amount of time. |
| 646 | |
| 647 | Now we enter a loop that will continue to try to take ownership of the mutex, or |
| 648 | fail from a timeout or signal. |
| 649 | |
| 650 | Once again we try to take the mutex. This will usually fail the first time |
| 651 | in the loop, since it had just failed to get the mutex. But the second time |
| 652 | in the loop, this would likely succeed, since the task would likely be |
| 653 | the pending owner. |
| 654 | |
| 655 | If the mutex is TASK_INTERRUPTIBLE a check for signals and timeout is done |
| 656 | here. |
| 657 | |
| 658 | The waiter structure has a "task" field that points to the task that is blocked |
| 659 | on the mutex. This field can be NULL the first time it goes through the loop |
Francis Galiegue | a33f322 | 2010-04-23 00:08:02 +0200 | [diff] [blame] | 660 | or if the task is a pending owner and had its mutex stolen. If the "task" |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 661 | field is NULL then we need to set up the accounting for it. |
| 662 | |
| 663 | Task blocks on mutex |
| 664 | -------------------- |
| 665 | |
| 666 | The accounting of a mutex and process is done with the waiter structure of |
| 667 | the process. The "task" field is set to the process, and the "lock" field |
| 668 | to the mutex. The plist nodes are initialized to the processes current |
| 669 | priority. |
| 670 | |
| 671 | Since the wait_lock was taken at the entry of the slow lock, we can safely |
| 672 | add the waiter to the wait_list. If the current process is the highest |
| 673 | priority process currently waiting on this mutex, then we remove the |
| 674 | previous top waiter process (if it exists) from the pi_list of the owner, |
| 675 | and add the current process to that list. Since the pi_list of the owner |
| 676 | has changed, we call rt_mutex_adjust_prio on the owner to see if the owner |
| 677 | should adjust its priority accordingly. |
| 678 | |
| 679 | If the owner is also blocked on a lock, and had its pi_list changed |
| 680 | (or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead |
| 681 | and run rt_mutex_adjust_prio_chain on the owner, as described earlier. |
| 682 | |
| 683 | Now all locks are released, and if the current process is still blocked on a |
| 684 | mutex (waiter "task" field is not NULL), then we go to sleep (call schedule). |
| 685 | |
| 686 | Waking up in the loop |
| 687 | --------------------- |
| 688 | |
| 689 | The schedule can then wake up for a few reasons. |
| 690 | 1) we were given pending ownership of the mutex. |
| 691 | 2) we received a signal and was TASK_INTERRUPTIBLE |
| 692 | 3) we had a timeout and was TASK_INTERRUPTIBLE |
| 693 | |
| 694 | In any of these cases, we continue the loop and once again try to grab the |
| 695 | ownership of the mutex. If we succeed, we exit the loop, otherwise we continue |
| 696 | and on signal and timeout, will exit the loop, or if we had the mutex stolen |
| 697 | we just simply add ourselves back on the lists and go back to sleep. |
| 698 | |
| 699 | Note: For various reasons, because of timeout and signals, the steal mutex |
| 700 | algorithm needs to be careful. This is because the current process is |
| 701 | still on the wait_list. And because of dynamic changing of priorities, |
| 702 | especially on SCHED_OTHER tasks, the current process can be the |
| 703 | highest priority task on the wait_list. |
| 704 | |
| 705 | Failed to get mutex on Timeout or Signal |
| 706 | ---------------------------------------- |
| 707 | |
| 708 | If a timeout or signal occurred, the waiter's "task" field would not be |
| 709 | NULL and the task needs to be taken off the wait_list of the mutex and perhaps |
| 710 | pi_list of the owner. If this process was a high priority process, then |
| 711 | the rt_mutex_adjust_prio_chain needs to be executed again on the owner, |
| 712 | but this time it will be lowering the priorities. |
| 713 | |
| 714 | |
| 715 | Unlocking the Mutex |
| 716 | ------------------- |
| 717 | |
| 718 | The unlocking of a mutex also has a fast path for those architectures with |
| 719 | CMPXCHG. Since the taking of a mutex on contention always sets the |
| 720 | "Has Waiters" flag of the mutex's owner, we use this to know if we need to |
| 721 | take the slow path when unlocking the mutex. If the mutex doesn't have any |
| 722 | waiters, the owner field of the mutex would equal the current process and |
| 723 | the mutex can be unlocked by just replacing the owner field with NULL. |
| 724 | |
| 725 | If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available), |
| 726 | the slow unlock path is taken. |
| 727 | |
| 728 | The first thing done in the slow unlock path is to take the wait_lock of the |
| 729 | mutex. This synchronizes the locking and unlocking of the mutex. |
| 730 | |
| 731 | A check is made to see if the mutex has waiters or not. On architectures that |
| 732 | do not have CMPXCHG, this is the location that the owner of the mutex will |
| 733 | determine if a waiter needs to be awoken or not. On architectures that |
| 734 | do have CMPXCHG, that check is done in the fast path, but it is still needed |
| 735 | in the slow path too. If a waiter of a mutex woke up because of a signal |
| 736 | or timeout between the time the owner failed the fast path CMPXCHG check and |
| 737 | the grabbing of the wait_lock, the mutex may not have any waiters, thus the |
Jan Altenberg | 9ba0bdf | 2006-09-30 23:28:08 -0700 | [diff] [blame] | 738 | owner still needs to make this check. If there are no waiters then the mutex |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 739 | owner field is set to NULL, the wait_lock is released and nothing more is |
| 740 | needed. |
| 741 | |
| 742 | If there are waiters, then we need to wake one up and give that waiter |
| 743 | pending ownership. |
| 744 | |
| 745 | On the wake up code, the pi_lock of the current owner is taken. The top |
| 746 | waiter of the lock is found and removed from the wait_list of the mutex |
| 747 | as well as the pi_list of the current owner. The task field of the new |
| 748 | pending owner's waiter structure is set to NULL, and the owner field of the |
| 749 | mutex is set to the new owner with the "Pending Owner" bit set, as well |
| 750 | as the "Has Waiters" bit if there still are other processes blocked on the |
| 751 | mutex. |
| 752 | |
| 753 | The pi_lock of the previous owner is released, and the new pending owner's |
| 754 | pi_lock is taken. Remember that this is the trick to prevent the race |
| 755 | condition in rt_mutex_adjust_prio_chain from adding itself as a waiter |
| 756 | on the mutex. |
| 757 | |
| 758 | We now clear the "pi_blocked_on" field of the new pending owner, and if |
| 759 | the mutex still has waiters pending, we add the new top waiter to the pi_list |
| 760 | of the pending owner. |
| 761 | |
| 762 | Finally we unlock the pi_lock of the pending owner and wake it up. |
| 763 | |
| 764 | |
| 765 | Contact |
| 766 | ------- |
| 767 | |
| 768 | For updates on this document, please email Steven Rostedt <rostedt@goodmis.org> |
| 769 | |
| 770 | |
| 771 | Credits |
| 772 | ------- |
| 773 | |
| 774 | Author: Steven Rostedt <rostedt@goodmis.org> |
| 775 | |
| 776 | Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy Dunlap |
| 777 | |
| 778 | Updates |
| 779 | ------- |
| 780 | |
| 781 | This document was originally written for 2.6.17-rc3-mm1 |