blob: 14ab5787b9a77cd485899c704aef0212634aa212 [file] [log] [blame]
Mauro Carvalho Chehabdb4df482017-05-14 13:35:29 -03001================
Darren Hartb30505c2009-05-07 15:40:14 -07002Futex Requeue PI
Mauro Carvalho Chehabdb4df482017-05-14 13:35:29 -03003================
Darren Hartb30505c2009-05-07 15:40:14 -07004
5Requeueing of tasks from a non-PI futex to a PI futex requires
6special handling in order to ensure the underlying rt_mutex is never
7left without an owner if it has waiters; doing so would break the PI
8boosting logic [see rt-mutex-desgin.txt] For the purposes of
9brevity, this action will be referred to as "requeue_pi" throughout
10this document. Priority inheritance is abbreviated throughout as
11"PI".
12
13Motivation
14----------
15
16Without requeue_pi, the glibc implementation of
17pthread_cond_broadcast() must resort to waking all the tasks waiting
18on a pthread_condvar and letting them try to sort out which task
19gets to run first in classic thundering-herd formation. An ideal
20implementation would wake the highest-priority waiter, and leave the
21rest to the natural wakeup inherent in unlocking the mutex
22associated with the condvar.
23
Mauro Carvalho Chehabdb4df482017-05-14 13:35:29 -030024Consider the simplified glibc calls::
Darren Hartb30505c2009-05-07 15:40:14 -070025
Mauro Carvalho Chehabdb4df482017-05-14 13:35:29 -030026 /* caller must lock mutex */
27 pthread_cond_wait(cond, mutex)
28 {
29 lock(cond->__data.__lock);
30 unlock(mutex);
31 do {
32 unlock(cond->__data.__lock);
33 futex_wait(cond->__data.__futex);
34 lock(cond->__data.__lock);
35 } while(...)
36 unlock(cond->__data.__lock);
37 lock(mutex);
38 }
Darren Hartb30505c2009-05-07 15:40:14 -070039
Mauro Carvalho Chehabdb4df482017-05-14 13:35:29 -030040 pthread_cond_broadcast(cond)
41 {
42 lock(cond->__data.__lock);
43 unlock(cond->__data.__lock);
44 futex_requeue(cond->data.__futex, cond->mutex);
45 }
Darren Hartb30505c2009-05-07 15:40:14 -070046
47Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
48has waiters. Note that pthread_cond_wait() attempts to lock the
49mutex only after it has returned to user space. This will leave the
50underlying rt_mutex with waiters, and no owner, breaking the
51previously mentioned PI-boosting algorithms.
52
53In order to support PI-aware pthread_condvar's, the kernel needs to
54be able to requeue tasks to PI futexes. This support implies that
55upon a successful futex_wait system call, the caller would return to
56user space already holding the PI futex. The glibc implementation
Mauro Carvalho Chehabdb4df482017-05-14 13:35:29 -030057would be modified as follows::
Darren Hartb30505c2009-05-07 15:40:14 -070058
59
Mauro Carvalho Chehabdb4df482017-05-14 13:35:29 -030060 /* caller must lock mutex */
61 pthread_cond_wait_pi(cond, mutex)
62 {
63 lock(cond->__data.__lock);
64 unlock(mutex);
65 do {
66 unlock(cond->__data.__lock);
67 futex_wait_requeue_pi(cond->__data.__futex);
68 lock(cond->__data.__lock);
69 } while(...)
70 unlock(cond->__data.__lock);
71 /* the kernel acquired the mutex for us */
72 }
Darren Hartb30505c2009-05-07 15:40:14 -070073
Mauro Carvalho Chehabdb4df482017-05-14 13:35:29 -030074 pthread_cond_broadcast_pi(cond)
75 {
76 lock(cond->__data.__lock);
77 unlock(cond->__data.__lock);
78 futex_requeue_pi(cond->data.__futex, cond->mutex);
79 }
Darren Hartb30505c2009-05-07 15:40:14 -070080
81The actual glibc implementation will likely test for PI and make the
82necessary changes inside the existing calls rather than creating new
83calls for the PI cases. Similar changes are needed for
84pthread_cond_timedwait() and pthread_cond_signal().
85
86Implementation
87--------------
88
89In order to ensure the rt_mutex has an owner if it has waiters, it
90is necessary for both the requeue code, as well as the waiting code,
91to be able to acquire the rt_mutex before returning to user space.
92The requeue code cannot simply wake the waiter and leave it to
93acquire the rt_mutex as it would open a race window between the
94requeue call returning to user space and the waiter waking and
95starting to run. This is especially true in the uncontended case.
96
97The solution involves two new rt_mutex helper routines,
98rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
99allow the requeue code to acquire an uncontended rt_mutex on behalf
100of the waiter and to enqueue the waiter on a contended rt_mutex.
101Two new system calls provide the kernel<->user interface to
Michael Kerrisk40a35502015-01-16 20:27:57 +0100102requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI.
Darren Hartb30505c2009-05-07 15:40:14 -0700103
104FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
105and pthread_cond_timedwait()) to block on the initial futex and wait
106to be requeued to a PI-aware futex. The implementation is the
107result of a high-speed collision between futex_wait() and
108futex_lock_pi(), with some extra logic to check for the additional
109wake-up scenarios.
110
Michael Kerrisk40a35502015-01-16 20:27:57 +0100111FUTEX_CMP_REQUEUE_PI is called by the waker
Darren Hartb30505c2009-05-07 15:40:14 -0700112(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
113possibly wake the waiting tasks. Internally, this system call is
114still handled by futex_requeue (by passing requeue_pi=1). Before
115requeueing, futex_requeue() attempts to acquire the requeue target
116PI futex on behalf of the top waiter. If it can, this waiter is
117woken. futex_requeue() then proceeds to requeue the remaining
118nr_wake+nr_requeue tasks to the PI futex, calling
119rt_mutex_start_proxy_lock() prior to each requeue to prepare the
120task as a waiter on the underlying rt_mutex. It is possible that
121the lock can be acquired at this stage as well, if so, the next
122waiter is woken to finish the acquisition of the lock.
123
Michael Kerrisk40a35502015-01-16 20:27:57 +0100124FUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
Darren Hartb30505c2009-05-07 15:40:14 -0700125their sum is all that really matters. futex_requeue() will wake or
126requeue up to nr_wake + nr_requeue tasks. It will wake only as many
127tasks as it can acquire the lock for, which in the majority of cases
128should be 0 as good programming practice dictates that the caller of
129either pthread_cond_broadcast() or pthread_cond_signal() acquire the
Michael Kerrisk40a35502015-01-16 20:27:57 +0100130mutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that
Darren Hartb30505c2009-05-07 15:40:14 -0700131nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for
132signal.