Dave Martin | 9762f12 | 2012-08-17 16:07:01 +0100 | [diff] [blame] | 1 | vlocks for Bare-Metal Mutual Exclusion |
| 2 | ====================================== |
| 3 | |
| 4 | Voting Locks, or "vlocks" provide a simple low-level mutual exclusion |
| 5 | mechanism, with reasonable but minimal requirements on the memory |
| 6 | system. |
| 7 | |
| 8 | These are intended to be used to coordinate critical activity among CPUs |
| 9 | which are otherwise non-coherent, in situations where the hardware |
| 10 | provides no other mechanism to support this and ordinary spinlocks |
| 11 | cannot be used. |
| 12 | |
| 13 | |
| 14 | vlocks make use of the atomicity provided by the memory system for |
| 15 | writes to a single memory location. To arbitrate, every CPU "votes for |
| 16 | itself", by storing a unique number to a common memory location. The |
| 17 | final value seen in that memory location when all the votes have been |
| 18 | cast identifies the winner. |
| 19 | |
| 20 | In order to make sure that the election produces an unambiguous result |
| 21 | in finite time, a CPU will only enter the election in the first place if |
| 22 | no winner has been chosen and the election does not appear to have |
| 23 | started yet. |
| 24 | |
| 25 | |
| 26 | Algorithm |
| 27 | --------- |
| 28 | |
| 29 | The easiest way to explain the vlocks algorithm is with some pseudo-code: |
| 30 | |
| 31 | |
| 32 | int currently_voting[NR_CPUS] = { 0, }; |
| 33 | int last_vote = -1; /* no votes yet */ |
| 34 | |
| 35 | bool vlock_trylock(int this_cpu) |
| 36 | { |
| 37 | /* signal our desire to vote */ |
| 38 | currently_voting[this_cpu] = 1; |
| 39 | if (last_vote != -1) { |
| 40 | /* someone already volunteered himself */ |
| 41 | currently_voting[this_cpu] = 0; |
| 42 | return false; /* not ourself */ |
| 43 | } |
| 44 | |
| 45 | /* let's suggest ourself */ |
| 46 | last_vote = this_cpu; |
| 47 | currently_voting[this_cpu] = 0; |
| 48 | |
| 49 | /* then wait until everyone else is done voting */ |
| 50 | for_each_cpu(i) { |
| 51 | while (currently_voting[i] != 0) |
| 52 | /* wait */; |
| 53 | } |
| 54 | |
| 55 | /* result */ |
| 56 | if (last_vote == this_cpu) |
| 57 | return true; /* we won */ |
| 58 | return false; |
| 59 | } |
| 60 | |
| 61 | bool vlock_unlock(void) |
| 62 | { |
| 63 | last_vote = -1; |
| 64 | } |
| 65 | |
| 66 | |
| 67 | The currently_voting[] array provides a way for the CPUs to determine |
| 68 | whether an election is in progress, and plays a role analogous to the |
| 69 | "entering" array in Lamport's bakery algorithm [1]. |
| 70 | |
| 71 | However, once the election has started, the underlying memory system |
| 72 | atomicity is used to pick the winner. This avoids the need for a static |
| 73 | priority rule to act as a tie-breaker, or any counters which could |
| 74 | overflow. |
| 75 | |
| 76 | As long as the last_vote variable is globally visible to all CPUs, it |
| 77 | will contain only one value that won't change once every CPU has cleared |
| 78 | its currently_voting flag. |
| 79 | |
| 80 | |
| 81 | Features and limitations |
| 82 | ------------------------ |
| 83 | |
| 84 | * vlocks are not intended to be fair. In the contended case, it is the |
| 85 | _last_ CPU which attempts to get the lock which will be most likely |
| 86 | to win. |
| 87 | |
| 88 | vlocks are therefore best suited to situations where it is necessary |
| 89 | to pick a unique winner, but it does not matter which CPU actually |
| 90 | wins. |
| 91 | |
| 92 | * Like other similar mechanisms, vlocks will not scale well to a large |
| 93 | number of CPUs. |
| 94 | |
| 95 | vlocks can be cascaded in a voting hierarchy to permit better scaling |
| 96 | if necessary, as in the following hypothetical example for 4096 CPUs: |
| 97 | |
| 98 | /* first level: local election */ |
| 99 | my_town = towns[(this_cpu >> 4) & 0xf]; |
| 100 | I_won = vlock_trylock(my_town, this_cpu & 0xf); |
| 101 | if (I_won) { |
| 102 | /* we won the town election, let's go for the state */ |
| 103 | my_state = states[(this_cpu >> 8) & 0xf]; |
| 104 | I_won = vlock_lock(my_state, this_cpu & 0xf)); |
| 105 | if (I_won) { |
| 106 | /* and so on */ |
| 107 | I_won = vlock_lock(the_whole_country, this_cpu & 0xf]; |
| 108 | if (I_won) { |
| 109 | /* ... */ |
| 110 | } |
| 111 | vlock_unlock(the_whole_country); |
| 112 | } |
| 113 | vlock_unlock(my_state); |
| 114 | } |
| 115 | vlock_unlock(my_town); |
| 116 | |
| 117 | |
| 118 | ARM implementation |
| 119 | ------------------ |
| 120 | |
| 121 | The current ARM implementation [2] contains some optimisations beyond |
| 122 | the basic algorithm: |
| 123 | |
| 124 | * By packing the members of the currently_voting array close together, |
| 125 | we can read the whole array in one transaction (providing the number |
| 126 | of CPUs potentially contending the lock is small enough). This |
| 127 | reduces the number of round-trips required to external memory. |
| 128 | |
| 129 | In the ARM implementation, this means that we can use a single load |
| 130 | and comparison: |
| 131 | |
| 132 | LDR Rt, [Rn] |
| 133 | CMP Rt, #0 |
| 134 | |
| 135 | ...in place of code equivalent to: |
| 136 | |
| 137 | LDRB Rt, [Rn] |
| 138 | CMP Rt, #0 |
| 139 | LDRBEQ Rt, [Rn, #1] |
| 140 | CMPEQ Rt, #0 |
| 141 | LDRBEQ Rt, [Rn, #2] |
| 142 | CMPEQ Rt, #0 |
| 143 | LDRBEQ Rt, [Rn, #3] |
| 144 | CMPEQ Rt, #0 |
| 145 | |
| 146 | This cuts down on the fast-path latency, as well as potentially |
| 147 | reducing bus contention in contended cases. |
| 148 | |
| 149 | The optimisation relies on the fact that the ARM memory system |
| 150 | guarantees coherency between overlapping memory accesses of |
| 151 | different sizes, similarly to many other architectures. Note that |
| 152 | we do not care which element of currently_voting appears in which |
| 153 | bits of Rt, so there is no need to worry about endianness in this |
| 154 | optimisation. |
| 155 | |
| 156 | If there are too many CPUs to read the currently_voting array in |
| 157 | one transaction then multiple transations are still required. The |
| 158 | implementation uses a simple loop of word-sized loads for this |
| 159 | case. The number of transactions is still fewer than would be |
| 160 | required if bytes were loaded individually. |
| 161 | |
| 162 | |
| 163 | In principle, we could aggregate further by using LDRD or LDM, but |
| 164 | to keep the code simple this was not attempted in the initial |
| 165 | implementation. |
| 166 | |
| 167 | |
| 168 | * vlocks are currently only used to coordinate between CPUs which are |
| 169 | unable to enable their caches yet. This means that the |
| 170 | implementation removes many of the barriers which would be required |
| 171 | when executing the algorithm in cached memory. |
| 172 | |
| 173 | packing of the currently_voting array does not work with cached |
| 174 | memory unless all CPUs contending the lock are cache-coherent, due |
| 175 | to cache writebacks from one CPU clobbering values written by other |
| 176 | CPUs. (Though if all the CPUs are cache-coherent, you should be |
| 177 | probably be using proper spinlocks instead anyway). |
| 178 | |
| 179 | |
| 180 | * The "no votes yet" value used for the last_vote variable is 0 (not |
| 181 | -1 as in the pseudocode). This allows statically-allocated vlocks |
| 182 | to be implicitly initialised to an unlocked state simply by putting |
| 183 | them in .bss. |
| 184 | |
| 185 | An offset is added to each CPU's ID for the purpose of setting this |
| 186 | variable, so that no CPU uses the value 0 for its ID. |
| 187 | |
| 188 | |
| 189 | Colophon |
| 190 | -------- |
| 191 | |
| 192 | Originally created and documented by Dave Martin for Linaro Limited, for |
| 193 | use in ARM-based big.LITTLE platforms, with review and input gratefully |
| 194 | received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for |
| 195 | grabbing most of this text out of the relevant mail thread and writing |
| 196 | up the pseudocode. |
| 197 | |
| 198 | Copyright (C) 2012-2013 Linaro Limited |
| 199 | Distributed under the terms of Version 2 of the GNU General Public |
| 200 | License, as defined in linux/COPYING. |
| 201 | |
| 202 | |
| 203 | References |
| 204 | ---------- |
| 205 | |
| 206 | [1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming |
| 207 | Problem", Communications of the ACM 17, 8 (August 1974), 453-455. |
| 208 | |
| 209 | http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm |
| 210 | |
| 211 | [2] linux/arch/arm/common/vlock.S, www.kernel.org. |