| ============================ |
| LINUX KERNEL MEMORY BARRIERS |
| ============================ |
| |
| By: David Howells <dhowells@redhat.com> |
| |
| Contents: |
| |
| (*) Abstract memory access model. |
| |
| - Device operations. |
| - Guarantees. |
| |
| (*) What are memory barriers? |
| |
| - Varieties of memory barrier. |
| - What may not be assumed about memory barriers? |
| - Data dependency barriers. |
| - Control dependencies. |
| - SMP barrier pairing. |
| - Examples of memory barrier sequences. |
| |
| (*) Explicit kernel barriers. |
| |
| - Compiler barrier. |
| - The CPU memory barriers. |
| - MMIO write barrier. |
| |
| (*) Implicit kernel memory barriers. |
| |
| - Locking functions. |
| - Interrupt disabling functions. |
| - Miscellaneous functions. |
| |
| (*) Inter-CPU locking barrier effects. |
| |
| - Locks vs memory accesses. |
| - Locks vs I/O accesses. |
| |
| (*) Where are memory barriers needed? |
| |
| - Interprocessor interaction. |
| - Atomic operations. |
| - Accessing devices. |
| - Interrupts. |
| |
| (*) Kernel I/O barrier effects. |
| |
| (*) Assumed minimum execution ordering model. |
| |
| (*) The effects of the cpu cache. |
| |
| - Cache coherency. |
| - Cache coherency vs DMA. |
| - Cache coherency vs MMIO. |
| |
| (*) The things CPUs get up to. |
| |
| - And then there's the Alpha. |
| |
| (*) References. |
| |
| |
| ============================ |
| ABSTRACT MEMORY ACCESS MODEL |
| ============================ |
| |
| Consider the following abstract model of the system: |
| |
| : : |
| : : |
| : : |
| +-------+ : +--------+ : +-------+ |
| | | : | | : | | |
| | | : | | : | | |
| | CPU 1 |<----->| Memory |<----->| CPU 2 | |
| | | : | | : | | |
| | | : | | : | | |
| +-------+ : +--------+ : +-------+ |
| ^ : ^ : ^ |
| | : | : | |
| | : | : | |
| | : v : | |
| | : +--------+ : | |
| | : | | : | |
| | : | | : | |
| +---------->| Device |<----------+ |
| : | | : |
| : | | : |
| : +--------+ : |
| : : |
| |
| Each CPU executes a program that generates memory access operations. In the |
| abstract CPU, memory operation ordering is very relaxed, and a CPU may actually |
| perform the memory operations in any order it likes, provided program causality |
| appears to be maintained. Similarly, the compiler may also arrange the |
| instructions it emits in any order it likes, provided it doesn't affect the |
| apparent operation of the program. |
| |
| So in the above diagram, the effects of the memory operations performed by a |
| CPU are perceived by the rest of the system as the operations cross the |
| interface between the CPU and rest of the system (the dotted lines). |
| |
| |
| For example, consider the following sequence of events: |
| |
| CPU 1 CPU 2 |
| =============== =============== |
| { A == 1; B == 2 } |
| A = 3; x = A; |
| B = 4; y = B; |
| |
| The set of accesses as seen by the memory system in the middle can be arranged |
| in 24 different combinations: |
| |
| STORE A=3, STORE B=4, x=LOAD A->3, y=LOAD B->4 |
| STORE A=3, STORE B=4, y=LOAD B->4, x=LOAD A->3 |
| STORE A=3, x=LOAD A->3, STORE B=4, y=LOAD B->4 |
| STORE A=3, x=LOAD A->3, y=LOAD B->2, STORE B=4 |
| STORE A=3, y=LOAD B->2, STORE B=4, x=LOAD A->3 |
| STORE A=3, y=LOAD B->2, x=LOAD A->3, STORE B=4 |
| STORE B=4, STORE A=3, x=LOAD A->3, y=LOAD B->4 |
| STORE B=4, ... |
| ... |
| |
| and can thus result in four different combinations of values: |
| |
| x == 1, y == 2 |
| x == 1, y == 4 |
| x == 3, y == 2 |
| x == 3, y == 4 |
| |
| |
| Furthermore, the stores committed by a CPU to the memory system may not be |
| perceived by the loads made by another CPU in the same order as the stores were |
| committed. |
| |
| |
| As a further example, consider this sequence of events: |
| |
| CPU 1 CPU 2 |
| =============== =============== |
| { A == 1, B == 2, C = 3, P == &A, Q == &C } |
| B = 4; Q = P; |
| P = &B D = *Q; |
| |
| There is an obvious data dependency here, as the value loaded into D depends on |
| the address retrieved from P by CPU 2. At the end of the sequence, any of the |
| following results are possible: |
| |
| (Q == &A) and (D == 1) |
| (Q == &B) and (D == 2) |
| (Q == &B) and (D == 4) |
| |
| Note that CPU 2 will never try and load C into D because the CPU will load P |
| into Q before issuing the load of *Q. |
| |
| |
| DEVICE OPERATIONS |
| ----------------- |
| |
| Some devices present their control interfaces as collections of memory |
| locations, but the order in which the control registers are accessed is very |
| important. For instance, imagine an ethernet card with a set of internal |
| registers that are accessed through an address port register (A) and a data |
| port register (D). To read internal register 5, the following code might then |
| be used: |
| |
| *A = 5; |
| x = *D; |
| |
| but this might show up as either of the following two sequences: |
| |
| STORE *A = 5, x = LOAD *D |
| x = LOAD *D, STORE *A = 5 |
| |
| the second of which will almost certainly result in a malfunction, since it set |
| the address _after_ attempting to read the register. |
| |
| |
| GUARANTEES |
| ---------- |
| |
| There are some minimal guarantees that may be expected of a CPU: |
| |
| (*) On any given CPU, dependent memory accesses will be issued in order, with |
| respect to itself. This means that for: |
| |
| Q = P; D = *Q; |
| |
| the CPU will issue the following memory operations: |
| |
| Q = LOAD P, D = LOAD *Q |
| |
| and always in that order. |
| |
| (*) Overlapping loads and stores within a particular CPU will appear to be |
| ordered within that CPU. This means that for: |
| |
| a = *X; *X = b; |
| |
| the CPU will only issue the following sequence of memory operations: |
| |
| a = LOAD *X, STORE *X = b |
| |
| And for: |
| |
| *X = c; d = *X; |
| |
| the CPU will only issue: |
| |
| STORE *X = c, d = LOAD *X |
| |
| (Loads and stores overlap if they are targetted at overlapping pieces of |
| memory). |
| |
| And there are a number of things that _must_ or _must_not_ be assumed: |
| |
| (*) It _must_not_ be assumed that independent loads and stores will be issued |
| in the order given. This means that for: |
| |
| X = *A; Y = *B; *D = Z; |
| |
| we may get any of the following sequences: |
| |
| X = LOAD *A, Y = LOAD *B, STORE *D = Z |
| X = LOAD *A, STORE *D = Z, Y = LOAD *B |
| Y = LOAD *B, X = LOAD *A, STORE *D = Z |
| Y = LOAD *B, STORE *D = Z, X = LOAD *A |
| STORE *D = Z, X = LOAD *A, Y = LOAD *B |
| STORE *D = Z, Y = LOAD *B, X = LOAD *A |
| |
| (*) It _must_ be assumed that overlapping memory accesses may be merged or |
| discarded. This means that for: |
| |
| X = *A; Y = *(A + 4); |
| |
| we may get any one of the following sequences: |
| |
| X = LOAD *A; Y = LOAD *(A + 4); |
| Y = LOAD *(A + 4); X = LOAD *A; |
| {X, Y} = LOAD {*A, *(A + 4) }; |
| |
| And for: |
| |
| *A = X; Y = *A; |
| |
| we may get either of: |
| |
| STORE *A = X; Y = LOAD *A; |
| STORE *A = Y; |
| |
| |
| ========================= |
| WHAT ARE MEMORY BARRIERS? |
| ========================= |
| |
| As can be seen above, independent memory operations are effectively performed |
| in random order, but this can be a problem for CPU-CPU interaction and for I/O. |
| What is required is some way of intervening to instruct the compiler and the |
| CPU to restrict the order. |
| |
| Memory barriers are such interventions. They impose a perceived partial |
| ordering between the memory operations specified on either side of the barrier. |
| They request that the sequence of memory events generated appears to other |
| parts of the system as if the barrier is effective on that CPU. |
| |
| |
| VARIETIES OF MEMORY BARRIER |
| --------------------------- |
| |
| Memory barriers come in four basic varieties: |
| |
| (1) Write (or store) memory barriers. |
| |
| A write memory barrier gives a guarantee that all the STORE operations |
| specified before the barrier will appear to happen before all the STORE |
| operations specified after the barrier with respect to the other |
| components of the system. |
| |
| A write barrier is a partial ordering on stores only; it is not required |
| to have any effect on loads. |
| |
| A CPU can be viewed as as commiting a sequence of store operations to the |
| memory system as time progresses. All stores before a write barrier will |
| occur in the sequence _before_ all the stores after the write barrier. |
| |
| [!] Note that write barriers should normally be paired with read or data |
| dependency barriers; see the "SMP barrier pairing" subsection. |
| |
| |
| (2) Data dependency barriers. |
| |
| A data dependency barrier is a weaker form of read barrier. In the case |
| where two loads are performed such that the second depends on the result |
| of the first (eg: the first load retrieves the address to which the second |
| load will be directed), a data dependency barrier would be required to |
| make sure that the target of the second load is updated before the address |
| obtained by the first load is accessed. |
| |
| A data dependency barrier is a partial ordering on interdependent loads |
| only; it is not required to have any effect on stores, independent loads |
| or overlapping loads. |
| |
| As mentioned in (1), the other CPUs in the system can be viewed as |
| committing sequences of stores to the memory system that the CPU being |
| considered can then perceive. A data dependency barrier issued by the CPU |
| under consideration guarantees that for any load preceding it, if that |
| load touches one of a sequence of stores from another CPU, then by the |
| time the barrier completes, the effects of all the stores prior to that |
| touched by the load will be perceptible to any loads issued after the data |
| dependency barrier. |
| |
| See the "Examples of memory barrier sequences" subsection for diagrams |
| showing the ordering constraints. |
| |
| [!] Note that the first load really has to have a _data_ dependency and |
| not a control dependency. If the address for the second load is dependent |
| on the first load, but the dependency is through a conditional rather than |
| actually loading the address itself, then it's a _control_ dependency and |
| a full read barrier or better is required. See the "Control dependencies" |
| subsection for more information. |
| |
| [!] Note that data dependency barriers should normally be paired with |
| write barriers; see the "SMP barrier pairing" subsection. |
| |
| |
| (3) Read (or load) memory barriers. |
| |
| A read barrier is a data dependency barrier plus a guarantee that all the |
| LOAD operations specified before the barrier will appear to happen before |
| all the LOAD operations specified after the barrier with respect to the |
| other components of the system. |
| |
| A read barrier is a partial ordering on loads only; it is not required to |
| have any effect on stores. |
| |
| Read memory barriers imply data dependency barriers, and so can substitute |
| for them. |
| |
| [!] Note that read barriers should normally be paired with write barriers; |
| see the "SMP barrier pairing" subsection. |
| |
| |
| (4) General memory barriers. |
| |
| A general memory barrier is a combination of both a read memory barrier |
| and a write memory barrier. It is a partial ordering over both loads and |
| stores. |
| |
| General memory barriers imply both read and write memory barriers, and so |
| can substitute for either. |
| |
| |
| And a couple of implicit varieties: |
| |
| (5) LOCK operations. |
| |
| This acts as a one-way permeable barrier. It guarantees that all memory |
| operations after the LOCK operation will appear to happen after the LOCK |
| operation with respect to the other components of the system. |
| |
| Memory operations that occur before a LOCK operation may appear to happen |
| after it completes. |
| |
| A LOCK operation should almost always be paired with an UNLOCK operation. |
| |
| |
| (6) UNLOCK operations. |
| |
| This also acts as a one-way permeable barrier. It guarantees that all |
| memory operations before the UNLOCK operation will appear to happen before |
| the UNLOCK operation with respect to the other components of the system. |
| |
| Memory operations that occur after an UNLOCK operation may appear to |
| happen before it completes. |
| |
| LOCK and UNLOCK operations are guaranteed to appear with respect to each |
| other strictly in the order specified. |
| |
| The use of LOCK and UNLOCK operations generally precludes the need for |
| other sorts of memory barrier (but note the exceptions mentioned in the |
| subsection "MMIO write barrier"). |
| |
| |
| Memory barriers are only required where there's a possibility of interaction |
| between two CPUs or between a CPU and a device. If it can be guaranteed that |
| there won't be any such interaction in any particular piece of code, then |
| memory barriers are unnecessary in that piece of code. |
| |
| |
| Note that these are the _minimum_ guarantees. Different architectures may give |
| more substantial guarantees, but they may _not_ be relied upon outside of arch |
| specific code. |
| |
| |
| WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS? |
| ---------------------------------------------- |
| |
| There are certain things that the Linux kernel memory barriers do not guarantee: |
| |
| (*) There is no guarantee that any of the memory accesses specified before a |
| memory barrier will be _complete_ by the completion of a memory barrier |
| instruction; the barrier can be considered to draw a line in that CPU's |
| access queue that accesses of the appropriate type may not cross. |
| |
| (*) There is no guarantee that issuing a memory barrier on one CPU will have |
| any direct effect on another CPU or any other hardware in the system. The |
| indirect effect will be the order in which the second CPU sees the effects |
| of the first CPU's accesses occur, but see the next point: |
| |
| (*) There is no guarantee that the a CPU will see the correct order of effects |
| from a second CPU's accesses, even _if_ the second CPU uses a memory |
| barrier, unless the first CPU _also_ uses a matching memory barrier (see |
| the subsection on "SMP Barrier Pairing"). |
| |
| (*) There is no guarantee that some intervening piece of off-the-CPU |
| hardware[*] will not reorder the memory accesses. CPU cache coherency |
| mechanisms should propagate the indirect effects of a memory barrier |
| between CPUs, but might not do so in order. |
| |
| [*] For information on bus mastering DMA and coherency please read: |
| |
| Documentation/pci.txt |
| Documentation/DMA-mapping.txt |
| Documentation/DMA-API.txt |
| |
| |
| DATA DEPENDENCY BARRIERS |
| ------------------------ |
| |
| The usage requirements of data dependency barriers are a little subtle, and |
| it's not always obvious that they're needed. To illustrate, consider the |
| following sequence of events: |
| |
| CPU 1 CPU 2 |
| =============== =============== |
| { A == 1, B == 2, C = 3, P == &A, Q == &C } |
| B = 4; |
| <write barrier> |
| P = &B |
| Q = P; |
| D = *Q; |
| |
| There's a clear data dependency here, and it would seem that by the end of the |
| sequence, Q must be either &A or &B, and that: |
| |
| (Q == &A) implies (D == 1) |
| (Q == &B) implies (D == 4) |
| |
| But! CPU 2's perception of P may be updated _before_ its perception of B, thus |
| leading to the following situation: |
| |
| (Q == &B) and (D == 2) ???? |
| |
| Whilst this may seem like a failure of coherency or causality maintenance, it |
| isn't, and this behaviour can be observed on certain real CPUs (such as the DEC |
| Alpha). |
| |
| To deal with this, a data dependency barrier must be inserted between the |
| address load and the data load: |
| |
| CPU 1 CPU 2 |
| =============== =============== |
| { A == 1, B == 2, C = 3, P == &A, Q == &C } |
| B = 4; |
| <write barrier> |
| P = &B |
| Q = P; |
| <data dependency barrier> |
| D = *Q; |
| |
| This enforces the occurrence of one of the two implications, and prevents the |
| third possibility from arising. |
| |
| [!] Note that this extremely counterintuitive situation arises most easily on |
| machines with split caches, so that, for example, one cache bank processes |
| even-numbered cache lines and the other bank processes odd-numbered cache |
| lines. The pointer P might be stored in an odd-numbered cache line, and the |
| variable B might be stored in an even-numbered cache line. Then, if the |
| even-numbered bank of the reading CPU's cache is extremely busy while the |
| odd-numbered bank is idle, one can see the new value of the pointer P (&B), |
| but the old value of the variable B (1). |
| |
| |
| Another example of where data dependency barriers might by required is where a |
| number is read from memory and then used to calculate the index for an array |
| access: |
| |
| CPU 1 CPU 2 |
| =============== =============== |
| { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 } |
| M[1] = 4; |
| <write barrier> |
| P = 1 |
| Q = P; |
| <data dependency barrier> |
| D = M[Q]; |
| |
| |
| The data dependency barrier is very important to the RCU system, for example. |
| See rcu_dereference() in include/linux/rcupdate.h. This permits the current |
| target of an RCU'd pointer to be replaced with a new modified target, without |
| the replacement target appearing to be incompletely initialised. |
| |
| See also the subsection on "Cache Coherency" for a more thorough example. |
| |
| |
| CONTROL DEPENDENCIES |
| -------------------- |
| |
| A control dependency requires a full read memory barrier, not simply a data |
| dependency barrier to make it work correctly. Consider the following bit of |
| code: |
| |
| q = &a; |
| if (p) |
| q = &b; |
| <data dependency barrier> |
| x = *q; |
| |
| This will not have the desired effect because there is no actual data |
| dependency, but rather a control dependency that the CPU may short-circuit by |
| attempting to predict the outcome in advance. In such a case what's actually |
| required is: |
| |
| q = &a; |
| if (p) |
| q = &b; |
| <read barrier> |
| x = *q; |
| |
| |
| SMP BARRIER PAIRING |
| ------------------- |
| |
| When dealing with CPU-CPU interactions, certain types of memory barrier should |
| always be paired. A lack of appropriate pairing is almost certainly an error. |
| |
| A write barrier should always be paired with a data dependency barrier or read |
| barrier, though a general barrier would also be viable. Similarly a read |
| barrier or a data dependency barrier should always be paired with at least an |
| write barrier, though, again, a general barrier is viable: |
| |
| CPU 1 CPU 2 |
| =============== =============== |
| a = 1; |
| <write barrier> |
| b = 2; x = a; |
| <read barrier> |
| y = b; |
| |
| Or: |
| |
| CPU 1 CPU 2 |
| =============== =============================== |
| a = 1; |
| <write barrier> |
| b = &a; x = b; |
| <data dependency barrier> |
| y = *x; |
| |
| Basically, the read barrier always has to be there, even though it can be of |
| the "weaker" type. |
| |
| |
| EXAMPLES OF MEMORY BARRIER SEQUENCES |
| ------------------------------------ |
| |
| Firstly, write barriers act as a partial orderings on store operations. |
| Consider the following sequence of events: |
| |
| CPU 1 |
| ======================= |
| STORE A = 1 |
| STORE B = 2 |
| STORE C = 3 |
| <write barrier> |
| STORE D = 4 |
| STORE E = 5 |
| |
| This sequence of events is committed to the memory coherence system in an order |
| that the rest of the system might perceive as the unordered set of { STORE A, |
| STORE B, STORE C } all occuring before the unordered set of { STORE D, STORE E |
| }: |
| |
| +-------+ : : |
| | | +------+ |
| | |------>| C=3 | } /\ |
| | | : +------+ }----- \ -----> Events perceptible |
| | | : | A=1 | } \/ to rest of system |
| | | : +------+ } |
| | CPU 1 | : | B=2 | } |
| | | +------+ } |
| | | wwwwwwwwwwwwwwww } <--- At this point the write barrier |
| | | +------+ } requires all stores prior to the |
| | | : | E=5 | } barrier to be committed before |
| | | : +------+ } further stores may be take place. |
| | |------>| D=4 | } |
| | | +------+ |
| +-------+ : : |
| | |
| | Sequence in which stores committed to memory system |
| | by CPU 1 |
| V |
| |
| |
| Secondly, data dependency barriers act as a partial orderings on data-dependent |
| loads. Consider the following sequence of events: |
| |
| CPU 1 CPU 2 |
| ======================= ======================= |
| STORE A = 1 |
| STORE B = 2 |
| <write barrier> |
| STORE C = &B LOAD X |
| STORE D = 4 LOAD C (gets &B) |
| LOAD *C (reads B) |
| |
| Without intervention, CPU 2 may perceive the events on CPU 1 in some |
| effectively random order, despite the write barrier issued by CPU 1: |
| |
| +-------+ : : : : |
| | | +------+ +-------+ | Sequence of update |
| | |------>| B=2 |----- --->| Y->8 | | of perception on |
| | | : +------+ \ +-------+ | CPU 2 |
| | CPU 1 | : | A=1 | \ --->| C->&Y | V |
| | | +------+ | +-------+ |
| | | wwwwwwwwwwwwwwww | : : |
| | | +------+ | : : |
| | | : | C=&B |--- | : : +-------+ |
| | | : +------+ \ | +-------+ | | |
| | |------>| D=4 | ----------->| C->&B |------>| | |
| | | +------+ | +-------+ | | |
| +-------+ : : | : : | | |
| | : : | | |
| | : : | CPU 2 | |
| | +-------+ | | |
| Apparently incorrect ---> | | B->7 |------>| | |
| perception of B (!) | +-------+ | | |
| | : : | | |
| | +-------+ | | |
| The load of X holds ---> \ | X->9 |------>| | |
| up the maintenance \ +-------+ | | |
| of coherence of B ----->| B->2 | +-------+ |
| +-------+ |
| : : |
| |
| |
| In the above example, CPU 2 perceives that B is 7, despite the load of *C |
| (which would be B) coming after the the LOAD of C. |
| |
| If, however, a data dependency barrier were to be placed between the load of C |
| and the load of *C (ie: B) on CPU 2, then the following will occur: |
| |
| +-------+ : : : : |
| | | +------+ +-------+ |
| | |------>| B=2 |----- --->| Y->8 | |
| | | : +------+ \ +-------+ |
| | CPU 1 | : | A=1 | \ --->| C->&Y | |
| | | +------+ | +-------+ |
| | | wwwwwwwwwwwwwwww | : : |
| | | +------+ | : : |
| | | : | C=&B |--- | : : +-------+ |
| | | : +------+ \ | +-------+ | | |
| | |------>| D=4 | ----------->| C->&B |------>| | |
| | | +------+ | +-------+ | | |
| +-------+ : : | : : | | |
| | : : | | |
| | : : | CPU 2 | |
| | +-------+ | | |
| \ | X->9 |------>| | |
| \ +-------+ | | |
| ----->| B->2 | | | |
| +-------+ | | |
| Makes sure all effects ---> ddddddddddddddddd | | |
| prior to the store of C +-------+ | | |
| are perceptible to | B->2 |------>| | |
| successive loads +-------+ | | |
| : : +-------+ |
| |
| |
| And thirdly, a read barrier acts as a partial order on loads. Consider the |
| following sequence of events: |
| |
| CPU 1 CPU 2 |
| ======================= ======================= |
| STORE A=1 |
| STORE B=2 |
| STORE C=3 |
| <write barrier> |
| STORE D=4 |
| STORE E=5 |
| LOAD A |
| LOAD B |
| LOAD C |
| LOAD D |
| LOAD E |
| |
| Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in |
| some effectively random order, despite the write barrier issued by CPU 1: |
| |
| +-------+ : : |
| | | +------+ |
| | |------>| C=3 | } |
| | | : +------+ } |
| | | : | A=1 | } |
| | | : +------+ } |
| | CPU 1 | : | B=2 | }--- |
| | | +------+ } \ |
| | | wwwwwwwwwwwww} \ |
| | | +------+ } \ : : +-------+ |
| | | : | E=5 | } \ +-------+ | | |
| | | : +------+ } \ { | C->3 |------>| | |
| | |------>| D=4 | } \ { +-------+ : | | |
| | | +------+ \ { | E->5 | : | | |
| +-------+ : : \ { +-------+ : | | |
| Transfer -->{ | A->1 | : | CPU 2 | |
| from CPU 1 { +-------+ : | | |
| to CPU 2 { | D->4 | : | | |
| { +-------+ : | | |
| { | B->2 |------>| | |
| +-------+ | | |
| : : +-------+ |
| |
| |
| If, however, a read barrier were to be placed between the load of C and the |
| load of D on CPU 2, then the partial ordering imposed by CPU 1 will be |
| perceived correctly by CPU 2. |
| |
| +-------+ : : |
| | | +------+ |
| | |------>| C=3 | } |
| | | : +------+ } |
| | | : | A=1 | }--- |
| | | : +------+ } \ |
| | CPU 1 | : | B=2 | } \ |
| | | +------+ \ |
| | | wwwwwwwwwwwwwwww \ |
| | | +------+ \ : : +-------+ |
| | | : | E=5 | } \ +-------+ | | |
| | | : +------+ }--- \ { | C->3 |------>| | |
| | |------>| D=4 | } \ \ { +-------+ : | | |
| | | +------+ \ -->{ | B->2 | : | | |
| +-------+ : : \ { +-------+ : | | |
| \ { | A->1 | : | CPU 2 | |
| \ +-------+ | | |
| At this point the read ----> \ rrrrrrrrrrrrrrrrr | | |
| barrier causes all effects \ +-------+ | | |
| prior to the storage of C \ { | E->5 | : | | |
| to be perceptible to CPU 2 -->{ +-------+ : | | |
| { | D->4 |------>| | |
| +-------+ | | |
| : : +-------+ |
| |
| |
| ======================== |
| EXPLICIT KERNEL BARRIERS |
| ======================== |
| |
| The Linux kernel has a variety of different barriers that act at different |
| levels: |
| |
| (*) Compiler barrier. |
| |
| (*) CPU memory barriers. |
| |
| (*) MMIO write barrier. |
| |
| |
| COMPILER BARRIER |
| ---------------- |
| |
| The Linux kernel has an explicit compiler barrier function that prevents the |
| compiler from moving the memory accesses either side of it to the other side: |
| |
| barrier(); |
| |
| This a general barrier - lesser varieties of compiler barrier do not exist. |
| |
| The compiler barrier has no direct effect on the CPU, which may then reorder |
| things however it wishes. |
| |
| |
| CPU MEMORY BARRIERS |
| ------------------- |
| |
| The Linux kernel has eight basic CPU memory barriers: |
| |
| TYPE MANDATORY SMP CONDITIONAL |
| =============== ======================= =========================== |
| GENERAL mb() smp_mb() |
| WRITE wmb() smp_wmb() |
| READ rmb() smp_rmb() |
| DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends() |
| |
| |
| All CPU memory barriers unconditionally imply compiler barriers. |
| |
| SMP memory barriers are reduced to compiler barriers on uniprocessor compiled |
| systems because it is assumed that a CPU will be appear to be self-consistent, |
| and will order overlapping accesses correctly with respect to itself. |
| |
| [!] Note that SMP memory barriers _must_ be used to control the ordering of |
| references to shared memory on SMP systems, though the use of locking instead |
| is sufficient. |
| |
| Mandatory barriers should not be used to control SMP effects, since mandatory |
| barriers unnecessarily impose overhead on UP systems. They may, however, be |
| used to control MMIO effects on accesses through relaxed memory I/O windows. |
| These are required even on non-SMP systems as they affect the order in which |
| memory operations appear to a device by prohibiting both the compiler and the |
| CPU from reordering them. |
| |
| |
| There are some more advanced barrier functions: |
| |
| (*) set_mb(var, value) |
| (*) set_wmb(var, value) |
| |
| These assign the value to the variable and then insert at least a write |
| barrier after it, depending on the function. They aren't guaranteed to |
| insert anything more than a compiler barrier in a UP compilation. |
| |
| |
| (*) smp_mb__before_atomic_dec(); |
| (*) smp_mb__after_atomic_dec(); |
| (*) smp_mb__before_atomic_inc(); |
| (*) smp_mb__after_atomic_inc(); |
| |
| These are for use with atomic add, subtract, increment and decrement |
| functions, especially when used for reference counting. These functions |
| do not imply memory barriers. |
| |
| As an example, consider a piece of code that marks an object as being dead |
| and then decrements the object's reference count: |
| |
| obj->dead = 1; |
| smp_mb__before_atomic_dec(); |
| atomic_dec(&obj->ref_count); |
| |
| This makes sure that the death mark on the object is perceived to be set |
| *before* the reference counter is decremented. |
| |
| See Documentation/atomic_ops.txt for more information. See the "Atomic |
| operations" subsection for information on where to use these. |
| |
| |
| (*) smp_mb__before_clear_bit(void); |
| (*) smp_mb__after_clear_bit(void); |
| |
| These are for use similar to the atomic inc/dec barriers. These are |
| typically used for bitwise unlocking operations, so care must be taken as |
| there are no implicit memory barriers here either. |
| |
| Consider implementing an unlock operation of some nature by clearing a |
| locking bit. The clear_bit() would then need to be barriered like this: |
| |
| smp_mb__before_clear_bit(); |
| clear_bit( ... ); |
| |
| This prevents memory operations before the clear leaking to after it. See |
| the subsection on "Locking Functions" with reference to UNLOCK operation |
| implications. |
| |
| See Documentation/atomic_ops.txt for more information. See the "Atomic |
| operations" subsection for information on where to use these. |
| |
| |
| MMIO WRITE BARRIER |
| ------------------ |
| |
| The Linux kernel also has a special barrier for use with memory-mapped I/O |
| writes: |
| |
| mmiowb(); |
| |
| This is a variation on the mandatory write barrier that causes writes to weakly |
| ordered I/O regions to be partially ordered. Its effects may go beyond the |
| CPU->Hardware interface and actually affect the hardware at some level. |
| |
| See the subsection "Locks vs I/O accesses" for more information. |
| |
| |
| =============================== |
| IMPLICIT KERNEL MEMORY BARRIERS |
| =============================== |
| |
| Some of the other functions in the linux kernel imply memory barriers, amongst |
| which are locking, scheduling and memory allocation functions. |
| |
| This specification is a _minimum_ guarantee; any particular architecture may |
| provide more substantial guarantees, but these may not be relied upon outside |
| of arch specific code. |
| |
| |
| LOCKING FUNCTIONS |
| ----------------- |
| |
| The Linux kernel has a number of locking constructs: |
| |
| (*) spin locks |
| (*) R/W spin locks |
| (*) mutexes |
| (*) semaphores |
| (*) R/W semaphores |
| (*) RCU |
| |
| In all cases there are variants on "LOCK" operations and "UNLOCK" operations |
| for each construct. These operations all imply certain barriers: |
| |
| (1) LOCK operation implication: |
| |
| Memory operations issued after the LOCK will be completed after the LOCK |
| operation has completed. |
| |
| Memory operations issued before the LOCK may be completed after the LOCK |
| operation has completed. |
| |
| (2) UNLOCK operation implication: |
| |
| Memory operations issued before the UNLOCK will be completed before the |
| UNLOCK operation has completed. |
| |
| Memory operations issued after the UNLOCK may be completed before the |
| UNLOCK operation has completed. |
| |
| (3) LOCK vs LOCK implication: |
| |
| All LOCK operations issued before another LOCK operation will be completed |
| before that LOCK operation. |
| |
| (4) LOCK vs UNLOCK implication: |
| |
| All LOCK operations issued before an UNLOCK operation will be completed |
| before the UNLOCK operation. |
| |
| All UNLOCK operations issued before a LOCK operation will be completed |
| before the LOCK operation. |
| |
| (5) Failed conditional LOCK implication: |
| |
| Certain variants of the LOCK operation may fail, either due to being |
| unable to get the lock immediately, or due to receiving an unblocked |
| signal whilst asleep waiting for the lock to become available. Failed |
| locks do not imply any sort of barrier. |
| |
| Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is |
| equivalent to a full barrier, but a LOCK followed by an UNLOCK is not. |
| |
| [!] Note: one of the consequence of LOCKs and UNLOCKs being only one-way |
| barriers is that the effects instructions outside of a critical section may |
| seep into the inside of the critical section. |
| |
| Locks and semaphores may not provide any guarantee of ordering on UP compiled |
| systems, and so cannot be counted on in such a situation to actually achieve |
| anything at all - especially with respect to I/O accesses - unless combined |
| with interrupt disabling operations. |
| |
| See also the section on "Inter-CPU locking barrier effects". |
| |
| |
| As an example, consider the following: |
| |
| *A = a; |
| *B = b; |
| LOCK |
| *C = c; |
| *D = d; |
| UNLOCK |
| *E = e; |
| *F = f; |
| |
| The following sequence of events is acceptable: |
| |
| LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK |
| |
| [+] Note that {*F,*A} indicates a combined access. |
| |
| But none of the following are: |
| |
| {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E |
| *A, *B, *C, LOCK, *D, UNLOCK, *E, *F |
| *A, *B, LOCK, *C, UNLOCK, *D, *E, *F |
| *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E |
| |
| |
| |
| INTERRUPT DISABLING FUNCTIONS |
| ----------------------------- |
| |
| Functions that disable interrupts (LOCK equivalent) and enable interrupts |
| (UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O |
| barriers are required in such a situation, they must be provided from some |
| other means. |
| |
| |
| MISCELLANEOUS FUNCTIONS |
| ----------------------- |
| |
| Other functions that imply barriers: |
| |
| (*) schedule() and similar imply full memory barriers. |
| |
| (*) Memory allocation and release functions imply full memory barriers. |
| |
| |
| ================================= |
| INTER-CPU LOCKING BARRIER EFFECTS |
| ================================= |
| |
| On SMP systems locking primitives give a more substantial form of barrier: one |
| that does affect memory access ordering on other CPUs, within the context of |
| conflict on any particular lock. |
| |
| |
| LOCKS VS MEMORY ACCESSES |
| ------------------------ |
| |
| Consider the following: the system has a pair of spinlocks (N) and (Q), and |
| three CPUs; then should the following sequence of events occur: |
| |
| CPU 1 CPU 2 |
| =============================== =============================== |
| *A = a; *E = e; |
| LOCK M LOCK Q |
| *B = b; *F = f; |
| *C = c; *G = g; |
| UNLOCK M UNLOCK Q |
| *D = d; *H = h; |
| |
| Then there is no guarantee as to what order CPU #3 will see the accesses to *A |
| through *H occur in, other than the constraints imposed by the separate locks |
| on the separate CPUs. It might, for example, see: |
| |
| *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M |
| |
| But it won't see any of: |
| |
| *B, *C or *D preceding LOCK M |
| *A, *B or *C following UNLOCK M |
| *F, *G or *H preceding LOCK Q |
| *E, *F or *G following UNLOCK Q |
| |
| |
| However, if the following occurs: |
| |
| CPU 1 CPU 2 |
| =============================== =============================== |
| *A = a; |
| LOCK M [1] |
| *B = b; |
| *C = c; |
| UNLOCK M [1] |
| *D = d; *E = e; |
| LOCK M [2] |
| *F = f; |
| *G = g; |
| UNLOCK M [2] |
| *H = h; |
| |
| CPU #3 might see: |
| |
| *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], |
| LOCK M [2], *H, *F, *G, UNLOCK M [2], *D |
| |
| But assuming CPU #1 gets the lock first, it won't see any of: |
| |
| *B, *C, *D, *F, *G or *H preceding LOCK M [1] |
| *A, *B or *C following UNLOCK M [1] |
| *F, *G or *H preceding LOCK M [2] |
| *A, *B, *C, *E, *F or *G following UNLOCK M [2] |
| |
| |
| LOCKS VS I/O ACCESSES |
| --------------------- |
| |
| Under certain circumstances (especially involving NUMA), I/O accesses within |
| two spinlocked sections on two different CPUs may be seen as interleaved by the |
| PCI bridge, because the PCI bridge does not necessarily participate in the |
| cache-coherence protocol, and is therefore incapable of issuing the required |
| read memory barriers. |
| |
| For example: |
| |
| CPU 1 CPU 2 |
| =============================== =============================== |
| spin_lock(Q) |
| writel(0, ADDR) |
| writel(1, DATA); |
| spin_unlock(Q); |
| spin_lock(Q); |
| writel(4, ADDR); |
| writel(5, DATA); |
| spin_unlock(Q); |
| |
| may be seen by the PCI bridge as follows: |
| |
| STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5 |
| |
| which would probably cause the hardware to malfunction. |
| |
| |
| What is necessary here is to intervene with an mmiowb() before dropping the |
| spinlock, for example: |
| |
| CPU 1 CPU 2 |
| =============================== =============================== |
| spin_lock(Q) |
| writel(0, ADDR) |
| writel(1, DATA); |
| mmiowb(); |
| spin_unlock(Q); |
| spin_lock(Q); |
| writel(4, ADDR); |
| writel(5, DATA); |
| mmiowb(); |
| spin_unlock(Q); |
| |
| this will ensure that the two stores issued on CPU #1 appear at the PCI bridge |
| before either of the stores issued on CPU #2. |
| |
| |
| Furthermore, following a store by a load to the same device obviates the need |
| for an mmiowb(), because the load forces the store to complete before the load |
| is performed: |
| |
| CPU 1 CPU 2 |
| =============================== =============================== |
| spin_lock(Q) |
| writel(0, ADDR) |
| a = readl(DATA); |
| spin_unlock(Q); |
| spin_lock(Q); |
| writel(4, ADDR); |
| b = readl(DATA); |
| spin_unlock(Q); |
| |
| |
| See Documentation/DocBook/deviceiobook.tmpl for more information. |
| |
| |
| ================================= |
| WHERE ARE MEMORY BARRIERS NEEDED? |
| ================================= |
| |
| Under normal operation, memory operation reordering is generally not going to |
| be a problem as a single-threaded linear piece of code will still appear to |
| work correctly, even if it's in an SMP kernel. There are, however, three |
| circumstances in which reordering definitely _could_ be a problem: |
| |
| (*) Interprocessor interaction. |
| |
| (*) Atomic operations. |
| |
| (*) Accessing devices (I/O). |
| |
| (*) Interrupts. |
| |
| |
| INTERPROCESSOR INTERACTION |
| -------------------------- |
| |
| When there's a system with more than one processor, more than one CPU in the |
| system may be working on the same data set at the same time. This can cause |
| synchronisation problems, and the usual way of dealing with them is to use |
| locks. Locks, however, are quite expensive, and so it may be preferable to |
| operate without the use of a lock if at all possible. In such a case |
| operations that affect both CPUs may have to be carefully ordered to prevent |
| a malfunction. |
| |
| Consider, for example, the R/W semaphore slow path. Here a waiting process is |
| queued on the semaphore, by virtue of it having a piece of its stack linked to |
| the semaphore's list of waiting processes: |
| |
| struct rw_semaphore { |
| ... |
| spinlock_t lock; |
| struct list_head waiters; |
| }; |
| |
| struct rwsem_waiter { |
| struct list_head list; |
| struct task_struct *task; |
| }; |
| |
| To wake up a particular waiter, the up_read() or up_write() functions have to: |
| |
| (1) read the next pointer from this waiter's record to know as to where the |
| next waiter record is; |
| |
| (4) read the pointer to the waiter's task structure; |
| |
| (3) clear the task pointer to tell the waiter it has been given the semaphore; |
| |
| (4) call wake_up_process() on the task; and |
| |
| (5) release the reference held on the waiter's task struct. |
| |
| In otherwords, it has to perform this sequence of events: |
| |
| LOAD waiter->list.next; |
| LOAD waiter->task; |
| STORE waiter->task; |
| CALL wakeup |
| RELEASE task |
| |
| and if any of these steps occur out of order, then the whole thing may |
| malfunction. |
| |
| Once it has queued itself and dropped the semaphore lock, the waiter does not |
| get the lock again; it instead just waits for its task pointer to be cleared |
| before proceeding. Since the record is on the waiter's stack, this means that |
| if the task pointer is cleared _before_ the next pointer in the list is read, |
| another CPU might start processing the waiter and might clobber the waiter's |
| stack before the up*() function has a chance to read the next pointer. |
| |
| Consider then what might happen to the above sequence of events: |
| |
| CPU 1 CPU 2 |
| =============================== =============================== |
| down_xxx() |
| Queue waiter |
| Sleep |
| up_yyy() |
| LOAD waiter->task; |
| STORE waiter->task; |
| Woken up by other event |
| <preempt> |
| Resume processing |
| down_xxx() returns |
| call foo() |
| foo() clobbers *waiter |
| </preempt> |
| LOAD waiter->list.next; |
| --- OOPS --- |
| |
| This could be dealt with using the semaphore lock, but then the down_xxx() |
| function has to needlessly get the spinlock again after being woken up. |
| |
| The way to deal with this is to insert a general SMP memory barrier: |
| |
| LOAD waiter->list.next; |
| LOAD waiter->task; |
| smp_mb(); |
| STORE waiter->task; |
| CALL wakeup |
| RELEASE task |
| |
| In this case, the barrier makes a guarantee that all memory accesses before the |
| barrier will appear to happen before all the memory accesses after the barrier |
| with respect to the other CPUs on the system. It does _not_ guarantee that all |
| the memory accesses before the barrier will be complete by the time the barrier |
| instruction itself is complete. |
| |
| On a UP system - where this wouldn't be a problem - the smp_mb() is just a |
| compiler barrier, thus making sure the compiler emits the instructions in the |
| right order without actually intervening in the CPU. Since there there's only |
| one CPU, that CPU's dependency ordering logic will take care of everything |
| else. |
| |
| |
| ATOMIC OPERATIONS |
| ----------------- |
| |
| Though they are technically interprocessor interaction considerations, atomic |
| operations are noted specially as they do _not_ generally imply memory |
| barriers. The possible offenders include: |
| |
| xchg(); |
| cmpxchg(); |
| test_and_set_bit(); |
| test_and_clear_bit(); |
| test_and_change_bit(); |
| atomic_cmpxchg(); |
| atomic_inc_return(); |
| atomic_dec_return(); |
| atomic_add_return(); |
| atomic_sub_return(); |
| atomic_inc_and_test(); |
| atomic_dec_and_test(); |
| atomic_sub_and_test(); |
| atomic_add_negative(); |
| atomic_add_unless(); |
| |
| These may be used for such things as implementing LOCK operations or controlling |
| the lifetime of objects by decreasing their reference counts. In such cases |
| they need preceding memory barriers. |
| |
| The following may also be possible offenders as they may be used as UNLOCK |
| operations. |
| |
| set_bit(); |
| clear_bit(); |
| change_bit(); |
| atomic_set(); |
| |
| |
| The following are a little tricky: |
| |
| atomic_add(); |
| atomic_sub(); |
| atomic_inc(); |
| atomic_dec(); |
| |
| If they're used for statistics generation, then they probably don't need memory |
| barriers, unless there's a coupling between statistical data. |
| |
| If they're used for reference counting on an object to control its lifetime, |
| they probably don't need memory barriers because either the reference count |
| will be adjusted inside a locked section, or the caller will already hold |
| sufficient references to make the lock, and thus a memory barrier unnecessary. |
| |
| If they're used for constructing a lock of some description, then they probably |
| do need memory barriers as a lock primitive generally has to do things in a |
| specific order. |
| |
| |
| Basically, each usage case has to be carefully considered as to whether memory |
| barriers are needed or not. The simplest rule is probably: if the atomic |
| operation is protected by a lock, then it does not require a barrier unless |
| there's another operation within the critical section with respect to which an |
| ordering must be maintained. |
| |
| See Documentation/atomic_ops.txt for more information. |
| |
| |
| ACCESSING DEVICES |
| ----------------- |
| |
| Many devices can be memory mapped, and so appear to the CPU as if they're just |
| a set of memory locations. To control such a device, the driver usually has to |
| make the right memory accesses in exactly the right order. |
| |
| However, having a clever CPU or a clever compiler creates a potential problem |
| in that the carefully sequenced accesses in the driver code won't reach the |
| device in the requisite order if the CPU or the compiler thinks it is more |
| efficient to reorder, combine or merge accesses - something that would cause |
| the device to malfunction. |
| |
| Inside of the Linux kernel, I/O should be done through the appropriate accessor |
| routines - such as inb() or writel() - which know how to make such accesses |
| appropriately sequential. Whilst this, for the most part, renders the explicit |
| use of memory barriers unnecessary, there are a couple of situations where they |
| might be needed: |
| |
| (1) On some systems, I/O stores are not strongly ordered across all CPUs, and |
| so for _all_ general drivers locks should be used and mmiowb() must be |
| issued prior to unlocking the critical section. |
| |
| (2) If the accessor functions are used to refer to an I/O memory window with |
| relaxed memory access properties, then _mandatory_ memory barriers are |
| required to enforce ordering. |
| |
| See Documentation/DocBook/deviceiobook.tmpl for more information. |
| |
| |
| INTERRUPTS |
| ---------- |
| |
| A driver may be interrupted by its own interrupt service routine, and thus the |
| two parts of the driver may interfere with each other's attempts to control or |
| access the device. |
| |
| This may be alleviated - at least in part - by disabling local interrupts (a |
| form of locking), such that the critical operations are all contained within |
| the interrupt-disabled section in the driver. Whilst the driver's interrupt |
| routine is executing, the driver's core may not run on the same CPU, and its |
| interrupt is not permitted to happen again until the current interrupt has been |
| handled, thus the interrupt handler does not need to lock against that. |
| |
| However, consider a driver that was talking to an ethernet card that sports an |
| address register and a data register. If that driver's core talks to the card |
| under interrupt-disablement and then the driver's interrupt handler is invoked: |
| |
| LOCAL IRQ DISABLE |
| writew(ADDR, 3); |
| writew(DATA, y); |
| LOCAL IRQ ENABLE |
| <interrupt> |
| writew(ADDR, 4); |
| q = readw(DATA); |
| </interrupt> |
| |
| The store to the data register might happen after the second store to the |
| address register if ordering rules are sufficiently relaxed: |
| |
| STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA |
| |
| |
| If ordering rules are relaxed, it must be assumed that accesses done inside an |
| interrupt disabled section may leak outside of it and may interleave with |
| accesses performed in an interrupt - and vice versa - unless implicit or |
| explicit barriers are used. |
| |
| Normally this won't be a problem because the I/O accesses done inside such |
| sections will include synchronous load operations on strictly ordered I/O |
| registers that form implicit I/O barriers. If this isn't sufficient then an |
| mmiowb() may need to be used explicitly. |
| |
| |
| A similar situation may occur between an interrupt routine and two routines |
| running on separate CPUs that communicate with each other. If such a case is |
| likely, then interrupt-disabling locks should be used to guarantee ordering. |
| |
| |
| ========================== |
| KERNEL I/O BARRIER EFFECTS |
| ========================== |
| |
| When accessing I/O memory, drivers should use the appropriate accessor |
| functions: |
| |
| (*) inX(), outX(): |
| |
| These are intended to talk to I/O space rather than memory space, but |
| that's primarily a CPU-specific concept. The i386 and x86_64 processors do |
| indeed have special I/O space access cycles and instructions, but many |
| CPUs don't have such a concept. |
| |
| The PCI bus, amongst others, defines an I/O space concept - which on such |
| CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O |
| space. However, it may also mapped as a virtual I/O space in the CPU's |
| memory map, particularly on those CPUs that don't support alternate |
| I/O spaces. |
| |
| Accesses to this space may be fully synchronous (as on i386), but |
| intermediary bridges (such as the PCI host bridge) may not fully honour |
| that. |
| |
| They are guaranteed to be fully ordered with respect to each other. |
| |
| They are not guaranteed to be fully ordered with respect to other types of |
| memory and I/O operation. |
| |
| (*) readX(), writeX(): |
| |
| Whether these are guaranteed to be fully ordered and uncombined with |
| respect to each other on the issuing CPU depends on the characteristics |
| defined for the memory window through which they're accessing. On later |
| i386 architecture machines, for example, this is controlled by way of the |
| MTRR registers. |
| |
| Ordinarily, these will be guaranteed to be fully ordered and uncombined,, |
| provided they're not accessing a prefetchable device. |
| |
| However, intermediary hardware (such as a PCI bridge) may indulge in |
| deferral if it so wishes; to flush a store, a load from the same location |
| is preferred[*], but a load from the same device or from configuration |
| space should suffice for PCI. |
| |
| [*] NOTE! attempting to load from the same location as was written to may |
| cause a malfunction - consider the 16550 Rx/Tx serial registers for |
| example. |
| |
| Used with prefetchable I/O memory, an mmiowb() barrier may be required to |
| force stores to be ordered. |
| |
| Please refer to the PCI specification for more information on interactions |
| between PCI transactions. |
| |
| (*) readX_relaxed() |
| |
| These are similar to readX(), but are not guaranteed to be ordered in any |
| way. Be aware that there is no I/O read barrier available. |
| |
| (*) ioreadX(), iowriteX() |
| |
| These will perform as appropriate for the type of access they're actually |
| doing, be it inX()/outX() or readX()/writeX(). |
| |
| |
| ======================================== |
| ASSUMED MINIMUM EXECUTION ORDERING MODEL |
| ======================================== |
| |
| It has to be assumed that the conceptual CPU is weakly-ordered but that it will |
| maintain the appearance of program causality with respect to itself. Some CPUs |
| (such as i386 or x86_64) are more constrained than others (such as powerpc or |
| frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside |
| of arch-specific code. |
| |
| This means that it must be considered that the CPU will execute its instruction |
| stream in any order it feels like - or even in parallel - provided that if an |
| instruction in the stream depends on the an earlier instruction, then that |
| earlier instruction must be sufficiently complete[*] before the later |
| instruction may proceed; in other words: provided that the appearance of |
| causality is maintained. |
| |
| [*] Some instructions have more than one effect - such as changing the |
| condition codes, changing registers or changing memory - and different |
| instructions may depend on different effects. |
| |
| A CPU may also discard any instruction sequence that winds up having no |
| ultimate effect. For example, if two adjacent instructions both load an |
| immediate value into the same register, the first may be discarded. |
| |
| |
| Similarly, it has to be assumed that compiler might reorder the instruction |
| stream in any way it sees fit, again provided the appearance of causality is |
| maintained. |
| |
| |
| ============================ |
| THE EFFECTS OF THE CPU CACHE |
| ============================ |
| |
| The way cached memory operations are perceived across the system is affected to |
| a certain extent by the caches that lie between CPUs and memory, and by the |
| memory coherence system that maintains the consistency of state in the system. |
| |
| As far as the way a CPU interacts with another part of the system through the |
| caches goes, the memory system has to include the CPU's caches, and memory |
| barriers for the most part act at the interface between the CPU and its cache |
| (memory barriers logically act on the dotted line in the following diagram): |
| |
| <--- CPU ---> : <----------- Memory -----------> |
| : |
| +--------+ +--------+ : +--------+ +-----------+ |
| | | | | : | | | | +--------+ |
| | CPU | | Memory | : | CPU | | | | | |
| | Core |--->| Access |----->| Cache |<-->| | | | |
| | | | Queue | : | | | |--->| Memory | |
| | | | | : | | | | | | |
| +--------+ +--------+ : +--------+ | | | | |
| : | Cache | +--------+ |
| : | Coherency | |
| : | Mechanism | +--------+ |
| +--------+ +--------+ : +--------+ | | | | |
| | | | | : | | | | | | |
| | CPU | | Memory | : | CPU | | |--->| Device | |
| | Core |--->| Access |----->| Cache |<-->| | | | |
| | | | Queue | : | | | | | | |
| | | | | : | | | | +--------+ |
| +--------+ +--------+ : +--------+ +-----------+ |
| : |
| : |
| |
| Although any particular load or store may not actually appear outside of the |
| CPU that issued it since it may have been satisfied within the CPU's own cache, |
| it will still appear as if the full memory access had taken place as far as the |
| other CPUs are concerned since the cache coherency mechanisms will migrate the |
| cacheline over to the accessing CPU and propagate the effects upon conflict. |
| |
| The CPU core may execute instructions in any order it deems fit, provided the |
| expected program causality appears to be maintained. Some of the instructions |
| generate load and store operations which then go into the queue of memory |
| accesses to be performed. The core may place these in the queue in any order |
| it wishes, and continue execution until it is forced to wait for an instruction |
| to complete. |
| |
| What memory barriers are concerned with is controlling the order in which |
| accesses cross from the CPU side of things to the memory side of things, and |
| the order in which the effects are perceived to happen by the other observers |
| in the system. |
| |
| [!] Memory barriers are _not_ needed within a given CPU, as CPUs always see |
| their own loads and stores as if they had happened in program order. |
| |
| [!] MMIO or other device accesses may bypass the cache system. This depends on |
| the properties of the memory window through which devices are accessed and/or |
| the use of any special device communication instructions the CPU may have. |
| |
| |
| CACHE COHERENCY |
| --------------- |
| |
| Life isn't quite as simple as it may appear above, however: for while the |
| caches are expected to be coherent, there's no guarantee that that coherency |
| will be ordered. This means that whilst changes made on one CPU will |
| eventually become visible on all CPUs, there's no guarantee that they will |
| become apparent in the same order on those other CPUs. |
| |
| |
| Consider dealing with a system that has pair of CPUs (1 & 2), each of which has |
| a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D): |
| |
| : |
| : +--------+ |
| : +---------+ | | |
| +--------+ : +--->| Cache A |<------->| | |
| | | : | +---------+ | | |
| | CPU 1 |<---+ | | |
| | | : | +---------+ | | |
| +--------+ : +--->| Cache B |<------->| | |
| : +---------+ | | |
| : | Memory | |
| : +---------+ | System | |
| +--------+ : +--->| Cache C |<------->| | |
| | | : | +---------+ | | |
| | CPU 2 |<---+ | | |
| | | : | +---------+ | | |
| +--------+ : +--->| Cache D |<------->| | |
| : +---------+ | | |
| : +--------+ |
| : |
| |
| Imagine the system has the following properties: |
| |
| (*) an odd-numbered cache line may be in cache A, cache C or it may still be |
| resident in memory; |
| |
| (*) an even-numbered cache line may be in cache B, cache D or it may still be |
| resident in memory; |
| |
| (*) whilst the CPU core is interrogating one cache, the other cache may be |
| making use of the bus to access the rest of the system - perhaps to |
| displace a dirty cacheline or to do a speculative load; |
| |
| (*) each cache has a queue of operations that need to be applied to that cache |
| to maintain coherency with the rest of the system; |
| |
| (*) the coherency queue is not flushed by normal loads to lines already |
| present in the cache, even though the contents of the queue may |
| potentially effect those loads. |
| |
| Imagine, then, that two writes are made on the first CPU, with a write barrier |
| between them to guarantee that they will appear to reach that CPU's caches in |
| the requisite order: |
| |
| CPU 1 CPU 2 COMMENT |
| =============== =============== ======================================= |
| u == 0, v == 1 and p == &u, q == &u |
| v = 2; |
| smp_wmb(); Make sure change to v visible before |
| change to p |
| <A:modify v=2> v is now in cache A exclusively |
| p = &v; |
| <B:modify p=&v> p is now in cache B exclusively |
| |
| The write memory barrier forces the other CPUs in the system to perceive that |
| the local CPU's caches have apparently been updated in the correct order. But |
| now imagine that the second CPU that wants to read those values: |
| |
| CPU 1 CPU 2 COMMENT |
| =============== =============== ======================================= |
| ... |
| q = p; |
| x = *q; |
| |
| The above pair of reads may then fail to happen in expected order, as the |
| cacheline holding p may get updated in one of the second CPU's caches whilst |
| the update to the cacheline holding v is delayed in the other of the second |
| CPU's caches by some other cache event: |
| |
| CPU 1 CPU 2 COMMENT |
| =============== =============== ======================================= |
| u == 0, v == 1 and p == &u, q == &u |
| v = 2; |
| smp_wmb(); |
| <A:modify v=2> <C:busy> |
| <C:queue v=2> |
| p = &b; q = p; |
| <D:request p> |
| <B:modify p=&v> <D:commit p=&v> |
| <D:read p> |
| x = *q; |
| <C:read *q> Reads from v before v updated in cache |
| <C:unbusy> |
| <C:commit v=2> |
| |
| Basically, whilst both cachelines will be updated on CPU 2 eventually, there's |
| no guarantee that, without intervention, the order of update will be the same |
| as that committed on CPU 1. |
| |
| |
| To intervene, we need to interpolate a data dependency barrier or a read |
| barrier between the loads. This will force the cache to commit its coherency |
| queue before processing any further requests: |
| |
| CPU 1 CPU 2 COMMENT |
| =============== =============== ======================================= |
| u == 0, v == 1 and p == &u, q == &u |
| v = 2; |
| smp_wmb(); |
| <A:modify v=2> <C:busy> |
| <C:queue v=2> |
| p = &b; q = p; |
| <D:request p> |
| <B:modify p=&v> <D:commit p=&v> |
| <D:read p> |
| smp_read_barrier_depends() |
| <C:unbusy> |
| <C:commit v=2> |
| x = *q; |
| <C:read *q> Reads from v after v updated in cache |
| |
| |
| This sort of problem can be encountered on DEC Alpha processors as they have a |
| split cache that improves performance by making better use of the data bus. |
| Whilst most CPUs do imply a data dependency barrier on the read when a memory |
| access depends on a read, not all do, so it may not be relied on. |
| |
| Other CPUs may also have split caches, but must coordinate between the various |
| cachelets for normal memory accesss. The semantics of the Alpha removes the |
| need for coordination in absence of memory barriers. |
| |
| |
| CACHE COHERENCY VS DMA |
| ---------------------- |
| |
| Not all systems maintain cache coherency with respect to devices doing DMA. In |
| such cases, a device attempting DMA may obtain stale data from RAM because |
| dirty cache lines may be resident in the caches of various CPUs, and may not |
| have been written back to RAM yet. To deal with this, the appropriate part of |
| the kernel must flush the overlapping bits of cache on each CPU (and maybe |
| invalidate them as well). |
| |
| In addition, the data DMA'd to RAM by a device may be overwritten by dirty |
| cache lines being written back to RAM from a CPU's cache after the device has |
| installed its own data, or cache lines simply present in a CPUs cache may |
| simply obscure the fact that RAM has been updated, until at such time as the |
| cacheline is discarded from the CPU's cache and reloaded. To deal with this, |
| the appropriate part of the kernel must invalidate the overlapping bits of the |
| cache on each CPU. |
| |
| See Documentation/cachetlb.txt for more information on cache management. |
| |
| |
| CACHE COHERENCY VS MMIO |
| ----------------------- |
| |
| Memory mapped I/O usually takes place through memory locations that are part of |
| a window in the CPU's memory space that have different properties assigned than |
| the usual RAM directed window. |
| |
| Amongst these properties is usually the fact that such accesses bypass the |
| caching entirely and go directly to the device buses. This means MMIO accesses |
| may, in effect, overtake accesses to cached memory that were emitted earlier. |
| A memory barrier isn't sufficient in such a case, but rather the cache must be |
| flushed between the cached memory write and the MMIO access if the two are in |
| any way dependent. |
| |
| |
| ========================= |
| THE THINGS CPUS GET UP TO |
| ========================= |
| |
| A programmer might take it for granted that the CPU will perform memory |
| operations in exactly the order specified, so that if a CPU is, for example, |
| given the following piece of code to execute: |
| |
| a = *A; |
| *B = b; |
| c = *C; |
| d = *D; |
| *E = e; |
| |
| They would then expect that the CPU will complete the memory operation for each |
| instruction before moving on to the next one, leading to a definite sequence of |
| operations as seen by external observers in the system: |
| |
| LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E. |
| |
| |
| Reality is, of course, much messier. With many CPUs and compilers, the above |
| assumption doesn't hold because: |
| |
| (*) loads are more likely to need to be completed immediately to permit |
| execution progress, whereas stores can often be deferred without a |
| problem; |
| |
| (*) loads may be done speculatively, and the result discarded should it prove |
| to have been unnecessary; |
| |
| (*) loads may be done speculatively, leading to the result having being |
| fetched at the wrong time in the expected sequence of events; |
| |
| (*) the order of the memory accesses may be rearranged to promote better use |
| of the CPU buses and caches; |
| |
| (*) loads and stores may be combined to improve performance when talking to |
| memory or I/O hardware that can do batched accesses of adjacent locations, |
| thus cutting down on transaction setup costs (memory and PCI devices may |
| both be able to do this); and |
| |
| (*) the CPU's data cache may affect the ordering, and whilst cache-coherency |
| mechanisms may alleviate this - once the store has actually hit the cache |
| - there's no guarantee that the coherency management will be propagated in |
| order to other CPUs. |
| |
| So what another CPU, say, might actually observe from the above piece of code |
| is: |
| |
| LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B |
| |
| (Where "LOAD {*C,*D}" is a combined load) |
| |
| |
| However, it is guaranteed that a CPU will be self-consistent: it will see its |
| _own_ accesses appear to be correctly ordered, without the need for a memory |
| barrier. For instance with the following code: |
| |
| U = *A; |
| *A = V; |
| *A = W; |
| X = *A; |
| *A = Y; |
| Z = *A; |
| |
| and assuming no intervention by an external influence, it can be assumed that |
| the final result will appear to be: |
| |
| U == the original value of *A |
| X == W |
| Z == Y |
| *A == Y |
| |
| The code above may cause the CPU to generate the full sequence of memory |
| accesses: |
| |
| U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A |
| |
| in that order, but, without intervention, the sequence may have almost any |
| combination of elements combined or discarded, provided the program's view of |
| the world remains consistent. |
| |
| The compiler may also combine, discard or defer elements of the sequence before |
| the CPU even sees them. |
| |
| For instance: |
| |
| *A = V; |
| *A = W; |
| |
| may be reduced to: |
| |
| *A = W; |
| |
| since, without a write barrier, it can be assumed that the effect of the |
| storage of V to *A is lost. Similarly: |
| |
| *A = Y; |
| Z = *A; |
| |
| may, without a memory barrier, be reduced to: |
| |
| *A = Y; |
| Z = Y; |
| |
| and the LOAD operation never appear outside of the CPU. |
| |
| |
| AND THEN THERE'S THE ALPHA |
| -------------------------- |
| |
| The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that, |
| some versions of the Alpha CPU have a split data cache, permitting them to have |
| two semantically related cache lines updating at separate times. This is where |
| the data dependency barrier really becomes necessary as this synchronises both |
| caches with the memory coherence system, thus making it seem like pointer |
| changes vs new data occur in the right order. |
| |
| The Alpha defines the Linux's kernel's memory barrier model. |
| |
| See the subsection on "Cache Coherency" above. |
| |
| |
| ========== |
| REFERENCES |
| ========== |
| |
| Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek, |
| Digital Press) |
| Chapter 5.2: Physical Address Space Characteristics |
| Chapter 5.4: Caches and Write Buffers |
| Chapter 5.5: Data Sharing |
| Chapter 5.6: Read/Write Ordering |
| |
| AMD64 Architecture Programmer's Manual Volume 2: System Programming |
| Chapter 7.1: Memory-Access Ordering |
| Chapter 7.4: Buffering and Combining Memory Writes |
| |
| IA-32 Intel Architecture Software Developer's Manual, Volume 3: |
| System Programming Guide |
| Chapter 7.1: Locked Atomic Operations |
| Chapter 7.2: Memory Ordering |
| Chapter 7.4: Serializing Instructions |
| |
| The SPARC Architecture Manual, Version 9 |
| Chapter 8: Memory Models |
| Appendix D: Formal Specification of the Memory Models |
| Appendix J: Programming with the Memory Models |
| |
| UltraSPARC Programmer Reference Manual |
| Chapter 5: Memory Accesses and Cacheability |
| Chapter 15: Sparc-V9 Memory Models |
| |
| UltraSPARC III Cu User's Manual |
| Chapter 9: Memory Models |
| |
| UltraSPARC IIIi Processor User's Manual |
| Chapter 8: Memory Models |
| |
| UltraSPARC Architecture 2005 |
| Chapter 9: Memory |
| Appendix D: Formal Specifications of the Memory Models |
| |
| UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005 |
| Chapter 8: Memory Models |
| Appendix F: Caches and Cache Coherency |
| |
| Solaris Internals, Core Kernel Architecture, p63-68: |
| Chapter 3.3: Hardware Considerations for Locks and |
| Synchronization |
| |
| Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching |
| for Kernel Programmers: |
| Chapter 13: Other Memory Models |
| |
| Intel Itanium Architecture Software Developer's Manual: Volume 1: |
| Section 2.6: Speculation |
| Section 4.4: Memory Access |