blob: 13266cff42ffa8fcfa8b795081f90afc37c34308 [file] [log] [blame]
Paul E. McKenney32300752008-05-12 21:21:05 +02001Please note that the "What is RCU?" LWN series is an excellent place
2to start learning about RCU:
3
41. What is RCU, Fundamentally? http://lwn.net/Articles/262464/
52. What is RCU? Part 2: Usage http://lwn.net/Articles/263130/
63. RCU part 3: the RCU API http://lwn.net/Articles/264090/
Kees Cookd4930112011-12-07 15:11:23 -080074. The RCU API, 2010 Edition http://lwn.net/Articles/418853/
Paul E. McKenney2921b122016-04-01 05:09:53 -070085. The RCU API, 2014 Edition http://lwn.net/Articles/609904/
Paul E. McKenney32300752008-05-12 21:21:05 +02009
10
Paul E. McKenneydd81eca2005-09-10 00:26:24 -070011What is RCU?
12
13RCU is a synchronization mechanism that was added to the Linux kernel
14during the 2.5 development effort that is optimized for read-mostly
15situations. Although RCU is actually quite simple once you understand it,
16getting there can sometimes be a challenge. Part of the problem is that
17most of the past descriptions of RCU have been written with the mistaken
18assumption that there is "one true way" to describe RCU. Instead,
19the experience has been that different people must take different paths
20to arrive at an understanding of RCU. This document provides several
21different paths, as follows:
22
231. RCU OVERVIEW
242. WHAT IS RCU'S CORE API?
253. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
264. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
275. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
286. ANALOGY WITH READER-WRITER LOCKING
297. FULL LIST OF RCU APIs
308. ANSWERS TO QUICK QUIZZES
31
32People who prefer starting with a conceptual overview should focus on
33Section 1, though most readers will profit by reading this section at
34some point. People who prefer to start with an API that they can then
35experiment with should focus on Section 2. People who prefer to start
36with example uses should focus on Sections 3 and 4. People who need to
37understand the RCU implementation should focus on Section 5, then dive
38into the kernel source code. People who reason best by analogy should
39focus on Section 6. Section 7 serves as an index to the docbook API
40documentation, and Section 8 is the traditional answer key.
41
42So, start with the section that makes the most sense to you and your
43preferred method of learning. If you need to know everything about
44everything, feel free to read the whole thing -- but if you are really
45that type of person, you have perused the source code and will therefore
46never need this document anyway. ;-)
47
48
491. RCU OVERVIEW
50
51The basic idea behind RCU is to split updates into "removal" and
52"reclamation" phases. The removal phase removes references to data items
53within a data structure (possibly by replacing them with references to
54new versions of these data items), and can run concurrently with readers.
55The reason that it is safe to run the removal phase concurrently with
56readers is the semantics of modern CPUs guarantee that readers will see
57either the old or the new version of the data structure rather than a
58partially updated reference. The reclamation phase does the work of reclaiming
59(e.g., freeing) the data items removed from the data structure during the
60removal phase. Because reclaiming data items can disrupt any readers
61concurrently referencing those data items, the reclamation phase must
62not start until readers no longer hold references to those data items.
63
64Splitting the update into removal and reclamation phases permits the
65updater to perform the removal phase immediately, and to defer the
66reclamation phase until all readers active during the removal phase have
67completed, either by blocking until they finish or by registering a
68callback that is invoked after they finish. Only readers that are active
69during the removal phase need be considered, because any reader starting
70after the removal phase will be unable to gain a reference to the removed
71data items, and therefore cannot be disrupted by the reclamation phase.
72
73So the typical RCU update sequence goes something like the following:
74
75a. Remove pointers to a data structure, so that subsequent
76 readers cannot gain a reference to it.
77
78b. Wait for all previous readers to complete their RCU read-side
79 critical sections.
80
81c. At this point, there cannot be any readers who hold references
82 to the data structure, so it now may safely be reclaimed
83 (e.g., kfree()d).
84
85Step (b) above is the key idea underlying RCU's deferred destruction.
86The ability to wait until all readers are done allows RCU readers to
87use much lighter-weight synchronization, in some cases, absolutely no
88synchronization at all. In contrast, in more conventional lock-based
89schemes, readers must use heavy-weight synchronization in order to
90prevent an updater from deleting the data structure out from under them.
91This is because lock-based updaters typically update data items in place,
92and must therefore exclude readers. In contrast, RCU-based updaters
93typically take advantage of the fact that writes to single aligned
94pointers are atomic on modern CPUs, allowing atomic insertion, removal,
95and replacement of data items in a linked structure without disrupting
96readers. Concurrent RCU readers can then continue accessing the old
97versions, and can dispense with the atomic operations, memory barriers,
98and communications cache misses that are so expensive on present-day
99SMP computer systems, even in absence of lock contention.
100
101In the three-step procedure shown above, the updater is performing both
102the removal and the reclamation step, but it is often helpful for an
103entirely different thread to do the reclamation, as is in fact the case
104in the Linux kernel's directory-entry cache (dcache). Even if the same
105thread performs both the update step (step (a) above) and the reclamation
106step (step (c) above), it is often helpful to think of them separately.
107For example, RCU readers and updaters need not communicate at all,
108but RCU provides implicit low-overhead communication between readers
109and reclaimers, namely, in step (b) above.
110
111So how the heck can a reclaimer tell when a reader is done, given
112that readers are not doing any sort of synchronization operations???
113Read on to learn about how RCU's API makes this easy.
114
115
1162. WHAT IS RCU'S CORE API?
117
118The core RCU API is quite small:
119
120a. rcu_read_lock()
121b. rcu_read_unlock()
122c. synchronize_rcu() / call_rcu()
123d. rcu_assign_pointer()
124e. rcu_dereference()
125
126There are many other members of the RCU API, but the rest can be
127expressed in terms of these five, though most implementations instead
128express synchronize_rcu() in terms of the call_rcu() callback API.
129
130The five core RCU APIs are described below, the other 18 will be enumerated
131later. See the kernel docbook documentation for more info, or look directly
132at the function header comments.
133
134rcu_read_lock()
135
136 void rcu_read_lock(void);
137
138 Used by a reader to inform the reclaimer that the reader is
139 entering an RCU read-side critical section. It is illegal
140 to block while in an RCU read-side critical section, though
Pranith Kumar28f65692014-09-22 14:00:48 -0400141 kernels built with CONFIG_PREEMPT_RCU can preempt RCU
Paul E. McKenney6b3ef482009-08-22 13:56:53 -0700142 read-side critical sections. Any RCU-protected data structure
143 accessed during an RCU read-side critical section is guaranteed to
144 remain unreclaimed for the full duration of that critical section.
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700145 Reference counts may be used in conjunction with RCU to maintain
146 longer-term references to data structures.
147
148rcu_read_unlock()
149
150 void rcu_read_unlock(void);
151
152 Used by a reader to inform the reclaimer that the reader is
153 exiting an RCU read-side critical section. Note that RCU
154 read-side critical sections may be nested and/or overlapping.
155
156synchronize_rcu()
157
158 void synchronize_rcu(void);
159
160 Marks the end of updater code and the beginning of reclaimer
161 code. It does this by blocking until all pre-existing RCU
162 read-side critical sections on all CPUs have completed.
163 Note that synchronize_rcu() will -not- necessarily wait for
164 any subsequent RCU read-side critical sections to complete.
165 For example, consider the following sequence of events:
166
167 CPU 0 CPU 1 CPU 2
168 ----------------- ------------------------- ---------------
169 1. rcu_read_lock()
170 2. enters synchronize_rcu()
171 3. rcu_read_lock()
172 4. rcu_read_unlock()
173 5. exits synchronize_rcu()
174 6. rcu_read_unlock()
175
176 To reiterate, synchronize_rcu() waits only for ongoing RCU
177 read-side critical sections to complete, not necessarily for
178 any that begin after synchronize_rcu() is invoked.
179
180 Of course, synchronize_rcu() does not necessarily return
181 -immediately- after the last pre-existing RCU read-side critical
182 section completes. For one thing, there might well be scheduling
183 delays. For another thing, many RCU implementations process
184 requests in batches in order to improve efficiencies, which can
185 further delay synchronize_rcu().
186
187 Since synchronize_rcu() is the API that must figure out when
188 readers are done, its implementation is key to RCU. For RCU
189 to be useful in all but the most read-intensive situations,
190 synchronize_rcu()'s overhead must also be quite small.
191
192 The call_rcu() API is a callback form of synchronize_rcu(),
193 and is described in more detail in a later section. Instead of
194 blocking, it registers a function and argument which are invoked
195 after all ongoing RCU read-side critical sections have completed.
196 This callback variant is particularly useful in situations where
Paul E. McKenney165d6c72006-06-25 05:48:44 -0700197 it is illegal to block or where update-side performance is
198 critically important.
199
200 However, the call_rcu() API should not be used lightly, as use
201 of the synchronize_rcu() API generally results in simpler code.
202 In addition, the synchronize_rcu() API has the nice property
203 of automatically limiting update rate should grace periods
204 be delayed. This property results in system resilience in face
205 of denial-of-service attacks. Code using call_rcu() should limit
206 update rate in order to gain this same sort of resilience. See
207 checklist.txt for some approaches to limiting the update rate.
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700208
209rcu_assign_pointer()
210
211 typeof(p) rcu_assign_pointer(p, typeof(p) v);
212
213 Yes, rcu_assign_pointer() -is- implemented as a macro, though it
214 would be cool to be able to declare a function in this manner.
215 (Compiler experts will no doubt disagree.)
216
217 The updater uses this function to assign a new value to an
218 RCU-protected pointer, in order to safely communicate the change
219 in value from the updater to the reader. This function returns
220 the new value, and also executes any memory-barrier instructions
221 required for a given CPU architecture.
222
Paul E. McKenneyd19720a2006-02-01 03:06:42 -0800223 Perhaps just as important, it serves to document (1) which
224 pointers are protected by RCU and (2) the point at which a
225 given structure becomes accessible to other CPUs. That said,
226 rcu_assign_pointer() is most frequently used indirectly, via
227 the _rcu list-manipulation primitives such as list_add_rcu().
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700228
229rcu_dereference()
230
231 typeof(p) rcu_dereference(p);
232
233 Like rcu_assign_pointer(), rcu_dereference() must be implemented
234 as a macro.
235
236 The reader uses rcu_dereference() to fetch an RCU-protected
237 pointer, which returns a value that may then be safely
238 dereferenced. Note that rcu_deference() does not actually
239 dereference the pointer, instead, it protects the pointer for
240 later dereferencing. It also executes any needed memory-barrier
241 instructions for a given CPU architecture. Currently, only Alpha
242 needs memory barriers within rcu_dereference() -- on other CPUs,
243 it compiles to nothing, not even a compiler directive.
244
245 Common coding practice uses rcu_dereference() to copy an
246 RCU-protected pointer to a local variable, then dereferences
247 this local variable, for example as follows:
248
249 p = rcu_dereference(head.next);
250 return p->data;
251
252 However, in this case, one could just as easily combine these
253 into one statement:
254
255 return rcu_dereference(head.next)->data;
256
257 If you are going to be fetching multiple fields from the
258 RCU-protected structure, using the local variable is of
259 course preferred. Repeated rcu_dereference() calls look
Milos Vyleteled384462015-04-17 16:38:04 +0200260 ugly, do not guarantee that the same pointer will be returned
261 if an update happened while in the critical section, and incur
262 unnecessary overhead on Alpha CPUs.
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700263
264 Note that the value returned by rcu_dereference() is valid
265 only within the enclosing RCU read-side critical section.
266 For example, the following is -not- legal:
267
268 rcu_read_lock();
269 p = rcu_dereference(head.next);
270 rcu_read_unlock();
Paul E. McKenney4357fb52013-02-12 07:56:27 -0800271 x = p->address; /* BUG!!! */
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700272 rcu_read_lock();
Paul E. McKenney4357fb52013-02-12 07:56:27 -0800273 y = p->data; /* BUG!!! */
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700274 rcu_read_unlock();
275
276 Holding a reference from one RCU read-side critical section
277 to another is just as illegal as holding a reference from
278 one lock-based critical section to another! Similarly,
279 using a reference outside of the critical section in which
280 it was acquired is just as illegal as doing so with normal
281 locking.
282
283 As with rcu_assign_pointer(), an important function of
Paul E. McKenneyd19720a2006-02-01 03:06:42 -0800284 rcu_dereference() is to document which pointers are protected by
285 RCU, in particular, flagging a pointer that is subject to changing
286 at any time, including immediately after the rcu_dereference().
287 And, again like rcu_assign_pointer(), rcu_dereference() is
288 typically used indirectly, via the _rcu list-manipulation
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700289 primitives, such as list_for_each_entry_rcu().
290
291The following diagram shows how each API communicates among the
292reader, updater, and reclaimer.
293
294
295 rcu_assign_pointer()
296 +--------+
297 +---------------------->| reader |---------+
298 | +--------+ |
299 | | |
300 | | | Protect:
301 | | | rcu_read_lock()
302 | | | rcu_read_unlock()
303 | rcu_dereference() | |
304 +---------+ | |
305 | updater |<---------------------+ |
306 +---------+ V
307 | +-----------+
308 +----------------------------------->| reclaimer |
309 +-----------+
310 Defer:
311 synchronize_rcu() & call_rcu()
312
313
314The RCU infrastructure observes the time sequence of rcu_read_lock(),
315rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
316order to determine when (1) synchronize_rcu() invocations may return
317to their callers and (2) call_rcu() callbacks may be invoked. Efficient
318implementations of the RCU infrastructure make heavy use of batching in
319order to amortize their overhead over many uses of the corresponding APIs.
320
321There are no fewer than three RCU mechanisms in the Linux kernel; the
322diagram above shows the first one, which is by far the most commonly used.
323The rcu_dereference() and rcu_assign_pointer() primitives are used for
324all three mechanisms, but different defer and protect primitives are
325used as follows:
326
327 Defer Protect
328
329a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock()
Paul E. McKenneyc598a072010-02-22 17:04:57 -0800330 call_rcu() rcu_dereference()
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700331
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700332b. synchronize_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh()
333 call_rcu_bh() rcu_dereference_bh()
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700334
Paul E. McKenney4c540052010-01-14 16:10:57 -0800335c. synchronize_sched() rcu_read_lock_sched() / rcu_read_unlock_sched()
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700336 call_rcu_sched() preempt_disable() / preempt_enable()
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700337 local_irq_save() / local_irq_restore()
338 hardirq enter / hardirq exit
339 NMI enter / NMI exit
Paul E. McKenneyc598a072010-02-22 17:04:57 -0800340 rcu_dereference_sched()
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700341
342These three mechanisms are used as follows:
343
344a. RCU applied to normal data structures.
345
346b. RCU applied to networking data structures that may be subjected
347 to remote denial-of-service attacks.
348
349c. RCU applied to scheduler and interrupt/NMI-handler tasks.
350
351Again, most uses will be of (a). The (b) and (c) cases are important
352for specialized uses, but are relatively uncommon.
353
354
3553. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
356
357This section shows a simple use of the core RCU API to protect a
Paul E. McKenneyd19720a2006-02-01 03:06:42 -0800358global pointer to a dynamically allocated structure. More-typical
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700359uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt.
360
361 struct foo {
362 int a;
363 char b;
364 long c;
365 };
366 DEFINE_SPINLOCK(foo_mutex);
367
Jason A. Donenfeld2c4ac342015-08-11 14:26:33 +0200368 struct foo __rcu *gbl_foo;
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700369
370 /*
371 * Create a new struct foo that is the same as the one currently
372 * pointed to by gbl_foo, except that field "a" is replaced
373 * with "new_a". Points gbl_foo to the new structure, and
374 * frees up the old structure after a grace period.
375 *
376 * Uses rcu_assign_pointer() to ensure that concurrent readers
377 * see the initialized version of the new structure.
378 *
379 * Uses synchronize_rcu() to ensure that any readers that might
380 * have references to the old structure complete before freeing
381 * the old structure.
382 */
383 void foo_update_a(int new_a)
384 {
385 struct foo *new_fp;
386 struct foo *old_fp;
387
Baruch Evende0dfcd2006-03-24 18:25:25 +0100388 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700389 spin_lock(&foo_mutex);
Jason A. Donenfeld2c4ac342015-08-11 14:26:33 +0200390 old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex));
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700391 *new_fp = *old_fp;
392 new_fp->a = new_a;
393 rcu_assign_pointer(gbl_foo, new_fp);
394 spin_unlock(&foo_mutex);
395 synchronize_rcu();
396 kfree(old_fp);
397 }
398
399 /*
400 * Return the value of field "a" of the current gbl_foo
401 * structure. Use rcu_read_lock() and rcu_read_unlock()
402 * to ensure that the structure does not get deleted out
403 * from under us, and use rcu_dereference() to ensure that
404 * we see the initialized version of the structure (important
405 * for DEC Alpha and for people reading the code).
406 */
407 int foo_get_a(void)
408 {
409 int retval;
410
411 rcu_read_lock();
412 retval = rcu_dereference(gbl_foo)->a;
413 rcu_read_unlock();
414 return retval;
415 }
416
417So, to sum up:
418
419o Use rcu_read_lock() and rcu_read_unlock() to guard RCU
420 read-side critical sections.
421
422o Within an RCU read-side critical section, use rcu_dereference()
423 to dereference RCU-protected pointers.
424
425o Use some solid scheme (such as locks or semaphores) to
426 keep concurrent updates from interfering with each other.
427
428o Use rcu_assign_pointer() to update an RCU-protected pointer.
429 This primitive protects concurrent readers from the updater,
430 -not- concurrent updates from each other! You therefore still
431 need to use locking (or something similar) to keep concurrent
432 rcu_assign_pointer() primitives from interfering with each other.
433
434o Use synchronize_rcu() -after- removing a data element from an
435 RCU-protected data structure, but -before- reclaiming/freeing
436 the data element, in order to wait for the completion of all
437 RCU read-side critical sections that might be referencing that
438 data item.
439
440See checklist.txt for additional rules to follow when using RCU.
Paul E. McKenneyd19720a2006-02-01 03:06:42 -0800441And again, more-typical uses of RCU may be found in listRCU.txt,
442arrayRCU.txt, and NMI-RCU.txt.
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700443
444
4454. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
446
447In the example above, foo_update_a() blocks until a grace period elapses.
448This is quite simple, but in some cases one cannot afford to wait so
449long -- there might be other high-priority work to be done.
450
451In such cases, one uses call_rcu() rather than synchronize_rcu().
452The call_rcu() API is as follows:
453
454 void call_rcu(struct rcu_head * head,
455 void (*func)(struct rcu_head *head));
456
457This function invokes func(head) after a grace period has elapsed.
458This invocation might happen from either softirq or process context,
459so the function is not permitted to block. The foo struct needs to
460have an rcu_head structure added, perhaps as follows:
461
462 struct foo {
463 int a;
464 char b;
465 long c;
466 struct rcu_head rcu;
467 };
468
469The foo_update_a() function might then be written as follows:
470
471 /*
472 * Create a new struct foo that is the same as the one currently
473 * pointed to by gbl_foo, except that field "a" is replaced
474 * with "new_a". Points gbl_foo to the new structure, and
475 * frees up the old structure after a grace period.
476 *
477 * Uses rcu_assign_pointer() to ensure that concurrent readers
478 * see the initialized version of the new structure.
479 *
480 * Uses call_rcu() to ensure that any readers that might have
481 * references to the old structure complete before freeing the
482 * old structure.
483 */
484 void foo_update_a(int new_a)
485 {
486 struct foo *new_fp;
487 struct foo *old_fp;
488
Baruch Evende0dfcd2006-03-24 18:25:25 +0100489 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700490 spin_lock(&foo_mutex);
Jason A. Donenfeld2c4ac342015-08-11 14:26:33 +0200491 old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex));
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700492 *new_fp = *old_fp;
493 new_fp->a = new_a;
494 rcu_assign_pointer(gbl_foo, new_fp);
495 spin_unlock(&foo_mutex);
496 call_rcu(&old_fp->rcu, foo_reclaim);
497 }
498
499The foo_reclaim() function might appear as follows:
500
501 void foo_reclaim(struct rcu_head *rp)
502 {
503 struct foo *fp = container_of(rp, struct foo, rcu);
504
Kees Cook57d34a62012-10-19 09:48:30 -0700505 foo_cleanup(fp->a);
506
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700507 kfree(fp);
508 }
509
510The container_of() primitive is a macro that, given a pointer into a
511struct, the type of the struct, and the pointed-to field within the
512struct, returns a pointer to the beginning of the struct.
513
514The use of call_rcu() permits the caller of foo_update_a() to
515immediately regain control, without needing to worry further about the
516old version of the newly updated element. It also clearly shows the
517RCU distinction between updater, namely foo_update_a(), and reclaimer,
518namely foo_reclaim().
519
520The summary of advice is the same as for the previous section, except
521that we are now using call_rcu() rather than synchronize_rcu():
522
523o Use call_rcu() -after- removing a data element from an
524 RCU-protected data structure in order to register a callback
525 function that will be invoked after the completion of all RCU
526 read-side critical sections that might be referencing that
527 data item.
528
Kees Cook57d34a62012-10-19 09:48:30 -0700529If the callback for call_rcu() is not doing anything more than calling
530kfree() on the structure, you can use kfree_rcu() instead of call_rcu()
531to avoid having to write your own callback:
532
533 kfree_rcu(old_fp, rcu);
534
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700535Again, see checklist.txt for additional rules governing the use of RCU.
536
537
5385. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
539
540One of the nice things about RCU is that it has extremely simple "toy"
541implementations that are a good first step towards understanding the
542production-quality implementations in the Linux kernel. This section
543presents two such "toy" implementations of RCU, one that is implemented
544in terms of familiar locking primitives, and another that more closely
545resembles "classic" RCU. Both are way too simple for real-world use,
546lacking both functionality and performance. However, they are useful
547in getting a feel for how RCU works. See kernel/rcupdate.c for a
548production-quality implementation, and see:
549
550 http://www.rdrop.com/users/paulmck/RCU
551
552for papers describing the Linux kernel RCU implementation. The OLS'01
553and OLS'02 papers are a good introduction, and the dissertation provides
Paul E. McKenneyd19720a2006-02-01 03:06:42 -0800554more details on the current implementation as of early 2004.
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700555
556
5575A. "TOY" IMPLEMENTATION #1: LOCKING
558
559This section presents a "toy" RCU implementation that is based on
560familiar locking primitives. Its overhead makes it a non-starter for
561real-life use, as does its lack of scalability. It is also unsuitable
562for realtime use, since it allows scheduling latency to "bleed" from
563one read-side critical section to another.
564
565However, it is probably the easiest implementation to relate to, so is
566a good starting point.
567
568It is extremely simple:
569
570 static DEFINE_RWLOCK(rcu_gp_mutex);
571
572 void rcu_read_lock(void)
573 {
574 read_lock(&rcu_gp_mutex);
575 }
576
577 void rcu_read_unlock(void)
578 {
579 read_unlock(&rcu_gp_mutex);
580 }
581
582 void synchronize_rcu(void)
583 {
584 write_lock(&rcu_gp_mutex);
585 write_unlock(&rcu_gp_mutex);
586 }
587
588[You can ignore rcu_assign_pointer() and rcu_dereference() without
589missing much. But here they are anyway. And whatever you do, don't
590forget about them when submitting patches making use of RCU!]
591
592 #define rcu_assign_pointer(p, v) ({ \
593 smp_wmb(); \
594 (p) = (v); \
595 })
596
597 #define rcu_dereference(p) ({ \
598 typeof(p) _________p1 = p; \
599 smp_read_barrier_depends(); \
600 (_________p1); \
601 })
602
603
604The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
605and release a global reader-writer lock. The synchronize_rcu()
606primitive write-acquires this same lock, then immediately releases
607it. This means that once synchronize_rcu() exits, all RCU read-side
Matt LaPlante53cb4722006-10-03 22:55:17 +0200608critical sections that were in progress before synchronize_rcu() was
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700609called are guaranteed to have completed -- there is no way that
610synchronize_rcu() would have been able to write-acquire the lock
611otherwise.
612
613It is possible to nest rcu_read_lock(), since reader-writer locks may
614be recursively acquired. Note also that rcu_read_lock() is immune
615from deadlock (an important property of RCU). The reason for this is
616that the only thing that can block rcu_read_lock() is a synchronize_rcu().
617But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex,
618so there can be no deadlock cycle.
619
620Quick Quiz #1: Why is this argument naive? How could a deadlock
621 occur when using this algorithm in a real-world Linux
622 kernel? How could this deadlock be avoided?
623
624
6255B. "TOY" EXAMPLE #2: CLASSIC RCU
626
627This section presents a "toy" RCU implementation that is based on
628"classic RCU". It is also short on performance (but only for updates) and
629on features such as hotplug CPU and the ability to run in CONFIG_PREEMPT
630kernels. The definitions of rcu_dereference() and rcu_assign_pointer()
631are the same as those shown in the preceding section, so they are omitted.
632
633 void rcu_read_lock(void) { }
634
635 void rcu_read_unlock(void) { }
636
637 void synchronize_rcu(void)
638 {
639 int cpu;
640
KAMEZAWA Hiroyuki3c30a752006-03-28 01:56:39 -0800641 for_each_possible_cpu(cpu)
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700642 run_on(cpu);
643 }
644
645Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing.
646This is the great strength of classic RCU in a non-preemptive kernel:
647read-side overhead is precisely zero, at least on non-Alpha CPUs.
648And there is absolutely no way that rcu_read_lock() can possibly
649participate in a deadlock cycle!
650
651The implementation of synchronize_rcu() simply schedules itself on each
652CPU in turn. The run_on() primitive can be implemented straightforwardly
653in terms of the sched_setaffinity() primitive. Of course, a somewhat less
654"toy" implementation would restore the affinity upon completion rather
655than just leaving all tasks running on the last CPU, but when I said
656"toy", I meant -toy-!
657
658So how the heck is this supposed to work???
659
660Remember that it is illegal to block while in an RCU read-side critical
661section. Therefore, if a given CPU executes a context switch, we know
662that it must have completed all preceding RCU read-side critical sections.
663Once -all- CPUs have executed a context switch, then -all- preceding
664RCU read-side critical sections will have completed.
665
666So, suppose that we remove a data item from its structure and then invoke
667synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed
668that there are no RCU read-side critical sections holding a reference
669to that data item, so we can safely reclaim it.
670
671Quick Quiz #2: Give an example where Classic RCU's read-side
672 overhead is -negative-.
673
674Quick Quiz #3: If it is illegal to block in an RCU read-side
675 critical section, what the heck do you do in
676 PREEMPT_RT, where normal spinlocks can block???
677
678
6796. ANALOGY WITH READER-WRITER LOCKING
680
681Although RCU can be used in many different ways, a very common use of
682RCU is analogous to reader-writer locking. The following unified
683diff shows how closely related RCU and reader-writer locking can be.
684
Yao Dongdong70946a42016-03-07 16:02:14 +0800685 @@ -5,5 +5,5 @@ struct el {
686 int data;
687 /* Other data fields */
688 };
689 -rwlock_t listmutex;
690 +spinlock_t listmutex;
691 struct el head;
692
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700693 @@ -13,15 +14,15 @@
694 struct list_head *lp;
695 struct el *p;
696
Yao Dongdong70946a42016-03-07 16:02:14 +0800697 - read_lock(&listmutex);
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700698 - list_for_each_entry(p, head, lp) {
699 + rcu_read_lock();
700 + list_for_each_entry_rcu(p, head, lp) {
701 if (p->key == key) {
702 *result = p->data;
Yao Dongdong70946a42016-03-07 16:02:14 +0800703 - read_unlock(&listmutex);
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700704 + rcu_read_unlock();
705 return 1;
706 }
707 }
Yao Dongdong70946a42016-03-07 16:02:14 +0800708 - read_unlock(&listmutex);
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700709 + rcu_read_unlock();
710 return 0;
711 }
712
713 @@ -29,15 +30,16 @@
714 {
715 struct el *p;
716
717 - write_lock(&listmutex);
718 + spin_lock(&listmutex);
719 list_for_each_entry(p, head, lp) {
720 if (p->key == key) {
Urs Thuermann82a854e2006-07-10 04:44:06 -0700721 - list_del(&p->list);
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700722 - write_unlock(&listmutex);
Urs Thuermann82a854e2006-07-10 04:44:06 -0700723 + list_del_rcu(&p->list);
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700724 + spin_unlock(&listmutex);
725 + synchronize_rcu();
726 kfree(p);
727 return 1;
728 }
729 }
730 - write_unlock(&listmutex);
731 + spin_unlock(&listmutex);
732 return 0;
733 }
734
735Or, for those who prefer a side-by-side listing:
736
737 1 struct el { 1 struct el {
738 2 struct list_head list; 2 struct list_head list;
739 3 long key; 3 long key;
740 4 spinlock_t mutex; 4 spinlock_t mutex;
741 5 int data; 5 int data;
742 6 /* Other data fields */ 6 /* Other data fields */
743 7 }; 7 };
Yao Dongdong70946a42016-03-07 16:02:14 +0800744 8 rwlock_t listmutex; 8 spinlock_t listmutex;
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700745 9 struct el head; 9 struct el head;
746
747 1 int search(long key, int *result) 1 int search(long key, int *result)
748 2 { 2 {
749 3 struct list_head *lp; 3 struct list_head *lp;
750 4 struct el *p; 4 struct el *p;
751 5 5
Yao Dongdong70946a42016-03-07 16:02:14 +0800752 6 read_lock(&listmutex); 6 rcu_read_lock();
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700753 7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) {
754 8 if (p->key == key) { 8 if (p->key == key) {
755 9 *result = p->data; 9 *result = p->data;
Yao Dongdong70946a42016-03-07 16:02:14 +080075610 read_unlock(&listmutex); 10 rcu_read_unlock();
Paul E. McKenneydd81eca2005-09-10 00:26:24 -070075711 return 1; 11 return 1;
75812 } 12 }
75913 } 13 }
Yao Dongdong70946a42016-03-07 16:02:14 +080076014 read_unlock(&listmutex); 14 rcu_read_unlock();
Paul E. McKenneydd81eca2005-09-10 00:26:24 -070076115 return 0; 15 return 0;
76216 } 16 }
763
764 1 int delete(long key) 1 int delete(long key)
765 2 { 2 {
766 3 struct el *p; 3 struct el *p;
767 4 4
768 5 write_lock(&listmutex); 5 spin_lock(&listmutex);
769 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) {
770 7 if (p->key == key) { 7 if (p->key == key) {
Urs Thuermann82a854e2006-07-10 04:44:06 -0700771 8 list_del(&p->list); 8 list_del_rcu(&p->list);
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700772 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex);
773 10 synchronize_rcu();
77410 kfree(p); 11 kfree(p);
77511 return 1; 12 return 1;
77612 } 13 }
77713 } 14 }
77814 write_unlock(&listmutex); 15 spin_unlock(&listmutex);
77915 return 0; 16 return 0;
78016 } 17 }
781
782Either way, the differences are quite small. Read-side locking moves
783to rcu_read_lock() and rcu_read_unlock, update-side locking moves from
Paolo Ornati670e9f32006-10-03 22:57:56 +0200784a reader-writer lock to a simple spinlock, and a synchronize_rcu()
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700785precedes the kfree().
786
787However, there is one potential catch: the read-side and update-side
788critical sections can now run concurrently. In many cases, this will
789not be a problem, but it is necessary to check carefully regardless.
790For example, if multiple independent list updates must be seen as
791a single atomic update, converting to RCU will require special care.
792
793Also, the presence of synchronize_rcu() means that the RCU version of
794delete() can now block. If this is a problem, there is a callback-based
Kees Cook57d34a62012-10-19 09:48:30 -0700795mechanism that never blocks, namely call_rcu() or kfree_rcu(), that can
796be used in place of synchronize_rcu().
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700797
798
7997. FULL LIST OF RCU APIs
800
801The RCU APIs are documented in docbook-format header comments in the
802Linux-kernel source code, but it helps to have a full list of the
803APIs, since there does not appear to be a way to categorize them
804in docbook. Here is the list, by category.
805
Paul E. McKenneyc598a072010-02-22 17:04:57 -0800806RCU list traversal:
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700807
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700808 list_entry_rcu
809 list_first_entry_rcu
810 list_next_rcu
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700811 list_for_each_entry_rcu
Paul E. McKenneybb08f762012-10-20 12:33:37 -0700812 list_for_each_entry_continue_rcu
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700813 hlist_first_rcu
814 hlist_next_rcu
815 hlist_pprev_rcu
816 hlist_for_each_entry_rcu
817 hlist_for_each_entry_rcu_bh
818 hlist_for_each_entry_continue_rcu
819 hlist_for_each_entry_continue_rcu_bh
820 hlist_nulls_first_rcu
821 hlist_nulls_for_each_entry_rcu
822 hlist_bl_first_rcu
823 hlist_bl_for_each_entry_rcu
Paul E. McKenney32300752008-05-12 21:21:05 +0200824
825RCU pointer/list update:
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700826
827 rcu_assign_pointer
828 list_add_rcu
829 list_add_tail_rcu
830 list_del_rcu
831 list_replace_rcu
Ken Helias1d023282014-08-06 16:09:16 -0700832 hlist_add_behind_rcu
Paul E. McKenney32300752008-05-12 21:21:05 +0200833 hlist_add_before_rcu
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700834 hlist_add_head_rcu
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700835 hlist_del_rcu
836 hlist_del_init_rcu
Paul E. McKenney32300752008-05-12 21:21:05 +0200837 hlist_replace_rcu
838 list_splice_init_rcu()
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700839 hlist_nulls_del_init_rcu
840 hlist_nulls_del_rcu
841 hlist_nulls_add_head_rcu
842 hlist_bl_add_head_rcu
843 hlist_bl_del_init_rcu
844 hlist_bl_del_rcu
845 hlist_bl_set_first_rcu
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700846
Paul E. McKenney32300752008-05-12 21:21:05 +0200847RCU: Critical sections Grace period Barrier
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700848
Paul E. McKenney32300752008-05-12 21:21:05 +0200849 rcu_read_lock synchronize_net rcu_barrier
850 rcu_read_unlock synchronize_rcu
Paul E. McKenneyc598a072010-02-22 17:04:57 -0800851 rcu_dereference synchronize_rcu_expedited
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700852 rcu_read_lock_held call_rcu
853 rcu_dereference_check kfree_rcu
854 rcu_dereference_protected
Paul E. McKenney32300752008-05-12 21:21:05 +0200855
856bh: Critical sections Grace period Barrier
857
858 rcu_read_lock_bh call_rcu_bh rcu_barrier_bh
Paul E. McKenney240ebbf2009-06-25 09:08:18 -0700859 rcu_read_unlock_bh synchronize_rcu_bh
Paul E. McKenneyc598a072010-02-22 17:04:57 -0800860 rcu_dereference_bh synchronize_rcu_bh_expedited
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700861 rcu_dereference_bh_check
862 rcu_dereference_bh_protected
863 rcu_read_lock_bh_held
Paul E. McKenney32300752008-05-12 21:21:05 +0200864
865sched: Critical sections Grace period Barrier
866
Paul E. McKenney240ebbf2009-06-25 09:08:18 -0700867 rcu_read_lock_sched synchronize_sched rcu_barrier_sched
868 rcu_read_unlock_sched call_rcu_sched
869 [preempt_disable] synchronize_sched_expedited
870 [and friends]
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700871 rcu_read_lock_sched_notrace
872 rcu_read_unlock_sched_notrace
Paul E. McKenneyc598a072010-02-22 17:04:57 -0800873 rcu_dereference_sched
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700874 rcu_dereference_sched_check
875 rcu_dereference_sched_protected
876 rcu_read_lock_sched_held
Paul E. McKenney32300752008-05-12 21:21:05 +0200877
878
879SRCU: Critical sections Grace period Barrier
880
Paul E. McKenney74d874e2012-05-07 13:43:30 -0700881 srcu_read_lock synchronize_srcu srcu_barrier
882 srcu_read_unlock call_srcu
Paul E. McKenney99f88912013-03-12 16:54:14 -0700883 srcu_dereference synchronize_srcu_expedited
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700884 srcu_dereference_check
885 srcu_read_lock_held
Paul E. McKenney32300752008-05-12 21:21:05 +0200886
Paul E. McKenney240ebbf2009-06-25 09:08:18 -0700887SRCU: Initialization/cleanup
888 init_srcu_struct
889 cleanup_srcu_struct
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700890
Paul E. McKenney50aec002010-04-09 15:39:12 -0700891All: lockdep-checked RCU-protected pointer access
892
Paul E. McKenney50aec002010-04-09 15:39:12 -0700893 rcu_access_pointer
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700894 rcu_dereference_raw
Paul E. McKenneyf78f5b92015-06-18 15:50:02 -0700895 RCU_LOCKDEP_WARN
Paul E. McKenneyd07e6d02014-03-31 13:36:33 -0700896 rcu_sleep_check
897 RCU_NONIDLE
Paul E. McKenney50aec002010-04-09 15:39:12 -0700898
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700899See the comment headers in the source code (or the docbook generated
900from them) for more information.
901
Paul E. McKenneyfea65122011-01-23 22:35:45 -0800902However, given that there are no fewer than four families of RCU APIs
903in the Linux kernel, how do you choose which one to use? The following
904list can be helpful:
905
906a. Will readers need to block? If so, you need SRCU.
907
Paul E. McKenney99f88912013-03-12 16:54:14 -0700908b. What about the -rt patchset? If readers would need to block
Paul E. McKenneyfea65122011-01-23 22:35:45 -0800909 in an non-rt kernel, you need SRCU. If readers would block
910 in a -rt kernel, but not in a non-rt kernel, SRCU is not
911 necessary.
912
Paul E. McKenney99f88912013-03-12 16:54:14 -0700913c. Do you need to treat NMI handlers, hardirq handlers,
Paul E. McKenneyfea65122011-01-23 22:35:45 -0800914 and code segments with preemption disabled (whether
915 via preempt_disable(), local_irq_save(), local_bh_disable(),
916 or some other mechanism) as if they were explicit RCU readers?
Paul E. McKenney2aef6192012-08-03 16:41:23 -0700917 If so, RCU-sched is the only choice that will work for you.
Paul E. McKenneyfea65122011-01-23 22:35:45 -0800918
Paul E. McKenney99f88912013-03-12 16:54:14 -0700919d. Do you need RCU grace periods to complete even in the face
Paul E. McKenneyfea65122011-01-23 22:35:45 -0800920 of softirq monopolization of one or more of the CPUs? For
921 example, is your code subject to network-based denial-of-service
922 attacks? If so, you need RCU-bh.
923
Paul E. McKenney99f88912013-03-12 16:54:14 -0700924e. Is your workload too update-intensive for normal use of
Paul E. McKenneyfea65122011-01-23 22:35:45 -0800925 RCU, but inappropriate for other synchronization mechanisms?
926 If so, consider SLAB_DESTROY_BY_RCU. But please be careful!
927
Paul E. McKenney99f88912013-03-12 16:54:14 -0700928f. Do you need read-side critical sections that are respected
Paul E. McKenney2aef6192012-08-03 16:41:23 -0700929 even though they are in the middle of the idle loop, during
930 user-mode execution, or on an offlined CPU? If so, SRCU is the
931 only choice that will work for you.
932
Paul E. McKenney99f88912013-03-12 16:54:14 -0700933g. Otherwise, use RCU.
Paul E. McKenneyfea65122011-01-23 22:35:45 -0800934
935Of course, this all assumes that you have determined that RCU is in fact
936the right tool for your job.
937
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700938
9398. ANSWERS TO QUICK QUIZZES
940
941Quick Quiz #1: Why is this argument naive? How could a deadlock
942 occur when using this algorithm in a real-world Linux
943 kernel? [Referring to the lock-based "toy" RCU
944 algorithm.]
945
946Answer: Consider the following sequence of events:
947
948 1. CPU 0 acquires some unrelated lock, call it
Paul E. McKenneyd19720a2006-02-01 03:06:42 -0800949 "problematic_lock", disabling irq via
950 spin_lock_irqsave().
Paul E. McKenneydd81eca2005-09-10 00:26:24 -0700951
952 2. CPU 1 enters synchronize_rcu(), write-acquiring
953 rcu_gp_mutex.
954
955 3. CPU 0 enters rcu_read_lock(), but must wait
956 because CPU 1 holds rcu_gp_mutex.
957
958 4. CPU 1 is interrupted, and the irq handler
959 attempts to acquire problematic_lock.
960
961 The system is now deadlocked.
962
963 One way to avoid this deadlock is to use an approach like
964 that of CONFIG_PREEMPT_RT, where all normal spinlocks
965 become blocking locks, and all irq handlers execute in
966 the context of special tasks. In this case, in step 4
967 above, the irq handler would block, allowing CPU 1 to
968 release rcu_gp_mutex, avoiding the deadlock.
969
970 Even in the absence of deadlock, this RCU implementation
971 allows latency to "bleed" from readers to other
972 readers through synchronize_rcu(). To see this,
973 consider task A in an RCU read-side critical section
974 (thus read-holding rcu_gp_mutex), task B blocked
975 attempting to write-acquire rcu_gp_mutex, and
976 task C blocked in rcu_read_lock() attempting to
977 read_acquire rcu_gp_mutex. Task A's RCU read-side
978 latency is holding up task C, albeit indirectly via
979 task B.
980
981 Realtime RCU implementations therefore use a counter-based
982 approach where tasks in RCU read-side critical sections
983 cannot be blocked by tasks executing synchronize_rcu().
984
985Quick Quiz #2: Give an example where Classic RCU's read-side
986 overhead is -negative-.
987
988Answer: Imagine a single-CPU system with a non-CONFIG_PREEMPT
989 kernel where a routing table is used by process-context
990 code, but can be updated by irq-context code (for example,
991 by an "ICMP REDIRECT" packet). The usual way of handling
992 this would be to have the process-context code disable
993 interrupts while searching the routing table. Use of
994 RCU allows such interrupt-disabling to be dispensed with.
995 Thus, without RCU, you pay the cost of disabling interrupts,
996 and with RCU you don't.
997
998 One can argue that the overhead of RCU in this
999 case is negative with respect to the single-CPU
1000 interrupt-disabling approach. Others might argue that
1001 the overhead of RCU is merely zero, and that replacing
1002 the positive overhead of the interrupt-disabling scheme
1003 with the zero-overhead RCU scheme does not constitute
1004 negative overhead.
1005
1006 In real life, of course, things are more complex. But
1007 even the theoretical possibility of negative overhead for
1008 a synchronization primitive is a bit unexpected. ;-)
1009
1010Quick Quiz #3: If it is illegal to block in an RCU read-side
1011 critical section, what the heck do you do in
1012 PREEMPT_RT, where normal spinlocks can block???
1013
1014Answer: Just as PREEMPT_RT permits preemption of spinlock
1015 critical sections, it permits preemption of RCU
1016 read-side critical sections. It also permits
1017 spinlocks blocking while in RCU read-side critical
1018 sections.
1019
1020 Why the apparent inconsistency? Because it is it
1021 possible to use priority boosting to keep the RCU
1022 grace periods short if need be (for example, if running
1023 short of memory). In contrast, if blocking waiting
1024 for (say) network reception, there is no way to know
1025 what should be boosted. Especially given that the
1026 process we need to boost might well be a human being
1027 who just went out for a pizza or something. And although
1028 a computer-operated cattle prod might arouse serious
1029 interest, it might also provoke serious objections.
1030 Besides, how does the computer know what pizza parlor
1031 the human being went to???
1032
1033
1034ACKNOWLEDGEMENTS
1035
1036My thanks to the people who helped make this human-readable, including
Paul E. McKenneyd19720a2006-02-01 03:06:42 -08001037Jon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern.
Paul E. McKenneydd81eca2005-09-10 00:26:24 -07001038
1039
1040For more information, see http://www.rdrop.com/users/paulmck/RCU.