| Using RCU's CPU Stall Detector |
| |
| The CONFIG_RCU_CPU_STALL_DETECTOR kernel config parameter enables |
| RCU's CPU stall detector, which detects conditions that unduly delay |
| RCU grace periods. The stall detector's idea of what constitutes |
| "unduly delayed" is controlled by a pair of C preprocessor macros: |
| |
| RCU_SECONDS_TILL_STALL_CHECK |
| |
| This macro defines the period of time that RCU will wait from |
| the beginning of a grace period until it issues an RCU CPU |
| stall warning. It is normally ten seconds. |
| |
| RCU_SECONDS_TILL_STALL_RECHECK |
| |
| This macro defines the period of time that RCU will wait after |
| issuing a stall warning until it issues another stall warning. |
| It is normally set to thirty seconds. |
| |
| RCU_STALL_RAT_DELAY |
| |
| The CPU stall detector tries to make the offending CPU rat on itself, |
| as this often gives better-quality stack traces. However, if |
| the offending CPU does not detect its own stall in the number |
| of jiffies specified by RCU_STALL_RAT_DELAY, then other CPUs will |
| complain. This is normally set to two jiffies. |
| |
| The following problems can result in an RCU CPU stall warning: |
| |
| o A CPU looping in an RCU read-side critical section. |
| |
| o A CPU looping with interrupts disabled. |
| |
| o A CPU looping with preemption disabled. |
| |
| o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel |
| without invoking schedule(). |
| |
| o A bug in the RCU implementation. |
| |
| o A hardware failure. This is quite unlikely, but has occurred |
| at least once in a former life. A CPU failed in a running system, |
| becoming unresponsive, but not causing an immediate crash. |
| This resulted in a series of RCU CPU stall warnings, eventually |
| leading the realization that the CPU had failed. |
| |
| The RCU, RCU-sched, and RCU-bh implementations have CPU stall warning. |
| SRCU does not do so directly, but its calls to synchronize_sched() will |
| result in RCU-sched detecting any CPU stalls that might be occurring. |
| |
| To diagnose the cause of the stall, inspect the stack traces. The offending |
| function will usually be near the top of the stack. If you have a series |
| of stall warnings from a single extended stall, comparing the stack traces |
| can often help determine where the stall is occurring, which will usually |
| be in the function nearest the top of the stack that stays the same from |
| trace to trace. |
| |
| RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE. |