Keith Owens | 8ee9e23 | 2005-09-16 14:49:14 +1000 | [diff] [blame] | 1 | An ad-hoc collection of notes on IA64 MCA and INIT processing. Feel |
| 2 | free to update it with notes about any area that is not clear. |
| 3 | |
| 4 | --- |
| 5 | |
| 6 | MCA/INIT are completely asynchronous. They can occur at any time, when |
| 7 | the OS is in any state. Including when one of the cpus is already |
| 8 | holding a spinlock. Trying to get any lock from MCA/INIT state is |
| 9 | asking for deadlock. Also the state of structures that are protected |
| 10 | by locks is indeterminate, including linked lists. |
| 11 | |
| 12 | --- |
| 13 | |
| 14 | The complicated ia64 MCA process. All of this is mandated by Intel's |
| 15 | specification for ia64 SAL, error recovery and and unwind, it is not as |
| 16 | if we have a choice here. |
| 17 | |
| 18 | * MCA occurs on one cpu, usually due to a double bit memory error. |
| 19 | This is the monarch cpu. |
| 20 | |
| 21 | * SAL sends an MCA rendezvous interrupt (which is a normal interrupt) |
| 22 | to all the other cpus, the slaves. |
| 23 | |
| 24 | * Slave cpus that receive the MCA interrupt call down into SAL, they |
| 25 | end up spinning disabled while the MCA is being serviced. |
| 26 | |
| 27 | * If any slave cpu was already spinning disabled when the MCA occurred |
| 28 | then it cannot service the MCA interrupt. SAL waits ~20 seconds then |
| 29 | sends an unmaskable INIT event to the slave cpus that have not |
| 30 | already rendezvoused. |
| 31 | |
| 32 | * Because MCA/INIT can be delivered at any time, including when the cpu |
| 33 | is down in PAL in physical mode, the registers at the time of the |
| 34 | event are _completely_ undefined. In particular the MCA/INIT |
| 35 | handlers cannot rely on the thread pointer, PAL physical mode can |
| 36 | (and does) modify TP. It is allowed to do that as long as it resets |
| 37 | TP on return. However MCA/INIT events expose us to these PAL |
| 38 | internal TP changes. Hence curr_task(). |
| 39 | |
| 40 | * If an MCA/INIT event occurs while the kernel was running (not user |
| 41 | space) and the kernel has called PAL then the MCA/INIT handler cannot |
| 42 | assume that the kernel stack is in a fit state to be used. Mainly |
| 43 | because PAL may or may not maintain the stack pointer internally. |
| 44 | Because the MCA/INIT handlers cannot trust the kernel stack, they |
| 45 | have to use their own, per-cpu stacks. The MCA/INIT stacks are |
| 46 | preformatted with just enough task state to let the relevant handlers |
| 47 | do their job. |
| 48 | |
| 49 | * Unlike most other architectures, the ia64 struct task is embedded in |
| 50 | the kernel stack[1]. So switching to a new kernel stack means that |
| 51 | we switch to a new task as well. Because various bits of the kernel |
| 52 | assume that current points into the struct task, switching to a new |
| 53 | stack also means a new value for current. |
| 54 | |
| 55 | * Once all slaves have rendezvoused and are spinning disabled, the |
| 56 | monarch is entered. The monarch now tries to diagnose the problem |
| 57 | and decide if it can recover or not. |
| 58 | |
| 59 | * Part of the monarch's job is to look at the state of all the other |
| 60 | tasks. The only way to do that on ia64 is to call the unwinder, |
| 61 | as mandated by Intel. |
| 62 | |
| 63 | * The starting point for the unwind depends on whether a task is |
| 64 | running or not. That is, whether it is on a cpu or is blocked. The |
| 65 | monarch has to determine whether or not a task is on a cpu before it |
| 66 | knows how to start unwinding it. The tasks that received an MCA or |
| 67 | INIT event are no longer running, they have been converted to blocked |
| 68 | tasks. But (and its a big but), the cpus that received the MCA |
| 69 | rendezvous interrupt are still running on their normal kernel stacks! |
| 70 | |
| 71 | * To distinguish between these two cases, the monarch must know which |
| 72 | tasks are on a cpu and which are not. Hence each slave cpu that |
| 73 | switches to an MCA/INIT stack, registers its new stack using |
| 74 | set_curr_task(), so the monarch can tell that the _original_ task is |
| 75 | no longer running on that cpu. That gives us a decent chance of |
| 76 | getting a valid backtrace of the _original_ task. |
| 77 | |
| 78 | * MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a |
| 79 | nested error, we want diagnostics on the MCA/INIT handler that |
| 80 | failed, not on the task that was originally running. Again this |
| 81 | requires set_curr_task() so the MCA/INIT handlers can register their |
| 82 | own stack as running on that cpu. Then a recursive error gets a |
| 83 | trace of the failing handler's "task". |
| 84 | |
| 85 | [1] My (Keith Owens) original design called for ia64 to separate its |
| 86 | struct task and the kernel stacks. Then the MCA/INIT data would be |
| 87 | chained stacks like i386 interrupt stacks. But that required |
| 88 | radical surgery on the rest of ia64, plus extra hard wired TLB |
| 89 | entries with its associated performance degradation. David |
| 90 | Mosberger vetoed that approach. Which meant that separate kernel |
| 91 | stacks meant separate "tasks" for the MCA/INIT handlers. |
| 92 | |
| 93 | --- |
| 94 | |
| 95 | INIT is less complicated than MCA. Pressing the nmi button or using |
| 96 | the equivalent command on the management console sends INIT to all |
| 97 | cpus. SAL picks one one of the cpus as the monarch and the rest are |
| 98 | slaves. All the OS INIT handlers are entered at approximately the same |
| 99 | time. The OS monarch prints the state of all tasks and returns, after |
| 100 | which the slaves return and the system resumes. |
| 101 | |
| 102 | At least that is what is supposed to happen. Alas there are broken |
| 103 | versions of SAL out there. Some drive all the cpus as monarchs. Some |
| 104 | drive them all as slaves. Some drive one cpu as monarch, wait for that |
| 105 | cpu to return from the OS then drive the rest as slaves. Some versions |
| 106 | of SAL cannot even cope with returning from the OS, they spin inside |
| 107 | SAL on resume. The OS INIT code has workarounds for some of these |
| 108 | broken SAL symptoms, but some simply cannot be fixed from the OS side. |
| 109 | |
| 110 | --- |
| 111 | |
| 112 | The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer |
| 113 | violations. Unfortunately MCA/INIT start off as massive layer |
| 114 | violations (can occur at _any_ time) and they build from there. |
| 115 | |
| 116 | At least ia64 makes an attempt at recovering from hardware errors, but |
| 117 | it is a difficult problem because of the asynchronous nature of these |
| 118 | errors. When processing an unmaskable interrupt we sometimes need |
| 119 | special code to cope with our inability to take any locks. |
| 120 | |
| 121 | --- |
| 122 | |
| 123 | How is ia64 MCA/INIT different from x86 NMI? |
| 124 | |
| 125 | * x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to |
| 126 | all cpus. |
| 127 | |
| 128 | * x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2 |
| 129 | per cpu. |
| 130 | |
| 131 | * x86 has a separate struct task which points to one of multiple kernel |
| 132 | stacks. ia64 has the struct task embedded in the single kernel |
| 133 | stack, so switching stack means switching task. |
| 134 | |
| 135 | * x86 does not call the BIOS so the NMI handler does not have to worry |
| 136 | about any registers having changed. MCA/INIT can occur while the cpu |
| 137 | is in PAL in physical mode, with undefined registers and an undefined |
| 138 | kernel stack. |
| 139 | |
| 140 | * i386 backtrace is not very sensitive to whether a process is running |
| 141 | or not. ia64 unwind is very, very sensitive to whether a process is |
| 142 | running or not. |
| 143 | |
| 144 | --- |
| 145 | |
| 146 | What happens when MCA/INIT is delivered what a cpu is running user |
| 147 | space code? |
| 148 | |
| 149 | The user mode registers are stored in the RSE area of the MCA/INIT on |
| 150 | entry to the OS and are restored from there on return to SAL, so user |
| 151 | mode registers are preserved across a recoverable MCA/INIT. Since the |
| 152 | OS has no idea what unwind data is available for the user space stack, |
| 153 | MCA/INIT never tries to backtrace user space. Which means that the OS |
| 154 | does not bother making the user space process look like a blocked task, |
| 155 | i.e. the OS does not copy pt_regs and switch_stack to the user space |
| 156 | stack. Also the OS has no idea how big the user space RSE and memory |
| 157 | stacks are, which makes it too risky to copy the saved state to a user |
| 158 | mode stack. |
| 159 | |
| 160 | --- |
| 161 | |
| 162 | How do we get a backtrace on the tasks that were running when MCA/INIT |
| 163 | was delivered? |
| 164 | |
| 165 | mca.c:::ia64_mca_modify_original_stack(). That identifies and |
| 166 | verifies the original kernel stack, copies the dirty registers from |
| 167 | the MCA/INIT stack's RSE to the original stack's RSE, copies the |
| 168 | skeleton struct pt_regs and switch_stack to the original stack, fills |
| 169 | in the skeleton structures from the PAL minstate area and updates the |
| 170 | original stack's thread.ksp. That makes the original stack look |
| 171 | exactly like any other blocked task, i.e. it now appears to be |
| 172 | sleeping. To get a backtrace, just start with thread.ksp for the |
| 173 | original task and unwind like any other sleeping task. |
| 174 | |
| 175 | --- |
| 176 | |
| 177 | How do we identify the tasks that were running when MCA/INIT was |
| 178 | delivered? |
| 179 | |
| 180 | If the previous task has been verified and converted to a blocked |
| 181 | state, then sos->prev_task on the MCA/INIT stack is updated to point to |
| 182 | the previous task. You can look at that field in dumps or debuggers. |
| 183 | To help distinguish between the handler and the original tasks, |
| 184 | handlers have _TIF_MCA_INIT set in thread_info.flags. |
| 185 | |
| 186 | The sos data is always in the MCA/INIT handler stack, at offset |
| 187 | MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it |
| 188 | as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct |
| 189 | ia64_sal_os_state), with 16 byte alignment for all structures. |
| 190 | |
| 191 | Also the comm field of the MCA/INIT task is modified to include the pid |
| 192 | of the original task, for humans to use. For example, a comm field of |
| 193 | 'MCA 12159' means that pid 12159 was running when the MCA was |
| 194 | delivered. |