Mahesh Salgaonkar | 8e0aa6d | 2012-02-16 01:14:14 +0000 | [diff] [blame] | 1 | |
| 2 | Firmware-Assisted Dump |
| 3 | ------------------------ |
| 4 | July 2011 |
| 5 | |
| 6 | The goal of firmware-assisted dump is to enable the dump of |
| 7 | a crashed system, and to do so from a fully-reset system, and |
| 8 | to minimize the total elapsed time until the system is back |
| 9 | in production use. |
| 10 | |
| 11 | - Firmware assisted dump (fadump) infrastructure is intended to replace |
| 12 | the existing phyp assisted dump. |
| 13 | - Fadump uses the same firmware interfaces and memory reservation model |
| 14 | as phyp assisted dump. |
| 15 | - Unlike phyp dump, fadump exports the memory dump through /proc/vmcore |
| 16 | in the ELF format in the same way as kdump. This helps us reuse the |
| 17 | kdump infrastructure for dump capture and filtering. |
| 18 | - Unlike phyp dump, userspace tool does not need to refer any sysfs |
| 19 | interface while reading /proc/vmcore. |
| 20 | - Unlike phyp dump, fadump allows user to release all the memory reserved |
| 21 | for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. |
| 22 | - Once enabled through kernel boot parameter, fadump can be |
| 23 | started/stopped through /sys/kernel/fadump_registered interface (see |
| 24 | sysfs files section below) and can be easily integrated with kdump |
| 25 | service start/stop init scripts. |
| 26 | |
| 27 | Comparing with kdump or other strategies, firmware-assisted |
| 28 | dump offers several strong, practical advantages: |
| 29 | |
| 30 | -- Unlike kdump, the system has been reset, and loaded |
| 31 | with a fresh copy of the kernel. In particular, |
| 32 | PCI and I/O devices have been reinitialized and are |
| 33 | in a clean, consistent state. |
| 34 | -- Once the dump is copied out, the memory that held the dump |
| 35 | is immediately available to the running kernel. And therefore, |
| 36 | unlike kdump, fadump doesn't need a 2nd reboot to get back |
| 37 | the system to the production configuration. |
| 38 | |
| 39 | The above can only be accomplished by coordination with, |
| 40 | and assistance from the Power firmware. The procedure is |
| 41 | as follows: |
| 42 | |
| 43 | -- The first kernel registers the sections of memory with the |
| 44 | Power firmware for dump preservation during OS initialization. |
| 45 | These registered sections of memory are reserved by the first |
| 46 | kernel during early boot. |
| 47 | |
| 48 | -- When a system crashes, the Power firmware will save |
| 49 | the low memory (boot memory of size larger of 5% of system RAM |
| 50 | or 256MB) of RAM to the previous registered region. It will |
| 51 | also save system registers, and hardware PTE's. |
| 52 | |
| 53 | NOTE: The term 'boot memory' means size of the low memory chunk |
| 54 | that is required for a kernel to boot successfully when |
| 55 | booted with restricted memory. By default, the boot memory |
| 56 | size will be the larger of 5% of system RAM or 256MB. |
| 57 | Alternatively, user can also specify boot memory size |
| 58 | through boot parameter 'fadump_reserve_mem=' which will |
| 59 | override the default calculated size. Use this option |
| 60 | if default boot memory size is not sufficient for second |
| 61 | kernel to boot successfully. |
| 62 | |
| 63 | -- After the low memory (boot memory) area has been saved, the |
| 64 | firmware will reset PCI and other hardware state. It will |
| 65 | *not* clear the RAM. It will then launch the bootloader, as |
| 66 | normal. |
| 67 | |
| 68 | -- The freshly booted kernel will notice that there is a new |
| 69 | node (ibm,dump-kernel) in the device tree, indicating that |
| 70 | there is crash data available from a previous boot. During |
| 71 | the early boot OS will reserve rest of the memory above |
| 72 | boot memory size effectively booting with restricted memory |
| 73 | size. This will make sure that the second kernel will not |
| 74 | touch any of the dump memory area. |
| 75 | |
| 76 | -- User-space tools will read /proc/vmcore to obtain the contents |
| 77 | of memory, which holds the previous crashed kernel dump in ELF |
| 78 | format. The userspace tools may copy this info to disk, or |
| 79 | network, nas, san, iscsi, etc. as desired. |
| 80 | |
| 81 | -- Once the userspace tool is done saving dump, it will echo |
| 82 | '1' to /sys/kernel/fadump_release_mem to release the reserved |
| 83 | memory back to general use, except the memory required for |
| 84 | next firmware-assisted dump registration. |
| 85 | |
| 86 | e.g. |
| 87 | # echo 1 > /sys/kernel/fadump_release_mem |
| 88 | |
| 89 | Please note that the firmware-assisted dump feature |
| 90 | is only available on Power6 and above systems with recent |
| 91 | firmware versions. |
| 92 | |
| 93 | Implementation details: |
| 94 | ---------------------- |
| 95 | |
| 96 | During boot, a check is made to see if firmware supports |
| 97 | this feature on that particular machine. If it does, then |
| 98 | we check to see if an active dump is waiting for us. If yes |
| 99 | then everything but boot memory size of RAM is reserved during |
| 100 | early boot (See Fig. 2). This area is released once we finish |
| 101 | collecting the dump from user land scripts (e.g. kdump scripts) |
| 102 | that are run. If there is dump data, then the |
| 103 | /sys/kernel/fadump_release_mem file is created, and the reserved |
| 104 | memory is held. |
| 105 | |
| 106 | If there is no waiting dump data, then only the memory required |
| 107 | to hold CPU state, HPTE region, boot memory dump and elfcore |
| 108 | header, is reserved at the top of memory (see Fig. 1). This area |
| 109 | is *not* released: this region will be kept permanently reserved, |
| 110 | so that it can act as a receptacle for a copy of the boot memory |
| 111 | content in addition to CPU state and HPTE region, in the case a |
| 112 | crash does occur. |
| 113 | |
| 114 | o Memory Reservation during first kernel |
| 115 | |
| 116 | Low memory Top of memory |
| 117 | 0 boot memory size | |
| 118 | | | |<--Reserved dump area -->| |
| 119 | V V | Permanent Reservation V |
| 120 | +-----------+----------/ /----------+---+----+-----------+----+ |
| 121 | | | |CPU|HPTE| DUMP |ELF | |
| 122 | +-----------+----------/ /----------+---+----+-----------+----+ |
| 123 | | ^ |
| 124 | | | |
| 125 | \ / |
| 126 | ------------------------------------------- |
| 127 | Boot memory content gets transferred to |
| 128 | reserved area by firmware at the time of |
| 129 | crash |
| 130 | Fig. 1 |
| 131 | |
| 132 | o Memory Reservation during second kernel after crash |
| 133 | |
| 134 | Low memory Top of memory |
| 135 | 0 boot memory size | |
| 136 | | |<------------- Reserved dump area ----------- -->| |
| 137 | V V V |
| 138 | +-----------+----------/ /----------+---+----+-----------+----+ |
| 139 | | | |CPU|HPTE| DUMP |ELF | |
| 140 | +-----------+----------/ /----------+---+----+-----------+----+ |
| 141 | | | |
| 142 | V V |
| 143 | Used by second /proc/vmcore |
| 144 | kernel to boot |
| 145 | Fig. 2 |
| 146 | |
| 147 | Currently the dump will be copied from /proc/vmcore to a |
| 148 | a new file upon user intervention. The dump data available through |
| 149 | /proc/vmcore will be in ELF format. Hence the existing kdump |
| 150 | infrastructure (kdump scripts) to save the dump works fine with |
| 151 | minor modifications. |
| 152 | |
| 153 | The tools to examine the dump will be same as the ones |
| 154 | used for kdump. |
| 155 | |
| 156 | How to enable firmware-assisted dump (fadump): |
| 157 | ------------------------------------- |
| 158 | |
| 159 | 1. Set config option CONFIG_FA_DUMP=y and build kernel. |
| 160 | 2. Boot into linux kernel with 'fadump=on' kernel cmdline option. |
| 161 | 3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline |
| 162 | to specify size of the memory to reserve for boot memory dump |
| 163 | preservation. |
| 164 | |
| 165 | NOTE: If firmware-assisted dump fails to reserve memory then it will |
| 166 | fallback to existing kdump mechanism if 'crashkernel=' option |
| 167 | is set at kernel cmdline. |
| 168 | |
| 169 | Sysfs/debugfs files: |
| 170 | ------------ |
| 171 | |
| 172 | Firmware-assisted dump feature uses sysfs file system to hold |
| 173 | the control files and debugfs file to display memory reserved region. |
| 174 | |
| 175 | Here is the list of files under kernel sysfs: |
| 176 | |
| 177 | /sys/kernel/fadump_enabled |
| 178 | |
| 179 | This is used to display the fadump status. |
| 180 | 0 = fadump is disabled |
| 181 | 1 = fadump is enabled |
| 182 | |
| 183 | This interface can be used by kdump init scripts to identify if |
| 184 | fadump is enabled in the kernel and act accordingly. |
| 185 | |
| 186 | /sys/kernel/fadump_registered |
| 187 | |
| 188 | This is used to display the fadump registration status as well |
| 189 | as to control (start/stop) the fadump registration. |
| 190 | 0 = fadump is not registered. |
| 191 | 1 = fadump is registered and ready to handle system crash. |
| 192 | |
| 193 | To register fadump echo 1 > /sys/kernel/fadump_registered and |
| 194 | echo 0 > /sys/kernel/fadump_registered for un-register and stop the |
| 195 | fadump. Once the fadump is un-registered, the system crash will not |
| 196 | be handled and vmcore will not be captured. This interface can be |
| 197 | easily integrated with kdump service start/stop. |
| 198 | |
| 199 | /sys/kernel/fadump_release_mem |
| 200 | |
| 201 | This file is available only when fadump is active during |
| 202 | second kernel. This is used to release the reserved memory |
| 203 | region that are held for saving crash dump. To release the |
| 204 | reserved memory echo 1 to it: |
| 205 | |
| 206 | echo 1 > /sys/kernel/fadump_release_mem |
| 207 | |
| 208 | After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region |
| 209 | file will change to reflect the new memory reservations. |
| 210 | |
| 211 | The existing userspace tools (kdump infrastructure) can be easily |
| 212 | enhanced to use this interface to release the memory reserved for |
| 213 | dump and continue without 2nd reboot. |
| 214 | |
| 215 | Here is the list of files under powerpc debugfs: |
| 216 | (Assuming debugfs is mounted on /sys/kernel/debug directory.) |
| 217 | |
| 218 | /sys/kernel/debug/powerpc/fadump_region |
| 219 | |
| 220 | This file shows the reserved memory regions if fadump is |
| 221 | enabled otherwise this file is empty. The output format |
| 222 | is: |
| 223 | <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> |
| 224 | |
| 225 | e.g. |
| 226 | Contents when fadump is registered during first kernel |
| 227 | |
| 228 | # cat /sys/kernel/debug/powerpc/fadump_region |
| 229 | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 |
| 230 | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 |
| 231 | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 |
| 232 | |
| 233 | Contents when fadump is active during second kernel |
| 234 | |
| 235 | # cat /sys/kernel/debug/powerpc/fadump_region |
| 236 | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 |
| 237 | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 |
| 238 | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 |
| 239 | : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 |
| 240 | |
| 241 | NOTE: Please refer to Documentation/filesystems/debugfs.txt on |
| 242 | how to mount the debugfs filesystem. |
| 243 | |
| 244 | |
| 245 | TODO: |
| 246 | ----- |
| 247 | o Need to come up with the better approach to find out more |
| 248 | accurate boot memory size that is required for a kernel to |
| 249 | boot successfully when booted with restricted memory. |
| 250 | o The fadump implementation introduces a fadump crash info structure |
| 251 | in the scratch area before the ELF core header. The idea of introducing |
| 252 | this structure is to pass some important crash info data to the second |
| 253 | kernel which will help second kernel to populate ELF core header with |
| 254 | correct data before it gets exported through /proc/vmcore. The current |
| 255 | design implementation does not address a possibility of introducing |
| 256 | additional fields (in future) to this structure without affecting |
| 257 | compatibility. Need to come up with the better approach to address this. |
| 258 | The possible approaches are: |
| 259 | 1. Introduce version field for version tracking, bump up the version |
| 260 | whenever a new field is added to the structure in future. The version |
| 261 | field can be used to find out what fields are valid for the current |
| 262 | version of the structure. |
| 263 | 2. Reserve the area of predefined size (say PAGE_SIZE) for this |
| 264 | structure and have unused area as reserved (initialized to zero) |
| 265 | for future field additions. |
| 266 | The advantage of approach 1 over 2 is we don't need to reserve extra space. |
| 267 | --- |
| 268 | Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> |
| 269 | This document is based on the original documentation written for phyp |
| 270 | assisted dump by Linas Vepstas and Manish Ahuja. |