Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | |
| 2 | |
| 3 | PCI Bus EEH Error Recovery |
| 4 | -------------------------- |
| 5 | Linas Vepstas |
| 6 | <linas@austin.ibm.com> |
| 7 | 12 January 2005 |
| 8 | |
| 9 | |
| 10 | Overview: |
| 11 | --------- |
| 12 | The IBM POWER-based pSeries and iSeries computers include PCI bus |
| 13 | controller chips that have extended capabilities for detecting and |
| 14 | reporting a large variety of PCI bus error conditions. These features |
| 15 | go under the name of "EEH", for "Extended Error Handling". The EEH |
| 16 | hardware features allow PCI bus errors to be cleared and a PCI |
| 17 | card to be "rebooted", without also having to reboot the operating |
| 18 | system. |
| 19 | |
| 20 | This is in contrast to traditional PCI error handling, where the |
| 21 | PCI chip is wired directly to the CPU, and an error would cause |
| 22 | a CPU machine-check/check-stop condition, halting the CPU entirely. |
| 23 | Another "traditional" technique is to ignore such errors, which |
| 24 | can lead to data corruption, both of user data or of kernel data, |
| 25 | hung/unresponsive adapters, or system crashes/lockups. Thus, |
| 26 | the idea behind EEH is that the operating system can become more |
| 27 | reliable and robust by protecting it from PCI errors, and giving |
| 28 | the OS the ability to "reboot"/recover individual PCI devices. |
| 29 | |
| 30 | Future systems from other vendors, based on the PCI-E specification, |
| 31 | may contain similar features. |
| 32 | |
| 33 | |
| 34 | Causes of EEH Errors |
| 35 | -------------------- |
| 36 | EEH was originally designed to guard against hardware failure, such |
| 37 | as PCI cards dying from heat, humidity, dust, vibration and bad |
| 38 | electrical connections. The vast majority of EEH errors seen in |
| 39 | "real life" are due to eithr poorly seated PCI cards, or, |
| 40 | unfortunately quite commonly, due device driver bugs, device firmware |
| 41 | bugs, and sometimes PCI card hardware bugs. |
| 42 | |
| 43 | The most common software bug, is one that causes the device to |
| 44 | attempt to DMA to a location in system memory that has not been |
| 45 | reserved for DMA access for that card. This is a powerful feature, |
| 46 | as it prevents what; otherwise, would have been silent memory |
| 47 | corruption caused by the bad DMA. A number of device driver |
| 48 | bugs have been found and fixed in this way over the past few |
| 49 | years. Other possible causes of EEH errors include data or |
| 50 | address line parity errors (for example, due to poor electrical |
| 51 | connectivity due to a poorly seated card), and PCI-X split-completion |
| 52 | errors (due to software, device firmware, or device PCI hardware bugs). |
| 53 | The vast majority of "true hardware failures" can be cured by |
| 54 | physically removing and re-seating the PCI card. |
| 55 | |
| 56 | |
| 57 | Detection and Recovery |
| 58 | ---------------------- |
| 59 | In the following discussion, a generic overview of how to detect |
| 60 | and recover from EEH errors will be presented. This is followed |
| 61 | by an overview of how the current implementation in the Linux |
| 62 | kernel does it. The actual implementation is subject to change, |
| 63 | and some of the finer points are still being debated. These |
| 64 | may in turn be swayed if or when other architectures implement |
| 65 | similar functionality. |
| 66 | |
| 67 | When a PCI Host Bridge (PHB, the bus controller connecting the |
| 68 | PCI bus to the system CPU electronics complex) detects a PCI error |
| 69 | condition, it will "isolate" the affected PCI card. Isolation |
| 70 | will block all writes (either to the card from the system, or |
| 71 | from the card to the system), and it will cause all reads to |
| 72 | return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads). |
| 73 | This value was chosen because it is the same value you would |
| 74 | get if the device was physically unplugged from the slot. |
| 75 | This includes access to PCI memory, I/O space, and PCI config |
| 76 | space. Interrupts; however, will continued to be delivered. |
| 77 | |
| 78 | Detection and recovery are performed with the aid of ppc64 |
| 79 | firmware. The programming interfaces in the Linux kernel |
| 80 | into the firmware are referred to as RTAS (Run-Time Abstraction |
| 81 | Services). The Linux kernel does not (should not) access |
| 82 | the EEH function in the PCI chipsets directly, primarily because |
| 83 | there are a number of different chipsets out there, each with |
| 84 | different interfaces and quirks. The firmware provides a |
| 85 | uniform abstraction layer that will work with all pSeries |
| 86 | and iSeries hardware (and be forwards-compatible). |
| 87 | |
| 88 | If the OS or device driver suspects that a PCI slot has been |
| 89 | EEH-isolated, there is a firmware call it can make to determine if |
| 90 | this is the case. If so, then the device driver should put itself |
| 91 | into a consistent state (given that it won't be able to complete any |
| 92 | pending work) and start recovery of the card. Recovery normally |
| 93 | would consist of reseting the PCI device (holding the PCI #RST |
| 94 | line high for two seconds), followed by setting up the device |
| 95 | config space (the base address registers (BAR's), latency timer, |
| 96 | cache line size, interrupt line, and so on). This is followed by a |
| 97 | reinitialization of the device driver. In a worst-case scenario, |
| 98 | the power to the card can be toggled, at least on hot-plug-capable |
| 99 | slots. In principle, layers far above the device driver probably |
| 100 | do not need to know that the PCI card has been "rebooted" in this |
| 101 | way; ideally, there should be at most a pause in Ethernet/disk/USB |
| 102 | I/O while the card is being reset. |
| 103 | |
| 104 | If the card cannot be recovered after three or four resets, the |
| 105 | kernel/device driver should assume the worst-case scenario, that the |
| 106 | card has died completely, and report this error to the sysadmin. |
| 107 | In addition, error messages are reported through RTAS and also through |
| 108 | syslogd (/var/log/messages) to alert the sysadmin of PCI resets. |
| 109 | The correct way to deal with failed adapters is to use the standard |
| 110 | PCI hotplug tools to remove and replace the dead card. |
| 111 | |
| 112 | |
| 113 | Current PPC64 Linux EEH Implementation |
| 114 | -------------------------------------- |
| 115 | At this time, a generic EEH recovery mechanism has been implemented, |
| 116 | so that individual device drivers do not need to be modified to support |
| 117 | EEH recovery. This generic mechanism piggy-backs on the PCI hotplug |
Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 118 | infrastructure, and percolates events up through the userspace/udev |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 119 | infrastructure. Followiing is a detailed description of how this is |
| 120 | accomplished. |
| 121 | |
| 122 | EEH must be enabled in the PHB's very early during the boot process, |
| 123 | and if a PCI slot is hot-plugged. The former is performed by |
| 124 | eeh_init() in arch/ppc64/kernel/eeh.c, and the later by |
| 125 | drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code. |
| 126 | EEH must be enabled before a PCI scan of the device can proceed. |
| 127 | Current Power5 hardware will not work unless EEH is enabled; |
| 128 | although older Power4 can run with it disabled. Effectively, |
| 129 | EEH can no longer be turned off. PCI devices *must* be |
| 130 | registered with the EEH code; the EEH code needs to know about |
| 131 | the I/O address ranges of the PCI device in order to detect an |
| 132 | error. Given an arbitrary address, the routine |
| 133 | pci_get_device_by_addr() will find the pci device associated |
| 134 | with that address (if any). |
| 135 | |
| 136 | The default include/asm-ppc64/io.h macros readb(), inb(), insb(), |
Tobias Klauser | d533f67 | 2005-09-10 00:26:46 -0700 | [diff] [blame] | 137 | etc. include a check to see if the i/o read returned all-0xff's. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 138 | If so, these make a call to eeh_dn_check_failure(), which in turn |
| 139 | asks the firmware if the all-ff's value is the sign of a true EEH |
| 140 | error. If it is not, processing continues as normal. The grand |
| 141 | total number of these false alarms or "false positives" can be |
| 142 | seen in /proc/ppc64/eeh (subject to change). Normally, almost |
| 143 | all of these occur during boot, when the PCI bus is scanned, where |
| 144 | a large number of 0xff reads are part of the bus scan procedure. |
| 145 | |
| 146 | If a frozen slot is detected, code in arch/ppc64/kernel/eeh.c will |
| 147 | print a stack trace to syslog (/var/log/messages). This stack trace |
| 148 | has proven to be very useful to device-driver authors for finding |
| 149 | out at what point the EEH error was detected, as the error itself |
| 150 | usually occurs slightly beforehand. |
| 151 | |
| 152 | Next, it uses the Linux kernel notifier chain/work queue mechanism to |
| 153 | allow any interested parties to find out about the failure. Device |
| 154 | drivers, or other parts of the kernel, can use |
| 155 | eeh_register_notifier(struct notifier_block *) to find out about EEH |
| 156 | events. The event will include a pointer to the pci device, the |
| 157 | device node and some state info. Receivers of the event can "do as |
| 158 | they wish"; the default handler will be described further in this |
| 159 | section. |
| 160 | |
| 161 | To assist in the recovery of the device, eeh.c exports the |
| 162 | following functions: |
| 163 | |
| 164 | rtas_set_slot_reset() -- assert the PCI #RST line for 1/8th of a second |
| 165 | rtas_configure_bridge() -- ask firmware to configure any PCI bridges |
| 166 | located topologically under the pci slot. |
| 167 | eeh_save_bars() and eeh_restore_bars(): save and restore the PCI |
| 168 | config-space info for a device and any devices under it. |
| 169 | |
| 170 | |
| 171 | A handler for the EEH notifier_block events is implemented in |
| 172 | drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events(). |
| 173 | It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter(). |
| 174 | This last call causes the device driver for the card to be stopped, |
Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 175 | which causes uevents to go out to user space. This triggers |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 176 | user-space scripts that might issue commands such as "ifdown eth0" |
| 177 | for ethernet cards, and so on. This handler then sleeps for 5 seconds, |
| 178 | hoping to give the user-space scripts enough time to complete. |
| 179 | It then resets the PCI card, reconfigures the device BAR's, and |
| 180 | any bridges underneath. It then calls rpaphp_enable_pci_slot(), |
| 181 | which restarts the device driver and triggers more user-space |
| 182 | events (for example, calling "ifup eth0" for ethernet cards). |
| 183 | |
| 184 | |
| 185 | Device Shutdown and User-Space Events |
| 186 | ------------------------------------- |
| 187 | This section documents what happens when a pci slot is unconfigured, |
| 188 | focusing on how the device driver gets shut down, and on how the |
| 189 | events get delivered to user-space scripts. |
| 190 | |
| 191 | Following is an example sequence of events that cause a device driver |
| 192 | close function to be called during the first phase of an EEH reset. |
| 193 | The following sequence is an example of the pcnet32 device driver. |
| 194 | |
| 195 | rpa_php_unconfig_pci_adapter (struct slot *) // in rpaphp_pci.c |
| 196 | { |
| 197 | calls |
| 198 | pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c |
| 199 | { |
| 200 | calls |
| 201 | pci_destroy_dev (struct pci_dev *) |
| 202 | { |
| 203 | calls |
| 204 | device_unregister (&dev->dev) // in /drivers/base/core.c |
| 205 | { |
| 206 | calls |
| 207 | device_del (struct device *) |
| 208 | { |
| 209 | calls |
| 210 | bus_remove_device() // in /drivers/base/bus.c |
| 211 | { |
| 212 | calls |
| 213 | device_release_driver() |
| 214 | { |
| 215 | calls |
| 216 | struct device_driver->remove() which is just |
| 217 | pci_device_remove() // in /drivers/pci/pci_driver.c |
| 218 | { |
| 219 | calls |
| 220 | struct pci_driver->remove() which is just |
| 221 | pcnet32_remove_one() // in /drivers/net/pcnet32.c |
| 222 | { |
| 223 | calls |
| 224 | unregister_netdev() // in /net/core/dev.c |
| 225 | { |
| 226 | calls |
| 227 | dev_close() // in /net/core/dev.c |
| 228 | { |
| 229 | calls dev->stop(); |
| 230 | which is just pcnet32_close() // in pcnet32.c |
| 231 | { |
| 232 | which does what you wanted |
| 233 | to stop the device |
| 234 | } |
| 235 | } |
| 236 | } |
| 237 | which |
| 238 | frees pcnet32 device driver memory |
| 239 | } |
| 240 | }}}}}} |
| 241 | |
| 242 | |
| 243 | in drivers/pci/pci_driver.c, |
| 244 | struct device_driver->remove() is just pci_device_remove() |
| 245 | which calls struct pci_driver->remove() which is pcnet32_remove_one() |
| 246 | which calls unregister_netdev() (in net/core/dev.c) |
| 247 | which calls dev_close() (in net/core/dev.c) |
| 248 | which calls dev->stop() which is pcnet32_close() |
| 249 | which then does the appropriate shutdown. |
| 250 | |
| 251 | --- |
| 252 | Following is the analogous stack trace for events sent to user-space |
| 253 | when the pci device is unconfigured. |
| 254 | |
| 255 | rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c |
| 256 | calls |
| 257 | pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c |
| 258 | calls |
| 259 | pci_destroy_dev (struct pci_dev *) { |
| 260 | calls |
Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 261 | device_unregister (&dev->dev) { // in /drivers/base/core.c |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 262 | calls |
Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 263 | device_del(struct device * dev) { // in /drivers/base/core.c |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 264 | calls |
Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 265 | kobject_del() { //in /libs/kobject.c |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 266 | calls |
Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 267 | kobject_uevent() { // in /libs/kobject.c |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 268 | calls |
Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 269 | kset_uevent() { // in /lib/kobject.c |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 270 | calls |
Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 271 | kset->uevent_ops->uevent() // which is really just |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 272 | a call to |
Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 273 | dev_uevent() { // in /drivers/base/core.c |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 274 | calls |
Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 275 | dev->bus->uevent() which is really just a call to |
| 276 | pci_uevent () { // in drivers/pci/hotplug.c |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 277 | which prints device name, etc.... |
| 278 | } |
| 279 | } |
Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 280 | then kobject_uevent() sends a netlink uevent to userspace |
| 281 | --> userspace uevent |
| 282 | (during early boot, nobody listens to netlink events and |
| 283 | kobject_uevent() executes uevent_helper[], which runs the |
| 284 | event process /sbin/hotplug) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 285 | } |
| 286 | } |
| 287 | kobject_del() then calls sysfs_remove_dir(), which would |
| 288 | trigger any user-space daemon that was watching /sysfs, |
| 289 | and notice the delete event. |
| 290 | |
| 291 | |
| 292 | Pro's and Con's of the Current Design |
| 293 | ------------------------------------- |
| 294 | There are several issues with the current EEH software recovery design, |
| 295 | which may be addressed in future revisions. But first, note that the |
| 296 | big plus of the current design is that no changes need to be made to |
| 297 | individual device drivers, so that the current design throws a wide net. |
| 298 | The biggest negative of the design is that it potentially disturbs |
| 299 | network daemons and file systems that didn't need to be disturbed. |
| 300 | |
| 301 | -- A minor complaint is that resetting the network card causes |
| 302 | user-space back-to-back ifdown/ifup burps that potentially disturb |
| 303 | network daemons, that didn't need to even know that the pci |
| 304 | card was being rebooted. |
| 305 | |
| 306 | -- A more serious concern is that the same reset, for SCSI devices, |
| 307 | causes havoc to mounted file systems. Scripts cannot post-facto |
| 308 | unmount a file system without flushing pending buffers, but this |
| 309 | is impossible, because I/O has already been stopped. Thus, |
| 310 | ideally, the reset should happen at or below the block layer, |
| 311 | so that the file systems are not disturbed. |
| 312 | |
| 313 | Reiserfs does not tolerate errors returned from the block device. |
| 314 | Ext3fs seems to be tolerant, retrying reads/writes until it does |
| 315 | succeed. Both have been only lightly tested in this scenario. |
| 316 | |
| 317 | The SCSI-generic subsystem already has built-in code for performing |
| 318 | SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter |
| 319 | (HBA) resets. These are cascaded into a chain of attempted |
| 320 | resets if a SCSI command fails. These are completely hidden |
| 321 | from the block layer. It would be very natural to add an EEH |
| 322 | reset into this chain of events. |
| 323 | |
| 324 | -- If a SCSI error occurs for the root device, all is lost unless |
| 325 | the sysadmin had the foresight to run /bin, /sbin, /etc, /var |
| 326 | and so on, out of ramdisk/tmpfs. |
| 327 | |
| 328 | |
| 329 | Conclusions |
| 330 | ----------- |
| 331 | There's forward progress ... |
| 332 | |
| 333 | |