| |
| |
| PCI Bus EEH Error Recovery |
| -------------------------- |
| Linas Vepstas |
| <linas@austin.ibm.com> |
| 12 January 2005 |
| |
| |
| Overview: |
| --------- |
| The IBM POWER-based pSeries and iSeries computers include PCI bus |
| controller chips that have extended capabilities for detecting and |
| reporting a large variety of PCI bus error conditions. These features |
| go under the name of "EEH", for "Extended Error Handling". The EEH |
| hardware features allow PCI bus errors to be cleared and a PCI |
| card to be "rebooted", without also having to reboot the operating |
| system. |
| |
| This is in contrast to traditional PCI error handling, where the |
| PCI chip is wired directly to the CPU, and an error would cause |
| a CPU machine-check/check-stop condition, halting the CPU entirely. |
| Another "traditional" technique is to ignore such errors, which |
| can lead to data corruption, both of user data or of kernel data, |
| hung/unresponsive adapters, or system crashes/lockups. Thus, |
| the idea behind EEH is that the operating system can become more |
| reliable and robust by protecting it from PCI errors, and giving |
| the OS the ability to "reboot"/recover individual PCI devices. |
| |
| Future systems from other vendors, based on the PCI-E specification, |
| may contain similar features. |
| |
| |
| Causes of EEH Errors |
| -------------------- |
| EEH was originally designed to guard against hardware failure, such |
| as PCI cards dying from heat, humidity, dust, vibration and bad |
| electrical connections. The vast majority of EEH errors seen in |
| "real life" are due to eithr poorly seated PCI cards, or, |
| unfortunately quite commonly, due device driver bugs, device firmware |
| bugs, and sometimes PCI card hardware bugs. |
| |
| The most common software bug, is one that causes the device to |
| attempt to DMA to a location in system memory that has not been |
| reserved for DMA access for that card. This is a powerful feature, |
| as it prevents what; otherwise, would have been silent memory |
| corruption caused by the bad DMA. A number of device driver |
| bugs have been found and fixed in this way over the past few |
| years. Other possible causes of EEH errors include data or |
| address line parity errors (for example, due to poor electrical |
| connectivity due to a poorly seated card), and PCI-X split-completion |
| errors (due to software, device firmware, or device PCI hardware bugs). |
| The vast majority of "true hardware failures" can be cured by |
| physically removing and re-seating the PCI card. |
| |
| |
| Detection and Recovery |
| ---------------------- |
| In the following discussion, a generic overview of how to detect |
| and recover from EEH errors will be presented. This is followed |
| by an overview of how the current implementation in the Linux |
| kernel does it. The actual implementation is subject to change, |
| and some of the finer points are still being debated. These |
| may in turn be swayed if or when other architectures implement |
| similar functionality. |
| |
| When a PCI Host Bridge (PHB, the bus controller connecting the |
| PCI bus to the system CPU electronics complex) detects a PCI error |
| condition, it will "isolate" the affected PCI card. Isolation |
| will block all writes (either to the card from the system, or |
| from the card to the system), and it will cause all reads to |
| return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads). |
| This value was chosen because it is the same value you would |
| get if the device was physically unplugged from the slot. |
| This includes access to PCI memory, I/O space, and PCI config |
| space. Interrupts; however, will continued to be delivered. |
| |
| Detection and recovery are performed with the aid of ppc64 |
| firmware. The programming interfaces in the Linux kernel |
| into the firmware are referred to as RTAS (Run-Time Abstraction |
| Services). The Linux kernel does not (should not) access |
| the EEH function in the PCI chipsets directly, primarily because |
| there are a number of different chipsets out there, each with |
| different interfaces and quirks. The firmware provides a |
| uniform abstraction layer that will work with all pSeries |
| and iSeries hardware (and be forwards-compatible). |
| |
| If the OS or device driver suspects that a PCI slot has been |
| EEH-isolated, there is a firmware call it can make to determine if |
| this is the case. If so, then the device driver should put itself |
| into a consistent state (given that it won't be able to complete any |
| pending work) and start recovery of the card. Recovery normally |
| would consist of reseting the PCI device (holding the PCI #RST |
| line high for two seconds), followed by setting up the device |
| config space (the base address registers (BAR's), latency timer, |
| cache line size, interrupt line, and so on). This is followed by a |
| reinitialization of the device driver. In a worst-case scenario, |
| the power to the card can be toggled, at least on hot-plug-capable |
| slots. In principle, layers far above the device driver probably |
| do not need to know that the PCI card has been "rebooted" in this |
| way; ideally, there should be at most a pause in Ethernet/disk/USB |
| I/O while the card is being reset. |
| |
| If the card cannot be recovered after three or four resets, the |
| kernel/device driver should assume the worst-case scenario, that the |
| card has died completely, and report this error to the sysadmin. |
| In addition, error messages are reported through RTAS and also through |
| syslogd (/var/log/messages) to alert the sysadmin of PCI resets. |
| The correct way to deal with failed adapters is to use the standard |
| PCI hotplug tools to remove and replace the dead card. |
| |
| |
| Current PPC64 Linux EEH Implementation |
| -------------------------------------- |
| At this time, a generic EEH recovery mechanism has been implemented, |
| so that individual device drivers do not need to be modified to support |
| EEH recovery. This generic mechanism piggy-backs on the PCI hotplug |
| infrastructure, and percolates events up through the userspace/udev |
| infrastructure. Followiing is a detailed description of how this is |
| accomplished. |
| |
| EEH must be enabled in the PHB's very early during the boot process, |
| and if a PCI slot is hot-plugged. The former is performed by |
| eeh_init() in arch/ppc64/kernel/eeh.c, and the later by |
| drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code. |
| EEH must be enabled before a PCI scan of the device can proceed. |
| Current Power5 hardware will not work unless EEH is enabled; |
| although older Power4 can run with it disabled. Effectively, |
| EEH can no longer be turned off. PCI devices *must* be |
| registered with the EEH code; the EEH code needs to know about |
| the I/O address ranges of the PCI device in order to detect an |
| error. Given an arbitrary address, the routine |
| pci_get_device_by_addr() will find the pci device associated |
| with that address (if any). |
| |
| The default include/asm-ppc64/io.h macros readb(), inb(), insb(), |
| etc. include a check to see if the i/o read returned all-0xff's. |
| If so, these make a call to eeh_dn_check_failure(), which in turn |
| asks the firmware if the all-ff's value is the sign of a true EEH |
| error. If it is not, processing continues as normal. The grand |
| total number of these false alarms or "false positives" can be |
| seen in /proc/ppc64/eeh (subject to change). Normally, almost |
| all of these occur during boot, when the PCI bus is scanned, where |
| a large number of 0xff reads are part of the bus scan procedure. |
| |
| If a frozen slot is detected, code in arch/ppc64/kernel/eeh.c will |
| print a stack trace to syslog (/var/log/messages). This stack trace |
| has proven to be very useful to device-driver authors for finding |
| out at what point the EEH error was detected, as the error itself |
| usually occurs slightly beforehand. |
| |
| Next, it uses the Linux kernel notifier chain/work queue mechanism to |
| allow any interested parties to find out about the failure. Device |
| drivers, or other parts of the kernel, can use |
| eeh_register_notifier(struct notifier_block *) to find out about EEH |
| events. The event will include a pointer to the pci device, the |
| device node and some state info. Receivers of the event can "do as |
| they wish"; the default handler will be described further in this |
| section. |
| |
| To assist in the recovery of the device, eeh.c exports the |
| following functions: |
| |
| rtas_set_slot_reset() -- assert the PCI #RST line for 1/8th of a second |
| rtas_configure_bridge() -- ask firmware to configure any PCI bridges |
| located topologically under the pci slot. |
| eeh_save_bars() and eeh_restore_bars(): save and restore the PCI |
| config-space info for a device and any devices under it. |
| |
| |
| A handler for the EEH notifier_block events is implemented in |
| drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events(). |
| It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter(). |
| This last call causes the device driver for the card to be stopped, |
| which causes uevents to go out to user space. This triggers |
| user-space scripts that might issue commands such as "ifdown eth0" |
| for ethernet cards, and so on. This handler then sleeps for 5 seconds, |
| hoping to give the user-space scripts enough time to complete. |
| It then resets the PCI card, reconfigures the device BAR's, and |
| any bridges underneath. It then calls rpaphp_enable_pci_slot(), |
| which restarts the device driver and triggers more user-space |
| events (for example, calling "ifup eth0" for ethernet cards). |
| |
| |
| Device Shutdown and User-Space Events |
| ------------------------------------- |
| This section documents what happens when a pci slot is unconfigured, |
| focusing on how the device driver gets shut down, and on how the |
| events get delivered to user-space scripts. |
| |
| Following is an example sequence of events that cause a device driver |
| close function to be called during the first phase of an EEH reset. |
| The following sequence is an example of the pcnet32 device driver. |
| |
| rpa_php_unconfig_pci_adapter (struct slot *) // in rpaphp_pci.c |
| { |
| calls |
| pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c |
| { |
| calls |
| pci_destroy_dev (struct pci_dev *) |
| { |
| calls |
| device_unregister (&dev->dev) // in /drivers/base/core.c |
| { |
| calls |
| device_del (struct device *) |
| { |
| calls |
| bus_remove_device() // in /drivers/base/bus.c |
| { |
| calls |
| device_release_driver() |
| { |
| calls |
| struct device_driver->remove() which is just |
| pci_device_remove() // in /drivers/pci/pci_driver.c |
| { |
| calls |
| struct pci_driver->remove() which is just |
| pcnet32_remove_one() // in /drivers/net/pcnet32.c |
| { |
| calls |
| unregister_netdev() // in /net/core/dev.c |
| { |
| calls |
| dev_close() // in /net/core/dev.c |
| { |
| calls dev->stop(); |
| which is just pcnet32_close() // in pcnet32.c |
| { |
| which does what you wanted |
| to stop the device |
| } |
| } |
| } |
| which |
| frees pcnet32 device driver memory |
| } |
| }}}}}} |
| |
| |
| in drivers/pci/pci_driver.c, |
| struct device_driver->remove() is just pci_device_remove() |
| which calls struct pci_driver->remove() which is pcnet32_remove_one() |
| which calls unregister_netdev() (in net/core/dev.c) |
| which calls dev_close() (in net/core/dev.c) |
| which calls dev->stop() which is pcnet32_close() |
| which then does the appropriate shutdown. |
| |
| --- |
| Following is the analogous stack trace for events sent to user-space |
| when the pci device is unconfigured. |
| |
| rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c |
| calls |
| pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c |
| calls |
| pci_destroy_dev (struct pci_dev *) { |
| calls |
| device_unregister (&dev->dev) { // in /drivers/base/core.c |
| calls |
| device_del(struct device * dev) { // in /drivers/base/core.c |
| calls |
| kobject_del() { //in /libs/kobject.c |
| calls |
| kobject_uevent() { // in /libs/kobject.c |
| calls |
| kset_uevent() { // in /lib/kobject.c |
| calls |
| kset->uevent_ops->uevent() // which is really just |
| a call to |
| dev_uevent() { // in /drivers/base/core.c |
| calls |
| dev->bus->uevent() which is really just a call to |
| pci_uevent () { // in drivers/pci/hotplug.c |
| which prints device name, etc.... |
| } |
| } |
| then kobject_uevent() sends a netlink uevent to userspace |
| --> userspace uevent |
| (during early boot, nobody listens to netlink events and |
| kobject_uevent() executes uevent_helper[], which runs the |
| event process /sbin/hotplug) |
| } |
| } |
| kobject_del() then calls sysfs_remove_dir(), which would |
| trigger any user-space daemon that was watching /sysfs, |
| and notice the delete event. |
| |
| |
| Pro's and Con's of the Current Design |
| ------------------------------------- |
| There are several issues with the current EEH software recovery design, |
| which may be addressed in future revisions. But first, note that the |
| big plus of the current design is that no changes need to be made to |
| individual device drivers, so that the current design throws a wide net. |
| The biggest negative of the design is that it potentially disturbs |
| network daemons and file systems that didn't need to be disturbed. |
| |
| -- A minor complaint is that resetting the network card causes |
| user-space back-to-back ifdown/ifup burps that potentially disturb |
| network daemons, that didn't need to even know that the pci |
| card was being rebooted. |
| |
| -- A more serious concern is that the same reset, for SCSI devices, |
| causes havoc to mounted file systems. Scripts cannot post-facto |
| unmount a file system without flushing pending buffers, but this |
| is impossible, because I/O has already been stopped. Thus, |
| ideally, the reset should happen at or below the block layer, |
| so that the file systems are not disturbed. |
| |
| Reiserfs does not tolerate errors returned from the block device. |
| Ext3fs seems to be tolerant, retrying reads/writes until it does |
| succeed. Both have been only lightly tested in this scenario. |
| |
| The SCSI-generic subsystem already has built-in code for performing |
| SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter |
| (HBA) resets. These are cascaded into a chain of attempted |
| resets if a SCSI command fails. These are completely hidden |
| from the block layer. It would be very natural to add an EEH |
| reset into this chain of events. |
| |
| -- If a SCSI error occurs for the root device, all is lost unless |
| the sysadmin had the foresight to run /bin, /sbin, /etc, /var |
| and so on, out of ramdisk/tmpfs. |
| |
| |
| Conclusions |
| ----------- |
| There's forward progress ... |
| |
| |