Blame - Documentation/powerpc/eeh-pci-error-recovery.txt - kernel/msm-4.9

blob: 9d4e33df624c2390e03cc75cdbf245e50bca1294 [file] [log] [blame]

Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1
				2
				3	PCI Bus EEH Error Recovery
				4	--------------------------
				5	Linas Vepstas
				6	<linas@austin.ibm.com>
				7	12 January 2005
				8
				9
				10	Overview:
				11	---------
				12	The IBM POWER-based pSeries and iSeries computers include PCI bus
				13	controller chips that have extended capabilities for detecting and
				14	reporting a large variety of PCI bus error conditions. These features
				15	go under the name of "EEH", for "Extended Error Handling". The EEH
				16	hardware features allow PCI bus errors to be cleared and a PCI
				17	card to be "rebooted", without also having to reboot the operating
				18	system.
				19
				20	This is in contrast to traditional PCI error handling, where the
				21	PCI chip is wired directly to the CPU, and an error would cause
				22	a CPU machine-check/check-stop condition, halting the CPU entirely.
				23	Another "traditional" technique is to ignore such errors, which
				24	can lead to data corruption, both of user data or of kernel data,
				25	hung/unresponsive adapters, or system crashes/lockups. Thus,
				26	the idea behind EEH is that the operating system can become more
				27	reliable and robust by protecting it from PCI errors, and giving
				28	the OS the ability to "reboot"/recover individual PCI devices.
				29
				30	Future systems from other vendors, based on the PCI-E specification,
				31	may contain similar features.
				32
				33
				34	Causes of EEH Errors
				35	--------------------
				36	EEH was originally designed to guard against hardware failure, such
				37	as PCI cards dying from heat, humidity, dust, vibration and bad
				38	electrical connections. The vast majority of EEH errors seen in
Matt LaPlante	01dd2fb	2007-10-20 01:34:40 +0200	[diff] [blame]	39	"real life" are due to either poorly seated PCI cards, or,
				40	unfortunately quite commonly, due to device driver bugs, device firmware
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	41	bugs, and sometimes PCI card hardware bugs.
				42
				43	The most common software bug, is one that causes the device to
				44	attempt to DMA to a location in system memory that has not been
				45	reserved for DMA access for that card. This is a powerful feature,
				46	as it prevents what; otherwise, would have been silent memory
				47	corruption caused by the bad DMA. A number of device driver
				48	bugs have been found and fixed in this way over the past few
				49	years. Other possible causes of EEH errors include data or
				50	address line parity errors (for example, due to poor electrical
				51	connectivity due to a poorly seated card), and PCI-X split-completion
				52	errors (due to software, device firmware, or device PCI hardware bugs).
				53	The vast majority of "true hardware failures" can be cured by
				54	physically removing and re-seating the PCI card.
				55
				56
				57	Detection and Recovery
				58	----------------------
				59	In the following discussion, a generic overview of how to detect
				60	and recover from EEH errors will be presented. This is followed
				61	by an overview of how the current implementation in the Linux
				62	kernel does it. The actual implementation is subject to change,
				63	and some of the finer points are still being debated. These
				64	may in turn be swayed if or when other architectures implement
				65	similar functionality.
				66
				67	When a PCI Host Bridge (PHB, the bus controller connecting the
				68	PCI bus to the system CPU electronics complex) detects a PCI error
				69	condition, it will "isolate" the affected PCI card. Isolation
				70	will block all writes (either to the card from the system, or
				71	from the card to the system), and it will cause all reads to
				72	return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads).
				73	This value was chosen because it is the same value you would
				74	get if the device was physically unplugged from the slot.
				75	This includes access to PCI memory, I/O space, and PCI config
				76	space. Interrupts; however, will continued to be delivered.
				77
				78	Detection and recovery are performed with the aid of ppc64
				79	firmware. The programming interfaces in the Linux kernel
				80	into the firmware are referred to as RTAS (Run-Time Abstraction
				81	Services). The Linux kernel does not (should not) access
				82	the EEH function in the PCI chipsets directly, primarily because
				83	there are a number of different chipsets out there, each with
				84	different interfaces and quirks. The firmware provides a
				85	uniform abstraction layer that will work with all pSeries
				86	and iSeries hardware (and be forwards-compatible).
				87
				88	If the OS or device driver suspects that a PCI slot has been
				89	EEH-isolated, there is a firmware call it can make to determine if
				90	this is the case. If so, then the device driver should put itself
				91	into a consistent state (given that it won't be able to complete any
				92	pending work) and start recovery of the card. Recovery normally
Matt LaPlante	d6bc8ac	2006-10-03 22:54:15 +0200	[diff] [blame]	93	would consist of resetting the PCI device (holding the PCI #RST
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	94	line high for two seconds), followed by setting up the device
				95	config space (the base address registers (BAR's), latency timer,
				96	cache line size, interrupt line, and so on). This is followed by a
				97	reinitialization of the device driver. In a worst-case scenario,
				98	the power to the card can be toggled, at least on hot-plug-capable
				99	slots. In principle, layers far above the device driver probably
				100	do not need to know that the PCI card has been "rebooted" in this
				101	way; ideally, there should be at most a pause in Ethernet/disk/USB
				102	I/O while the card is being reset.
				103
				104	If the card cannot be recovered after three or four resets, the
				105	kernel/device driver should assume the worst-case scenario, that the
				106	card has died completely, and report this error to the sysadmin.
				107	In addition, error messages are reported through RTAS and also through
				108	syslogd (/var/log/messages) to alert the sysadmin of PCI resets.
				109	The correct way to deal with failed adapters is to use the standard
				110	PCI hotplug tools to remove and replace the dead card.
				111
				112
				113	Current PPC64 Linux EEH Implementation
				114	--------------------------------------
				115	At this time, a generic EEH recovery mechanism has been implemented,
				116	so that individual device drivers do not need to be modified to support
				117	EEH recovery. This generic mechanism piggy-backs on the PCI hotplug
Kay Sievers	312c004	2005-11-16 09:00:00 +0100	[diff] [blame]	118	infrastructure, and percolates events up through the userspace/udev
Matt LaPlante	a2ffd27	2006-10-03 22:49:15 +0200	[diff] [blame]	119	infrastructure. Following is a detailed description of how this is
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	120	accomplished.
				121
				122	EEH must be enabled in the PHB's very early during the boot process,
				123	and if a PCI slot is hot-plugged. The former is performed by
Jon Mason	2ef9481	2006-01-23 10:58:20 -0600	[diff] [blame]	124	eeh_init() in arch/powerpc/platforms/pseries/eeh.c, and the later by
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	125	drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code.
				126	EEH must be enabled before a PCI scan of the device can proceed.
				127	Current Power5 hardware will not work unless EEH is enabled;
				128	although older Power4 can run with it disabled. Effectively,
				129	EEH can no longer be turned off. PCI devices must be
				130	registered with the EEH code; the EEH code needs to know about
				131	the I/O address ranges of the PCI device in order to detect an
				132	error. Given an arbitrary address, the routine
				133	pci_get_device_by_addr() will find the pci device associated
				134	with that address (if any).
				135
Stephen Rothwell	b8b572e	2008-08-01 15:20:30 +1000	[diff] [blame]	136	The default arch/powerpc/include/asm/io.h macros readb(), inb(), insb(),
Tobias Klauser	d533f67	2005-09-10 00:26:46 -0700	[diff] [blame]	137	etc. include a check to see if the i/o read returned all-0xff's.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	138	If so, these make a call to eeh_dn_check_failure(), which in turn
				139	asks the firmware if the all-ff's value is the sign of a true EEH
				140	error. If it is not, processing continues as normal. The grand
				141	total number of these false alarms or "false positives" can be
				142	seen in /proc/ppc64/eeh (subject to change). Normally, almost
				143	all of these occur during boot, when the PCI bus is scanned, where
				144	a large number of 0xff reads are part of the bus scan procedure.
				145
Jon Mason	2ef9481	2006-01-23 10:58:20 -0600	[diff] [blame]	146	If a frozen slot is detected, code in
				147	arch/powerpc/platforms/pseries/eeh.c will print a stack trace to
				148	syslog (/var/log/messages). This stack trace has proven to be very
				149	useful to device-driver authors for finding out at what point the EEH
				150	error was detected, as the error itself usually occurs slightly
				151	beforehand.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	152
				153	Next, it uses the Linux kernel notifier chain/work queue mechanism to
				154	allow any interested parties to find out about the failure. Device
				155	drivers, or other parts of the kernel, can use
				156	eeh_register_notifier(struct notifier_block *) to find out about EEH
				157	events. The event will include a pointer to the pci device, the
				158	device node and some state info. Receivers of the event can "do as
				159	they wish"; the default handler will be described further in this
				160	section.
				161
				162	To assist in the recovery of the device, eeh.c exports the
				163	following functions:
				164
				165	rtas_set_slot_reset() -- assert the PCI #RST line for 1/8th of a second
				166	rtas_configure_bridge() -- ask firmware to configure any PCI bridges
				167	located topologically under the pci slot.
				168	eeh_save_bars() and eeh_restore_bars(): save and restore the PCI
				169	config-space info for a device and any devices under it.
				170
				171
				172	A handler for the EEH notifier_block events is implemented in
				173	drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events().
				174	It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter().
				175	This last call causes the device driver for the card to be stopped,
Kay Sievers	312c004	2005-11-16 09:00:00 +0100	[diff] [blame]	176	which causes uevents to go out to user space. This triggers
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	177	user-space scripts that might issue commands such as "ifdown eth0"
				178	for ethernet cards, and so on. This handler then sleeps for 5 seconds,
				179	hoping to give the user-space scripts enough time to complete.
				180	It then resets the PCI card, reconfigures the device BAR's, and
				181	any bridges underneath. It then calls rpaphp_enable_pci_slot(),
				182	which restarts the device driver and triggers more user-space
				183	events (for example, calling "ifup eth0" for ethernet cards).
				184
				185
				186	Device Shutdown and User-Space Events
				187	-------------------------------------
				188	This section documents what happens when a pci slot is unconfigured,
				189	focusing on how the device driver gets shut down, and on how the
				190	events get delivered to user-space scripts.
				191
				192	Following is an example sequence of events that cause a device driver
				193	close function to be called during the first phase of an EEH reset.
				194	The following sequence is an example of the pcnet32 device driver.
				195
				196	rpa_php_unconfig_pci_adapter (struct slot *) // in rpaphp_pci.c
				197	{
				198	calls
				199	pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c
				200	{
				201	calls
				202	pci_destroy_dev (struct pci_dev *)
				203	{
				204	calls
				205	device_unregister (&dev->dev) // in /drivers/base/core.c
				206	{
				207	calls
				208	device_del (struct device *)
				209	{
				210	calls
				211	bus_remove_device() // in /drivers/base/bus.c
				212	{
				213	calls
				214	device_release_driver()
				215	{
				216	calls
				217	struct device_driver->remove() which is just
				218	pci_device_remove() // in /drivers/pci/pci_driver.c
				219	{
				220	calls
				221	struct pci_driver->remove() which is just
				222	pcnet32_remove_one() // in /drivers/net/pcnet32.c
				223	{
				224	calls
				225	unregister_netdev() // in /net/core/dev.c
				226	{
				227	calls
				228	dev_close() // in /net/core/dev.c
				229	{
				230	calls dev->stop();
				231	which is just pcnet32_close() // in pcnet32.c
				232	{
				233	which does what you wanted
				234	to stop the device
				235	}
				236	}
				237	}
				238	which
				239	frees pcnet32 device driver memory
				240	}
				241	}}}}}}
				242
				243
				244	in drivers/pci/pci_driver.c,
				245	struct device_driver->remove() is just pci_device_remove()
				246	which calls struct pci_driver->remove() which is pcnet32_remove_one()
				247	which calls unregister_netdev() (in net/core/dev.c)
				248	which calls dev_close() (in net/core/dev.c)
				249	which calls dev->stop() which is pcnet32_close()
				250	which then does the appropriate shutdown.
				251
				252	---
				253	Following is the analogous stack trace for events sent to user-space
				254	when the pci device is unconfigured.
				255
				256	rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c
				257	calls
				258	pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
				259	calls
				260	pci_destroy_dev (struct pci_dev *) {
				261	calls
Kay Sievers	312c004	2005-11-16 09:00:00 +0100	[diff] [blame]	262	device_unregister (&dev->dev) { // in /drivers/base/core.c
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	263	calls
Kay Sievers	312c004	2005-11-16 09:00:00 +0100	[diff] [blame]	264	device_del(struct device * dev) { // in /drivers/base/core.c
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	265	calls
Kay Sievers	312c004	2005-11-16 09:00:00 +0100	[diff] [blame]	266	kobject_del() { //in /libs/kobject.c
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	267	calls
Kay Sievers	312c004	2005-11-16 09:00:00 +0100	[diff] [blame]	268	kobject_uevent() { // in /libs/kobject.c
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	269	calls
Kay Sievers	312c004	2005-11-16 09:00:00 +0100	[diff] [blame]	270	kset_uevent() { // in /lib/kobject.c
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	271	calls
Kay Sievers	312c004	2005-11-16 09:00:00 +0100	[diff] [blame]	272	kset->uevent_ops->uevent() // which is really just
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	273	a call to
Kay Sievers	312c004	2005-11-16 09:00:00 +0100	[diff] [blame]	274	dev_uevent() { // in /drivers/base/core.c
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	275	calls
Kay Sievers	312c004	2005-11-16 09:00:00 +0100	[diff] [blame]	276	dev->bus->uevent() which is really just a call to
				277	pci_uevent () { // in drivers/pci/hotplug.c
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	278	which prints device name, etc....
				279	}
				280	}
Kay Sievers	312c004	2005-11-16 09:00:00 +0100	[diff] [blame]	281	then kobject_uevent() sends a netlink uevent to userspace
				282	--> userspace uevent
				283	(during early boot, nobody listens to netlink events and
				284	kobject_uevent() executes uevent_helper[], which runs the
				285	event process /sbin/hotplug)
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	286	}
				287	}
				288	kobject_del() then calls sysfs_remove_dir(), which would
				289	trigger any user-space daemon that was watching /sysfs,
				290	and notice the delete event.
				291
				292
				293	Pro's and Con's of the Current Design
				294	-------------------------------------
				295	There are several issues with the current EEH software recovery design,
				296	which may be addressed in future revisions. But first, note that the
				297	big plus of the current design is that no changes need to be made to
				298	individual device drivers, so that the current design throws a wide net.
				299	The biggest negative of the design is that it potentially disturbs
				300	network daemons and file systems that didn't need to be disturbed.
				301
				302	-- A minor complaint is that resetting the network card causes
				303	user-space back-to-back ifdown/ifup burps that potentially disturb
				304	network daemons, that didn't need to even know that the pci
				305	card was being rebooted.
				306
				307	-- A more serious concern is that the same reset, for SCSI devices,
				308	causes havoc to mounted file systems. Scripts cannot post-facto
				309	unmount a file system without flushing pending buffers, but this
				310	is impossible, because I/O has already been stopped. Thus,
				311	ideally, the reset should happen at or below the block layer,
				312	so that the file systems are not disturbed.
				313
				314	Reiserfs does not tolerate errors returned from the block device.
				315	Ext3fs seems to be tolerant, retrying reads/writes until it does
				316	succeed. Both have been only lightly tested in this scenario.
				317
				318	The SCSI-generic subsystem already has built-in code for performing
				319	SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter
				320	(HBA) resets. These are cascaded into a chain of attempted
				321	resets if a SCSI command fails. These are completely hidden
				322	from the block layer. It would be very natural to add an EEH
				323	reset into this chain of events.
				324
				325	-- If a SCSI error occurs for the root device, all is lost unless
				326	the sysadmin had the foresight to run /bin, /sbin, /etc, /var
				327	and so on, out of ramdisk/tmpfs.
				328
				329
				330	Conclusions
				331	-----------
				332	There's forward progress ...
				333
				334