Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 1 | Last reviewed: 06/02/2009 |
| 2 | |
| 3 | HP iLO2 NMI Watchdog Driver |
| 4 | NMI sourcing for iLO2 based ProLiant Servers |
| 5 | Documentation and Driver by |
| 6 | Thomas Mingarelli <thomas.mingarelli@hp.com> |
| 7 | |
| 8 | The HP iLO2 NMI Watchdog driver is a kernel module that provides basic |
| 9 | watchdog functionality and the added benefit of NMI sourcing. Both the |
| 10 | watchdog functionality and the NMI sourcing capability need to be enabled |
Lucas De Marchi | 25985ed | 2011-03-30 22:57:33 -0300 | [diff] [blame] | 11 | by the user. Remember that the two modes are not dependent on one another. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 12 | A user can have the NMI sourcing without the watchdog timer and vice-versa. |
| 13 | |
| 14 | Watchdog functionality is enabled like any other common watchdog driver. That |
| 15 | is, an application needs to be started that kicks off the watchdog timer. A |
| 16 | basic application exists in the Documentation/watchdog/src directory called |
| 17 | watchdog-test.c. Simply compile the C file and kick it off. If the system |
| 18 | gets into a bad state and hangs, the HP ProLiant iLO 2 timer register will |
| 19 | not be updated in a timely fashion and a hardware system reset (also known as |
| 20 | an Automatic Server Recovery (ASR)) event will occur. |
| 21 | |
Tom Mingarelli | 44df753 | 2009-06-18 23:28:57 +0000 | [diff] [blame] | 22 | The hpwdt driver also has four (4) module parameters. They are the following: |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 23 | |
| 24 | soft_margin - allows the user to set the watchdog timer value |
| 25 | allow_kdump - allows the user to save off a kernel dump image after an NMI |
| 26 | nowayout - basic watchdog parameter that does not allow the timer to |
| 27 | be restarted or an impending ASR to be escaped. |
Tom Mingarelli | 44df753 | 2009-06-18 23:28:57 +0000 | [diff] [blame] | 28 | priority - determines whether or not the hpwdt driver is first on the |
| 29 | die_notify list to handle NMIs or last. The default value |
| 30 | for this module parameter is 0 or LAST. If the user wants to |
| 31 | enable NMI sourcing then reload the hpwdt driver with |
| 32 | priority=1 (and boot with nmi_watchdog=0). |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 33 | |
| 34 | NOTE: More information about watchdog drivers in general, including the ioctl |
| 35 | interface to /dev/watchdog can be found in |
| 36 | Documentation/watchdog/watchdog-api.txt and Documentation/IPMI.txt. |
| 37 | |
Tom Mingarelli | 44df753 | 2009-06-18 23:28:57 +0000 | [diff] [blame] | 38 | The priority parameter was introduced due to other kernel software that relied |
| 39 | on handling NMIs (like oprofile). Keeping hpwdt's priority at 0 (or LAST) |
| 40 | enables the users of NMIs for non critical events to be work as expected. |
| 41 | |
| 42 | The NMI sourcing capability is disabled by default due to the inability to |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 43 | distinguish between "NMI Watchdog Ticks" and "HW generated NMI events" in the |
| 44 | Linux kernel. What this means is that the hpwdt nmi handler code is called |
| 45 | each time the NMI signal fires off. This could amount to several thousands of |
| 46 | NMIs in a matter of seconds. If a user sees the Linux kernel's "dazed and |
| 47 | confused" message in the logs or if the system gets into a hung state, then |
Tom Mingarelli | 44df753 | 2009-06-18 23:28:57 +0000 | [diff] [blame] | 48 | the hpwdt driver can be reloaded with the "priority" module parameter set |
| 49 | (priority=1). |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 50 | |
| 51 | 1. If the kernel has not been booted with nmi_watchdog turned off then |
| 52 | edit /boot/grub/menu.lst and place the nmi_watchdog=0 at the end of the |
| 53 | currently booting kernel line. |
| 54 | 2. reboot the sever |
Tom Mingarelli | 44df753 | 2009-06-18 23:28:57 +0000 | [diff] [blame] | 55 | 3. Once the system comes up perform a rmmod hpwdt |
| 56 | 4. insmod /lib/modules/`uname -r`/kernel/drivers/char/watchdog/hpwdt.ko priority=1 |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 57 | |
| 58 | Now, the hpwdt can successfully receive and source the NMI and provide a log |
| 59 | message that details the reason for the NMI (as determined by the HP BIOS). |
| 60 | |
| 61 | Below is a list of NMIs the HP BIOS understands along with the associated |
| 62 | code (reason): |
| 63 | |
| 64 | No source found 00h |
| 65 | |
| 66 | Uncorrectable Memory Error 01h |
| 67 | |
| 68 | ASR NMI 1Bh |
| 69 | |
| 70 | PCI Parity Error 20h |
| 71 | |
| 72 | NMI Button Press 27h |
| 73 | |
| 74 | SB_BUS_NMI 28h |
| 75 | |
| 76 | ILO Doorbell NMI 29h |
| 77 | |
| 78 | ILO IOP NMI 2Ah |
| 79 | |
| 80 | ILO Watchdog NMI 2Bh |
| 81 | |
| 82 | Proc Throt NMI 2Ch |
| 83 | |
| 84 | Front Side Bus NMI 2Dh |
| 85 | |
| 86 | PCI Express Error 2Fh |
| 87 | |
| 88 | DMA controller NMI 30h |
| 89 | |
| 90 | Hypertransport/CSI Error 31h |
| 91 | |
| 92 | |
| 93 | |
| 94 | -- Tom Mingarelli |
| 95 | (thomas.mingarelli@hp.com) |