Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame^] | 1 | Last reviewed: 06/02/2009 |
| 2 | |
| 3 | HP iLO2 NMI Watchdog Driver |
| 4 | NMI sourcing for iLO2 based ProLiant Servers |
| 5 | Documentation and Driver by |
| 6 | Thomas Mingarelli <thomas.mingarelli@hp.com> |
| 7 | |
| 8 | The HP iLO2 NMI Watchdog driver is a kernel module that provides basic |
| 9 | watchdog functionality and the added benefit of NMI sourcing. Both the |
| 10 | watchdog functionality and the NMI sourcing capability need to be enabled |
| 11 | by the user. Remember that the two modes are not dependant on one another. |
| 12 | A user can have the NMI sourcing without the watchdog timer and vice-versa. |
| 13 | |
| 14 | Watchdog functionality is enabled like any other common watchdog driver. That |
| 15 | is, an application needs to be started that kicks off the watchdog timer. A |
| 16 | basic application exists in the Documentation/watchdog/src directory called |
| 17 | watchdog-test.c. Simply compile the C file and kick it off. If the system |
| 18 | gets into a bad state and hangs, the HP ProLiant iLO 2 timer register will |
| 19 | not be updated in a timely fashion and a hardware system reset (also known as |
| 20 | an Automatic Server Recovery (ASR)) event will occur. |
| 21 | |
| 22 | The hpwdt driver also has three (3) module parameters. They are the following: |
| 23 | |
| 24 | soft_margin - allows the user to set the watchdog timer value |
| 25 | allow_kdump - allows the user to save off a kernel dump image after an NMI |
| 26 | nowayout - basic watchdog parameter that does not allow the timer to |
| 27 | be restarted or an impending ASR to be escaped. |
| 28 | |
| 29 | NOTE: More information about watchdog drivers in general, including the ioctl |
| 30 | interface to /dev/watchdog can be found in |
| 31 | Documentation/watchdog/watchdog-api.txt and Documentation/IPMI.txt. |
| 32 | |
| 33 | The NMI sourcing capability is disabled when the driver discovers that the |
| 34 | nmi_watchdog is turned on (nmi_watchdog = 1). This is due to the inability to |
| 35 | distinguish between "NMI Watchdog Ticks" and "HW generated NMI events" in the |
| 36 | Linux kernel. What this means is that the hpwdt nmi handler code is called |
| 37 | each time the NMI signal fires off. This could amount to several thousands of |
| 38 | NMIs in a matter of seconds. If a user sees the Linux kernel's "dazed and |
| 39 | confused" message in the logs or if the system gets into a hung state, then |
| 40 | the user should reboot with nmi_watchdog=0. |
| 41 | |
| 42 | 1. If the kernel has not been booted with nmi_watchdog turned off then |
| 43 | edit /boot/grub/menu.lst and place the nmi_watchdog=0 at the end of the |
| 44 | currently booting kernel line. |
| 45 | 2. reboot the sever |
| 46 | |
| 47 | Now, the hpwdt can successfully receive and source the NMI and provide a log |
| 48 | message that details the reason for the NMI (as determined by the HP BIOS). |
| 49 | |
| 50 | Below is a list of NMIs the HP BIOS understands along with the associated |
| 51 | code (reason): |
| 52 | |
| 53 | No source found 00h |
| 54 | |
| 55 | Uncorrectable Memory Error 01h |
| 56 | |
| 57 | ASR NMI 1Bh |
| 58 | |
| 59 | PCI Parity Error 20h |
| 60 | |
| 61 | NMI Button Press 27h |
| 62 | |
| 63 | SB_BUS_NMI 28h |
| 64 | |
| 65 | ILO Doorbell NMI 29h |
| 66 | |
| 67 | ILO IOP NMI 2Ah |
| 68 | |
| 69 | ILO Watchdog NMI 2Bh |
| 70 | |
| 71 | Proc Throt NMI 2Ch |
| 72 | |
| 73 | Front Side Bus NMI 2Dh |
| 74 | |
| 75 | PCI Express Error 2Fh |
| 76 | |
| 77 | DMA controller NMI 30h |
| 78 | |
| 79 | Hypertransport/CSI Error 31h |
| 80 | |
| 81 | |
| 82 | |
| 83 | -- Tom Mingarelli |
| 84 | (thomas.mingarelli@hp.com) |