Brian Boylston | 0215efc | 2016-05-25 11:20:15 -0500 | [diff] [blame] | 1 | Last reviewed: 05/20/2016 |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 2 | |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 3 | HPE iLO NMI Watchdog Driver |
| 4 | NMI sourcing for iLO based ProLiant Servers |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 5 | Documentation and Driver by |
Brian Boylston | 0215efc | 2016-05-25 11:20:15 -0500 | [diff] [blame] | 6 | Thomas Mingarelli |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 7 | |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 8 | The HPE iLO NMI Watchdog driver is a kernel module that provides basic |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 9 | watchdog functionality and the added benefit of NMI sourcing. Both the |
| 10 | watchdog functionality and the NMI sourcing capability need to be enabled |
Lucas De Marchi | 25985ed | 2011-03-30 22:57:33 -0300 | [diff] [blame] | 11 | by the user. Remember that the two modes are not dependent on one another. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 12 | A user can have the NMI sourcing without the watchdog timer and vice-versa. |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 13 | All references to iLO in this document imply it also works on iLO2 and all |
| 14 | subsequent generations. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 15 | |
| 16 | Watchdog functionality is enabled like any other common watchdog driver. That |
| 17 | is, an application needs to be started that kicks off the watchdog timer. A |
Tom Saeger | 718d50e | 2017-10-12 15:24:10 -0500 | [diff] [blame] | 18 | basic application exists in tools/testing/selftests/watchdog/ named |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 19 | watchdog-test.c. Simply compile the C file and kick it off. If the system |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 20 | gets into a bad state and hangs, the HPE ProLiant iLO timer register will |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 21 | not be updated in a timely fashion and a hardware system reset (also known as |
| 22 | an Automatic Server Recovery (ASR)) event will occur. |
| 23 | |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 24 | The hpwdt driver also has three (3) module parameters. They are the following: |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 25 | |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 26 | soft_margin - allows the user to set the watchdog timer value. |
| 27 | Default value is 30 seconds. |
| 28 | allow_kdump - allows the user to save off a kernel dump image after an NMI. |
| 29 | Default value is 1/ON |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 30 | nowayout - basic watchdog parameter that does not allow the timer to |
| 31 | be restarted or an impending ASR to be escaped. |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 32 | Default value is set when compiling the kernel. If it is set |
| 33 | to "Y", then there is no way of disabling the watchdog once |
| 34 | it has been started. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 35 | |
| 36 | NOTE: More information about watchdog drivers in general, including the ioctl |
| 37 | interface to /dev/watchdog can be found in |
| 38 | Documentation/watchdog/watchdog-api.txt and Documentation/IPMI.txt. |
| 39 | |
Tom Mingarelli | 44df753 | 2009-06-18 23:28:57 +0000 | [diff] [blame] | 40 | The NMI sourcing capability is disabled by default due to the inability to |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 41 | distinguish between "NMI Watchdog Ticks" and "HW generated NMI events" in the |
| 42 | Linux kernel. What this means is that the hpwdt nmi handler code is called |
| 43 | each time the NMI signal fires off. This could amount to several thousands of |
| 44 | NMIs in a matter of seconds. If a user sees the Linux kernel's "dazed and |
| 45 | confused" message in the logs or if the system gets into a hung state, then |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 46 | the hpwdt driver can be reloaded. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 47 | |
| 48 | 1. If the kernel has not been booted with nmi_watchdog turned off then |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 49 | edit and place the nmi_watchdog=0 at the end of the currently booting |
| 50 | kernel line. Depending on your Linux distribution and platform setup: |
| 51 | For non-UEFI systems |
| 52 | /boot/grub/grub.conf or |
| 53 | /boot/grub/menu.lst |
| 54 | For UEFI systems |
| 55 | /boot/efi/EFI/distroname/grub.conf or |
| 56 | /boot/efi/efi/distroname/elilo.conf |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 57 | 2. reboot the sever |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 58 | 3. Once the system comes up perform a modprobe -r hpwdt |
| 59 | 4. modprobe /lib/modules/`uname -r`/kernel/drivers/watchdog/hpwdt.ko |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 60 | |
| 61 | Now, the hpwdt can successfully receive and source the NMI and provide a log |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 62 | message that details the reason for the NMI (as determined by the HPE BIOS). |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 63 | |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 64 | Below is a list of NMIs the HPE BIOS understands along with the associated |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 65 | code (reason): |
| 66 | |
| 67 | No source found 00h |
| 68 | |
| 69 | Uncorrectable Memory Error 01h |
| 70 | |
| 71 | ASR NMI 1Bh |
| 72 | |
| 73 | PCI Parity Error 20h |
| 74 | |
| 75 | NMI Button Press 27h |
| 76 | |
| 77 | SB_BUS_NMI 28h |
| 78 | |
| 79 | ILO Doorbell NMI 29h |
| 80 | |
| 81 | ILO IOP NMI 2Ah |
| 82 | |
| 83 | ILO Watchdog NMI 2Bh |
| 84 | |
| 85 | Proc Throt NMI 2Ch |
| 86 | |
| 87 | Front Side Bus NMI 2Dh |
| 88 | |
| 89 | PCI Express Error 2Fh |
| 90 | |
| 91 | DMA controller NMI 30h |
| 92 | |
| 93 | Hypertransport/CSI Error 31h |
| 94 | |
| 95 | |
| 96 | |
| 97 | -- Tom Mingarelli |