| The Linux Watchdog driver API. |
| |
| Copyright 2002 Christer Weingel <wingel@nano-system.com> |
| |
| Some parts of this document are copied verbatim from the sbc60xxwdt |
| driver which is (c) Copyright 2000 Jakob Oestergaard <jakob@ostenfeld.dk> |
| |
| This document describes the state of the Linux 2.4.18 kernel. |
| |
| Introduction: |
| |
| A Watchdog Timer (WDT) is a hardware circuit that can reset the |
| computer system in case of a software fault. You probably knew that |
| already. |
| |
| Usually a userspace daemon will notify the kernel watchdog driver via the |
| /dev/watchdog special device file that userspace is still alive, at |
| regular intervals. When such a notification occurs, the driver will |
| usually tell the hardware watchdog that everything is in order, and |
| that the watchdog should wait for yet another little while to reset |
| the system. If userspace fails (RAM error, kernel bug, whatever), the |
| notifications cease to occur, and the hardware watchdog will reset the |
| system (causing a reboot) after the timeout occurs. |
| |
| The Linux watchdog API is a rather AD hoc construction and different |
| drivers implement different, and sometimes incompatible, parts of it. |
| This file is an attempt to document the existing usage and allow |
| future driver writers to use it as a reference. |
| |
| The simplest API: |
| |
| All drivers support the basic mode of operation, where the watchdog |
| activates as soon as /dev/watchdog is opened and will reboot unless |
| the watchdog is pinged within a certain time, this time is called the |
| timeout or margin. The simplest way to ping the watchdog is to write |
| some data to the device. So a very simple watchdog daemon would look |
| like this: |
| |
| int main(int argc, const char *argv[]) { |
| int fd=open("/dev/watchdog",O_WRONLY); |
| if (fd==-1) { |
| perror("watchdog"); |
| exit(1); |
| } |
| while(1) { |
| write(fd, "\0", 1); |
| sleep(10); |
| } |
| } |
| |
| A more advanced driver could for example check that a HTTP server is |
| still responding before doing the write call to ping the watchdog. |
| |
| When the device is closed, the watchdog is disabled. This is not |
| always such a good idea, since if there is a bug in the watchdog |
| daemon and it crashes the system will not reboot. Because of this, |
| some of the drivers support the configuration option "Disable watchdog |
| shutdown on close", CONFIG_WATCHDOG_NOWAYOUT. If it is set to Y when |
| compiling the kernel, there is no way of disabling the watchdog once |
| it has been started. So, if the watchdog dameon crashes, the system |
| will reboot after the timeout has passed. |
| |
| Some other drivers will not disable the watchdog, unless a specific |
| magic character 'V' has been sent /dev/watchdog just before closing |
| the file. If the userspace daemon closes the file without sending |
| this special character, the driver will assume that the daemon (and |
| userspace in general) died, and will stop pinging the watchdog without |
| disabling it first. This will then cause a reboot. |
| |
| The ioctl API: |
| |
| All conforming drivers also support an ioctl API. |
| |
| Pinging the watchdog using an ioctl: |
| |
| All drivers that have an ioctl interface support at least one ioctl, |
| KEEPALIVE. This ioctl does exactly the same thing as a write to the |
| watchdog device, so the main loop in the above program could be |
| replaced with: |
| |
| while (1) { |
| ioctl(fd, WDIOC_KEEPALIVE, 0); |
| sleep(10); |
| } |
| |
| the argument to the ioctl is ignored. |
| |
| Setting and getting the timeout: |
| |
| For some drivers it is possible to modify the watchdog timeout on the |
| fly with the SETTIMEOUT ioctl, those drivers have the WDIOF_SETTIMEOUT |
| flag set in their option field. The argument is an integer |
| representing the timeout in seconds. The driver returns the real |
| timeout used in the same variable, and this timeout might differ from |
| the requested one due to limitation of the hardware. |
| |
| int timeout = 45; |
| ioctl(fd, WDIOC_SETTIMEOUT, &timeout); |
| printf("The timeout was set to %d seconds\n", timeout); |
| |
| This example might actually print "The timeout was set to 60 seconds" |
| if the device has a granularity of minutes for its timeout. |
| |
| Starting with the Linux 2.4.18 kernel, it is possible to query the |
| current timeout using the GETTIMEOUT ioctl. |
| |
| ioctl(fd, WDIOC_GETTIMEOUT, &timeout); |
| printf("The timeout was is %d seconds\n", timeout); |
| |
| Envinronmental monitoring: |
| |
| All watchdog drivers are required return more information about the system, |
| some do temperature, fan and power level monitoring, some can tell you |
| the reason for the last reboot of the system. The GETSUPPORT ioctl is |
| available to ask what the device can do: |
| |
| struct watchdog_info ident; |
| ioctl(fd, WDIOC_GETSUPPORT, &ident); |
| |
| the fields returned in the ident struct are: |
| |
| identity a string identifying the watchdog driver |
| firmware_version the firmware version of the card if available |
| options a flags describing what the device supports |
| |
| the options field can have the following bits set, and describes what |
| kind of information that the GET_STATUS and GET_BOOT_STATUS ioctls can |
| return. [FIXME -- Is this correct?] |
| |
| WDIOF_OVERHEAT Reset due to CPU overheat |
| |
| The machine was last rebooted by the watchdog because the thermal limit was |
| exceeded |
| |
| WDIOF_FANFAULT Fan failed |
| |
| A system fan monitored by the watchdog card has failed |
| |
| WDIOF_EXTERN1 External relay 1 |
| |
| External monitoring relay/source 1 was triggered. Controllers intended for |
| real world applications include external monitoring pins that will trigger |
| a reset. |
| |
| WDIOF_EXTERN2 External relay 2 |
| |
| External monitoring relay/source 2 was triggered |
| |
| WDIOF_POWERUNDER Power bad/power fault |
| |
| The machine is showing an undervoltage status |
| |
| WDIOF_CARDRESET Card previously reset the CPU |
| |
| The last reboot was caused by the watchdog card |
| |
| WDIOF_POWEROVER Power over voltage |
| |
| The machine is showing an overvoltage status. Note that if one level is |
| under and one over both bits will be set - this may seem odd but makes |
| sense. |
| |
| WDIOF_KEEPALIVEPING Keep alive ping reply |
| |
| The watchdog saw a keepalive ping since it was last queried. |
| |
| WDIOF_SETTIMEOUT Can set/get the timeout |
| |
| |
| For those drivers that return any bits set in the option field, the |
| GETSTATUS and GETBOOTSTATUS ioctls can be used to ask for the current |
| status, and the status at the last reboot, respectively. |
| |
| int flags; |
| ioctl(fd, WDIOC_GETSTATUS, &flags); |
| |
| or |
| |
| ioctl(fd, WDIOC_GETBOOTSTATUS, &flags); |
| |
| Note that not all devices support these two calls, and some only |
| support the GETBOOTSTATUS call. |
| |
| Some drivers can measure the temperature using the GETTEMP ioctl. The |
| returned value is the temperature in degrees farenheit. |
| |
| int temperature; |
| ioctl(fd, WDIOC_GETTEMP, &temperature); |
| |
| Finally the SETOPTIONS ioctl can be used to control some aspects of |
| the cards operation; right now the pcwd driver is the only one |
| supporting thiss ioctl. |
| |
| int options = 0; |
| ioctl(fd, WDIOC_SETOPTIONS, options); |
| |
| The following options are available: |
| |
| WDIOS_DISABLECARD Turn off the watchdog timer |
| WDIOS_ENABLECARD Turn on the watchdog timer |
| WDIOS_TEMPPANIC Kernel panic on temperature trip |
| |
| [FIXME -- better explanations] |
| |
| Implementations in the current drivers in the kernel tree: |
| |
| Here I have tried to summarize what the different drivers support and |
| where they do strange things compared to the other drivers. |
| |
| acquirewdt.c -- Acquire Single Board Computer |
| |
| This driver has a hardcoded timeout of 1 minute |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT |
| |
| GETSUPPORT returns KEEPALIVEPING. GETSTATUS will return 1 if |
| the device is open, 0 if not. [FIXME -- isn't this rather |
| silly? To be able to use the ioctl, the device must be open |
| and so GETSTATUS will always return 1]. |
| |
| advantechwdt.c -- Advantech Single Board Computer |
| |
| Timeout that defaults to 60 seconds, supports SETTIMEOUT. |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT |
| |
| GETSUPPORT returns WDIOF_KEEPALIVEPING and WDIOF_SETTIMEOUT. |
| The GETSTATUS call returns if the device is open or not. |
| [FIXME -- silliness again?] |
| |
| eurotechwdt.c -- Eurotech CPU-1220/1410 |
| |
| The timeout can be set using the SETTIMEOUT ioctl and defaults |
| to 60 seconds. |
| |
| Also has a module parameter "ev", event type which controls |
| what should happen on a timeout, the string "int" or anything |
| else that causes a reboot. [FIXME -- better description] |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT |
| |
| GETSUPPORT returns CARDRESET and WDIOF_SETTIMEOUT but |
| GETSTATUS is not supported and GETBOOTSTATUS just returns 0. |
| |
| i810-tco.c -- Intel 810 chipset |
| |
| Also has support for a lot of other i8x0 stuff, but the |
| watchdog is one of the things. |
| |
| The timeout is set using the module parameter "i810_margin", |
| which is in steps of 0.6 seconds where 2<i810_margin<64. The |
| driver supports the SETTIMEOUT ioctl. |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT. |
| |
| GETSUPPORT returns WDIOF_SETTIMEOUT. The GETSTATUS call |
| returns some kind of timer value which ist not compatible with |
| the other drivers. GETBOOT status returns some kind of |
| hardware specific boot status. [FIXME -- describe this] |
| |
| ib700wdt.c -- IB700 Single Board Computer |
| |
| Default timeout of 30 seconds and the timeout is settable |
| using the SETTIMEOUT ioctl. Note that only a few timeout |
| values are supported. |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT |
| |
| GETSUPPORT returns WDIOF_KEEPALIVEPING and WDIOF_SETTIMEOUT. |
| The GETSTATUS call returns if the device is open or not. |
| [FIXME -- silliness again?] |
| |
| machzwd.c -- MachZ ZF-Logic |
| |
| Hardcoded timeout of 10 seconds |
| |
| Has a module parameter "action" that controls what happens |
| when the timeout runs out which can be 0 = RESET (default), |
| 1 = SMI, 2 = NMI, 3 = SCI. |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT and the magic character |
| 'V' close handling. |
| |
| GETSUPPORT returns WDIOF_KEEPALIVEPING, and the GETSTATUS call |
| returns if the device is open or not. [FIXME -- silliness |
| again?] |
| |
| mixcomwd.c -- MixCom Watchdog |
| |
| [FIXME -- I'm unable to tell what the timeout is] |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT |
| |
| GETSUPPORT returns WDIOF_KEEPALIVEPING, GETSTATUS returns if |
| the device is opened or not [FIXME -- I'm not really sure how |
| this works, there seems to be some magic connected to |
| CONFIG_WATCHDOG_NOWAYOUT] |
| |
| pcwd.c -- Berkshire PC Watchdog |
| |
| Hardcoded timeout of 1.5 seconds |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT |
| |
| GETSUPPORT returns WDIOF_OVERHEAT|WDIOF_CARDRESET and both |
| GETSTATUS and GETBOOTSTATUS return something useful. |
| |
| The SETOPTIONS call can be used to enable and disable the card |
| and to ask the driver to call panic if the system overheats. |
| |
| sbc60xxwdt.c -- 60xx Single Board Computer |
| |
| Hardcoded timeout of 10 seconds |
| |
| Does not support CONFIG_WATCHDOG_NOWAYOUT, but has the magic |
| character 'V' close handling. |
| |
| No bits set in GETSUPPORT |
| |
| scx200.c -- National SCx200 CPUs |
| |
| Not in the kernel yet. |
| |
| The timeout is set using a module parameter "margin" which |
| defaults to 60 seconds. The timeout can also be set using |
| SETTIMEOUT and read using GETTIMEOUT. |
| |
| Supports a module parameter "nowayout" that is initialized |
| with the value of CONFIG_WATCHDOG_NOWAYOUT. Also supports the |
| magic character 'V' handling. |
| |
| shwdt.c -- SuperH 3/4 processors |
| |
| [FIXME -- I'm unable to tell what the timeout is] |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT |
| |
| GETSUPPORT returns WDIOF_KEEPALIVEPING, and the GETSTATUS call |
| returns if the device is open or not. [FIXME -- silliness |
| again?] |
| |
| softdog.c -- Software watchdog |
| |
| The timeout is set with the module parameter "soft_margin" |
| which defaults to 60 seconds, the timeout is also settable |
| using the SETTIMEOUT ioctl. |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT |
| |
| WDIOF_SETTIMEOUT bit set in GETSUPPORT |
| |
| w83877f_wdt.c -- W83877F Computer |
| |
| Hardcoded timeout of 30 seconds |
| |
| Does not support CONFIG_WATCHDOG_NOWAYOUT, but has the magic |
| character 'V' close handling. |
| |
| No bits set in GETSUPPORT |
| |
| w83627hf_wdt.c -- w83627hf watchdog |
| |
| Timeout that defaults to 60 seconds, supports SETTIMEOUT. |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT |
| |
| GETSUPPORT returns WDIOF_KEEPALIVEPING and WDIOF_SETTIMEOUT. |
| The GETSTATUS call returns if the device is open or not. |
| |
| wdt.c -- ICS WDT500/501 ISA and |
| wdt_pci.c -- ICS WDT500/501 PCI |
| |
| Default timeout of 60 seconds. The timeout is also settable |
| using the SETTIMEOUT ioctl. |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT |
| |
| GETSUPPORT returns with bits set depending on the actual |
| card. The WDT501 supports a lot of external monitoring, the |
| WDT500 much less. |
| |
| wdt285.c -- Footbridge watchdog |
| |
| The timeout is set with the module parameter "soft_margin" |
| which defaults to 60 seconds. The timeout is also settable |
| using the SETTIMEOUT ioctl. |
| |
| Does not support CONFIG_WATCHDOG_NOWAYOUT |
| |
| WDIOF_SETTIMEOUT bit set in GETSUPPORT |
| |
| wdt977.c -- Netwinder W83977AF chip |
| |
| Hardcoded timeout of 3 minutes |
| |
| Supports CONFIG_WATCHDOG_NOWAYOUT |
| |
| Does not support any ioctls at all. |
| |