Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 1 | |
| 2 | Configurable sysfs parameters for the x86-64 machine check code. |
| 3 | |
| 4 | Machine checks report internal hardware error conditions detected |
| 5 | by the CPU. Uncorrected errors typically cause a machine check |
| 6 | (often with panic), corrected ones cause a machine check log entry. |
| 7 | |
| 8 | Machine checks are organized in banks (normally associated with |
| 9 | a hardware subsystem) and subevents in a bank. The exact meaning |
| 10 | of the banks and subevent is CPU specific. |
| 11 | |
| 12 | mcelog knows how to decode them. |
| 13 | |
| 14 | When you see the "Machine check errors logged" message in the system |
| 15 | log then mcelog should run to collect and decode machine check entries |
| 16 | from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. |
| 17 | |
| 18 | Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN |
| 19 | (N = CPU number) |
| 20 | |
| 21 | The directory contains some configurable entries: |
| 22 | |
| 23 | Entries: |
| 24 | |
| 25 | bankNctl |
| 26 | (N bank number) |
| 27 | 64bit Hex bitmask enabling/disabling specific subevents for bank N |
| 28 | When a bit in the bitmask is zero then the respective |
| 29 | subevent will not be reported. |
| 30 | By default all events are enabled. |
| 31 | Note that BIOS maintain another mask to disable specific events |
| 32 | per bank. This is not visible here |
| 33 | |
| 34 | The following entries appear for each CPU, but they are truly shared |
| 35 | between all CPUs. |
| 36 | |
| 37 | check_interval |
| 38 | How often to poll for corrected machine check errors, in seconds |
Tim Hockin | 8a336b0 | 2007-05-02 19:27:19 +0200 | [diff] [blame] | 39 | (Note output is hexademical). Default 5 minutes. When the poller |
| 40 | finds MCEs it triggers an exponential speedup (poll more often) on |
| 41 | the polling interval. When the poller stops finding MCEs, it |
| 42 | triggers an exponential backoff (poll less often) on the polling |
| 43 | interval. The check_interval variable is both the initial and |
| 44 | maximum polling interval. |
Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 45 | |
| 46 | tolerant |
| 47 | Tolerance level. When a machine check exception occurs for a non |
| 48 | corrected machine check the kernel can take different actions. |
| 49 | Since machine check exceptions can happen any time it is sometimes |
| 50 | risky for the kernel to kill a process because it defies |
| 51 | normal kernel locking rules. The tolerance level configures |
| 52 | how hard the kernel tries to recover even at some risk of deadlock. |
| 53 | |
| 54 | 0: always panic, |
| 55 | 1: panic if deadlock possible, |
| 56 | 2: try to avoid panic, |
| 57 | 3: never panic or exit (for testing only) |
| 58 | |
| 59 | Default: 1 |
| 60 | |
| 61 | Note this only makes a difference if the CPU allows recovery |
| 62 | from a machine check exception. Current x86 CPUs generally do not. |
| 63 | |
| 64 | trigger |
| 65 | Program to run when a machine check event is detected. |
| 66 | This is an alternative to running mcelog regularly from cron |
| 67 | and allows to detect events faster. |
| 68 | |
| 69 | TBD document entries for AMD threshold interrupt configuration |
| 70 | |
| 71 | For more details about the x86 machine check architecture |
| 72 | see the Intel and AMD architecture manuals from their developer websites. |
| 73 | |
| 74 | For more details about the architecture see |
| 75 | see http://one.firstfloor.org/~andi/mce.pdf |