Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 1 | |
| 2 | Configurable sysfs parameters for the x86-64 machine check code. |
| 3 | |
| 4 | Machine checks report internal hardware error conditions detected |
| 5 | by the CPU. Uncorrected errors typically cause a machine check |
| 6 | (often with panic), corrected ones cause a machine check log entry. |
| 7 | |
| 8 | Machine checks are organized in banks (normally associated with |
| 9 | a hardware subsystem) and subevents in a bank. The exact meaning |
| 10 | of the banks and subevent is CPU specific. |
| 11 | |
| 12 | mcelog knows how to decode them. |
| 13 | |
| 14 | When you see the "Machine check errors logged" message in the system |
| 15 | log then mcelog should run to collect and decode machine check entries |
| 16 | from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. |
| 17 | |
| 18 | Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN |
| 19 | (N = CPU number) |
| 20 | |
| 21 | The directory contains some configurable entries: |
| 22 | |
| 23 | Entries: |
| 24 | |
| 25 | bankNctl |
| 26 | (N bank number) |
| 27 | 64bit Hex bitmask enabling/disabling specific subevents for bank N |
| 28 | When a bit in the bitmask is zero then the respective |
| 29 | subevent will not be reported. |
| 30 | By default all events are enabled. |
| 31 | Note that BIOS maintain another mask to disable specific events |
| 32 | per bank. This is not visible here |
| 33 | |
| 34 | The following entries appear for each CPU, but they are truly shared |
| 35 | between all CPUs. |
| 36 | |
| 37 | check_interval |
| 38 | How often to poll for corrected machine check errors, in seconds |
Tim Hockin | 8a336b0 | 2007-05-02 19:27:19 +0200 | [diff] [blame] | 39 | (Note output is hexademical). Default 5 minutes. When the poller |
| 40 | finds MCEs it triggers an exponential speedup (poll more often) on |
| 41 | the polling interval. When the poller stops finding MCEs, it |
| 42 | triggers an exponential backoff (poll less often) on the polling |
| 43 | interval. The check_interval variable is both the initial and |
Andi Kleen | 8780e8e | 2009-05-27 21:56:56 +0200 | [diff] [blame] | 44 | maximum polling interval. 0 means no polling for corrected machine |
| 45 | check errors (but some corrected errors might be still reported |
| 46 | in other ways) |
Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 47 | |
| 48 | tolerant |
| 49 | Tolerance level. When a machine check exception occurs for a non |
| 50 | corrected machine check the kernel can take different actions. |
| 51 | Since machine check exceptions can happen any time it is sometimes |
| 52 | risky for the kernel to kill a process because it defies |
| 53 | normal kernel locking rules. The tolerance level configures |
Tim Hockin | bd78432 | 2007-07-21 17:10:37 +0200 | [diff] [blame] | 54 | how hard the kernel tries to recover even at some risk of |
| 55 | deadlock. Higher tolerant values trade potentially better uptime |
| 56 | with the risk of a crash or even corruption (for tolerant >= 3). |
Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 57 | |
Tim Hockin | bd78432 | 2007-07-21 17:10:37 +0200 | [diff] [blame] | 58 | 0: always panic on uncorrected errors, log corrected errors |
| 59 | 1: panic or SIGBUS on uncorrected errors, log corrected errors |
| 60 | 2: SIGBUS or log uncorrected errors, log corrected errors |
| 61 | 3: never panic or SIGBUS, log all errors (for testing only) |
Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 62 | |
| 63 | Default: 1 |
| 64 | |
| 65 | Note this only makes a difference if the CPU allows recovery |
| 66 | from a machine check exception. Current x86 CPUs generally do not. |
| 67 | |
| 68 | trigger |
| 69 | Program to run when a machine check event is detected. |
| 70 | This is an alternative to running mcelog regularly from cron |
| 71 | and allows to detect events faster. |
Andi Kleen | 3c07979 | 2009-05-27 21:56:55 +0200 | [diff] [blame] | 72 | monarch_timeout |
| 73 | How long to wait for the other CPUs to machine check too on a |
| 74 | exception. 0 to disable waiting for other CPUs. |
| 75 | Unit: us |
Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 76 | |
| 77 | TBD document entries for AMD threshold interrupt configuration |
| 78 | |
| 79 | For more details about the x86 machine check architecture |
| 80 | see the Intel and AMD architecture manuals from their developer websites. |
| 81 | |
| 82 | For more details about the architecture see |
| 83 | see http://one.firstfloor.org/~andi/mce.pdf |