Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 1 | APEI Error INJection |
| 2 | ~~~~~~~~~~~~~~~~~~~~ |
| 3 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 4 | EINJ provides a hardware error injection mechanism. It is very useful |
| 5 | for debugging and testing APEI and RAS features in general. |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 6 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 7 | You need to check whether your BIOS supports EINJ first. For that, look |
| 8 | for early boot messages similar to this one: |
| 9 | |
| 10 | ACPI: EINJ 0x000000007370A000 000150 (v01 INTEL 00000001 INTL 00000001) |
| 11 | |
| 12 | which shows that the BIOS is exposing an EINJ table - it is the |
| 13 | mechanism through which the injection is done. |
| 14 | |
| 15 | Alternatively, look in /sys/firmware/acpi/tables for an "EINJ" file, |
| 16 | which is a different representation of the same thing. |
| 17 | |
| 18 | It doesn't necessarily mean that EINJ is not supported if those above |
| 19 | don't exist: before you give up, go into BIOS setup to see if the BIOS |
| 20 | has an option to enable error injection. Look for something called WHEA |
| 21 | or similar. Often, you need to enable an ACPI5 support option prior, in |
| 22 | order to see the APEI,EINJ,... functionality supported and exposed by |
| 23 | the BIOS menu. |
| 24 | |
| 25 | To use EINJ, make sure the following are options enabled in your kernel |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 26 | configuration: |
| 27 | |
| 28 | CONFIG_DEBUG_FS |
| 29 | CONFIG_ACPI_APEI |
| 30 | CONFIG_ACPI_APEI_EINJ |
| 31 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 32 | The EINJ user interface is in <debugfs mount point>/apei/einj. |
| 33 | |
| 34 | The following files belong to it: |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 35 | |
| 36 | - available_error_type |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 37 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 38 | This file shows which error types are supported: |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 39 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 40 | Error Type Value Error Description |
| 41 | ================ ================= |
| 42 | 0x00000001 Processor Correctable |
| 43 | 0x00000002 Processor Uncorrectable non-fatal |
| 44 | 0x00000004 Processor Uncorrectable fatal |
| 45 | 0x00000008 Memory Correctable |
| 46 | 0x00000010 Memory Uncorrectable non-fatal |
| 47 | 0x00000020 Memory Uncorrectable fatal |
| 48 | 0x00000040 PCI Express Correctable |
| 49 | 0x00000080 PCI Express Uncorrectable fatal |
| 50 | 0x00000100 PCI Express Uncorrectable non-fatal |
| 51 | 0x00000200 Platform Correctable |
| 52 | 0x00000400 Platform Uncorrectable non-fatal |
| 53 | 0x00000800 Platform Uncorrectable fatal |
| 54 | |
| 55 | The format of the file contents are as above, except present are only |
| 56 | the available error types. |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 57 | |
| 58 | - error_type |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 59 | |
| 60 | Set the value of the error type being injected. Possible error types |
| 61 | are defined in the file available_error_type above. |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 62 | |
| 63 | - error_inject |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 64 | |
| 65 | Write any integer to this file to trigger the error injection. Make |
| 66 | sure you have specified all necessary error parameters, i.e. this |
| 67 | write should be the last step when injecting errors. |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 68 | |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 69 | - flags |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 70 | |
| 71 | Present for kernel versions 3.13 and above. Used to specify which |
| 72 | of param{1..4} are valid and should be used by the firmware during |
| 73 | injection. Value is a bitmask as specified in ACPI5.0 spec for the |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 74 | SET_ERROR_TYPE_WITH_ADDRESS data structure: |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 75 | |
| 76 | Bit 0 - Processor APIC field valid (see param3 below). |
| 77 | Bit 1 - Memory address and mask valid (param1 and param2). |
| 78 | Bit 2 - PCIe (seg,bus,dev,fn) valid (see param4 below). |
| 79 | |
| 80 | If set to zero, legacy behavior is mimicked where the type of |
| 81 | injection specifies just one bit set, and param1 is multiplexed. |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 82 | |
Huang Ying | 6e320ec | 2010-05-18 14:35:24 +0800 | [diff] [blame] | 83 | - param1 |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 84 | |
| 85 | This file is used to set the first error parameter value. Its effect |
| 86 | depends on the error type specified in error_type. For example, if |
| 87 | error type is memory related type, the param1 should be a valid |
| 88 | physical memory address. [Unless "flag" is set - see above] |
Huang Ying | 6e320ec | 2010-05-18 14:35:24 +0800 | [diff] [blame] | 89 | |
| 90 | - param2 |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 91 | |
| 92 | Same use as param1 above. For example, if error type is of memory |
| 93 | related type, then param2 should be a physical memory address mask. |
| 94 | Linux requires page or narrower granularity, say, 0xfffffffffffff000. |
Huang Ying | c3e6088 | 2011-07-20 16:09:29 +0800 | [diff] [blame] | 95 | |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 96 | - param3 |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 97 | |
| 98 | Used when the 0x1 bit is set in "flags" to specify the APIC id |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 99 | |
| 100 | - param4 |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 101 | Used when the 0x4 bit is set in "flags" to specify target PCIe device |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 102 | |
Chen Gong | 6ef19ab | 2012-03-15 16:53:37 +0800 | [diff] [blame] | 103 | - notrigger |
Chen Gong | 6ef19ab | 2012-03-15 16:53:37 +0800 | [diff] [blame] | 104 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 105 | The error injection mechanism is a two-step process. First inject the |
| 106 | error, then perform some actions to trigger it. Setting "notrigger" |
| 107 | to 1 skips the trigger phase, which *may* allow the user to cause the |
| 108 | error in some other context by a simple access to the CPU, memory |
| 109 | location, or device that is the target of the error injection. Whether |
| 110 | this actually works depends on what operations the BIOS actually |
| 111 | includes in the trigger phase. |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 112 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 113 | BIOS versions based on the ACPI 4.0 specification have limited options |
| 114 | in controlling where the errors are injected. Your BIOS may support an |
| 115 | extension (enabled with the param_extension=1 module parameter, or boot |
| 116 | command line einj.param_extension=1). This allows the address and mask |
| 117 | for memory injections to be specified by the param1 and param2 files in |
| 118 | apei/einj. |
| 119 | |
| 120 | BIOS versions based on the ACPI 5.0 specification have more control over |
| 121 | the target of the injection. For processor-related errors (type 0x1, 0x2 |
| 122 | and 0x4), you can set flags to 0x3 (param3 for bit 0, and param1 and |
| 123 | param2 for bit 1) so that you have more information added to the error |
| 124 | signature being injected. The actual data passed is this: |
| 125 | |
| 126 | memory_address = param1; |
| 127 | memory_address_range = param2; |
| 128 | apicid = param3; |
| 129 | pcie_sbdf = param4; |
| 130 | |
| 131 | For memory errors (type 0x8, 0x10 and 0x20) the address is set using |
| 132 | param1 with a mask in param2 (0x0 is equivalent to all ones). For PCI |
| 133 | express errors (type 0x40, 0x80 and 0x100) the segment, bus, device and |
| 134 | function are specified using param1: |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 135 | |
| 136 | 31 24 23 16 15 11 10 8 7 0 |
| 137 | +-------------------------------------------------+ |
| 138 | | segment | bus | device | function | reserved | |
| 139 | +-------------------------------------------------+ |
| 140 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 141 | Anyway, you get the idea, if there's doubt just take a look at the code |
| 142 | in drivers/acpi/apei/einj.c. |
| 143 | |
| 144 | An ACPI 5.0 BIOS may also allow vendor-specific errors to be injected. |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 145 | In this case a file named vendor will contain identifying information |
| 146 | from the BIOS that hopefully will allow an application wishing to use |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 147 | the vendor-specific extension to tell that they are running on a BIOS |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 148 | that supports it. All vendor extensions have the 0x80000000 bit set in |
| 149 | error_type. A file vendor_flags controls the interpretation of param1 |
| 150 | and param2 (1 = PROCESSOR, 2 = MEMORY, 4 = PCI). See your BIOS vendor |
| 151 | documentation for details (and expect changes to this API if vendors |
| 152 | creativity in using this feature expands beyond our expectations). |
| 153 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 154 | |
| 155 | An error injection example: |
| 156 | |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 157 | # cd /sys/kernel/debug/apei/einj |
| 158 | # cat available_error_type # See which errors can be injected |
| 159 | 0x00000002 Processor Uncorrectable non-fatal |
| 160 | 0x00000008 Memory Correctable |
| 161 | 0x00000010 Memory Uncorrectable non-fatal |
| 162 | # echo 0x12345000 > param1 # Set memory address for injection |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 163 | # echo $((-1 << 12)) > param2 # Mask 0xfffffffffffff000 - anywhere in this page |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 164 | # echo 0x8 > error_type # Choose correctable memory error |
| 165 | # echo 1 > error_inject # Inject now |
| 166 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 167 | You should see something like this in dmesg: |
| 168 | |
| 169 | [22715.830801] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR |
| 170 | [22715.834759] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090 |
| 171 | [22715.834759] EDAC sbridge MC3: TSC 0 |
| 172 | [22715.834759] EDAC sbridge MC3: ADDR 12345000 EDAC sbridge MC3: MISC 144780c86 |
| 173 | [22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0 |
| 174 | [22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0) |
Huang Ying | 6e320ec | 2010-05-18 14:35:24 +0800 | [diff] [blame] | 175 | |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 176 | For more information about EINJ, please refer to ACPI specification |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 177 | version 4.0, section 17.5 and ACPI 5.0, section 18.6. |