Mauro Carvalho Chehab | 6634fbb | 2016-10-26 14:14:45 -0200 | [diff] [blame] | 1 | Error Detection And Correction (EDAC) Devices |
| 2 | ============================================= |
| 3 | |
Mauro Carvalho Chehab | 6b1fb6f7 | 2016-10-29 16:13:23 -0200 | [diff] [blame] | 4 | Main Concepts used at the EDAC subsystem |
| 5 | ---------------------------------------- |
| 6 | |
| 7 | There are several things to be aware of that aren't at all obvious, like |
| 8 | *sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*, |
| 9 | etc... |
| 10 | |
| 11 | These are some of the many terms that are thrown about that don't always |
| 12 | mean what people think they mean (Inconceivable!). In the interest of |
| 13 | creating a common ground for discussion, terms and their definitions |
| 14 | will be established. |
| 15 | |
| 16 | * Memory devices |
| 17 | |
| 18 | The individual DRAM chips on a memory stick. These devices commonly |
| 19 | output 4 and 8 bits each (x4, x8). Grouping several of these in parallel |
| 20 | provides the number of bits that the memory controller expects: |
| 21 | typically 72 bits, in order to provide 64 bits + 8 bits of ECC data. |
| 22 | |
| 23 | * Memory Stick |
| 24 | |
| 25 | A printed circuit board that aggregates multiple memory devices in |
| 26 | parallel. In general, this is the Field Replaceable Unit (FRU) which |
| 27 | gets replaced, in the case of excessive errors. Most often it is also |
| 28 | called DIMM (Dual Inline Memory Module). |
| 29 | |
| 30 | * Memory Socket |
| 31 | |
| 32 | A physical connector on the motherboard that accepts a single memory |
| 33 | stick. Also called as "slot" on several datasheets. |
| 34 | |
| 35 | * Channel |
| 36 | |
| 37 | A memory controller channel, responsible to communicate with a group of |
| 38 | DIMMs. Each channel has its own independent control (command) and data |
| 39 | bus, and can be used independently or grouped with other channels. |
| 40 | |
| 41 | * Branch |
| 42 | |
| 43 | It is typically the highest hierarchy on a Fully-Buffered DIMM memory |
| 44 | controller. Typically, it contains two channels. Two channels at the |
| 45 | same branch can be used in single mode or in lockstep mode. When |
| 46 | lockstep is enabled, the cacheline is doubled, but it generally brings |
| 47 | some performance penalty. Also, it is generally not possible to point to |
| 48 | just one memory stick when an error occurs, as the error correction code |
| 49 | is calculated using two DIMMs instead of one. Due to that, it is capable |
| 50 | of correcting more errors than on single mode. |
| 51 | |
| 52 | * Single-channel |
| 53 | |
| 54 | The data accessed by the memory controller is contained into one dimm |
| 55 | only. E. g. if the data is 64 bits-wide, the data flows to the CPU using |
| 56 | one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3 |
| 57 | memories. FB-DIMM and RAMBUS use a different concept for channel, so |
| 58 | this concept doesn't apply there. |
| 59 | |
| 60 | * Double-channel |
| 61 | |
| 62 | The data size accessed by the memory controller is interlaced into two |
| 63 | dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72 |
| 64 | bits with ECC), the data flows to the CPU using a 128 bits parallel |
| 65 | access. |
| 66 | |
| 67 | * Chip-select row |
| 68 | |
| 69 | This is the name of the DRAM signal used to select the DRAM ranks to be |
| 70 | accessed. Common chip-select rows for single channel are 64 bits, for |
| 71 | dual channel 128 bits. It may not be visible by the memory controller, |
| 72 | as some DIMM types have a memory buffer that can hide direct access to |
| 73 | it from the Memory Controller. |
| 74 | |
| 75 | * Single-Ranked stick |
| 76 | |
| 77 | A Single-ranked stick has 1 chip-select row of memory. Motherboards |
| 78 | commonly drive two chip-select pins to a memory stick. A single-ranked |
| 79 | stick, will occupy only one of those rows. The other will be unused. |
| 80 | |
| 81 | .. _doubleranked: |
| 82 | |
| 83 | * Double-Ranked stick |
| 84 | |
| 85 | A double-ranked stick has two chip-select rows which access different |
| 86 | sets of memory devices. The two rows cannot be accessed concurrently. |
| 87 | |
| 88 | * Double-sided stick |
| 89 | |
| 90 | **DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`. |
| 91 | |
| 92 | A double-sided stick has two chip-select rows which access different sets |
| 93 | of memory devices. The two rows cannot be accessed concurrently. |
| 94 | "Double-sided" is irrespective of the memory devices being mounted on |
| 95 | both sides of the memory stick. |
| 96 | |
| 97 | * Socket set |
| 98 | |
| 99 | All of the memory sticks that are required for a single memory access or |
| 100 | all of the memory sticks spanned by a chip-select row. A single socket |
| 101 | set has two chip-select rows and if double-sided sticks are used these |
| 102 | will occupy those chip-select rows. |
| 103 | |
| 104 | * Bank |
| 105 | |
| 106 | This term is avoided because it is unclear when needing to distinguish |
| 107 | between chip-select rows and socket sets. |
| 108 | |
| 109 | |
Mauro Carvalho Chehab | 6634fbb | 2016-10-26 14:14:45 -0200 | [diff] [blame] | 110 | Memory Controllers |
| 111 | ------------------ |
| 112 | |
| 113 | Most of the EDAC core is focused on doing Memory Controller error detection. |
| 114 | The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info`` |
| 115 | to describe the memory controllers, with is an opaque struct for the EDAC |
| 116 | drivers. Only the EDAC core is allowed to touch it. |
| 117 | |
| 118 | .. kernel-doc:: include/linux/edac.h |
| 119 | |
| 120 | .. kernel-doc:: drivers/edac/edac_mc.h |
| 121 | |
| 122 | PCI Controllers |
| 123 | --------------- |
| 124 | |
| 125 | The EDAC subsystem provides a mechanism to handle PCI controllers by calling |
| 126 | the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct |
| 127 | :c:type:`edac_pci_ctl_info` to describe the PCI controllers. |
| 128 | |
| 129 | .. kernel-doc:: drivers/edac/edac_pci.h |
| 130 | |
| 131 | EDAC Blocks |
| 132 | ----------- |
| 133 | |
| 134 | The EDAC subsystem also provides a generic mechanism to report errors on |
| 135 | other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function. |
| 136 | |
| 137 | The structures :c:type:`edac_dev_sysfs_block_attribute`, |
| 138 | :c:type:`edac_device_block`, :c:type:`edac_device_instance` and |
| 139 | :c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device' |
| 140 | representation at sysfs. |
| 141 | |
| 142 | This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or |
| 143 | PCI, like: |
| 144 | |
| 145 | - CPU caches (L1 and L2) |
| 146 | - DMA engines |
| 147 | - Core CPU switches |
| 148 | - Fabric switch units |
| 149 | - PCIe interface controllers |
| 150 | - other EDAC/ECC type devices that can be monitored for |
| 151 | errors, etc. |
| 152 | |
| 153 | It allows for a 2 level set of hierarchy. |
| 154 | |
| 155 | For example, a cache could be composed of L1, L2 and L3 levels of cache. |
| 156 | Each CPU core would have its own L1 cache, while sharing L2 and maybe L3 |
| 157 | caches. On such case, those can be represented via the following sysfs |
| 158 | nodes:: |
| 159 | |
| 160 | /sys/devices/system/edac/.. |
| 161 | |
| 162 | pci/ <existing pci directory (if available)> |
| 163 | mc/ <existing memory device directory> |
| 164 | cpu/cpu0/.. <L1 and L2 block directory> |
| 165 | /L1-cache/ce_count |
| 166 | /ue_count |
| 167 | /L2-cache/ce_count |
| 168 | /ue_count |
| 169 | cpu/cpu1/.. <L1 and L2 block directory> |
| 170 | /L1-cache/ce_count |
| 171 | /ue_count |
| 172 | /L2-cache/ce_count |
| 173 | /ue_count |
| 174 | ... |
| 175 | |
| 176 | the L1 and L2 directories would be "edac_device_block's" |
| 177 | |
| 178 | .. kernel-doc:: drivers/edac/edac_device.h |