blob: 70d96a62e5e12e2dcdcd58a64e623e0f7678c106 [file] [log] [blame]
Alan Coxda9bb1d2006-01-18 17:44:13 -08001
2
3EDAC - Error Detection And Correction
4
5Written by Doug Thompson <norsk5@xmission.com>
67 Dec 2005
7
8
9EDAC was written by:
10 Thayne Harbaugh,
11 modified by Dave Peterson, Doug Thompson, et al,
12 from the bluesmoke.sourceforge.net project.
13
14
15============================================================================
16EDAC PURPOSE
17
18The 'edac' kernel module goal is to detect and report errors that occur
19within the computer system. In the initial release, memory Correctable Errors
20(CE) and Uncorrectable Errors (UE) are the primary errors being harvested.
21
22Detecting CE events, then harvesting those events and reporting them,
23CAN be a predictor of future UE events. With CE events, the system can
Dave Petersonf3479812006-03-26 01:38:53 -080024continue to operate, but with less safety. Preventive maintenance and
Alan Coxda9bb1d2006-01-18 17:44:13 -080025proactive part replacement of memory DIMMs exhibiting CEs can reduce
26the likelihood of the dreaded UE events and system 'panics'.
27
28
29In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
30in order to determine if errors are occurring on data transfers.
31The presence of PCI Parity errors must be examined with a grain of salt.
Dave Petersonf3479812006-03-26 01:38:53 -080032There are several add-in adapters that do NOT follow the PCI specification
Alan Coxda9bb1d2006-01-18 17:44:13 -080033with regards to Parity generation and reporting. The specification says
34the vendor should tie the parity status bits to 0 if they do not intend
35to generate parity. Some vendors do not do this, and thus the parity bit
36can "float" giving false positives.
37
Dave Petersonf3479812006-03-26 01:38:53 -080038The PCI Parity EDAC device has the ability to "skip" known flaky
Alan Coxda9bb1d2006-01-18 17:44:13 -080039cards during the parity scan. These are set by the parity "blacklist"
40interface in the sysfs for PCI Parity. (See the PCI section in the sysfs
41section below.) There is also a parity "whitelist" which is used as
42an explicit list of devices to scan, while the blacklist is a list
43of devices to skip.
44
45EDAC will have future error detectors that will be added or integrated
46into EDAC in the following list:
47
48 MCE Machine Check Exception
49 MCA Machine Check Architecture
50 NMI NMI notification of ECC errors
51 MSRs Machine Specific Register error cases
52 and other mechanisms.
53
54These errors are usually bus errors, ECC errors, thermal throttling
55and the like.
56
57
58============================================================================
59EDAC VERSIONING
60
61EDAC is composed of a "core" module (edac_mc.ko) and several Memory
62Controller (MC) driver modules. On a given system, the CORE
63is loaded and one MC driver will be loaded. Both the CORE and
64the MC driver have individual versions that reflect current release
65level of their respective modules. Thus, to "report" on what version
66a system is running, one must report both the CORE's and the
67MC driver's versions.
68
69
70LOADING
71
72If 'edac' was statically linked with the kernel then no loading is
73necessary. If 'edac' was built as modules then simply modprobe the
74'edac' pieces that you need. You should be able to modprobe
75hardware-specific modules and have the dependencies load the necessary core
76modules.
77
78Example:
79
80$> modprobe amd76x_edac
81
82loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
83core module.
84
85
86============================================================================
87EDAC sysfs INTERFACE
88
89EDAC presents a 'sysfs' interface for control, reporting and attribute
90reporting purposes.
91
92EDAC lives in the /sys/devices/system/edac directory. Within this directory
93there currently reside 2 'edac' components:
94
95 mc memory controller(s) system
96 pci PCI status system
97
98
99============================================================================
100Memory Controller (mc) Model
101
102First a background on the memory controller's model abstracted in EDAC.
103Each mc device controls a set of DIMM memory modules. These modules are
Dave Petersonf3479812006-03-26 01:38:53 -0800104laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can
Alan Coxda9bb1d2006-01-18 17:44:13 -0800105be multiple csrows and two channels.
106
107Memory controllers allow for several csrows, with 8 csrows being a typical value.
108Yet, the actual number of csrows depends on the electrical "loading"
109of a given motherboard, memory controller and DIMM characteristics.
110
111Dual channels allows for 128 bit data transfers to the CPU from memory.
112
113
114 Channel 0 Channel 1
115 ===================================
116 csrow0 | DIMM_A0 | DIMM_B0 |
117 csrow1 | DIMM_A0 | DIMM_B0 |
118 ===================================
119
120 ===================================
121 csrow2 | DIMM_A1 | DIMM_B1 |
122 csrow3 | DIMM_A1 | DIMM_B1 |
123 ===================================
124
125In the above example table there are 4 physical slots on the motherboard
126for memory DIMMs:
127
128 DIMM_A0
129 DIMM_B0
130 DIMM_A1
131 DIMM_B1
132
133Labels for these slots are usually silk screened on the motherboard. Slots
Dave Petersonf3479812006-03-26 01:38:53 -0800134labeled 'A' are channel 0 in this example. Slots labeled 'B'
Alan Coxda9bb1d2006-01-18 17:44:13 -0800135are channel 1. Notice that there are two csrows possible on a
136physical DIMM. These csrows are allocated their csrow assignment
137based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM
138is placed in each Channel, the csrows cross both DIMMs.
139
140Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
141Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
142will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,
Dave Petersonf3479812006-03-26 01:38:53 -0800143when 2 dual ranked DIMMs are similarly placed, then both csrow0 and
Alan Coxda9bb1d2006-01-18 17:44:13 -0800144csrow1 will be populated. The pattern repeats itself for csrow2 and
145csrow3.
146
147The representation of the above is reflected in the directory tree
148in EDAC's sysfs interface. Starting in directory
149/sys/devices/system/edac/mc each memory controller will be represented
150by its own 'mcX' directory, where 'X" is the index of the MC.
151
152
153 ..../edac/mc/
154 |
155 |->mc0
156 |->mc1
157 |->mc2
158 ....
159
160Under each 'mcX' directory each 'csrowX' is again represented by a
161'csrowX', where 'X" is the csrow index:
162
163
164 .../mc/mc0/
165 |
166 |->csrow0
167 |->csrow2
168 |->csrow3
169 ....
170
171Notice that there is no csrow1, which indicates that csrow0 is
172composed of a single ranked DIMMs. This should also apply in both
173Channels, in order to have dual-channel mode be operational. Since
174both csrow2 and csrow3 are populated, this indicates a dual ranked
175set of DIMMs for channels 0 and 1.
176
177
178Within each of the 'mc','mcX' and 'csrowX' directories are several
179EDAC control and attribute files.
180
181
182============================================================================
183DIRECTORY 'mc'
184
185In directory 'mc' are EDAC system overall control and attribute files:
186
187
188Panic on UE control file:
189
190 'panic_on_ue'
191
192 An uncorrectable error will cause a machine panic. This is usually
193 desirable. It is a bad idea to continue when an uncorrectable error
194 occurs - it is indeterminate what was uncorrected and the operating
195 system context might be so mangled that continuing will lead to further
196 corruption. If the kernel has MCE configured, then EDAC will never
197 notice the UE.
198
199 LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]
200
201 RUN TIME: echo "1" >/sys/devices/system/edac/mc/panic_on_ue
202
203
204Log UE control file:
205
206 'log_ue'
207
208 Generate kernel messages describing uncorrectable errors. These errors
209 are reported through the system message log system. UE statistics
210 will be accumulated even when UE logging is disabled.
211
212 LOAD TIME: module/kernel parameter: log_ue=[0|1]
213
214 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue
215
216
217Log CE control file:
218
219 'log_ce'
220
221 Generate kernel messages describing correctable errors. These
222 errors are reported through the system message log system.
223 CE statistics will be accumulated even when CE logging is disabled.
224
225 LOAD TIME: module/kernel parameter: log_ce=[0|1]
226
227 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce
228
229
230Polling period control file:
231
232 'poll_msec'
233
234 The time period, in milliseconds, for polling for error information.
235 Too small a value wastes resources. Too large a value might delay
236 necessary handling of errors and might loose valuable information for
237 locating the error. 1000 milliseconds (once each second) is about
238 right for most uses.
239
240 LOAD TIME: module/kernel parameter: poll_msec=[0|1]
241
242 RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec
243
244
245Module Version read-only attribute file:
246
247 'mc_version'
248
Dave Petersonf3479812006-03-26 01:38:53 -0800249 The EDAC CORE module's version and compile date are shown here to
Alan Coxda9bb1d2006-01-18 17:44:13 -0800250 indicate what EDAC is running.
251
252
253
254============================================================================
255'mcX' DIRECTORIES
256
257
258In 'mcX' directories are EDAC control and attribute files for
259this 'X" instance of the memory controllers:
260
261
262Counter reset control file:
263
264 'reset_counters'
265
266 This write-only control file will zero all the statistical counters
267 for UE and CE errors. Zeroing the counters will also reset the timer
268 indicating how long since the last counter zero. This is useful
269 for computing errors/time. Since the counters are always reset at
270 driver initialization time, no module/kernel parameter is available.
271
272 RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset
273
274 This resets the counters on memory controller 0
275
276
277Seconds since last counter reset control file:
278
279 'seconds_since_reset'
280
281 This attribute file displays how many seconds have elapsed since the
282 last counter reset. This can be used with the error counters to
283 measure error rates.
284
285
286
287DIMM capability attribute file:
288
289 'edac_capability'
290
291 The EDAC (Error Detection and Correction) capabilities/modes of
292 the memory controller hardware.
293
294
295DIMM Current Capability attribute file:
296
297 'edac_current_capability'
298
299 The EDAC capabilities available with the hardware
300 configuration. This may not be the same as "EDAC capability"
301 if the correct memory is not used. If a memory controller is
302 capable of EDAC, but DIMMs without check bits are in use, then
303 Parity, SECDED, S4ECD4ED capabilities will not be available
304 even though the memory controller might be capable of those
305 modes with the proper memory loaded.
306
307
308Memory Type supported on this controller attribute file:
309
310 'supported_mem_type'
311
312 This attribute file displays the memory type, usually
313 buffered and unbuffered DIMMs.
314
315
316Memory Controller name attribute file:
317
318 'mc_name'
319
320 This attribute file displays the type of memory controller
321 that is being utilized.
322
323
324Memory Controller Module name attribute file:
325
326 'module_name'
327
328 This attribute file displays the memory controller module name,
329 version and date built. The name of the memory controller
330 hardware - some drivers work with multiple controllers and
331 this field shows which hardware is present.
332
333
334Total memory managed by this memory controller attribute file:
335
336 'size_mb'
337
338 This attribute file displays, in count of megabytes, of memory
339 that this instance of memory controller manages.
340
341
342Total Uncorrectable Errors count attribute file:
343
344 'ue_count'
345
346 This attribute file displays the total count of uncorrectable
347 errors that have occurred on this memory controller. If panic_on_ue
348 is set this counter will not have a chance to increment,
349 since EDAC will panic the system.
350
351
352Total UE count that had no information attribute fileY:
353
354 'ue_noinfo_count'
355
356 This attribute file displays the number of UEs that
357 have occurred have occurred with no informations as to which DIMM
358 slot is having errors.
359
360
361Total Correctable Errors count attribute file:
362
363 'ce_count'
364
365 This attribute file displays the total count of correctable
366 errors that have occurred on this memory controller. This
367 count is very important to examine. CEs provide early
368 indications that a DIMM is beginning to fail. This count
369 field should be monitored for non-zero values and report
370 such information to the system administrator.
371
372
373Total Correctable Errors count attribute file:
374
375 'ce_noinfo_count'
376
377 This attribute file displays the number of CEs that
378 have occurred wherewith no informations as to which DIMM slot
379 is having errors. Memory is handicapped, but operational,
380 yet no information is available to indicate which slot
381 the failing memory is in. This count field should be also
382 be monitored for non-zero values.
383
384Device Symlink:
385
386 'device'
387
388 Symlink to the memory controller device
389
390
391
392============================================================================
393'csrowX' DIRECTORIES
394
395In the 'csrowX' directories are EDAC control and attribute files for
396this 'X" instance of csrow:
397
398
399Total Uncorrectable Errors count attribute file:
400
401 'ue_count'
402
403 This attribute file displays the total count of uncorrectable
404 errors that have occurred on this csrow. If panic_on_ue is set
405 this counter will not have a chance to increment, since EDAC
406 will panic the system.
407
408
409Total Correctable Errors count attribute file:
410
411 'ce_count'
412
413 This attribute file displays the total count of correctable
414 errors that have occurred on this csrow. This
415 count is very important to examine. CEs provide early
416 indications that a DIMM is beginning to fail. This count
417 field should be monitored for non-zero values and report
418 such information to the system administrator.
419
420
421Total memory managed by this csrow attribute file:
422
423 'size_mb'
424
425 This attribute file displays, in count of megabytes, of memory
Dave Petersonf3479812006-03-26 01:38:53 -0800426 that this csrow contains.
Alan Coxda9bb1d2006-01-18 17:44:13 -0800427
428
429Memory Type attribute file:
430
431 'mem_type'
432
433 This attribute file will display what type of memory is currently
434 on this csrow. Normally, either buffered or unbuffered memory.
435
436
437EDAC Mode of operation attribute file:
438
439 'edac_mode'
440
441 This attribute file will display what type of Error detection
442 and correction is being utilized.
443
444
445Device type attribute file:
446
447 'dev_type'
448
449 This attribute file will display what type of DIMM device is
450 being utilized. Example: x4
451
452
453Channel 0 CE Count attribute file:
454
455 'ch0_ce_count'
456
457 This attribute file will display the count of CEs on this
458 DIMM located in channel 0.
459
460
461Channel 0 UE Count attribute file:
462
463 'ch0_ue_count'
464
465 This attribute file will display the count of UEs on this
466 DIMM located in channel 0.
467
468
469Channel 0 DIMM Label control file:
470
471 'ch0_dimm_label'
472
473 This control file allows this DIMM to have a label assigned
474 to it. With this label in the module, when errors occur
475 the output can provide the DIMM label in the system log.
476 This becomes vital for panic events to isolate the
477 cause of the UE event.
478
479 DIMM Labels must be assigned after booting, with information
480 that correctly identifies the physical slot with its
481 silk screen label. This information is currently very
482 motherboard specific and determination of this information
483 must occur in userland at this time.
484
485
486Channel 1 CE Count attribute file:
487
488 'ch1_ce_count'
489
490 This attribute file will display the count of CEs on this
491 DIMM located in channel 1.
492
493
494Channel 1 UE Count attribute file:
495
496 'ch1_ue_count'
497
498 This attribute file will display the count of UEs on this
499 DIMM located in channel 0.
500
501
502Channel 1 DIMM Label control file:
503
504 'ch1_dimm_label'
505
506 This control file allows this DIMM to have a label assigned
507 to it. With this label in the module, when errors occur
508 the output can provide the DIMM label in the system log.
509 This becomes vital for panic events to isolate the
510 cause of the UE event.
511
512 DIMM Labels must be assigned after booting, with information
513 that correctly identifies the physical slot with its
514 silk screen label. This information is currently very
515 motherboard specific and determination of this information
516 must occur in userland at this time.
517
518
519============================================================================
520SYSTEM LOGGING
521
522If logging for UEs and CEs are enabled then system logs will have
523error notices indicating errors that have been detected:
524
525MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
526channel 1 "DIMM_B1": amd76x_edac
527
528MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
529channel 1 "DIMM_B1": amd76x_edac
530
531
532The structure of the message is:
533 the memory controller (MC0)
534 Error type (CE)
535 memory page (0x283)
536 offset in the page (0xce0)
537 the byte granularity (grain 8)
538 or resolution of the error
539 the error syndrome (0xb741)
540 memory row (row 0)
541 memory channel (channel 1)
542 DIMM label, if set prior (DIMM B1
543 and then an optional, driver-specific message that may
544 have additional information.
545
546Both UEs and CEs with no info will lack all but memory controller,
547error type, a notice of "no info" and then an optional,
548driver-specific error message.
549
550
551
552============================================================================
553PCI Bus Parity Detection
554
555
556On Header Type 00 devices the primary status is looked at
557for any parity error regardless of whether Parity is enabled on the
558device. (The spec indicates parity is generated in some cases).
559On Header Type 01 bridges, the secondary status register is also
Dave Petersonf3479812006-03-26 01:38:53 -0800560looked at to see if parity occurred on the bus on the other side of
Alan Coxda9bb1d2006-01-18 17:44:13 -0800561the bridge.
562
563
564SYSFS CONFIGURATION
565
566Under /sys/devices/system/edac/pci are control and attribute files as follows:
567
568
569Enable/Disable PCI Parity checking control file:
570
571 'check_pci_parity'
572
573
574 This control file enables or disables the PCI Bus Parity scanning
575 operation. Writing a 1 to this file enables the scanning. Writing
576 a 0 to this file disables the scanning.
577
578 Enable:
579 echo "1" >/sys/devices/system/edac/pci/check_pci_parity
580
581 Disable:
582 echo "0" >/sys/devices/system/edac/pci/check_pci_parity
583
584
585
586Panic on PCI PARITY Error:
587
588 'panic_on_pci_parity'
589
590
Dave Petersonf3479812006-03-26 01:38:53 -0800591 This control files enables or disables panicking when a parity
Alan Coxda9bb1d2006-01-18 17:44:13 -0800592 error has been detected.
593
594
595 module/kernel parameter: panic_on_pci_parity=[0|1]
596
597 Enable:
598 echo "1" >/sys/devices/system/edac/pci/panic_on_pci_parity
599
600 Disable:
601 echo "0" >/sys/devices/system/edac/pci/panic_on_pci_parity
602
603
604Parity Count:
605
606 'pci_parity_count'
607
608 This attribute file will display the number of parity errors that
609 have been detected.
610
611
612
613PCI Device Whitelist:
614
615 'pci_parity_whitelist'
616
617 This control file allows for an explicit list of PCI devices to be
618 scanned for parity errors. Only devices found on this list will
Dave Petersonf3479812006-03-26 01:38:53 -0800619 be examined. The list is a line of hexadecimal VENDOR and DEVICE
Alan Coxda9bb1d2006-01-18 17:44:13 -0800620 ID tuples:
621
622 1022:7450,1434:16a6
623
Dave Petersonf3479812006-03-26 01:38:53 -0800624 One or more can be inserted, separated by a comma.
Alan Coxda9bb1d2006-01-18 17:44:13 -0800625
626 To write the above list doing the following as one command line:
627
628 echo "1022:7450,1434:16a6"
629 > /sys/devices/system/edac/pci/pci_parity_whitelist
630
631
632
633 To display what the whitelist is, simply 'cat' the same file.
634
635
636PCI Device Blacklist:
637
638 'pci_parity_blacklist'
639
640 This control file allows for a list of PCI devices to be
641 skipped for scanning.
Dave Petersonf3479812006-03-26 01:38:53 -0800642 The list is a line of hexadecimal VENDOR and DEVICE ID tuples:
Alan Coxda9bb1d2006-01-18 17:44:13 -0800643
644 1022:7450,1434:16a6
645
Dave Petersonf3479812006-03-26 01:38:53 -0800646 One or more can be inserted, separated by a comma.
Alan Coxda9bb1d2006-01-18 17:44:13 -0800647
648 To write the above list doing the following as one command line:
649
650 echo "1022:7450,1434:16a6"
651 > /sys/devices/system/edac/pci/pci_parity_blacklist
652
653
Dave Petersonf3479812006-03-26 01:38:53 -0800654 To display what the whitelist currently contains,
Alan Coxda9bb1d2006-01-18 17:44:13 -0800655 simply 'cat' the same file.
656
657=======================================================================
658
659PCI Vendor and Devices IDs can be obtained with the lspci command. Using
660the -n option lspci will display the vendor and device IDs. The system
Dave Petersonf3479812006-03-26 01:38:53 -0800661administrator will have to determine which devices should be scanned or
Alan Coxda9bb1d2006-01-18 17:44:13 -0800662skipped.
663
664
665
666The two lists (white and black) are prioritized. blacklist is the lower
667priority and will NOT be utilized when a whitelist has been set.
668Turn OFF a whitelist by an empty echo command:
669
670 echo > /sys/devices/system/edac/pci/pci_parity_whitelist
671
Dave Petersonf3479812006-03-26 01:38:53 -0800672and any previous blacklist will be utilized.
Alan Coxda9bb1d2006-01-18 17:44:13 -0800673