blob: 3c5a9e4297b4d622cb4f1022f4e934e756378669 [file] [log] [blame]
Alan Coxda9bb1d2006-01-18 17:44:13 -08001
2
3EDAC - Error Detection And Correction
4
5Written by Doug Thompson <norsk5@xmission.com>
67 Dec 2005
7
8
9EDAC was written by:
10 Thayne Harbaugh,
11 modified by Dave Peterson, Doug Thompson, et al,
12 from the bluesmoke.sourceforge.net project.
13
14
15============================================================================
16EDAC PURPOSE
17
18The 'edac' kernel module goal is to detect and report errors that occur
19within the computer system. In the initial release, memory Correctable Errors
20(CE) and Uncorrectable Errors (UE) are the primary errors being harvested.
21
22Detecting CE events, then harvesting those events and reporting them,
23CAN be a predictor of future UE events. With CE events, the system can
Dave Petersonf3479812006-03-26 01:38:53 -080024continue to operate, but with less safety. Preventive maintenance and
Alan Coxda9bb1d2006-01-18 17:44:13 -080025proactive part replacement of memory DIMMs exhibiting CEs can reduce
26the likelihood of the dreaded UE events and system 'panics'.
27
28
29In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
30in order to determine if errors are occurring on data transfers.
31The presence of PCI Parity errors must be examined with a grain of salt.
Dave Petersonf3479812006-03-26 01:38:53 -080032There are several add-in adapters that do NOT follow the PCI specification
Alan Coxda9bb1d2006-01-18 17:44:13 -080033with regards to Parity generation and reporting. The specification says
34the vendor should tie the parity status bits to 0 if they do not intend
35to generate parity. Some vendors do not do this, and thus the parity bit
36can "float" giving false positives.
37
Doug Thompson49c0dab72006-07-10 04:45:19 -070038[There are patches in the kernel queue which will allow for storage of
39quirks of PCI devices reporting false parity positives. The 2.6.18
40kernel should have those patches included. When that becomes available,
41then EDAC will be patched to utilize that information to "skip" such
42devices.]
Alan Coxda9bb1d2006-01-18 17:44:13 -080043
Doug Thompson49c0dab72006-07-10 04:45:19 -070044EDAC will have future error detectors that will be integrated with
45EDAC or added to it, in the following list:
Alan Coxda9bb1d2006-01-18 17:44:13 -080046
47 MCE Machine Check Exception
48 MCA Machine Check Architecture
49 NMI NMI notification of ECC errors
50 MSRs Machine Specific Register error cases
51 and other mechanisms.
52
53These errors are usually bus errors, ECC errors, thermal throttling
54and the like.
55
56
57============================================================================
58EDAC VERSIONING
59
60EDAC is composed of a "core" module (edac_mc.ko) and several Memory
61Controller (MC) driver modules. On a given system, the CORE
62is loaded and one MC driver will be loaded. Both the CORE and
63the MC driver have individual versions that reflect current release
64level of their respective modules. Thus, to "report" on what version
65a system is running, one must report both the CORE's and the
66MC driver's versions.
67
68
69LOADING
70
71If 'edac' was statically linked with the kernel then no loading is
72necessary. If 'edac' was built as modules then simply modprobe the
73'edac' pieces that you need. You should be able to modprobe
74hardware-specific modules and have the dependencies load the necessary core
75modules.
76
77Example:
78
79$> modprobe amd76x_edac
80
81loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
82core module.
83
84
85============================================================================
86EDAC sysfs INTERFACE
87
88EDAC presents a 'sysfs' interface for control, reporting and attribute
89reporting purposes.
90
91EDAC lives in the /sys/devices/system/edac directory. Within this directory
92there currently reside 2 'edac' components:
93
94 mc memory controller(s) system
Doug Thompson49c0dab72006-07-10 04:45:19 -070095 pci PCI control and status system
Alan Coxda9bb1d2006-01-18 17:44:13 -080096
97
98============================================================================
99Memory Controller (mc) Model
100
101First a background on the memory controller's model abstracted in EDAC.
Doug Thompson49c0dab72006-07-10 04:45:19 -0700102Each 'mc' device controls a set of DIMM memory modules. These modules are
Dave Petersonf3479812006-03-26 01:38:53 -0800103laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can
Doug Thompson49c0dab72006-07-10 04:45:19 -0700104be multiple csrows and multiple channels.
Alan Coxda9bb1d2006-01-18 17:44:13 -0800105
106Memory controllers allow for several csrows, with 8 csrows being a typical value.
107Yet, the actual number of csrows depends on the electrical "loading"
108of a given motherboard, memory controller and DIMM characteristics.
109
110Dual channels allows for 128 bit data transfers to the CPU from memory.
Doug Thompson49c0dab72006-07-10 04:45:19 -0700111Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs
112(FB-DIMMs). The following example will assume 2 channels:
Alan Coxda9bb1d2006-01-18 17:44:13 -0800113
114
115 Channel 0 Channel 1
116 ===================================
117 csrow0 | DIMM_A0 | DIMM_B0 |
118 csrow1 | DIMM_A0 | DIMM_B0 |
119 ===================================
120
121 ===================================
122 csrow2 | DIMM_A1 | DIMM_B1 |
123 csrow3 | DIMM_A1 | DIMM_B1 |
124 ===================================
125
126In the above example table there are 4 physical slots on the motherboard
127for memory DIMMs:
128
129 DIMM_A0
130 DIMM_B0
131 DIMM_A1
132 DIMM_B1
133
134Labels for these slots are usually silk screened on the motherboard. Slots
Dave Petersonf3479812006-03-26 01:38:53 -0800135labeled 'A' are channel 0 in this example. Slots labeled 'B'
Alan Coxda9bb1d2006-01-18 17:44:13 -0800136are channel 1. Notice that there are two csrows possible on a
137physical DIMM. These csrows are allocated their csrow assignment
138based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM
139is placed in each Channel, the csrows cross both DIMMs.
140
141Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
142Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
143will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,
Dave Petersonf3479812006-03-26 01:38:53 -0800144when 2 dual ranked DIMMs are similarly placed, then both csrow0 and
Alan Coxda9bb1d2006-01-18 17:44:13 -0800145csrow1 will be populated. The pattern repeats itself for csrow2 and
146csrow3.
147
148The representation of the above is reflected in the directory tree
149in EDAC's sysfs interface. Starting in directory
150/sys/devices/system/edac/mc each memory controller will be represented
151by its own 'mcX' directory, where 'X" is the index of the MC.
152
153
154 ..../edac/mc/
155 |
156 |->mc0
157 |->mc1
158 |->mc2
159 ....
160
161Under each 'mcX' directory each 'csrowX' is again represented by a
162'csrowX', where 'X" is the csrow index:
163
164
165 .../mc/mc0/
166 |
167 |->csrow0
168 |->csrow2
169 |->csrow3
170 ....
171
172Notice that there is no csrow1, which indicates that csrow0 is
173composed of a single ranked DIMMs. This should also apply in both
174Channels, in order to have dual-channel mode be operational. Since
175both csrow2 and csrow3 are populated, this indicates a dual ranked
176set of DIMMs for channels 0 and 1.
177
178
179Within each of the 'mc','mcX' and 'csrowX' directories are several
180EDAC control and attribute files.
181
182
183============================================================================
184DIRECTORY 'mc'
185
186In directory 'mc' are EDAC system overall control and attribute files:
187
188
189Panic on UE control file:
190
191 'panic_on_ue'
192
193 An uncorrectable error will cause a machine panic. This is usually
194 desirable. It is a bad idea to continue when an uncorrectable error
195 occurs - it is indeterminate what was uncorrected and the operating
196 system context might be so mangled that continuing will lead to further
197 corruption. If the kernel has MCE configured, then EDAC will never
198 notice the UE.
199
200 LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]
201
202 RUN TIME: echo "1" >/sys/devices/system/edac/mc/panic_on_ue
203
204
205Log UE control file:
206
207 'log_ue'
208
209 Generate kernel messages describing uncorrectable errors. These errors
210 are reported through the system message log system. UE statistics
211 will be accumulated even when UE logging is disabled.
212
213 LOAD TIME: module/kernel parameter: log_ue=[0|1]
214
215 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue
216
217
218Log CE control file:
219
220 'log_ce'
221
222 Generate kernel messages describing correctable errors. These
223 errors are reported through the system message log system.
224 CE statistics will be accumulated even when CE logging is disabled.
225
226 LOAD TIME: module/kernel parameter: log_ce=[0|1]
227
228 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce
229
230
231Polling period control file:
232
233 'poll_msec'
234
235 The time period, in milliseconds, for polling for error information.
236 Too small a value wastes resources. Too large a value might delay
237 necessary handling of errors and might loose valuable information for
Doug Thompson49c0dab72006-07-10 04:45:19 -0700238 locating the error. 1000 milliseconds (once each second) is the current
239 default. Systems which require all the bandwidth they can get, may
240 increase this.
Alan Coxda9bb1d2006-01-18 17:44:13 -0800241
242 LOAD TIME: module/kernel parameter: poll_msec=[0|1]
243
244 RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec
245
246
Alan Coxda9bb1d2006-01-18 17:44:13 -0800247============================================================================
248'mcX' DIRECTORIES
249
250
251In 'mcX' directories are EDAC control and attribute files for
252this 'X" instance of the memory controllers:
253
254
255Counter reset control file:
256
257 'reset_counters'
258
259 This write-only control file will zero all the statistical counters
260 for UE and CE errors. Zeroing the counters will also reset the timer
261 indicating how long since the last counter zero. This is useful
262 for computing errors/time. Since the counters are always reset at
263 driver initialization time, no module/kernel parameter is available.
264
265 RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset
266
267 This resets the counters on memory controller 0
268
269
270Seconds since last counter reset control file:
271
272 'seconds_since_reset'
273
274 This attribute file displays how many seconds have elapsed since the
275 last counter reset. This can be used with the error counters to
276 measure error rates.
277
278
279
Alan Coxda9bb1d2006-01-18 17:44:13 -0800280Memory Controller name attribute file:
281
282 'mc_name'
283
284 This attribute file displays the type of memory controller
285 that is being utilized.
286
287
Alan Coxda9bb1d2006-01-18 17:44:13 -0800288Total memory managed by this memory controller attribute file:
289
290 'size_mb'
291
292 This attribute file displays, in count of megabytes, of memory
293 that this instance of memory controller manages.
294
295
296Total Uncorrectable Errors count attribute file:
297
298 'ue_count'
299
300 This attribute file displays the total count of uncorrectable
301 errors that have occurred on this memory controller. If panic_on_ue
302 is set this counter will not have a chance to increment,
303 since EDAC will panic the system.
304
305
306Total UE count that had no information attribute fileY:
307
308 'ue_noinfo_count'
309
310 This attribute file displays the number of UEs that
311 have occurred have occurred with no informations as to which DIMM
312 slot is having errors.
313
314
315Total Correctable Errors count attribute file:
316
317 'ce_count'
318
319 This attribute file displays the total count of correctable
320 errors that have occurred on this memory controller. This
321 count is very important to examine. CEs provide early
322 indications that a DIMM is beginning to fail. This count
323 field should be monitored for non-zero values and report
324 such information to the system administrator.
325
326
327Total Correctable Errors count attribute file:
328
329 'ce_noinfo_count'
330
331 This attribute file displays the number of CEs that
332 have occurred wherewith no informations as to which DIMM slot
333 is having errors. Memory is handicapped, but operational,
334 yet no information is available to indicate which slot
335 the failing memory is in. This count field should be also
336 be monitored for non-zero values.
337
338Device Symlink:
339
340 'device'
341
Frithiof Jensen4f423dd2007-02-12 00:53:07 -0800342 Symlink to the memory controller device.
343
344Sdram memory scrubbing rate:
345
346 'sdram_scrub_rate'
347
348 Read/Write attribute file that controls memory scrubbing. The scrubbing
349 rate is set by writing a minimum bandwith in bytes/sec to the attribute
350 file. The rate will be translated to an internal value that gives at
351 least the specified rate.
352
353 Reading the file will return the actual scrubbing rate employed.
354
355 If configuration fails or memory scrubbing is not implemented, the value
356 of the attribute file will be -1.
Alan Coxda9bb1d2006-01-18 17:44:13 -0800357
358
359
360============================================================================
361'csrowX' DIRECTORIES
362
363In the 'csrowX' directories are EDAC control and attribute files for
364this 'X" instance of csrow:
365
366
367Total Uncorrectable Errors count attribute file:
368
369 'ue_count'
370
371 This attribute file displays the total count of uncorrectable
372 errors that have occurred on this csrow. If panic_on_ue is set
373 this counter will not have a chance to increment, since EDAC
374 will panic the system.
375
376
377Total Correctable Errors count attribute file:
378
379 'ce_count'
380
381 This attribute file displays the total count of correctable
382 errors that have occurred on this csrow. This
383 count is very important to examine. CEs provide early
384 indications that a DIMM is beginning to fail. This count
385 field should be monitored for non-zero values and report
386 such information to the system administrator.
387
388
389Total memory managed by this csrow attribute file:
390
391 'size_mb'
392
393 This attribute file displays, in count of megabytes, of memory
Dave Petersonf3479812006-03-26 01:38:53 -0800394 that this csrow contains.
Alan Coxda9bb1d2006-01-18 17:44:13 -0800395
396
397Memory Type attribute file:
398
399 'mem_type'
400
401 This attribute file will display what type of memory is currently
402 on this csrow. Normally, either buffered or unbuffered memory.
Doug Thompson49c0dab72006-07-10 04:45:19 -0700403 Examples:
404 Registered-DDR
405 Unbuffered-DDR
Alan Coxda9bb1d2006-01-18 17:44:13 -0800406
407
408EDAC Mode of operation attribute file:
409
410 'edac_mode'
411
412 This attribute file will display what type of Error detection
413 and correction is being utilized.
414
415
416Device type attribute file:
417
418 'dev_type'
419
Doug Thompson49c0dab72006-07-10 04:45:19 -0700420 This attribute file will display what type of DRAM device is
421 being utilized on this DIMM.
422 Examples:
423 x1
424 x2
425 x4
426 x8
Alan Coxda9bb1d2006-01-18 17:44:13 -0800427
428
429Channel 0 CE Count attribute file:
430
431 'ch0_ce_count'
432
433 This attribute file will display the count of CEs on this
434 DIMM located in channel 0.
435
436
437Channel 0 UE Count attribute file:
438
439 'ch0_ue_count'
440
441 This attribute file will display the count of UEs on this
442 DIMM located in channel 0.
443
444
445Channel 0 DIMM Label control file:
446
447 'ch0_dimm_label'
448
449 This control file allows this DIMM to have a label assigned
450 to it. With this label in the module, when errors occur
451 the output can provide the DIMM label in the system log.
452 This becomes vital for panic events to isolate the
453 cause of the UE event.
454
455 DIMM Labels must be assigned after booting, with information
456 that correctly identifies the physical slot with its
457 silk screen label. This information is currently very
458 motherboard specific and determination of this information
459 must occur in userland at this time.
460
461
462Channel 1 CE Count attribute file:
463
464 'ch1_ce_count'
465
466 This attribute file will display the count of CEs on this
467 DIMM located in channel 1.
468
469
470Channel 1 UE Count attribute file:
471
472 'ch1_ue_count'
473
474 This attribute file will display the count of UEs on this
475 DIMM located in channel 0.
476
477
478Channel 1 DIMM Label control file:
479
480 'ch1_dimm_label'
481
482 This control file allows this DIMM to have a label assigned
483 to it. With this label in the module, when errors occur
484 the output can provide the DIMM label in the system log.
485 This becomes vital for panic events to isolate the
486 cause of the UE event.
487
488 DIMM Labels must be assigned after booting, with information
489 that correctly identifies the physical slot with its
490 silk screen label. This information is currently very
491 motherboard specific and determination of this information
492 must occur in userland at this time.
493
494
495============================================================================
496SYSTEM LOGGING
497
498If logging for UEs and CEs are enabled then system logs will have
499error notices indicating errors that have been detected:
500
Doug Thompson49c0dab72006-07-10 04:45:19 -0700501EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
Alan Coxda9bb1d2006-01-18 17:44:13 -0800502channel 1 "DIMM_B1": amd76x_edac
503
Doug Thompson49c0dab72006-07-10 04:45:19 -0700504EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
Alan Coxda9bb1d2006-01-18 17:44:13 -0800505channel 1 "DIMM_B1": amd76x_edac
506
507
508The structure of the message is:
509 the memory controller (MC0)
510 Error type (CE)
511 memory page (0x283)
512 offset in the page (0xce0)
513 the byte granularity (grain 8)
514 or resolution of the error
515 the error syndrome (0xb741)
516 memory row (row 0)
517 memory channel (channel 1)
518 DIMM label, if set prior (DIMM B1
519 and then an optional, driver-specific message that may
520 have additional information.
521
522Both UEs and CEs with no info will lack all but memory controller,
523error type, a notice of "no info" and then an optional,
524driver-specific error message.
525
526
527
528============================================================================
529PCI Bus Parity Detection
530
531
532On Header Type 00 devices the primary status is looked at
533for any parity error regardless of whether Parity is enabled on the
534device. (The spec indicates parity is generated in some cases).
535On Header Type 01 bridges, the secondary status register is also
Dave Petersonf3479812006-03-26 01:38:53 -0800536looked at to see if parity occurred on the bus on the other side of
Alan Coxda9bb1d2006-01-18 17:44:13 -0800537the bridge.
538
539
540SYSFS CONFIGURATION
541
542Under /sys/devices/system/edac/pci are control and attribute files as follows:
543
544
545Enable/Disable PCI Parity checking control file:
546
547 'check_pci_parity'
548
549
550 This control file enables or disables the PCI Bus Parity scanning
551 operation. Writing a 1 to this file enables the scanning. Writing
552 a 0 to this file disables the scanning.
553
554 Enable:
555 echo "1" >/sys/devices/system/edac/pci/check_pci_parity
556
557 Disable:
558 echo "0" >/sys/devices/system/edac/pci/check_pci_parity
559
560
561
562Panic on PCI PARITY Error:
563
564 'panic_on_pci_parity'
565
566
Dave Petersonf3479812006-03-26 01:38:53 -0800567 This control files enables or disables panicking when a parity
Alan Coxda9bb1d2006-01-18 17:44:13 -0800568 error has been detected.
569
570
571 module/kernel parameter: panic_on_pci_parity=[0|1]
572
573 Enable:
574 echo "1" >/sys/devices/system/edac/pci/panic_on_pci_parity
575
576 Disable:
577 echo "0" >/sys/devices/system/edac/pci/panic_on_pci_parity
578
579
580Parity Count:
581
582 'pci_parity_count'
583
584 This attribute file will display the number of parity errors that
585 have been detected.
586
587
588
Alan Coxda9bb1d2006-01-18 17:44:13 -0800589=======================================================================