Difference between revisions of "WhyAmIgettingMemoryErrors"

From EdacWiki
Jump to navigation Jump to search
 
(15 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
To help understand why you are seeing memory errors, please have a look at [[HowMemoryEdacHardwareWorks]].
 
To help understand why you are seeing memory errors, please have a look at [[HowMemoryEdacHardwareWorks]].
  +
  +
You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module. Note that your system is probably experiencing data corruption (if you are getting UEs - uncorrectable errors), so you should really check this out (this is why EDAC is set to `panic()` on UEs by default).
   
 
The reason that you are seeing problems is very likely to be one of:
 
The reason that you are seeing problems is very likely to be one of:
   
* Your RAM is bad
+
* Your RAM is bad.
* Your Motherboard is bad
+
* Your Motherboard is bad.
* Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
+
* Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64).
* The connection between your motherboard and your CPU, or memory module is bad
+
* The connection between your motherboard and your CPU, or memory module is bad.
* Some of your hardware is being operated outside of their design specification, such as:
+
* Some of your hardware is being operated outside of its design specification, such as:
** Things are being run too hot
+
** Things are being run too hot.
** Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
+
** Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation).
** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
+
** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop").
* You have seen one or more "Single Event Upsets" - see [[SoftErrors]]
+
* You have seen one or more "Single Event Upsets" - see [[SoftErrors]].
  +
* Memory ECC check bits are not properly initialised by BIOS prior to Linux boot. See [[Uninitialized ECC bits]]
* The EDAC module is buggy
+
* The EDAC module is buggy.
  +
* Memory loading is exceeded.
  +
* The powersupply is insufficient.
   
 
== So Which One Is It Then? ==
 
== So Which One Is It Then? ==
Line 27: Line 32:
 
| <b>Error Slot or Row</b>
 
| <b>Error Slot or Row</b>
 
| <b>Error Frequency</b>
 
| <b>Error Frequency</b>
|
 
 
|-
 
|-
 
| Bad Memory Module(s)
 
| Bad Memory Module(s)
Line 33: Line 37:
 
| Probably only 1
 
| Probably only 1
 
| May vary if part is marginally out of spec
 
| May vary if part is marginally out of spec
|
 
 
|-
 
|-
 
| Bad Motherboard
 
| Bad Motherboard
Line 39: Line 42:
 
| Maybe 1, maybe many
 
| Maybe 1, maybe many
 
| ?
 
| ?
|
 
 
|-
 
|-
 
| Bad Connection
 
| Bad Connection
Line 45: Line 47:
 
| 1 (bad mem), prob all (bad CPU)
 
| 1 (bad mem), prob all (bad CPU)
 
| ?
 
| ?
|
 
 
|-
 
|-
 
| Temp out of spec
 
| Temp out of spec
Line 51: Line 52:
 
| Maybe one, if different mem mfrs/parts
 
| Maybe one, if different mem mfrs/parts
 
| Usually higher, with higher temp
 
| Usually higher, with higher temp
|
 
 
|-
 
|-
 
| Timings out of spec
 
| Timings out of spec
Line 57: Line 57:
 
| Maybe one, if different mem mfrs/parts
 
| Maybe one, if different mem mfrs/parts
 
| Usually higher, with higher temp
 
| Usually higher, with higher temp
|
 
 
|-
 
|-
 
| Voltages out of spec
 
| Voltages out of spec
Line 63: Line 62:
 
| Maybe one, if different mem mfrs/parts
 
| Maybe one, if different mem mfrs/parts
 
| Usually higher at higher system load
 
| Usually higher at higher system load
|
+
|-
  +
| Bad BIOS check bit init
  +
| Probably random
 
| Probably all
  +
| High/very high (stops after a while for systems with background scrub)
 
|-
 
|-
 
| Single event upsets
 
| Single event upsets
 
| Random
 
| Random
 
| Varies with effective "cross-section" of part
 
| Varies with effective "cross-section" of part
  +
| Rare - more common with some part designs, and at high altitude etc.
| Very rare
 
|
 
 
|}
 
|}
   
 
=== Things to try to isolate the problem ===
 
=== Things to try to isolate the problem ===
  +
  +
General:
  +
  +
* Get a second opinion e.g. from [[http://www.memtest.org/]] or [[http://www.memtest86.com/]] - note that you should be sure that either:
  +
** The memory testing software knows how to disable ECC on your system, or
  +
** You have disabled ECC before running memory tester (note that memtest86 currently displays "ECC: No" on chipsets which have ECC, but which it doesn't know about!).
  +
* This may not catch problems like power-supply related problems, which don't occur when the memory tester is running.
  +
* Use a system stress tester such as "burnbx" from [[http://pages.sbcglobal.net/redelm/]].
  +
* Put your system under stress by (e.g.) running a parallelised Linux kernel build, whilst doing some heavy 3D graphics display, and a lot of disk I/O.
   
 
Suspected bad module:
 
Suspected bad module:
   
* Remove Module
+
* Remove Module.
* Move Module to different slot (do errors move with module)
+
* Move Module to different slot (do errors move with module).
* Move Module to different machine
+
* Move Module to different machine.
* See "suspected temp out of spec"
+
* See "suspected temp out of spec".
* See "suspected timings out of spec"
+
* See "suspected timings out of spec".
* See "voltages out of spec"
+
* See "voltages out of spec".
* Clean connections
+
* Clean connections.
  +
* Check Memory Loading
  +
** Some memory controllers can only support so many 'ranks' of memory at a given speed.
  +
For example, Opterons/Athlon64s can support only 4 ranks of 2 GB at PC3200.
  +
See http://www.valueram.com/memoryranks/default.asp for definitions.
   
 
Suspected bad motherboard:
 
Suspected bad motherboard:
   
* Check motherboard docs for memory module compatability
+
* Check motherboard docs for memory module compatability.
* Move modules to different slots
+
* Move modules to different slots.
* Clean connections
+
* Clean connections.
* Upgrade BIOS
+
* Upgrade BIOS.
* Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause
+
* Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause.
   
 
Suspected bad connection:
 
Suspected bad connection:
   
 
* Visually check connectors, pins, modules etc.
 
* Visually check connectors, pins, modules etc.
* [[HowToCleanEdgeConnectors]]
+
* [[HowToCleanEdgeConnectors]].
   
 
Suspected temp out of spec:
 
Suspected temp out of spec:
   
 
* Measure temp, compare to published specs:
 
* Measure temp, compare to published specs:
** Use internal machine sensors (motherboard, hard drive etc.) if possible
+
** Use internal machine sensors (motherboard, hard drive etc.) if possible.
** Use a temperature probe
+
** Use a temperature probe or infra-red thermometer.
* Check airflow
+
* Check airflow.
* De-dust
+
* De-dust.
 
* Lower temp:
 
* Lower temp:
** Lower room temp
+
** Lower room temp.
** Increase cooling
+
** Increase cooling.
** Improve airflow (tidy cables etc.)
+
** Improve airflow (tidy cables etc.).
   
 
Suspected timings out of spec:
 
Suspected timings out of spec:
   
* Try different BIOS version
+
* Try different BIOS version.
* Set pessimistic memory timings in BIOS
+
* Set pessimistic memory timings in BIOS.
* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
+
* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project.
  +
* Try disabling "spread spectrum" in the BIOS (easy if available), or by using an i2c driver for your board's clock generator (hard).
   
 
Suspected voltage out of spec:
 
Suspected voltage out of spec:
   
* Check PSU specs vs. total demand of system components
+
* Check PSU specs vs. total demand of system components.
* Swap power supply with another machine
+
* Swap power supply with another machine.
* Fit voltage regulator/spike surpressor to machine power supply
+
* Fit voltage regulator/spike suppressor to machine power supply.
   
 
Suspected single event upsets:
 
Suspected single event upsets:
   
 
* Fit less susceptible components
 
* Fit less susceptible components
* Move to a lower altitude, or area with lower cosmic radiation
+
* Move to a lower altitude, or area with lower cosmic radiation.
* Move your data centre underground
+
* Move your data centre underground.
* Improve error-reporting utilities to ignore them
+
* Improve error-reporting utilities to ignore them.
  +
  +
Suspected bad check-bit init:
  +
  +
* Upgrade BIOS.
  +
* Don't enable BIOS "quick boot".
  +
* Don't manually skip BIOS memory check.
  +
  +
Suspected insufficient powersupply:
  +
  +
* Try detaching some devices that are hardly use. Start with USB devices.
  +
* If the problems stop, either structurally reduce the devices, or get a higher capacity powersupply.
  +
* Use a DC current clamp (pref one with peak/inrush measurement function) to check over-capacity at a particular voltage.
  +
* This is closely related to voltage out of spec. That can also be caused by just a broken supply.
   
 
__NOTOC__
 
__NOTOC__

Latest revision as of 14:51, 9 November 2010

To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.

You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module. Note that your system is probably experiencing data corruption (if you are getting UEs - uncorrectable errors), so you should really check this out (this is why EDAC is set to `panic()` on UEs by default).

The reason that you are seeing problems is very likely to be one of:

  • Your RAM is bad.
  • Your Motherboard is bad.
  • Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64).
  • The connection between your motherboard and your CPU, or memory module is bad.
  • Some of your hardware is being operated outside of its design specification, such as:
    • Things are being run too hot.
    • Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation).
    • Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop").
  • You have seen one or more "Single Event Upsets" - see SoftErrors.
  • Memory ECC check bits are not properly initialised by BIOS prior to Linux boot. See Uninitialized ECC bits
  • The EDAC module is buggy.
  • Memory loading is exceeded.
  • The powersupply is insufficient.

So Which One Is It Then?

Good question. Time to try some things:

Symptoms

Here are the most likely symptoms.

Problem Error Addresses Error Slot or Row Error Frequency
Bad Memory Module(s) Single/Few Probably only 1 May vary if part is marginally out of spec
Bad Motherboard Probably many Maybe 1, maybe many ?
Bad Connection Probably many 1 (bad mem), prob all (bad CPU) ?
Temp out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Timings out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Voltages out of spec Probably random Maybe one, if different mem mfrs/parts Usually higher at higher system load
Bad BIOS check bit init Probably random Probably all High/very high (stops after a while for systems with background scrub)
Single event upsets Random Varies with effective "cross-section" of part Rare - more common with some part designs, and at high altitude etc.

Things to try to isolate the problem

General:

  • Get a second opinion e.g. from [[1]] or [[2]] - note that you should be sure that either:
    • The memory testing software knows how to disable ECC on your system, or
    • You have disabled ECC before running memory tester (note that memtest86 currently displays "ECC: No" on chipsets which have ECC, but which it doesn't know about!).
  • This may not catch problems like power-supply related problems, which don't occur when the memory tester is running.
  • Use a system stress tester such as "burnbx" from [[3]].
  • Put your system under stress by (e.g.) running a parallelised Linux kernel build, whilst doing some heavy 3D graphics display, and a lot of disk I/O.

Suspected bad module:

  • Remove Module.
  • Move Module to different slot (do errors move with module).
  • Move Module to different machine.
  • See "suspected temp out of spec".
  • See "suspected timings out of spec".
  • See "voltages out of spec".
  • Clean connections.
  • Check Memory Loading
    • Some memory controllers can only support so many 'ranks' of memory at a given speed.
    For example, Opterons/Athlon64s can support only 4 ranks of 2 GB at PC3200. 
    See http://www.valueram.com/memoryranks/default.asp for definitions.

Suspected bad motherboard:

  • Check motherboard docs for memory module compatability.
  • Move modules to different slots.
  • Clean connections.
  • Upgrade BIOS.
  • Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause.

Suspected bad connection:

Suspected temp out of spec:

  • Measure temp, compare to published specs:
    • Use internal machine sensors (motherboard, hard drive etc.) if possible.
    • Use a temperature probe or infra-red thermometer.
  • Check airflow.
  • De-dust.
  • Lower temp:
    • Lower room temp.
    • Increase cooling.
    • Improve airflow (tidy cables etc.).

Suspected timings out of spec:

  • Try different BIOS version.
  • Set pessimistic memory timings in BIOS.
  • Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project.
  • Try disabling "spread spectrum" in the BIOS (easy if available), or by using an i2c driver for your board's clock generator (hard).

Suspected voltage out of spec:

  • Check PSU specs vs. total demand of system components.
  • Swap power supply with another machine.
  • Fit voltage regulator/spike suppressor to machine power supply.

Suspected single event upsets:

  • Fit less susceptible components
  • Move to a lower altitude, or area with lower cosmic radiation.
  • Move your data centre underground.
  • Improve error-reporting utilities to ignore them.

Suspected bad check-bit init:

  • Upgrade BIOS.
  • Don't enable BIOS "quick boot".
  • Don't manually skip BIOS memory check.

Suspected insufficient powersupply:

  • Try detaching some devices that are hardly use. Start with USB devices.
  • If the problems stop, either structurally reduce the devices, or get a higher capacity powersupply.
  • Use a DC current clamp (pref one with peak/inrush measurement function) to check over-capacity at a particular voltage.
  • This is closely related to voltage out of spec. That can also be caused by just a broken supply.