Difference between revisions of "WhyAmIgettingMemoryErrors"

From EdacWiki
Jump to navigation Jump to search
 
Line 9: Line 9:
 
* Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
 
* Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
 
* The connection between your motherboard and your CPU, or memory module is bad
 
* The connection between your motherboard and your CPU, or memory module is bad
* Some of your hardware is being operated out of spec, such as:
+
* Some of your hardware is being operated outside of their design specification, such as:
 
** Things are being run too hot
 
** Things are being run too hot
 
** Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
 
** Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
** Supply voltages to the critical compontents
+
** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
 
* You have seen one or more "Single Event Upsets" - see [[SoftErrors]]
 
* You have seen one or more "Single Event Upsets" - see [[SoftErrors]]
 
* The EDAC module is buggy
 
* The EDAC module is buggy
  +
  +
== So Which One Is It Then? ==
  +
  +
Good question. Time to try some things:
  +
  +
=== Symptoms ===
  +
  +
Here are the most likely symptoms.
  +
  +
{| border="1" cellpadding="2" cellspacing="0"
  +
| Problem
  +
| Error Addresses
  +
| Error Slot or Row
  +
| Error Frequency
  +
|
  +
|-
  +
| Bad Memory Module(s)
  +
| Single/Few
  +
| Probably only 1
  +
| May vary if part is marginally out of spec
  +
|
  +
|-
  +
| Bad Motherboard
  +
| Probably many
  +
| Maybe 1, maybe many
  +
| ?
  +
|
  +
|-
  +
| Bad Connection
  +
| Probably many
  +
| 1 (bad mem), prob all (bad CPU)
  +
| ?
  +
|
  +
|-
  +
| Temp out of spec
  +
| Probably few
  +
| Maybe one, if different mem mfrs/parts
  +
| Usually higher, with higher temp
  +
|
  +
|-
  +
| Timings out of spec
  +
| Probably few
  +
| Maybe one, if different mem mfrs/parts
  +
| Usually higher, with higher temp
  +
|
  +
|-
  +
| Voltages out of spec
  +
| Probably random
  +
| Maybe one, if different mem mfrs/parts
  +
| Usually higher at higher system load
  +
|
  +
|-
  +
| Single event upsets
  +
| Random
  +
| Varies with effective size, and "cross-section" of part
  +
| Very rare
  +
|
  +
|}
  +
  +
=== Things to try ===
  +
  +
Suspected bad module:
  +
  +
* Remove Module
  +
* Move Module to different slot (do errors move with module)
  +
* Move Module to different machine
  +
* See "suspected temp out of spec"
  +
* See "suspected timings out of spec"
  +
* See "voltages out of spec"
  +
* Clean connections
  +
  +
Suspected bad motherboard:
  +
  +
* Check motherboard docs for memory module compatability
  +
* Move modules to different slots
  +
* Clean connections
  +
* Upgrade BIOS
  +
* Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause
  +
  +
Suspected bad connection:
  +
  +
* Visually check connectors, pins, modules etc.
  +
* Clean things up!
  +
  +
Suspected temp out of spec:
  +
  +
* Measure temp, compare to published specs:
  +
** Use internal machine sensors (motherboard, hard drive etc.) if possible
  +
** Use a temperature probe
  +
* Check airflow
  +
* De-dust
  +
* Lower temp:
  +
** Lower room temp
  +
** Increase cooling
  +
** Improve airflow (tidy cables etc.)
  +
  +
Suspected timings out of spec:
  +
  +
* Try different BIOS version
  +
* Set pessimistic memory timings
  +
* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
  +
  +
Suspected voltage out of spec:
  +
  +
* Check PSU specs vs. total demand of system components
  +
* Swap power supply with another machine
  +
* Fit voltage regulator/spike surpressor to machine power supply
  +
  +
Suspected single event upsets:
  +
  +
* Fit less susceptible components
  +
* Move to a lower altitude
  +
* Move your data centre underground
  +
* Improve detection utilities to ignore them
   
 
__NOTOC__
 
__NOTOC__

Revision as of 12:33, 9 March 2006

Why Am I Getting Memory Errors?

To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.

The reason that you are seeing problems is very likely to be one of:

  • Your RAM is bad
  • Your Motherboard is bad
  • Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
  • The connection between your motherboard and your CPU, or memory module is bad
  • Some of your hardware is being operated outside of their design specification, such as:
    • Things are being run too hot
    • Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
    • Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
  • You have seen one or more "Single Event Upsets" - see SoftErrors
  • The EDAC module is buggy

So Which One Is It Then?

Good question. Time to try some things:

Symptoms

Here are the most likely symptoms.

Problem Error Addresses Error Slot or Row Error Frequency
Bad Memory Module(s) Single/Few Probably only 1 May vary if part is marginally out of spec
Bad Motherboard Probably many Maybe 1, maybe many ?
Bad Connection Probably many 1 (bad mem), prob all (bad CPU) ?
Temp out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Timings out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Voltages out of spec Probably random Maybe one, if different mem mfrs/parts Usually higher at higher system load
Single event upsets Random Varies with effective size, and "cross-section" of part Very rare

Things to try

Suspected bad module:

  • Remove Module
  • Move Module to different slot (do errors move with module)
  • Move Module to different machine
  • See "suspected temp out of spec"
  • See "suspected timings out of spec"
  • See "voltages out of spec"
  • Clean connections

Suspected bad motherboard:

  • Check motherboard docs for memory module compatability
  • Move modules to different slots
  • Clean connections
  • Upgrade BIOS
  • Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause

Suspected bad connection:

  • Visually check connectors, pins, modules etc.
  • Clean things up!

Suspected temp out of spec:

  • Measure temp, compare to published specs:
    • Use internal machine sensors (motherboard, hard drive etc.) if possible
    • Use a temperature probe
  • Check airflow
  • De-dust
  • Lower temp:
    • Lower room temp
    • Increase cooling
    • Improve airflow (tidy cables etc.)

Suspected timings out of spec:

  • Try different BIOS version
  • Set pessimistic memory timings
  • Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project

Suspected voltage out of spec:

  • Check PSU specs vs. total demand of system components
  • Swap power supply with another machine
  • Fit voltage regulator/spike surpressor to machine power supply

Suspected single event upsets:

  • Fit less susceptible components
  • Move to a lower altitude
  • Move your data centre underground
  • Improve detection utilities to ignore them