Difference between revisions of "WhyAmIgettingMemoryErrors"

From EdacWiki
Jump to navigation Jump to search
Line 69: Line 69:
 
| Single event upsets
 
| Single event upsets
 
| Random
 
| Random
| Varies with effective size, and "cross-section" of part
+
| Varies with effective "cross-section" of part
| Very rare
+
| Very rare
 
|
 
|
 
|}
 
|}
   
=== Things to try ===
+
=== Things to try to isolate the problem ===
   
 
Suspected bad module:
 
Suspected bad module:
Line 97: Line 97:
   
 
* Visually check connectors, pins, modules etc.
 
* Visually check connectors, pins, modules etc.
  +
* [[HowToCleanEdgeConnectors]]
* Clean things up!
 
   
 
Suspected temp out of spec:
 
Suspected temp out of spec:
Line 114: Line 114:
   
 
* Try different BIOS version
 
* Try different BIOS version
* Set pessimistic memory timings
+
* Set pessimistic memory timings in BIOS
 
* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
 
* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
   
Line 126: Line 126:
   
 
* Fit less susceptible components
 
* Fit less susceptible components
* Move to a lower altitude
+
* Move to a lower altitude, or area with lower cosmic radiation
 
* Move your data centre underground
 
* Move your data centre underground
* Improve detection utilities to ignore them
+
* Improve error-reporting utilities to ignore them
   
 
__NOTOC__
 
__NOTOC__

Revision as of 13:38, 9 March 2006

Why Am I Getting Memory Errors?

To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.

The reason that you are seeing problems is very likely to be one of:

  • Your RAM is bad
  • Your Motherboard is bad
  • Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
  • The connection between your motherboard and your CPU, or memory module is bad
  • Some of your hardware is being operated outside of their design specification, such as:
    • Things are being run too hot
    • Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
    • Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
  • You have seen one or more "Single Event Upsets" - see SoftErrors
  • The EDAC module is buggy

So Which One Is It Then?

Good question. Time to try some things:

Symptoms

Here are the most likely symptoms.

Problem Error Addresses Error Slot or Row Error Frequency
Bad Memory Module(s) Single/Few Probably only 1 May vary if part is marginally out of spec
Bad Motherboard Probably many Maybe 1, maybe many ?
Bad Connection Probably many 1 (bad mem), prob all (bad CPU) ?
Temp out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Timings out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Voltages out of spec Probably random Maybe one, if different mem mfrs/parts Usually higher at higher system load
Single event upsets Random Varies with effective "cross-section" of part Very rare

Things to try to isolate the problem

Suspected bad module:

  • Remove Module
  • Move Module to different slot (do errors move with module)
  • Move Module to different machine
  • See "suspected temp out of spec"
  • See "suspected timings out of spec"
  • See "voltages out of spec"
  • Clean connections

Suspected bad motherboard:

  • Check motherboard docs for memory module compatability
  • Move modules to different slots
  • Clean connections
  • Upgrade BIOS
  • Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause

Suspected bad connection:

Suspected temp out of spec:

  • Measure temp, compare to published specs:
    • Use internal machine sensors (motherboard, hard drive etc.) if possible
    • Use a temperature probe
  • Check airflow
  • De-dust
  • Lower temp:
    • Lower room temp
    • Increase cooling
    • Improve airflow (tidy cables etc.)

Suspected timings out of spec:

  • Try different BIOS version
  • Set pessimistic memory timings in BIOS
  • Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project

Suspected voltage out of spec:

  • Check PSU specs vs. total demand of system components
  • Swap power supply with another machine
  • Fit voltage regulator/spike surpressor to machine power supply

Suspected single event upsets:

  • Fit less susceptible components
  • Move to a lower altitude, or area with lower cosmic radiation
  • Move your data centre underground
  • Improve error-reporting utilities to ignore them