Difference between revisions of "WhyAmIgettingMemoryErrors"

From EdacWiki
Jump to navigation Jump to search
Line 12: Line 12:
 
** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
 
** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
 
* You have seen one or more "Single Event Upsets" - see [[SoftErrors]]
 
* You have seen one or more "Single Event Upsets" - see [[SoftErrors]]
  +
* Memory ECC check bits are not properly initialised by BIOS prior to Linux boot
 
* The EDAC module is buggy
 
* The EDAC module is buggy
   
Line 64: Line 65:
 
| Usually higher at higher system load
 
| Usually higher at higher system load
 
|
 
|
  +
|-
  +
| Bad check bit init
  +
| Probably random
  +
| Probably all
  +
| High/very high (stops after a while for systems with background scrub)
  +
|
 
|-
 
|-
 
| Single event upsets
 
| Single event upsets
Line 114: Line 121:
 
* Set pessimistic memory timings in BIOS
 
* Set pessimistic memory timings in BIOS
 
* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
 
* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
  +
* Try disabling "spread spectrum" in the BIOS (easy), or by using an i2c driver for your board's clock generator (hard)
   
 
Suspected voltage out of spec:
 
Suspected voltage out of spec:
Line 127: Line 135:
 
* Move your data centre underground
 
* Move your data centre underground
 
* Improve error-reporting utilities to ignore them
 
* Improve error-reporting utilities to ignore them
  +
  +
Suspected bad check-bit init:
  +
  +
* Upgrade BIOS
  +
* Don't enable BIOS "quick boot"
  +
* Don't manually skip BIOS memory check
   
 
__NOTOC__
 
__NOTOC__

Revision as of 13:10, 9 March 2006

To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.

The reason that you are seeing problems is very likely to be one of:

  • Your RAM is bad
  • Your Motherboard is bad
  • Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
  • The connection between your motherboard and your CPU, or memory module is bad
  • Some of your hardware is being operated outside of their design specification, such as:
    • Things are being run too hot
    • Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
    • Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
  • You have seen one or more "Single Event Upsets" - see SoftErrors
  • Memory ECC check bits are not properly initialised by BIOS prior to Linux boot
  • The EDAC module is buggy

So Which One Is It Then?

Good question. Time to try some things:

Symptoms

Here are the most likely symptoms.

Problem Error Addresses Error Slot or Row Error Frequency
Bad Memory Module(s) Single/Few Probably only 1 May vary if part is marginally out of spec
Bad Motherboard Probably many Maybe 1, maybe many ?
Bad Connection Probably many 1 (bad mem), prob all (bad CPU) ?
Temp out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Timings out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Voltages out of spec Probably random Maybe one, if different mem mfrs/parts Usually higher at higher system load
Bad check bit init Probably random Probably all High/very high (stops after a while for systems with background scrub)
Single event upsets Random Varies with effective "cross-section" of part Very rare

Things to try to isolate the problem

Suspected bad module:

  • Remove Module
  • Move Module to different slot (do errors move with module)
  • Move Module to different machine
  • See "suspected temp out of spec"
  • See "suspected timings out of spec"
  • See "voltages out of spec"
  • Clean connections

Suspected bad motherboard:

  • Check motherboard docs for memory module compatability
  • Move modules to different slots
  • Clean connections
  • Upgrade BIOS
  • Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause

Suspected bad connection:

Suspected temp out of spec:

  • Measure temp, compare to published specs:
    • Use internal machine sensors (motherboard, hard drive etc.) if possible
    • Use a temperature probe
  • Check airflow
  • De-dust
  • Lower temp:
    • Lower room temp
    • Increase cooling
    • Improve airflow (tidy cables etc.)

Suspected timings out of spec:

  • Try different BIOS version
  • Set pessimistic memory timings in BIOS
  • Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
  • Try disabling "spread spectrum" in the BIOS (easy), or by using an i2c driver for your board's clock generator (hard)

Suspected voltage out of spec:

  • Check PSU specs vs. total demand of system components
  • Swap power supply with another machine
  • Fit voltage regulator/spike surpressor to machine power supply

Suspected single event upsets:

  • Fit less susceptible components
  • Move to a lower altitude, or area with lower cosmic radiation
  • Move your data centre underground
  • Improve error-reporting utilities to ignore them

Suspected bad check-bit init:

  • Upgrade BIOS
  • Don't enable BIOS "quick boot"
  • Don't manually skip BIOS memory check