Difference between revisions of "WhyAmIgettingMemoryErrors"

From EdacWiki
Jump to navigation Jump to search
(username removed)
(username removed)
Line 17: Line 17:
 
* The EDAC module is buggy
 
* The EDAC module is buggy
 
* Memory loading is exceeded
 
* Memory loading is exceeded
  +
* The powersupply is insufficient
   
 
== So Which One Is It Then? ==
 
== So Which One Is It Then? ==

Revision as of 20:03, 23 March 2006

To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.

You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module. Note that your system is probably experiencing data corruption (if you are getting UEs - uncorrectable errors), so you should really check this out (this is why EDAC is set to `panic()` on UEs by default).

The reason that you are seeing problems is very likely to be one of:

  • Your RAM is bad
  • Your Motherboard is bad
  • Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
  • The connection between your motherboard and your CPU, or memory module is bad
  • Some of your hardware is being operated outside of its design specification, such as:
    • Things are being run too hot
    • Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
    • Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
  • You have seen one or more "Single Event Upsets" - see SoftErrors
  • Memory ECC check bits are not properly initialised by BIOS prior to Linux boot
  • The EDAC module is buggy
  • Memory loading is exceeded
  • The powersupply is insufficient

So Which One Is It Then?

Good question. Time to try some things:

Symptoms

Here are the most likely symptoms.

Problem Error Addresses Error Slot or Row Error Frequency
Bad Memory Module(s) Single/Few Probably only 1 May vary if part is marginally out of spec
Bad Motherboard Probably many Maybe 1, maybe many ?
Bad Connection Probably many 1 (bad mem), prob all (bad CPU) ?
Temp out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Timings out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Voltages out of spec Probably random Maybe one, if different mem mfrs/parts Usually higher at higher system load
Bad check bit init Probably random Probably all High/very high (stops after a while for systems with background scrub)
Single event upsets Random Varies with effective "cross-section" of part Very rare

Things to try to isolate the problem

General:

  • Get a second opinion e.g. from [[1]] or [[2]] - note that you should be sure that either:
    • The memory testing software knows how to disable ECC on your system, or
    • You have disabled ECC before running memory tester (note that memtest86 currently displays "ECC: No" on chipsets which have ECC, but which it doesn't know about!)
  • This may not catch problems like power-supply related problems, which don't occur when the memory tester is running
  • Use a system stress tester such as "burnbx" from [[3]]
  • Put your system under stress by (e.g.) running a parallelised Linux kernel build, whilst doing some heavy 3D graphics display, and a lot of disk I/O

Suspected bad module:

  • Remove Module
  • Move Module to different slot (do errors move with module)
  • Move Module to different machine
  • See "suspected temp out of spec"
  • See "suspected timings out of spec"
  • See "voltages out of spec"
  • Clean connections
  • Check Memory Loading
    • Some memory controllers can only support so many 'ranks' of memory at a given speed.
    For example, Opterons/Athlon64s can support only 4 ranks of 2 GB at PC3200. 
    See http://www.valueram.com/memoryranks/default.asp for definitions.

Suspected bad motherboard:

  • Check motherboard docs for memory module compatability
  • Move modules to different slots
  • Clean connections
  • Upgrade BIOS
  • Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause

Suspected bad connection:

Suspected temp out of spec:

  • Measure temp, compare to published specs:
    • Use internal machine sensors (motherboard, hard drive etc.) if possible
    • Use a temperature probe
  • Check airflow
  • De-dust
  • Lower temp:
    • Lower room temp
    • Increase cooling
    • Improve airflow (tidy cables etc.)

Suspected timings out of spec:

  • Try different BIOS version
  • Set pessimistic memory timings in BIOS
  • Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
  • Try disabling "spread spectrum" in the BIOS (easy), or by using an i2c driver for your board's clock generator (hard)

Suspected voltage out of spec:

  • Check PSU specs vs. total demand of system components
  • Swap power supply with another machine
  • Fit voltage regulator/spike surpressor to machine power supply

Suspected single event upsets:

  • Fit less susceptible components
  • Move to a lower altitude, or area with lower cosmic radiation
  • Move your data centre underground
  • Improve error-reporting utilities to ignore them

Suspected bad check-bit init:

  • Upgrade BIOS
  • Don't enable BIOS "quick boot"
  • Don't manually skip BIOS memory check