WhyAmIgettingMemoryErrors

From EdacWiki
Revision as of 14:51, 9 November 2010 by Rew (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.

You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module. Note that your system is probably experiencing data corruption (if you are getting UEs - uncorrectable errors), so you should really check this out (this is why EDAC is set to `panic()` on UEs by default).

The reason that you are seeing problems is very likely to be one of:

  • Your RAM is bad.
  • Your Motherboard is bad.
  • Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64).
  • The connection between your motherboard and your CPU, or memory module is bad.
  • Some of your hardware is being operated outside of its design specification, such as:
    • Things are being run too hot.
    • Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation).
    • Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop").
  • You have seen one or more "Single Event Upsets" - see SoftErrors.
  • Memory ECC check bits are not properly initialised by BIOS prior to Linux boot. See Uninitialized ECC bits
  • The EDAC module is buggy.
  • Memory loading is exceeded.
  • The powersupply is insufficient.

So Which One Is It Then?

Good question. Time to try some things:

Symptoms

Here are the most likely symptoms.

Problem Error Addresses Error Slot or Row Error Frequency
Bad Memory Module(s) Single/Few Probably only 1 May vary if part is marginally out of spec
Bad Motherboard Probably many Maybe 1, maybe many ?
Bad Connection Probably many 1 (bad mem), prob all (bad CPU) ?
Temp out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Timings out of spec Probably few Maybe one, if different mem mfrs/parts Usually higher, with higher temp
Voltages out of spec Probably random Maybe one, if different mem mfrs/parts Usually higher at higher system load
Bad BIOS check bit init Probably random Probably all High/very high (stops after a while for systems with background scrub)
Single event upsets Random Varies with effective "cross-section" of part Rare - more common with some part designs, and at high altitude etc.

Things to try to isolate the problem

General:

  • Get a second opinion e.g. from [[1]] or [[2]] - note that you should be sure that either:
    • The memory testing software knows how to disable ECC on your system, or
    • You have disabled ECC before running memory tester (note that memtest86 currently displays "ECC: No" on chipsets which have ECC, but which it doesn't know about!).
  • This may not catch problems like power-supply related problems, which don't occur when the memory tester is running.
  • Use a system stress tester such as "burnbx" from [[3]].
  • Put your system under stress by (e.g.) running a parallelised Linux kernel build, whilst doing some heavy 3D graphics display, and a lot of disk I/O.

Suspected bad module:

  • Remove Module.
  • Move Module to different slot (do errors move with module).
  • Move Module to different machine.
  • See "suspected temp out of spec".
  • See "suspected timings out of spec".
  • See "voltages out of spec".
  • Clean connections.
  • Check Memory Loading
    • Some memory controllers can only support so many 'ranks' of memory at a given speed.
    For example, Opterons/Athlon64s can support only 4 ranks of 2 GB at PC3200. 
    See http://www.valueram.com/memoryranks/default.asp for definitions.

Suspected bad motherboard:

  • Check motherboard docs for memory module compatability.
  • Move modules to different slots.
  • Clean connections.
  • Upgrade BIOS.
  • Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause.

Suspected bad connection:

Suspected temp out of spec:

  • Measure temp, compare to published specs:
    • Use internal machine sensors (motherboard, hard drive etc.) if possible.
    • Use a temperature probe or infra-red thermometer.
  • Check airflow.
  • De-dust.
  • Lower temp:
    • Lower room temp.
    • Increase cooling.
    • Improve airflow (tidy cables etc.).

Suspected timings out of spec:

  • Try different BIOS version.
  • Set pessimistic memory timings in BIOS.
  • Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project.
  • Try disabling "spread spectrum" in the BIOS (easy if available), or by using an i2c driver for your board's clock generator (hard).

Suspected voltage out of spec:

  • Check PSU specs vs. total demand of system components.
  • Swap power supply with another machine.
  • Fit voltage regulator/spike suppressor to machine power supply.

Suspected single event upsets:

  • Fit less susceptible components
  • Move to a lower altitude, or area with lower cosmic radiation.
  • Move your data centre underground.
  • Improve error-reporting utilities to ignore them.

Suspected bad check-bit init:

  • Upgrade BIOS.
  • Don't enable BIOS "quick boot".
  • Don't manually skip BIOS memory check.

Suspected insufficient powersupply:

  • Try detaching some devices that are hardly use. Start with USB devices.
  • If the problems stop, either structurally reduce the devices, or get a higher capacity powersupply.
  • Use a DC current clamp (pref one with peak/inrush measurement function) to check over-capacity at a particular voltage.
  • This is closely related to voltage out of spec. That can also be caused by just a broken supply.