Difference between revisions of "WhyAmIgettingMemoryErrors"

Revision as of 14:23, 9 March 2006

To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.

You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module. Note that your system is probably experiencing data corruption (if you are getting UEs - uncorrectable errors), so you should really check this out (this is why EDAC is set to `panic()` on UEs by default).

The reason that you are seeing problems is very likely to be one of:

Your RAM is bad
Your Motherboard is bad
Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
The connection between your motherboard and your CPU, or memory module is bad
Some of your hardware is being operated outside of its design specification, such as:
- Things are being run too hot
- Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
- Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
You have seen one or more "Single Event Upsets" - see SoftErrors
Memory ECC check bits are not properly initialised by BIOS prior to Linux boot
The EDAC module is buggy

So Which One Is It Then?

Good question. Time to try some things:

Symptoms

Here are the most likely symptoms.

Problem	Error Addresses	Error Slot or Row	Error Frequency
Bad Memory Module(s)	Single/Few	Probably only 1	May vary if part is marginally out of spec
Bad Motherboard	Probably many	Maybe 1, maybe many	?
Bad Connection	Probably many	1 (bad mem), prob all (bad CPU)	?
Temp out of spec	Probably few	Maybe one, if different mem mfrs/parts	Usually higher, with higher temp
Timings out of spec	Probably few	Maybe one, if different mem mfrs/parts	Usually higher, with higher temp
Voltages out of spec	Probably random	Maybe one, if different mem mfrs/parts	Usually higher at higher system load
Bad check bit init	Probably random	Probably all	High/very high (stops after a while for systems with background scrub)
Single event upsets	Random	Varies with effective "cross-section" of part	Very rare

Things to try to isolate the problem

General:

Get a second opinion e.g. from [[1]] or [[2]] - note that you should be sure that either:
- The memory testing software knows how to disable ECC on your system, or
- You have disabled ECC before running memory tester (note that memtest86 currently displays "ECC: No" on chipsets which have ECC, but which it doesn't know about!)
This may not catch problems like power-supply related problems, which don't occur when the memory tester is running
Use a system stress tester such as "burnbx" from [[3]]
Put your system under stress by (e.g.) running a parallelised Linux kernel build, whilst doing some heavy 3D graphics display, and a lot of disk I/O

Suspected bad module:

Remove Module
Move Module to different slot (do errors move with module)
Move Module to different machine
See "suspected temp out of spec"
See "suspected timings out of spec"
See "voltages out of spec"
Clean connections

Suspected bad motherboard:

Check motherboard docs for memory module compatability
Move modules to different slots
Clean connections
Upgrade BIOS
Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause

Suspected bad connection:

Visually check connectors, pins, modules etc.
HowToCleanEdgeConnectors

Suspected temp out of spec:

Measure temp, compare to published specs:
- Use internal machine sensors (motherboard, hard drive etc.) if possible
- Use a temperature probe
Check airflow
De-dust
Lower temp:
- Lower room temp
- Increase cooling
- Improve airflow (tidy cables etc.)

Suspected timings out of spec:

Try different BIOS version
Set pessimistic memory timings in BIOS
Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
Try disabling "spread spectrum" in the BIOS (easy), or by using an i2c driver for your board's clock generator (hard)

Suspected voltage out of spec:

Check PSU specs vs. total demand of system components
Swap power supply with another machine
Fit voltage regulator/spike surpressor to machine power supply

Suspected single event upsets:

Fit less susceptible components
Move to a lower altitude, or area with lower cosmic radiation
Move your data centre underground
Improve error-reporting utilities to ignore them

Suspected bad check-bit init:

Upgrade BIOS
Don't enable BIOS "quick boot"
Don't manually skip BIOS memory check

@@ Line 1: / Line 1: @@
 To help understand why you are seeing memory errors, please have a look at [[HowMemoryEdacHardwareWorks]].
-You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module.  You may be experiencing dataloss (if you are getting UEs - uncorrectable errors), so you should really check this out.
+You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module.  Note that your system is probably experiencing data corruption (if you are getting UEs - uncorrectable errors), so you should really check this out (this is why EDAC is set to `panic()` on UEs by default).
 The reason that you are seeing problems is very likely to be one of:

Difference between revisions of "WhyAmIgettingMemoryErrors"

Revision as of 14:23, 9 March 2006

So Which One Is It Then?

Symptoms

Things to try to isolate the problem

Navigation menu

Search