Revision as of 12:33, 9 March 2006

Why Am I Getting Memory Errors?

To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.

The reason that you are seeing problems is very likely to be one of:

Your RAM is bad
Your Motherboard is bad
Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
The connection between your motherboard and your CPU, or memory module is bad
Some of your hardware is being operated outside of their design specification, such as:
- Things are being run too hot
- Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
- Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
You have seen one or more "Single Event Upsets" - see SoftErrors
The EDAC module is buggy

So Which One Is It Then?

Good question. Time to try some things:

Symptoms

Here are the most likely symptoms.

Problem	Error Addresses	Error Slot or Row	Error Frequency
Bad Memory Module(s)	Single/Few	Probably only 1	May vary if part is marginally out of spec
Bad Motherboard	Probably many	Maybe 1, maybe many	?
Bad Connection	Probably many	1 (bad mem), prob all (bad CPU)	?
Temp out of spec	Probably few	Maybe one, if different mem mfrs/parts	Usually higher, with higher temp
Timings out of spec	Probably few	Maybe one, if different mem mfrs/parts	Usually higher, with higher temp
Voltages out of spec	Probably random	Maybe one, if different mem mfrs/parts	Usually higher at higher system load
Single event upsets	Random	Varies with effective size, and "cross-section" of part	Very rare

Things to try

Suspected bad module:

Remove Module
Move Module to different slot (do errors move with module)
Move Module to different machine
See "suspected temp out of spec"
See "suspected timings out of spec"
See "voltages out of spec"
Clean connections

Suspected bad motherboard:

Check motherboard docs for memory module compatability
Move modules to different slots
Clean connections
Upgrade BIOS
Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause

Suspected bad connection:

Visually check connectors, pins, modules etc.
Clean things up!

Suspected temp out of spec:

Measure temp, compare to published specs:
- Use internal machine sensors (motherboard, hard drive etc.) if possible
- Use a temperature probe
Check airflow
De-dust
Lower temp:
- Lower room temp
- Increase cooling
- Improve airflow (tidy cables etc.)

Suspected timings out of spec:

Try different BIOS version
Set pessimistic memory timings
Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project

Suspected voltage out of spec:

Check PSU specs vs. total demand of system components
Swap power supply with another machine
Fit voltage regulator/spike surpressor to machine power supply

Suspected single event upsets:

Fit less susceptible components
Move to a lower altitude
Move your data centre underground
Improve detection utilities to ignore them

@@ Line 9: / Line 9: @@
 * Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
 * The connection between your motherboard and your CPU, or memory module is bad
-* Some of your hardware is being operated out of spec, such as:
+* Some of your hardware is being operated outside of their design specification, such as:
 ** Things are being run too hot
 ** Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
-** Supply voltages to the critical compontents
+** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
 * You have seen one or more "Single Event Upsets" - see [[SoftErrors]]
 * The EDAC module is buggy
+== So Which One Is It Then? ==
+Good question.  Time to try some things:
+=== Symptoms ===
+Here are the most likely symptoms.
+{| border="1" cellpadding="2" cellspacing="0"
+| Problem
+| Error Addresses
+| Error Slot or Row
+| Error Frequency
+|
+|-
+| Bad Memory Module(s)
+| Single/Few
+| Probably only 1
+| May vary if part is marginally out of spec
+|
+|-
+| Bad Motherboard
+| Probably many
+| Maybe 1, maybe many
+| ?
+|
+|-
+| Bad Connection
+| Probably many
+| 1 (bad mem), prob all (bad CPU)
+| ?
+|
+|-
+| Temp out of spec
+| Probably few
+| Maybe one, if different mem mfrs/parts
+| Usually higher, with higher temp
+|
+|-
+| Timings out of spec
+| Probably few
+| Maybe one, if different mem mfrs/parts
+| Usually higher, with higher temp
+|
+|-
+| Voltages out of spec
+| Probably random
+| Maybe one, if different mem mfrs/parts
+| Usually higher at higher system load
+|
+|-
+| Single event upsets
+| Random
+| Varies with effective size, and "cross-section" of part
+| Very rare
+|
+|}
+=== Things to try ===
+Suspected bad module:
+* Remove Module
+* Move Module to different slot (do errors move with module)
+* Move Module to different machine
+* See "suspected temp out of spec"
+* See "suspected timings out of spec"
+* See "voltages out of spec"
+* Clean connections
+Suspected bad motherboard:
+* Check motherboard docs for memory module compatability
+* Move modules to different slots
+* Clean connections
+* Upgrade BIOS
+* Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause
+Suspected bad connection:
+* Visually check connectors, pins, modules etc.
+* Clean things up!
+Suspected temp out of spec:
+* Measure temp, compare to published specs:
+** Use internal machine sensors (motherboard, hard drive etc.) if possible
+** Use a temperature probe
+* Check airflow
+* De-dust
+* Lower temp:
+** Lower room temp
+** Increase cooling
+** Improve airflow (tidy cables etc.)
+Suspected timings out of spec:
+* Try different BIOS version
+* Set pessimistic memory timings
+* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
+Suspected voltage out of spec:
+* Check PSU specs vs. total demand of system components
+* Swap power supply with another machine
+* Fit voltage regulator/spike surpressor to machine power supply
+Suspected single event upsets:
+* Fit less susceptible components
+* Move to a lower altitude
+* Move your data centre underground
+* Improve detection utilities to ignore them
 __NOTOC__

Difference between revisions of "WhyAmIgettingMemoryErrors"

Revision as of 12:33, 9 March 2006

Why Am I Getting Memory Errors?

So Which One Is It Then?

Symptoms

Things to try

Navigation menu

Search