Difference between revisions of "WhyAmIgettingMemoryErrors"

Latest revision as of 13:51, 9 November 2010

To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.

You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module. Note that your system is probably experiencing data corruption (if you are getting UEs - uncorrectable errors), so you should really check this out (this is why EDAC is set to `panic()` on UEs by default).

The reason that you are seeing problems is very likely to be one of:

Your RAM is bad.
Your Motherboard is bad.
Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64).
The connection between your motherboard and your CPU, or memory module is bad.
Some of your hardware is being operated outside of its design specification, such as:
- Things are being run too hot.
- Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation).
- Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop").
You have seen one or more "Single Event Upsets" - see SoftErrors.
Memory ECC check bits are not properly initialised by BIOS prior to Linux boot. See Uninitialized ECC bits
The EDAC module is buggy.
Memory loading is exceeded.
The powersupply is insufficient.

So Which One Is It Then?

Good question. Time to try some things:

Symptoms

Here are the most likely symptoms.

Problem	Error Addresses	Error Slot or Row	Error Frequency
Bad Memory Module(s)	Single/Few	Probably only 1	May vary if part is marginally out of spec
Bad Motherboard	Probably many	Maybe 1, maybe many	?
Bad Connection	Probably many	1 (bad mem), prob all (bad CPU)	?
Temp out of spec	Probably few	Maybe one, if different mem mfrs/parts	Usually higher, with higher temp
Timings out of spec	Probably few	Maybe one, if different mem mfrs/parts	Usually higher, with higher temp
Voltages out of spec	Probably random	Maybe one, if different mem mfrs/parts	Usually higher at higher system load
Bad BIOS check bit init	Probably random	Probably all	High/very high (stops after a while for systems with background scrub)
Single event upsets	Random	Varies with effective "cross-section" of part	Rare - more common with some part designs, and at high altitude etc.

Things to try to isolate the problem

General:

Get a second opinion e.g. from [[1]] or [[2]] - note that you should be sure that either:
- The memory testing software knows how to disable ECC on your system, or
- You have disabled ECC before running memory tester (note that memtest86 currently displays "ECC: No" on chipsets which have ECC, but which it doesn't know about!).
This may not catch problems like power-supply related problems, which don't occur when the memory tester is running.
Use a system stress tester such as "burnbx" from [[3]].
Put your system under stress by (e.g.) running a parallelised Linux kernel build, whilst doing some heavy 3D graphics display, and a lot of disk I/O.

Suspected bad module:

Remove Module.
Move Module to different slot (do errors move with module).
Move Module to different machine.
See "suspected temp out of spec".
See "suspected timings out of spec".
See "voltages out of spec".
Clean connections.
Check Memory Loading
- Some memory controllers can only support so many 'ranks' of memory at a given speed.

    For example, Opterons/Athlon64s can support only 4 ranks of 2 GB at PC3200. 
    See http://www.valueram.com/memoryranks/default.asp for definitions.

Suspected bad motherboard:

Check motherboard docs for memory module compatability.
Move modules to different slots.
Clean connections.
Upgrade BIOS.
Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause.

Suspected bad connection:

Visually check connectors, pins, modules etc.
HowToCleanEdgeConnectors.

Suspected temp out of spec:

Measure temp, compare to published specs:
- Use internal machine sensors (motherboard, hard drive etc.) if possible.
- Use a temperature probe or infra-red thermometer.
Check airflow.
De-dust.
Lower temp:
- Lower room temp.
- Increase cooling.
- Improve airflow (tidy cables etc.).

Suspected timings out of spec:

Try different BIOS version.
Set pessimistic memory timings in BIOS.
Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project.
Try disabling "spread spectrum" in the BIOS (easy if available), or by using an i2c driver for your board's clock generator (hard).

Suspected voltage out of spec:

Check PSU specs vs. total demand of system components.
Swap power supply with another machine.
Fit voltage regulator/spike suppressor to machine power supply.

Suspected single event upsets:

Fit less susceptible components
Move to a lower altitude, or area with lower cosmic radiation.
Move your data centre underground.
Improve error-reporting utilities to ignore them.

Suspected bad check-bit init:

Upgrade BIOS.
Don't enable BIOS "quick boot".
Don't manually skip BIOS memory check.

Suspected insufficient powersupply:

Try detaching some devices that are hardly use. Start with USB devices.
If the problems stop, either structurally reduce the devices, or get a higher capacity powersupply.
Use a DC current clamp (pref one with peak/inrush measurement function) to check over-capacity at a particular voltage.
This is closely related to voltage out of spec. That can also be caused by just a broken supply.

@@ Line 1: / Line 1: @@
 To help understand why you are seeing memory errors, please have a look at [[HowMemoryEdacHardwareWorks]].
+You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module.  Note that your system is probably experiencing data corruption (if you are getting UEs - uncorrectable errors), so you should really check this out (this is why EDAC is set to `panic()` on UEs by default).
 The reason that you are seeing problems is very likely to be one of:
-* Your RAM is bad
+* Your RAM is bad.
-* Your Motherboard is bad
+* Your Motherboard is bad.
-* Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
+* Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64).
-* The connection between your motherboard and your CPU, or memory module is bad
+* The connection between your motherboard and your CPU, or memory module is bad.
-* Some of your hardware is being operated outside of their design specification, such as:
+* Some of your hardware is being operated outside of its design specification, such as:
-** Things are being run too hot
+** Things are being run too hot.
-** Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
+** Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation).
-** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
+** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop").
-* You have seen one or more "Single Event Upsets" - see [[SoftErrors]]
+* You have seen one or more "Single Event Upsets" - see [[SoftErrors]].
+* Memory ECC check bits are not properly initialised by BIOS prior to Linux boot. See [[Uninitialized ECC bits]]
-* The EDAC module is buggy
+* The EDAC module is buggy.
+* Memory loading is exceeded.
+* The powersupply is insufficient.
 == So Which One Is It Then? ==
@@ Line 23: / Line 28: @@
 {| border="1" cellpadding="2" cellspacing="0"
-| Problem
+| <b>Problem</b>
-| Error Addresses
+| <b>Error Addresses</b>
-| Error Slot or Row
+| <b>Error Slot or Row</b>
-| Error Frequency
+| <b>Error Frequency</b>
-|
 |-
 | Bad Memory Module(s)
@@ Line 33: / Line 37: @@
 | Probably only 1
 | May vary if part is marginally out of spec
-|
 |-
 | Bad Motherboard
@@ Line 39: / Line 42: @@
 | Maybe 1, maybe many
 | ?
-|
 |-
 | Bad Connection
@@ Line 45: / Line 47: @@
 | 1 (bad mem), prob all (bad CPU)
 | ?
-|
 |-
 | Temp out of spec
@@ Line 51: / Line 52: @@
 | Maybe one, if different mem mfrs/parts
 | Usually higher, with higher temp
-|
 |-
 | Timings out of spec
@@ Line 57: / Line 57: @@
 | Maybe one, if different mem mfrs/parts
 | Usually higher, with higher temp
-|
 |-
 | Voltages out of spec
@@ Line 63: / Line 62: @@
 | Maybe one, if different mem mfrs/parts
 | Usually higher at higher system load
-|
+|-
+| Bad BIOS check bit init
+| Probably random
+| Probably all
+| High/very high (stops after a while for systems with background scrub)
 |-
 | Single event upsets
 | Random
 | Varies with effective "cross-section" of part
+| Rare - more common with some part designs, and at high altitude etc.
-| Very rare
-|
 |}
 === Things to try to isolate the problem ===
+General:
+* Get a second opinion e.g. from [[http://www.memtest.org/]] or [[http://www.memtest86.com/]] - note that you should be sure that either:
+** The memory testing software knows how to disable ECC on your system, or
+** You have disabled ECC before running memory tester (note that memtest86 currently displays "ECC: No" on chipsets which have ECC, but which it doesn't know about!).
+* This may not catch problems like power-supply related problems, which don't occur when the memory tester is running.
+* Use a system stress tester such as "burnbx" from [[http://pages.sbcglobal.net/redelm/]].
+* Put your system under stress by (e.g.) running a parallelised Linux kernel build, whilst doing some heavy 3D graphics display, and a lot of disk I/O.
 Suspected bad module:
-* Remove Module
+* Remove Module.
-* Move Module to different slot (do errors move with module)
+* Move Module to different slot (do errors move with module).
-* Move Module to different machine
+* Move Module to different machine.
-* See "suspected temp out of spec"
+* See "suspected temp out of spec".
-* See "suspected timings out of spec"
+* See "suspected timings out of spec".
-* See "voltages out of spec"
+* See "voltages out of spec".
-* Clean connections
+* Clean connections.
+* Check Memory Loading
+** Some memory controllers can only support so many 'ranks' of memory at a given speed.
+     For example, Opterons/Athlon64s can support only 4 ranks of 2 GB at PC3200.
+     See http://www.valueram.com/memoryranks/default.asp for definitions.
 Suspected bad motherboard:
-* Check motherboard docs for memory module compatability
+* Check motherboard docs for memory module compatability.
-* Move modules to different slots
+* Move modules to different slots.
-* Clean connections
+* Clean connections.
-* Upgrade BIOS
+* Upgrade BIOS.
-* Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause
+* Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause.
 Suspected bad connection:
 * Visually check connectors, pins, modules etc.
-* [[HowToCleanEdgeConnectors]]
+* [[HowToCleanEdgeConnectors]].
 Suspected temp out of spec:
 * Measure temp, compare to published specs:
-** Use internal machine sensors (motherboard, hard drive etc.) if possible
+** Use internal machine sensors (motherboard, hard drive etc.) if possible.
-** Use a temperature probe
+** Use a temperature probe or infra-red thermometer.
-* Check airflow
+* Check airflow.
-* De-dust
+* De-dust.
 * Lower temp:
-** Lower room temp
+** Lower room temp.
-** Increase cooling
+** Increase cooling.
-** Improve airflow (tidy cables etc.)
+** Improve airflow (tidy cables etc.).
 Suspected timings out of spec:
-* Try different BIOS version
+* Try different BIOS version.
-* Set pessimistic memory timings in BIOS
+* Set pessimistic memory timings in BIOS.
-* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
+* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project.
+* Try disabling "spread spectrum" in the BIOS (easy if available), or by using an i2c driver for your board's clock generator (hard).
 Suspected voltage out of spec:
-* Check PSU specs vs. total demand of system components
+* Check PSU specs vs. total demand of system components.
-* Swap power supply with another machine
+* Swap power supply with another machine.
-* Fit voltage regulator/spike surpressor to machine power supply
+* Fit voltage regulator/spike suppressor to machine power supply.
 Suspected single event upsets:
 * Fit less susceptible components
-* Move to a lower altitude, or area with lower cosmic radiation
+* Move to a lower altitude, or area with lower cosmic radiation.
-* Move your data centre underground
+* Move your data centre underground.
-* Improve error-reporting utilities to ignore them
+* Improve error-reporting utilities to ignore them.
+Suspected bad check-bit init:
+* Upgrade BIOS.
+* Don't enable BIOS "quick boot".
+* Don't manually skip BIOS memory check.
+Suspected insufficient powersupply:
+* Try detaching some devices that are hardly use. Start with USB devices.
+* If the problems stop, either structurally reduce the devices, or get a higher capacity powersupply.
+* Use a DC current clamp (pref one with peak/inrush measurement function) to check over-capacity at a particular voltage.
+* This is closely related to voltage out of spec. That can also be caused by just a broken supply.
 __NOTOC__

Difference between revisions of "WhyAmIgettingMemoryErrors"

Latest revision as of 13:51, 9 November 2010

So Which One Is It Then?

Symptoms

Things to try to isolate the problem

Navigation menu

Search