Difference between revisions of "WhyAmIgettingMemoryErrors"
(12 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
To help understand why you are seeing memory errors, please have a look at [[HowMemoryEdacHardwareWorks]]. |
To help understand why you are seeing memory errors, please have a look at [[HowMemoryEdacHardwareWorks]]. |
||
− | You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module. |
+ | You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module. Note that your system is probably experiencing data corruption (if you are getting UEs - uncorrectable errors), so you should really check this out (this is why EDAC is set to `panic()` on UEs by default). |
The reason that you are seeing problems is very likely to be one of: |
The reason that you are seeing problems is very likely to be one of: |
||
− | * Your RAM is bad |
+ | * Your RAM is bad. |
− | * Your Motherboard is bad |
+ | * Your Motherboard is bad. |
− | * Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64) |
+ | * Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64). |
− | * The connection between your motherboard and your CPU, or memory module is bad |
+ | * The connection between your motherboard and your CPU, or memory module is bad. |
− | * Some of your hardware is being operated outside of |
+ | * Some of your hardware is being operated outside of its design specification, such as: |
− | ** Things are being run too hot |
+ | ** Things are being run too hot. |
− | ** Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation) |
+ | ** Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation). |
− | ** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop") |
+ | ** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop"). |
− | * You have seen one or more "Single Event Upsets" - see [[SoftErrors]] |
+ | * You have seen one or more "Single Event Upsets" - see [[SoftErrors]]. |
− | * Memory ECC check bits are not properly initialised by BIOS prior to Linux boot |
+ | * Memory ECC check bits are not properly initialised by BIOS prior to Linux boot. See [[Uninitialized ECC bits]] |
− | * The EDAC module is buggy |
+ | * The EDAC module is buggy. |
+ | * Memory loading is exceeded. |
||
+ | * The powersupply is insufficient. |
||
== So Which One Is It Then? == |
== So Which One Is It Then? == |
||
Line 30: | Line 32: | ||
| <b>Error Slot or Row</b> |
| <b>Error Slot or Row</b> |
||
| <b>Error Frequency</b> |
| <b>Error Frequency</b> |
||
− | | |
||
|- |
|- |
||
| Bad Memory Module(s) |
| Bad Memory Module(s) |
||
Line 36: | Line 37: | ||
| Probably only 1 |
| Probably only 1 |
||
| May vary if part is marginally out of spec |
| May vary if part is marginally out of spec |
||
− | | |
||
|- |
|- |
||
| Bad Motherboard |
| Bad Motherboard |
||
Line 42: | Line 42: | ||
| Maybe 1, maybe many |
| Maybe 1, maybe many |
||
| ? |
| ? |
||
− | | |
||
|- |
|- |
||
| Bad Connection |
| Bad Connection |
||
Line 48: | Line 47: | ||
| 1 (bad mem), prob all (bad CPU) |
| 1 (bad mem), prob all (bad CPU) |
||
| ? |
| ? |
||
− | | |
||
|- |
|- |
||
| Temp out of spec |
| Temp out of spec |
||
Line 54: | Line 52: | ||
| Maybe one, if different mem mfrs/parts |
| Maybe one, if different mem mfrs/parts |
||
| Usually higher, with higher temp |
| Usually higher, with higher temp |
||
− | | |
||
|- |
|- |
||
| Timings out of spec |
| Timings out of spec |
||
Line 60: | Line 57: | ||
| Maybe one, if different mem mfrs/parts |
| Maybe one, if different mem mfrs/parts |
||
| Usually higher, with higher temp |
| Usually higher, with higher temp |
||
− | | |
||
|- |
|- |
||
| Voltages out of spec |
| Voltages out of spec |
||
Line 66: | Line 62: | ||
| Maybe one, if different mem mfrs/parts |
| Maybe one, if different mem mfrs/parts |
||
| Usually higher at higher system load |
| Usually higher at higher system load |
||
− | | |
||
|- |
|- |
||
− | | Bad check bit init |
+ | | Bad BIOS check bit init |
| Probably random |
| Probably random |
||
| Probably all |
| Probably all |
||
| High/very high (stops after a while for systems with background scrub) |
| High/very high (stops after a while for systems with background scrub) |
||
− | | |
||
|- |
|- |
||
| Single event upsets |
| Single event upsets |
||
| Random |
| Random |
||
| Varies with effective "cross-section" of part |
| Varies with effective "cross-section" of part |
||
+ | | Rare - more common with some part designs, and at high altitude etc. |
||
− | | Very rare |
||
− | | |
||
|} |
|} |
||
Line 87: | Line 80: | ||
* Get a second opinion e.g. from [[http://www.memtest.org/]] or [[http://www.memtest86.com/]] - note that you should be sure that either: |
* Get a second opinion e.g. from [[http://www.memtest.org/]] or [[http://www.memtest86.com/]] - note that you should be sure that either: |
||
** The memory testing software knows how to disable ECC on your system, or |
** The memory testing software knows how to disable ECC on your system, or |
||
− | ** You have disabled ECC before running memory tester |
+ | ** You have disabled ECC before running memory tester (note that memtest86 currently displays "ECC: No" on chipsets which have ECC, but which it doesn't know about!). |
− | * This may not catch problems like power-supply related problems, which don't occur when the memory tester is running |
+ | * This may not catch problems like power-supply related problems, which don't occur when the memory tester is running. |
− | * Use a system stress tester such as "burnbx" from [[http://pages.sbcglobal.net/redelm/]] |
+ | * Use a system stress tester such as "burnbx" from [[http://pages.sbcglobal.net/redelm/]]. |
− | * Put your system under stress by (e.g.) running a parallelised Linux kernel build, whilst doing some heavy 3D graphics display, and a lot of disk I/O |
+ | * Put your system under stress by (e.g.) running a parallelised Linux kernel build, whilst doing some heavy 3D graphics display, and a lot of disk I/O. |
Suspected bad module: |
Suspected bad module: |
||
− | * Remove Module |
+ | * Remove Module. |
− | * Move Module to different slot (do errors move with module) |
+ | * Move Module to different slot (do errors move with module). |
− | * Move Module to different machine |
+ | * Move Module to different machine. |
− | * See "suspected temp out of spec" |
+ | * See "suspected temp out of spec". |
− | * See "suspected timings out of spec" |
+ | * See "suspected timings out of spec". |
− | * See "voltages out of spec" |
+ | * See "voltages out of spec". |
− | * Clean connections |
+ | * Clean connections. |
+ | * Check Memory Loading |
||
+ | ** Some memory controllers can only support so many 'ranks' of memory at a given speed. |
||
+ | For example, Opterons/Athlon64s can support only 4 ranks of 2 GB at PC3200. |
||
+ | See http://www.valueram.com/memoryranks/default.asp for definitions. |
||
Suspected bad motherboard: |
Suspected bad motherboard: |
||
− | * Check motherboard docs for memory module compatability |
+ | * Check motherboard docs for memory module compatability. |
− | * Move modules to different slots |
+ | * Move modules to different slots. |
− | * Clean connections |
+ | * Clean connections. |
− | * Upgrade BIOS |
+ | * Upgrade BIOS. |
− | * Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause |
+ | * Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause. |
Suspected bad connection: |
Suspected bad connection: |
||
* Visually check connectors, pins, modules etc. |
* Visually check connectors, pins, modules etc. |
||
− | * [[HowToCleanEdgeConnectors]] |
+ | * [[HowToCleanEdgeConnectors]]. |
Suspected temp out of spec: |
Suspected temp out of spec: |
||
* Measure temp, compare to published specs: |
* Measure temp, compare to published specs: |
||
− | ** Use internal machine sensors (motherboard, hard drive etc.) if possible |
+ | ** Use internal machine sensors (motherboard, hard drive etc.) if possible. |
− | ** Use a temperature probe |
+ | ** Use a temperature probe or infra-red thermometer. |
− | * Check airflow |
+ | * Check airflow. |
− | * De-dust |
+ | * De-dust. |
* Lower temp: |
* Lower temp: |
||
− | ** Lower room temp |
+ | ** Lower room temp. |
− | ** Increase cooling |
+ | ** Increase cooling. |
− | ** Improve airflow (tidy cables etc.) |
+ | ** Improve airflow (tidy cables etc.). |
Suspected timings out of spec: |
Suspected timings out of spec: |
||
− | * Try different BIOS version |
+ | * Try different BIOS version. |
− | * Set pessimistic memory timings in BIOS |
+ | * Set pessimistic memory timings in BIOS. |
− | * Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project |
+ | * Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project. |
− | * Try disabling "spread spectrum" in the BIOS (easy), or by using an i2c driver for your board's clock generator (hard) |
+ | * Try disabling "spread spectrum" in the BIOS (easy if available), or by using an i2c driver for your board's clock generator (hard). |
Suspected voltage out of spec: |
Suspected voltage out of spec: |
||
− | * Check PSU specs vs. total demand of system components |
+ | * Check PSU specs vs. total demand of system components. |
− | * Swap power supply with another machine |
+ | * Swap power supply with another machine. |
− | * Fit voltage regulator/spike |
+ | * Fit voltage regulator/spike suppressor to machine power supply. |
Suspected single event upsets: |
Suspected single event upsets: |
||
* Fit less susceptible components |
* Fit less susceptible components |
||
− | * Move to a lower altitude, or area with lower cosmic radiation |
+ | * Move to a lower altitude, or area with lower cosmic radiation. |
− | * Move your data centre underground |
+ | * Move your data centre underground. |
− | * Improve error-reporting utilities to ignore them |
+ | * Improve error-reporting utilities to ignore them. |
Suspected bad check-bit init: |
Suspected bad check-bit init: |
||
− | * Upgrade BIOS |
+ | * Upgrade BIOS. |
− | * Don't enable BIOS "quick boot" |
+ | * Don't enable BIOS "quick boot". |
− | * Don't manually skip BIOS memory check |
+ | * Don't manually skip BIOS memory check. |
+ | |||
+ | Suspected insufficient powersupply: |
||
+ | |||
+ | * Try detaching some devices that are hardly use. Start with USB devices. |
||
+ | * If the problems stop, either structurally reduce the devices, or get a higher capacity powersupply. |
||
+ | * Use a DC current clamp (pref one with peak/inrush measurement function) to check over-capacity at a particular voltage. |
||
+ | * This is closely related to voltage out of spec. That can also be caused by just a broken supply. |
||
__NOTOC__ |
__NOTOC__ |
Latest revision as of 13:51, 9 November 2010
To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.
You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module. Note that your system is probably experiencing data corruption (if you are getting UEs - uncorrectable errors), so you should really check this out (this is why EDAC is set to `panic()` on UEs by default).
The reason that you are seeing problems is very likely to be one of:
- Your RAM is bad.
- Your Motherboard is bad.
- Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64).
- The connection between your motherboard and your CPU, or memory module is bad.
- Some of your hardware is being operated outside of its design specification, such as:
- Things are being run too hot.
- Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation).
- Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop").
- You have seen one or more "Single Event Upsets" - see SoftErrors.
- Memory ECC check bits are not properly initialised by BIOS prior to Linux boot. See Uninitialized ECC bits
- The EDAC module is buggy.
- Memory loading is exceeded.
- The powersupply is insufficient.
So Which One Is It Then?
Good question. Time to try some things:
Symptoms
Here are the most likely symptoms.
Problem | Error Addresses | Error Slot or Row | Error Frequency |
Bad Memory Module(s) | Single/Few | Probably only 1 | May vary if part is marginally out of spec |
Bad Motherboard | Probably many | Maybe 1, maybe many | ? |
Bad Connection | Probably many | 1 (bad mem), prob all (bad CPU) | ? |
Temp out of spec | Probably few | Maybe one, if different mem mfrs/parts | Usually higher, with higher temp |
Timings out of spec | Probably few | Maybe one, if different mem mfrs/parts | Usually higher, with higher temp |
Voltages out of spec | Probably random | Maybe one, if different mem mfrs/parts | Usually higher at higher system load |
Bad BIOS check bit init | Probably random | Probably all | High/very high (stops after a while for systems with background scrub) |
Single event upsets | Random | Varies with effective "cross-section" of part | Rare - more common with some part designs, and at high altitude etc. |
Things to try to isolate the problem
General:
- Get a second opinion e.g. from [[1]] or [[2]] - note that you should be sure that either:
- The memory testing software knows how to disable ECC on your system, or
- You have disabled ECC before running memory tester (note that memtest86 currently displays "ECC: No" on chipsets which have ECC, but which it doesn't know about!).
- This may not catch problems like power-supply related problems, which don't occur when the memory tester is running.
- Use a system stress tester such as "burnbx" from [[3]].
- Put your system under stress by (e.g.) running a parallelised Linux kernel build, whilst doing some heavy 3D graphics display, and a lot of disk I/O.
Suspected bad module:
- Remove Module.
- Move Module to different slot (do errors move with module).
- Move Module to different machine.
- See "suspected temp out of spec".
- See "suspected timings out of spec".
- See "voltages out of spec".
- Clean connections.
- Check Memory Loading
- Some memory controllers can only support so many 'ranks' of memory at a given speed.
For example, Opterons/Athlon64s can support only 4 ranks of 2 GB at PC3200. See http://www.valueram.com/memoryranks/default.asp for definitions.
Suspected bad motherboard:
- Check motherboard docs for memory module compatability.
- Move modules to different slots.
- Clean connections.
- Upgrade BIOS.
- Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause.
Suspected bad connection:
- Visually check connectors, pins, modules etc.
- HowToCleanEdgeConnectors.
Suspected temp out of spec:
- Measure temp, compare to published specs:
- Use internal machine sensors (motherboard, hard drive etc.) if possible.
- Use a temperature probe or infra-red thermometer.
- Check airflow.
- De-dust.
- Lower temp:
- Lower room temp.
- Increase cooling.
- Improve airflow (tidy cables etc.).
Suspected timings out of spec:
- Try different BIOS version.
- Set pessimistic memory timings in BIOS.
- Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project.
- Try disabling "spread spectrum" in the BIOS (easy if available), or by using an i2c driver for your board's clock generator (hard).
Suspected voltage out of spec:
- Check PSU specs vs. total demand of system components.
- Swap power supply with another machine.
- Fit voltage regulator/spike suppressor to machine power supply.
Suspected single event upsets:
- Fit less susceptible components
- Move to a lower altitude, or area with lower cosmic radiation.
- Move your data centre underground.
- Improve error-reporting utilities to ignore them.
Suspected bad check-bit init:
- Upgrade BIOS.
- Don't enable BIOS "quick boot".
- Don't manually skip BIOS memory check.
Suspected insufficient powersupply:
- Try detaching some devices that are hardly use. Start with USB devices.
- If the problems stop, either structurally reduce the devices, or get a higher capacity powersupply.
- Use a DC current clamp (pref one with peak/inrush measurement function) to check over-capacity at a particular voltage.
- This is closely related to voltage out of spec. That can also be caused by just a broken supply.