Difference between revisions of "WhyAmIgettingMemoryErrors"
Jump to navigation
Jump to search
Line 9: | Line 9: | ||
* Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64) |
* Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64) |
||
* The connection between your motherboard and your CPU, or memory module is bad |
* The connection between your motherboard and your CPU, or memory module is bad |
||
− | * Some of your hardware is being operated |
+ | * Some of your hardware is being operated outside of their design specification, such as: |
** Things are being run too hot |
** Things are being run too hot |
||
** Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation) |
** Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation) |
||
− | ** Supply voltages to the critical compontents |
+ | ** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop") |
* You have seen one or more "Single Event Upsets" - see [[SoftErrors]] |
* You have seen one or more "Single Event Upsets" - see [[SoftErrors]] |
||
* The EDAC module is buggy |
* The EDAC module is buggy |
||
+ | |||
+ | == So Which One Is It Then? == |
||
+ | |||
+ | Good question. Time to try some things: |
||
+ | |||
+ | === Symptoms === |
||
+ | |||
+ | Here are the most likely symptoms. |
||
+ | |||
+ | {| border="1" cellpadding="2" cellspacing="0" |
||
+ | | Problem |
||
+ | | Error Addresses |
||
+ | | Error Slot or Row |
||
+ | | Error Frequency |
||
+ | | |
||
+ | |- |
||
+ | | Bad Memory Module(s) |
||
+ | | Single/Few |
||
+ | | Probably only 1 |
||
+ | | May vary if part is marginally out of spec |
||
+ | | |
||
+ | |- |
||
+ | | Bad Motherboard |
||
+ | | Probably many |
||
+ | | Maybe 1, maybe many |
||
+ | | ? |
||
+ | | |
||
+ | |- |
||
+ | | Bad Connection |
||
+ | | Probably many |
||
+ | | 1 (bad mem), prob all (bad CPU) |
||
+ | | ? |
||
+ | | |
||
+ | |- |
||
+ | | Temp out of spec |
||
+ | | Probably few |
||
+ | | Maybe one, if different mem mfrs/parts |
||
+ | | Usually higher, with higher temp |
||
+ | | |
||
+ | |- |
||
+ | | Timings out of spec |
||
+ | | Probably few |
||
+ | | Maybe one, if different mem mfrs/parts |
||
+ | | Usually higher, with higher temp |
||
+ | | |
||
+ | |- |
||
+ | | Voltages out of spec |
||
+ | | Probably random |
||
+ | | Maybe one, if different mem mfrs/parts |
||
+ | | Usually higher at higher system load |
||
+ | | |
||
+ | |- |
||
+ | | Single event upsets |
||
+ | | Random |
||
+ | | Varies with effective size, and "cross-section" of part |
||
+ | | Very rare |
||
+ | | |
||
+ | |} |
||
+ | |||
+ | === Things to try === |
||
+ | |||
+ | Suspected bad module: |
||
+ | |||
+ | * Remove Module |
||
+ | * Move Module to different slot (do errors move with module) |
||
+ | * Move Module to different machine |
||
+ | * See "suspected temp out of spec" |
||
+ | * See "suspected timings out of spec" |
||
+ | * See "voltages out of spec" |
||
+ | * Clean connections |
||
+ | |||
+ | Suspected bad motherboard: |
||
+ | |||
+ | * Check motherboard docs for memory module compatability |
||
+ | * Move modules to different slots |
||
+ | * Clean connections |
||
+ | * Upgrade BIOS |
||
+ | * Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause |
||
+ | |||
+ | Suspected bad connection: |
||
+ | |||
+ | * Visually check connectors, pins, modules etc. |
||
+ | * Clean things up! |
||
+ | |||
+ | Suspected temp out of spec: |
||
+ | |||
+ | * Measure temp, compare to published specs: |
||
+ | ** Use internal machine sensors (motherboard, hard drive etc.) if possible |
||
+ | ** Use a temperature probe |
||
+ | * Check airflow |
||
+ | * De-dust |
||
+ | * Lower temp: |
||
+ | ** Lower room temp |
||
+ | ** Increase cooling |
||
+ | ** Improve airflow (tidy cables etc.) |
||
+ | |||
+ | Suspected timings out of spec: |
||
+ | |||
+ | * Try different BIOS version |
||
+ | * Set pessimistic memory timings |
||
+ | * Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project |
||
+ | |||
+ | Suspected voltage out of spec: |
||
+ | |||
+ | * Check PSU specs vs. total demand of system components |
||
+ | * Swap power supply with another machine |
||
+ | * Fit voltage regulator/spike surpressor to machine power supply |
||
+ | |||
+ | Suspected single event upsets: |
||
+ | |||
+ | * Fit less susceptible components |
||
+ | * Move to a lower altitude |
||
+ | * Move your data centre underground |
||
+ | * Improve detection utilities to ignore them |
||
__NOTOC__ |
__NOTOC__ |
Revision as of 12:33, 9 March 2006
Why Am I Getting Memory Errors?
To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.
The reason that you are seeing problems is very likely to be one of:
- Your RAM is bad
- Your Motherboard is bad
- Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
- The connection between your motherboard and your CPU, or memory module is bad
- Some of your hardware is being operated outside of their design specification, such as:
- Things are being run too hot
- Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
- Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
- You have seen one or more "Single Event Upsets" - see SoftErrors
- The EDAC module is buggy
So Which One Is It Then?
Good question. Time to try some things:
Symptoms
Here are the most likely symptoms.
Problem | Error Addresses | Error Slot or Row | Error Frequency | |
Bad Memory Module(s) | Single/Few | Probably only 1 | May vary if part is marginally out of spec | |
Bad Motherboard | Probably many | Maybe 1, maybe many | ? | |
Bad Connection | Probably many | 1 (bad mem), prob all (bad CPU) | ? | |
Temp out of spec | Probably few | Maybe one, if different mem mfrs/parts | Usually higher, with higher temp | |
Timings out of spec | Probably few | Maybe one, if different mem mfrs/parts | Usually higher, with higher temp | |
Voltages out of spec | Probably random | Maybe one, if different mem mfrs/parts | Usually higher at higher system load | |
Single event upsets | Random | Varies with effective size, and "cross-section" of part | Very rare |
Things to try
Suspected bad module:
- Remove Module
- Move Module to different slot (do errors move with module)
- Move Module to different machine
- See "suspected temp out of spec"
- See "suspected timings out of spec"
- See "voltages out of spec"
- Clean connections
Suspected bad motherboard:
- Check motherboard docs for memory module compatability
- Move modules to different slots
- Clean connections
- Upgrade BIOS
- Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause
Suspected bad connection:
- Visually check connectors, pins, modules etc.
- Clean things up!
Suspected temp out of spec:
- Measure temp, compare to published specs:
- Use internal machine sensors (motherboard, hard drive etc.) if possible
- Use a temperature probe
- Check airflow
- De-dust
- Lower temp:
- Lower room temp
- Increase cooling
- Improve airflow (tidy cables etc.)
Suspected timings out of spec:
- Try different BIOS version
- Set pessimistic memory timings
- Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
Suspected voltage out of spec:
- Check PSU specs vs. total demand of system components
- Swap power supply with another machine
- Fit voltage regulator/spike surpressor to machine power supply
Suspected single event upsets:
- Fit less susceptible components
- Move to a lower altitude
- Move your data centre underground
- Improve detection utilities to ignore them