Difference between revisions of "WhyAmIgettingMemoryErrors"
Jump to navigation
Jump to search
Line 12: | Line 12: | ||
** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop") |
** Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop") |
||
* You have seen one or more "Single Event Upsets" - see [[SoftErrors]] |
* You have seen one or more "Single Event Upsets" - see [[SoftErrors]] |
||
+ | * Memory ECC check bits are not properly initialised by BIOS prior to Linux boot |
||
* The EDAC module is buggy |
* The EDAC module is buggy |
||
Line 64: | Line 65: | ||
| Usually higher at higher system load |
| Usually higher at higher system load |
||
| |
| |
||
+ | |- |
||
+ | | Bad check bit init |
||
+ | | Probably random |
||
+ | | Probably all |
||
+ | | High/very high (stops after a while for systems with background scrub) |
||
+ | | |
||
|- |
|- |
||
| Single event upsets |
| Single event upsets |
||
Line 114: | Line 121: | ||
* Set pessimistic memory timings in BIOS |
* Set pessimistic memory timings in BIOS |
||
* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project |
* Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project |
||
+ | * Try disabling "spread spectrum" in the BIOS (easy), or by using an i2c driver for your board's clock generator (hard) |
||
Suspected voltage out of spec: |
Suspected voltage out of spec: |
||
Line 127: | Line 135: | ||
* Move your data centre underground |
* Move your data centre underground |
||
* Improve error-reporting utilities to ignore them |
* Improve error-reporting utilities to ignore them |
||
+ | |||
+ | Suspected bad check-bit init: |
||
+ | |||
+ | * Upgrade BIOS |
||
+ | * Don't enable BIOS "quick boot" |
||
+ | * Don't manually skip BIOS memory check |
||
__NOTOC__ |
__NOTOC__ |
Revision as of 13:10, 9 March 2006
To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.
The reason that you are seeing problems is very likely to be one of:
- Your RAM is bad
- Your Motherboard is bad
- Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64)
- The connection between your motherboard and your CPU, or memory module is bad
- Some of your hardware is being operated outside of their design specification, such as:
- Things are being run too hot
- Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation)
- Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop")
- You have seen one or more "Single Event Upsets" - see SoftErrors
- Memory ECC check bits are not properly initialised by BIOS prior to Linux boot
- The EDAC module is buggy
So Which One Is It Then?
Good question. Time to try some things:
Symptoms
Here are the most likely symptoms.
Problem | Error Addresses | Error Slot or Row | Error Frequency | |
Bad Memory Module(s) | Single/Few | Probably only 1 | May vary if part is marginally out of spec | |
Bad Motherboard | Probably many | Maybe 1, maybe many | ? | |
Bad Connection | Probably many | 1 (bad mem), prob all (bad CPU) | ? | |
Temp out of spec | Probably few | Maybe one, if different mem mfrs/parts | Usually higher, with higher temp | |
Timings out of spec | Probably few | Maybe one, if different mem mfrs/parts | Usually higher, with higher temp | |
Voltages out of spec | Probably random | Maybe one, if different mem mfrs/parts | Usually higher at higher system load | |
Bad check bit init | Probably random | Probably all | High/very high (stops after a while for systems with background scrub) | |
Single event upsets | Random | Varies with effective "cross-section" of part | Very rare |
Things to try to isolate the problem
Suspected bad module:
- Remove Module
- Move Module to different slot (do errors move with module)
- Move Module to different machine
- See "suspected temp out of spec"
- See "suspected timings out of spec"
- See "voltages out of spec"
- Clean connections
Suspected bad motherboard:
- Check motherboard docs for memory module compatability
- Move modules to different slots
- Clean connections
- Upgrade BIOS
- Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause
Suspected bad connection:
- Visually check connectors, pins, modules etc.
- HowToCleanEdgeConnectors
Suspected temp out of spec:
- Measure temp, compare to published specs:
- Use internal machine sensors (motherboard, hard drive etc.) if possible
- Use a temperature probe
- Check airflow
- De-dust
- Lower temp:
- Lower room temp
- Increase cooling
- Improve airflow (tidy cables etc.)
Suspected timings out of spec:
- Try different BIOS version
- Set pessimistic memory timings in BIOS
- Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project
- Try disabling "spread spectrum" in the BIOS (easy), or by using an i2c driver for your board's clock generator (hard)
Suspected voltage out of spec:
- Check PSU specs vs. total demand of system components
- Swap power supply with another machine
- Fit voltage regulator/spike surpressor to machine power supply
Suspected single event upsets:
- Fit less susceptible components
- Move to a lower altitude, or area with lower cosmic radiation
- Move your data centre underground
- Improve error-reporting utilities to ignore them
Suspected bad check-bit init:
- Upgrade BIOS
- Don't enable BIOS "quick boot"
- Don't manually skip BIOS memory check