To help understand why you are seeing memory errors, please have a look at HowMemoryEdacHardwareWorks.
You may well have been experiencing these errors for a while, it's just that nothing was checking them until you enabled the EDAC module. Note that your system is probably experiencing data corruption (if you are getting UEs - uncorrectable errors), so you should really check this out (this is why EDAC is set to `panic()` on UEs by default).
The reason that you are seeing problems is very likely to be one of:
- Your RAM is bad.
- Your Motherboard is bad.
- Your CPU is bad (for CPUs which have the memory controller built into the CPU core, such at the AMD Opteron/Athlon-64).
- The connection between your motherboard and your CPU, or memory module is bad.
- Some of your hardware is being operated outside of its design specification, such as:
- Things are being run too hot.
- Timings are being violated (e.g. running memory too fast, or bad DRAM clock generation).
- Supply voltages to the critical compontents are too high/low (this may even happend very briefly, as a supply "spike", or "droop").
- You have seen one or more "Single Event Upsets" - see SoftErrors.
- Memory ECC check bits are not properly initialised by BIOS prior to Linux boot. See Uninitialized ECC bits
- The EDAC module is buggy.
- Memory loading is exceeded.
- The powersupply is insufficient.
So Which One Is It Then?
Good question. Time to try some things:
Here are the most likely symptoms.
|Problem||Error Addresses||Error Slot or Row||Error Frequency|
|Bad Memory Module(s)||Single/Few||Probably only 1||May vary if part is marginally out of spec|
|Bad Motherboard||Probably many||Maybe 1, maybe many||?|
|Bad Connection||Probably many||1 (bad mem), prob all (bad CPU)||?|
|Temp out of spec||Probably few||Maybe one, if different mem mfrs/parts||Usually higher, with higher temp|
|Timings out of spec||Probably few||Maybe one, if different mem mfrs/parts||Usually higher, with higher temp|
|Voltages out of spec||Probably random||Maybe one, if different mem mfrs/parts||Usually higher at higher system load|
|Bad BIOS check bit init||Probably random||Probably all||High/very high (stops after a while for systems with background scrub)|
|Single event upsets||Random||Varies with effective "cross-section" of part||Rare - more common with some part designs, and at high altitude etc.|
Things to try to isolate the problem
- Get a second opinion e.g. from [] or [] - note that you should be sure that either:
- The memory testing software knows how to disable ECC on your system, or
- You have disabled ECC before running memory tester (note that memtest86 currently displays "ECC: No" on chipsets which have ECC, but which it doesn't know about!).
- This may not catch problems like power-supply related problems, which don't occur when the memory tester is running.
- Use a system stress tester such as "burnbx" from [].
- Put your system under stress by (e.g.) running a parallelised Linux kernel build, whilst doing some heavy 3D graphics display, and a lot of disk I/O.
Suspected bad module:
- Remove Module.
- Move Module to different slot (do errors move with module).
- Move Module to different machine.
- See "suspected temp out of spec".
- See "suspected timings out of spec".
- See "voltages out of spec".
- Clean connections.
- Check Memory Loading
- Some memory controllers can only support so many 'ranks' of memory at a given speed.
For example, Opterons/Athlon64s can support only 4 ranks of 2 GB at PC3200. See http://www.valueram.com/memoryranks/default.asp for definitions.
Suspected bad motherboard:
- Check motherboard docs for memory module compatability.
- Move modules to different slots.
- Clean connections.
- Upgrade BIOS.
- Select BIOS "fail-safe defaults", or equivalent change settings from there to isolate cause.
Suspected bad connection:
- Visually check connectors, pins, modules etc.
Suspected temp out of spec:
- Measure temp, compare to published specs:
- Use internal machine sensors (motherboard, hard drive etc.) if possible.
- Use a temperature probe or infra-red thermometer.
- Check airflow.
- Lower temp:
- Lower room temp.
- Increase cooling.
- Improve airflow (tidy cables etc.).
Suspected timings out of spec:
- Try different BIOS version.
- Set pessimistic memory timings in BIOS.
- Compare memory controller timings to DIMM specs, using decode-dimms.pl from the Linux i2c project.
- Try disabling "spread spectrum" in the BIOS (easy if available), or by using an i2c driver for your board's clock generator (hard).
Suspected voltage out of spec:
- Check PSU specs vs. total demand of system components.
- Swap power supply with another machine.
- Fit voltage regulator/spike suppressor to machine power supply.
Suspected single event upsets:
- Fit less susceptible components
- Move to a lower altitude, or area with lower cosmic radiation.
- Move your data centre underground.
- Improve error-reporting utilities to ignore them.
Suspected bad check-bit init:
- Upgrade BIOS.
- Don't enable BIOS "quick boot".
- Don't manually skip BIOS memory check.
Suspected insufficient powersupply:
- Try detaching some devices that are hardly use. Start with USB devices.
- If the problems stop, either structurally reduce the devices, or get a higher capacity powersupply.
- Use a DC current clamp (pref one with peak/inrush measurement function) to check over-capacity at a particular voltage.
- This is closely related to voltage out of spec. That can also be caused by just a broken supply.